feat(sdk): add bench placement knob (preferred default, strict opt-in) by epappas · Pull Request #460 · one-covenant/basilica

epappas · 2026-05-04T22:52:24Z

Summary

Mirrors the operator-side BenchPlacement enum on the SDK wire types so users can opt into strict bench-Pod pinning when honest measurement matters more than runnability.

This is the SDK companion to basilica-backend PR #444 (operator-side CRD field + render branching). The two PRs are interdependent and should ship together; the operator wins forward-compat (it always accepts the field; older SDKs that don't emit it default to Preferred operator-side).

What changed

Rust SDK (basilica-sdk):

New DistributedBenchPlacement { Preferred, Strict } enum (lowercase serde, matches operator's BenchPlacement).
DistributedBenchSpec.placement: Option<DistributedBenchPlacement> with skip_serializing_if = Option::is_none so the field is omitted on the wire when unset (forward-compat with pre-placement operators).
2 round-trip serde tests lock the lowercase token + None-elision contract.

Python SDK (basilica-sdk-python):

deploy_distributed(...) and deploy_distributed_async(...) gain bench_placement: str = "preferred".
Default emits no placement key on the wire; only "strict" opts into the new field.
Docstring documents the silently-wrong > no-number trade-off.
3 pytest cases cover default-omits-field, strict-emits-token, invalid-placement-rejected.

CLI (basilica-cli):

distributed-train handler updated to construct the new BenchSpec shape (placement: None, semantics unchanged).

Architecture rationale

The architecturally honest measurement requires the bench Pod pair to land on the exact nodes the worker StatefulSet runs on (issue #441). On hardware with multiple GPUs per node, that pin is unambiguously correct. On the current production cluster (1 GPU per node fleet-wide), the strict pin produces Pending Pods because the workers occupy the only GPU per node.

silently-wrong > no-number is FALSE; therefore strict is the architecture-correct mode for honest measurement. Preferred is a deliberate usability concession to the current single-GPU-per-node fleet — it lets the bench fall back off the worker pair so the example stays runnable. Users who need a truthful number on multi-GPU/node hardware (research papers, SLA evidence, capacity planning) opt into strict explicitly.

What is NOT in this PR

No version bump (lands as part of the in-flight 0.29.x SDK release).
No changes to the read-side bench surface (DistributedTraining.bench, BenchResult).
No changes to admission/quota logic — placement is purely a render-time knob.

Test plan

cargo test -p basilica-sdk --lib (79 passed, includes 2 new placement tests)
cargo clippy --workspace --tests -- -D warnings clean
cargo fmt --all -- --check clean
pytest tests/test_distributed.py (49 passed, includes 3 new placement tests)
End-to-end live verification on the cluster — covered in the operator-side PR fix(cli): correct card funding minimum to $10 #444 transcripts.

Linked PR

basilica-backend fix(cli): correct card funding minimum to $10 #444 — operator CRD field + render branching (must merge together).

Summary by CodeRabbit

New Features
- Added bench_placement parameter to distributed deployment methods, enabling users to configure benchmark placement strategy
- Supports two placement modes: "preferred" (default) and "strict" for control over benchmark resource allocation
- Parameter is optional and maintains backward compatibility with existing deployments

Mirrors the operator-side `BenchPlacement` enum on the wire types so SDK users can opt into strict bench-Pod pinning when honest measurement matters more than runnability: - `DistributedBenchPlacement` enum (Preferred | Strict, lowercase serde) added next to `DistributedBenchMode` in `basilica-sdk`. - `DistributedBenchSpec.placement: Option<DistributedBenchPlacement>` with `skip_serializing_if = Option::is_none` so older operators ignore the absent field and treat it as the operator-side default (Preferred). - 2 new round-trip serde tests lock the lowercase wire token plus the None-elision contract. - Python facade `deploy_distributed(...)` and `deploy_distributed_async(...)` gain `bench_placement: str = "preferred"`. Default emits no `placement` key (wire-compat); only `"strict"` opts into the new field. Docstring documents the silently-wrong > no-number trade-off. - 3 new pytest cases cover default-omits-field, strict-emits-token, and invalid-placement-rejected. - basilica-cli's distributed-train handler updated for the new BenchSpec field; semantics unchanged (placement: None). No version bump; this lands as part of the in-flight SDK release.

coderabbitai · 2026-05-04T22:52:36Z

Walkthrough

This PR adds a bench_placement parameter to distributed deployment methods, enabling configuration of benchmark placement behavior. A new DistributedBenchPlacement enum is introduced in the Rust SDK types with Preferred and Strict variants, an optional placement field is added to DistributedBenchSpec, and the Python client threads this parameter through with validation and conditional request payload construction.

Changes

Bench Placement Configuration

Layer / File(s)	Summary
Type Definition `crates/basilica-sdk/src/types.rs`	Introduces `DistributedBenchPlacement` enum with lowercase serde tokens (`preferred`, `strict`), and adds optional `placement: Option<DistributedBenchPlacement>` field to `DistributedBenchSpec` with skip-if-none serialization. Serde contract tests validate lowercase round-trip and field omission behavior.
CLI Handler Integration `crates/basilica-cli/src/cli/handlers/train.rs`	`handle_up` now populates `DistributedBenchSpec` with explicit `placement: None` alongside `mode: bench_mode` when constructing the benchmark spec.
Python SDK Client `crates/basilica-sdk-python/python/basilica/__init__.py`	`deploy_distributed` and `deploy_distributed_async` methods accept new `bench_placement: str = "preferred"` parameter. `_build_distributed_request` validates placement values (`"preferred"` or `"strict"`), conditionally includes `placement: "strict"` in bench dict only when `bench == "on-start"` and `bench_placement == "strict"`.
Python SDK Tests `crates/basilica-sdk-python/tests/test_distributed.py`	Three new unit tests in `TestBuildDistributedRequest` validate `bench_placement` wire behavior: default `"preferred"` omits placement field, `"strict"` emits `placement: "strict"`, and invalid values raise `ValidationError` with `field == "bench_placement"`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

one-covenant/basilica#447: Directly foundational to this PR; introduces distributed-training types and CLI/client handler plumbing that this change extends with bench placement configuration.

Poem

🐰 A placement most preferred,
Or strict as the rules deserve,
The benchmark now knows its place,
In distributed space—with grace!
Hops of validation, bounds so tight. 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a bench placement configuration option to the SDK with 'preferred' as the default and 'strict' as an opt-in choice, which aligns with the changeset across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sdk/bench-placement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/basilica-sdk-python/python/basilica/__init__.py`:
- Around line 1150-1163: The bench_placement validation currently runs
unconditionally and raises even when bench == "off"; update the validation so it
only executes when bench is enabled (e.g., bench == "on-start" or any non-"off"
value) by moving the check for bench_placement inside the branch that handles
bench being enabled (the code block that reads/handles the bench variable), or
alternatively change the condition to short-circuit when bench == "off". Locate
the existing bench_placement validation logic (the code that inspects/raises for
bench_placement) and wrap it with a guard referencing bench, and apply the same
change to the duplicate validation around the lines analogous to 1380-1387.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 40eb65e0-4574-48dc-99eb-fe1f7d17deb8

📥 Commits

Reviewing files that changed from the base of the PR and between 65514ac and db68a94.

📒 Files selected for processing (4)

crates/basilica-cli/src/cli/handlers/train.rs
crates/basilica-sdk-python/python/basilica/__init__.py
crates/basilica-sdk-python/tests/test_distributed.py
crates/basilica-sdk/src/types.rs

coderabbitai · 2026-05-04T22:54:57Z

+            bench_placement: Placement policy for the bench Pod pair on
+                multi-tenant clusters. `"preferred"` (default) lets the
+                bench fall back off the worker pair when those nodes have
+                no spare GPU — bench always schedules but may measure a
+                different pair than the workers. `"strict"` keeps the
+                bench pinned to the worker pair's nodes — bench measures
+                the worker pair's link or stays Pending; never silently
+                mismeasures. Architecture doc § 11.1: silently-wrong
+                outranks no-number, so `strict` is the architecturally
+                honest mode for research papers / SLA evidence on multi-
+                GPU/node hardware. Default is `"preferred"` for
+                runnability on the current single-GPU/node fleet.
+                Ignored when `bench == "off"`.
            rendezvous_backend: One of `etcd-v2 | c10d | static`. Default


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

bench_placement is documented as ignored for bench="off" but is still validated.

Right now, invalid bench_placement raises even when bench is off. Either move placement validation under bench == "on-start" or adjust the docstring to say it is always validated.

💡 Suggested fix (align code with current docstring)

if bench in ("on-start", "off"): bench_dict: Dict[str, Any] = {"mode": bench} @@ - if bench_placement not in ("preferred", "strict"): - raise ValidationError( - f"bench_placement must be 'preferred' or 'strict', got {bench_placement!r}", - field="bench_placement", - value=bench_placement, - ) - if bench == "on-start" and bench_placement == "strict": - bench_dict["placement"] = "strict" + if bench == "on-start": + if bench_placement not in ("preferred", "strict"): + raise ValidationError( + f"bench_placement must be 'preferred' or 'strict', got {bench_placement!r}", + field="bench_placement", + value=bench_placement, + ) + if bench_placement == "strict": + bench_dict["placement"] = "strict" distributed["bench"] = bench_dict

Also applies to: 1380-1387

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/basilica-sdk-python/python/basilica/__init__.py` around lines 1150 - 1163, The bench_placement validation currently runs unconditionally and raises even when bench == "off"; update the validation so it only executes when bench is enabled (e.g., bench == "on-start" or any non-"off" value) by moving the check for bench_placement inside the branch that handles bench being enabled (the code block that reads/handles the bench variable), or alternatively change the condition to short-circuit when bench == "off". Locate the existing bench_placement validation logic (the code that inspects/raises for bench_placement) and wrap it with a guard referencing bench, and apply the same change to the duplicate validation around the lines analogous to 1380-1387.

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

epappas merged commit c1d1c0a into main May 4, 2026
14 checks passed

epappas mentioned this pull request May 5, 2026

feat(sdk): expose UD phase + wait_until_complete + UDTerminalState (refs #445) #461

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sdk): add bench placement knob (preferred default, strict opt-in)#460

feat(sdk): add bench placement knob (preferred default, strict opt-in)#460
epappas merged 1 commit into
mainfrom
sdk/bench-placement

epappas commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epappas commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Architecture rationale

What is NOT in this PR

Test plan

Linked PR

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

epappas commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading