feat(sdk): add bench placement knob (preferred default, strict opt-in)#460
Conversation
Mirrors the operator-side `BenchPlacement` enum on the wire types so SDK users can opt into strict bench-Pod pinning when honest measurement matters more than runnability: - `DistributedBenchPlacement` enum (Preferred | Strict, lowercase serde) added next to `DistributedBenchMode` in `basilica-sdk`. - `DistributedBenchSpec.placement: Option<DistributedBenchPlacement>` with `skip_serializing_if = Option::is_none` so older operators ignore the absent field and treat it as the operator-side default (Preferred). - 2 new round-trip serde tests lock the lowercase wire token plus the None-elision contract. - Python facade `deploy_distributed(...)` and `deploy_distributed_async(...)` gain `bench_placement: str = "preferred"`. Default emits no `placement` key (wire-compat); only `"strict"` opts into the new field. Docstring documents the silently-wrong > no-number trade-off. - 3 new pytest cases cover default-omits-field, strict-emits-token, and invalid-placement-rejected. - basilica-cli's distributed-train handler updated for the new BenchSpec field; semantics unchanged (placement: None). No version bump; this lands as part of the in-flight SDK release.
WalkthroughThis PR adds a ChangesBench Placement Configuration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/basilica-sdk-python/python/basilica/__init__.py`:
- Around line 1150-1163: The bench_placement validation currently runs
unconditionally and raises even when bench == "off"; update the validation so it
only executes when bench is enabled (e.g., bench == "on-start" or any non-"off"
value) by moving the check for bench_placement inside the branch that handles
bench being enabled (the code block that reads/handles the bench variable), or
alternatively change the condition to short-circuit when bench == "off". Locate
the existing bench_placement validation logic (the code that inspects/raises for
bench_placement) and wrap it with a guard referencing bench, and apply the same
change to the duplicate validation around the lines analogous to 1380-1387.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 40eb65e0-4574-48dc-99eb-fe1f7d17deb8
📒 Files selected for processing (4)
crates/basilica-cli/src/cli/handlers/train.rscrates/basilica-sdk-python/python/basilica/__init__.pycrates/basilica-sdk-python/tests/test_distributed.pycrates/basilica-sdk/src/types.rs
| bench_placement: Placement policy for the bench Pod pair on | ||
| multi-tenant clusters. `"preferred"` (default) lets the | ||
| bench fall back off the worker pair when those nodes have | ||
| no spare GPU — bench always schedules but may measure a | ||
| different pair than the workers. `"strict"` keeps the | ||
| bench pinned to the worker pair's nodes — bench measures | ||
| the worker pair's link or stays Pending; never silently | ||
| mismeasures. Architecture doc § 11.1: silently-wrong | ||
| outranks no-number, so `strict` is the architecturally | ||
| honest mode for research papers / SLA evidence on multi- | ||
| GPU/node hardware. Default is `"preferred"` for | ||
| runnability on the current single-GPU/node fleet. | ||
| Ignored when `bench == "off"`. | ||
| rendezvous_backend: One of `etcd-v2 | c10d | static`. Default |
There was a problem hiding this comment.
bench_placement is documented as ignored for bench="off" but is still validated.
Right now, invalid bench_placement raises even when bench is off. Either move placement validation under bench == "on-start" or adjust the docstring to say it is always validated.
💡 Suggested fix (align code with current docstring)
if bench in ("on-start", "off"):
bench_dict: Dict[str, Any] = {"mode": bench}
@@
- if bench_placement not in ("preferred", "strict"):
- raise ValidationError(
- f"bench_placement must be 'preferred' or 'strict', got {bench_placement!r}",
- field="bench_placement",
- value=bench_placement,
- )
- if bench == "on-start" and bench_placement == "strict":
- bench_dict["placement"] = "strict"
+ if bench == "on-start":
+ if bench_placement not in ("preferred", "strict"):
+ raise ValidationError(
+ f"bench_placement must be 'preferred' or 'strict', got {bench_placement!r}",
+ field="bench_placement",
+ value=bench_placement,
+ )
+ if bench_placement == "strict":
+ bench_dict["placement"] = "strict"
distributed["bench"] = bench_dictAlso applies to: 1380-1387
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/basilica-sdk-python/python/basilica/__init__.py` around lines 1150 -
1163, The bench_placement validation currently runs unconditionally and raises
even when bench == "off"; update the validation so it only executes when bench
is enabled (e.g., bench == "on-start" or any non-"off" value) by moving the
check for bench_placement inside the branch that handles bench being enabled
(the code block that reads/handles the bench variable), or alternatively change
the condition to short-circuit when bench == "off". Locate the existing
bench_placement validation logic (the code that inspects/raises for
bench_placement) and wrap it with a guard referencing bench, and apply the same
change to the duplicate validation around the lines analogous to 1380-1387.
Summary
Mirrors the operator-side
BenchPlacementenum on the SDK wire types so users can opt into strict bench-Pod pinning when honest measurement matters more than runnability.This is the SDK companion to basilica-backend PR #444 (operator-side CRD field + render branching). The two PRs are interdependent and should ship together; the operator wins forward-compat (it always accepts the field; older SDKs that don't emit it default to
Preferredoperator-side).What changed
Rust SDK (
basilica-sdk):DistributedBenchPlacement { Preferred, Strict }enum (lowercase serde, matches operator'sBenchPlacement).DistributedBenchSpec.placement: Option<DistributedBenchPlacement>withskip_serializing_if = Option::is_noneso the field is omitted on the wire when unset (forward-compat with pre-placement operators).Python SDK (
basilica-sdk-python):deploy_distributed(...)anddeploy_distributed_async(...)gainbench_placement: str = "preferred".placementkey on the wire; only"strict"opts into the new field.CLI (
basilica-cli):distributed-trainhandler updated to construct the newBenchSpecshape (placement: None, semantics unchanged).Architecture rationale
The architecturally honest measurement requires the bench Pod pair to land on the exact nodes the worker StatefulSet runs on (issue #441). On hardware with multiple GPUs per node, that pin is unambiguously correct. On the current production cluster (1 GPU per node fleet-wide), the strict pin produces
PendingPods because the workers occupy the only GPU per node.silently-wrong > no-numberis FALSE; therefore strict is the architecture-correct mode for honest measurement.Preferredis a deliberate usability concession to the current single-GPU-per-node fleet — it lets the bench fall back off the worker pair so the example stays runnable. Users who need a truthful number on multi-GPU/node hardware (research papers, SLA evidence, capacity planning) opt intostrictexplicitly.What is NOT in this PR
DistributedTraining.bench,BenchResult).Test plan
cargo test -p basilica-sdk --lib(79 passed, includes 2 new placement tests)cargo clippy --workspace --tests -- -D warningscleancargo fmt --all -- --checkcleanpytest tests/test_distributed.py(49 passed, includes 3 new placement tests)Linked PR
Summary by CodeRabbit
bench_placementparameter to distributed deployment methods, enabling users to configure benchmark placement strategy