Skip to content

feat(sdk): add bench placement knob (preferred default, strict opt-in)#460

Merged
epappas merged 1 commit into
mainfrom
sdk/bench-placement
May 4, 2026
Merged

feat(sdk): add bench placement knob (preferred default, strict opt-in)#460
epappas merged 1 commit into
mainfrom
sdk/bench-placement

Conversation

@epappas
Copy link
Copy Markdown
Contributor

@epappas epappas commented May 4, 2026

Summary

Mirrors the operator-side BenchPlacement enum on the SDK wire types so users can opt into strict bench-Pod pinning when honest measurement matters more than runnability.

This is the SDK companion to basilica-backend PR #444 (operator-side CRD field + render branching). The two PRs are interdependent and should ship together; the operator wins forward-compat (it always accepts the field; older SDKs that don't emit it default to Preferred operator-side).

What changed

Rust SDK (basilica-sdk):

  • New DistributedBenchPlacement { Preferred, Strict } enum (lowercase serde, matches operator's BenchPlacement).
  • DistributedBenchSpec.placement: Option<DistributedBenchPlacement> with skip_serializing_if = Option::is_none so the field is omitted on the wire when unset (forward-compat with pre-placement operators).
  • 2 round-trip serde tests lock the lowercase token + None-elision contract.

Python SDK (basilica-sdk-python):

  • deploy_distributed(...) and deploy_distributed_async(...) gain bench_placement: str = "preferred".
  • Default emits no placement key on the wire; only "strict" opts into the new field.
  • Docstring documents the silently-wrong > no-number trade-off.
  • 3 pytest cases cover default-omits-field, strict-emits-token, invalid-placement-rejected.

CLI (basilica-cli):

  • distributed-train handler updated to construct the new BenchSpec shape (placement: None, semantics unchanged).

Architecture rationale

The architecturally honest measurement requires the bench Pod pair to land on the exact nodes the worker StatefulSet runs on (issue #441). On hardware with multiple GPUs per node, that pin is unambiguously correct. On the current production cluster (1 GPU per node fleet-wide), the strict pin produces Pending Pods because the workers occupy the only GPU per node.

silently-wrong > no-number is FALSE; therefore strict is the architecture-correct mode for honest measurement. Preferred is a deliberate usability concession to the current single-GPU-per-node fleet — it lets the bench fall back off the worker pair so the example stays runnable. Users who need a truthful number on multi-GPU/node hardware (research papers, SLA evidence, capacity planning) opt into strict explicitly.

What is NOT in this PR

  • No version bump (lands as part of the in-flight 0.29.x SDK release).
  • No changes to the read-side bench surface (DistributedTraining.bench, BenchResult).
  • No changes to admission/quota logic — placement is purely a render-time knob.

Test plan

  • cargo test -p basilica-sdk --lib (79 passed, includes 2 new placement tests)
  • cargo clippy --workspace --tests -- -D warnings clean
  • cargo fmt --all -- --check clean
  • pytest tests/test_distributed.py (49 passed, includes 3 new placement tests)
  • End-to-end live verification on the cluster — covered in the operator-side PR fix(cli): correct card funding minimum to $10 #444 transcripts.

Linked PR

Summary by CodeRabbit

  • New Features
    • Added bench_placement parameter to distributed deployment methods, enabling users to configure benchmark placement strategy
    • Supports two placement modes: "preferred" (default) and "strict" for control over benchmark resource allocation
    • Parameter is optional and maintains backward compatibility with existing deployments

Mirrors the operator-side `BenchPlacement` enum on the wire types so
SDK users can opt into strict bench-Pod pinning when honest measurement
matters more than runnability:

- `DistributedBenchPlacement` enum (Preferred | Strict, lowercase
  serde) added next to `DistributedBenchMode` in `basilica-sdk`.
- `DistributedBenchSpec.placement: Option<DistributedBenchPlacement>`
  with `skip_serializing_if = Option::is_none` so older operators
  ignore the absent field and treat it as the operator-side default
  (Preferred).
- 2 new round-trip serde tests lock the lowercase wire token plus the
  None-elision contract.
- Python facade `deploy_distributed(...)` and
  `deploy_distributed_async(...)` gain `bench_placement: str =
  "preferred"`. Default emits no `placement` key (wire-compat); only
  `"strict"` opts into the new field. Docstring documents the
  silently-wrong > no-number trade-off.
- 3 new pytest cases cover default-omits-field, strict-emits-token,
  and invalid-placement-rejected.
- basilica-cli's distributed-train handler updated for the new
  BenchSpec field; semantics unchanged (placement: None).

No version bump; this lands as part of the in-flight SDK release.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Walkthrough

This PR adds a bench_placement parameter to distributed deployment methods, enabling configuration of benchmark placement behavior. A new DistributedBenchPlacement enum is introduced in the Rust SDK types with Preferred and Strict variants, an optional placement field is added to DistributedBenchSpec, and the Python client threads this parameter through with validation and conditional request payload construction.

Changes

Bench Placement Configuration

Layer / File(s) Summary
Type Definition
crates/basilica-sdk/src/types.rs
Introduces DistributedBenchPlacement enum with lowercase serde tokens (preferred, strict), and adds optional placement: Option<DistributedBenchPlacement> field to DistributedBenchSpec with skip-if-none serialization. Serde contract tests validate lowercase round-trip and field omission behavior.
CLI Handler Integration
crates/basilica-cli/src/cli/handlers/train.rs
handle_up now populates DistributedBenchSpec with explicit placement: None alongside mode: bench_mode when constructing the benchmark spec.
Python SDK Client
crates/basilica-sdk-python/python/basilica/__init__.py
deploy_distributed and deploy_distributed_async methods accept new bench_placement: str = "preferred" parameter. _build_distributed_request validates placement values ("preferred" or "strict"), conditionally includes placement: "strict" in bench dict only when bench == "on-start" and bench_placement == "strict".
Python SDK Tests
crates/basilica-sdk-python/tests/test_distributed.py
Three new unit tests in TestBuildDistributedRequest validate bench_placement wire behavior: default "preferred" omits placement field, "strict" emits placement: "strict", and invalid values raise ValidationError with field == "bench_placement".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • one-covenant/basilica#447: Directly foundational to this PR; introduces distributed-training types and CLI/client handler plumbing that this change extends with bench placement configuration.

Poem

🐰 A placement most preferred,
Or strict as the rules deserve,
The benchmark now knows its place,
In distributed space—with grace!
Hops of validation, bounds so tight. 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a bench placement configuration option to the SDK with 'preferred' as the default and 'strict' as an opt-in choice, which aligns with the changeset across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sdk/bench-placement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/basilica-sdk-python/python/basilica/__init__.py`:
- Around line 1150-1163: The bench_placement validation currently runs
unconditionally and raises even when bench == "off"; update the validation so it
only executes when bench is enabled (e.g., bench == "on-start" or any non-"off"
value) by moving the check for bench_placement inside the branch that handles
bench being enabled (the code block that reads/handles the bench variable), or
alternatively change the condition to short-circuit when bench == "off". Locate
the existing bench_placement validation logic (the code that inspects/raises for
bench_placement) and wrap it with a guard referencing bench, and apply the same
change to the duplicate validation around the lines analogous to 1380-1387.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 40eb65e0-4574-48dc-99eb-fe1f7d17deb8

📥 Commits

Reviewing files that changed from the base of the PR and between 65514ac and db68a94.

📒 Files selected for processing (4)
  • crates/basilica-cli/src/cli/handlers/train.rs
  • crates/basilica-sdk-python/python/basilica/__init__.py
  • crates/basilica-sdk-python/tests/test_distributed.py
  • crates/basilica-sdk/src/types.rs

Comment on lines +1150 to 1163
bench_placement: Placement policy for the bench Pod pair on
multi-tenant clusters. `"preferred"` (default) lets the
bench fall back off the worker pair when those nodes have
no spare GPU — bench always schedules but may measure a
different pair than the workers. `"strict"` keeps the
bench pinned to the worker pair's nodes — bench measures
the worker pair's link or stays Pending; never silently
mismeasures. Architecture doc § 11.1: silently-wrong
outranks no-number, so `strict` is the architecturally
honest mode for research papers / SLA evidence on multi-
GPU/node hardware. Default is `"preferred"` for
runnability on the current single-GPU/node fleet.
Ignored when `bench == "off"`.
rendezvous_backend: One of `etcd-v2 | c10d | static`. Default
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

bench_placement is documented as ignored for bench="off" but is still validated.

Right now, invalid bench_placement raises even when bench is off. Either move placement validation under bench == "on-start" or adjust the docstring to say it is always validated.

💡 Suggested fix (align code with current docstring)
         if bench in ("on-start", "off"):
             bench_dict: Dict[str, Any] = {"mode": bench}
@@
-            if bench_placement not in ("preferred", "strict"):
-                raise ValidationError(
-                    f"bench_placement must be 'preferred' or 'strict', got {bench_placement!r}",
-                    field="bench_placement",
-                    value=bench_placement,
-                )
-            if bench == "on-start" and bench_placement == "strict":
-                bench_dict["placement"] = "strict"
+            if bench == "on-start":
+                if bench_placement not in ("preferred", "strict"):
+                    raise ValidationError(
+                        f"bench_placement must be 'preferred' or 'strict', got {bench_placement!r}",
+                        field="bench_placement",
+                        value=bench_placement,
+                    )
+                if bench_placement == "strict":
+                    bench_dict["placement"] = "strict"
             distributed["bench"] = bench_dict

Also applies to: 1380-1387

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/basilica-sdk-python/python/basilica/__init__.py` around lines 1150 -
1163, The bench_placement validation currently runs unconditionally and raises
even when bench == "off"; update the validation so it only executes when bench
is enabled (e.g., bench == "on-start" or any non-"off" value) by moving the
check for bench_placement inside the branch that handles bench being enabled
(the code block that reads/handles the bench variable), or alternatively change
the condition to short-circuit when bench == "off". Locate the existing
bench_placement validation logic (the code that inspects/raises for
bench_placement) and wrap it with a guard referencing bench, and apply the same
change to the duplicate validation around the lines analogous to 1380-1387.

@epappas epappas merged commit c1d1c0a into main May 4, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant