Skip to content

feat(sdk): migrate examples 20/21/22 to canonical decorator + bench bool + context-manager surface (fixes one-covenant/basilica-backend#664)#488

Merged
epappas merged 1 commit into
mainfrom
sdk-s5-examples-664
May 18, 2026
Merged

feat(sdk): migrate examples 20/21/22 to canonical decorator + bench bool + context-manager surface (fixes one-covenant/basilica-backend#664)#488
epappas merged 1 commit into
mainfrom
sdk-s5-examples-664

Conversation

@epappas
Copy link
Copy Markdown
Contributor

@epappas epappas commented May 18, 2026

Summary

S5 of the SDK API simplification plan (docs/plans/SDK-API-SIMPLIFICATION-PLAN.md on basilica-backend). Migrates the three flagship distributed-training examples to the canonical surface delivered by S1-S4 (PRs #483, #484, #485, #486, #487).

Examples now use:

  • @basilica.distributed(...) decorator (or basilica.distributed(command=...) factory) — no _managed suffix
  • bench=True bool (not bench='on-start' string)
  • with training: context manager for orchestration + auto-cleanup
  • lazy training.bench for the bench result; training.bench_diagnostics for the rare debug case
  • no wait_until_bench_complete, no bench_status.phase ceremony

Per-example diff

examples/20_distributed_diloco.py

  • bench="on-start" (str, deprecated by S2) -> bench=True (bool)
  • Removed time import + manual poll loop after the decorator call
  • Removed training.wait_until_bench_complete(timeout=1200) + 4-phase enum check
  • Added with train() as training: block (auto-cleanup on success OR exception)
  • Added training.wait_until_complete(timeout=1800) for terminal-phase block
  • Added canonical training.bench / training.bench_diagnostics reads
  • topology_spread="provider-aware" -> topology_spread="pack" (mesh-throughput recommendation per the user runbook)

examples/21_distributed_torchrun.py

  • client.deploy_distributed_managed(command=[...]) (S1-deprecated) -> basilica.distributed(command=[...]) (S3 factory shape)
  • Dropped explicit BasilicaClient() instantiation (factory creates one lazily; mirrors decorator semantics)
  • All other behaviour unchanged: same with training: block, same scale(target=3), same wait_until_target_world(timeout=300), same logs(tail=30)

examples/22_distributed_with_bench.py

  • client.deploy_distributed_managed(source=<inline str>, bench="on-start") -> @basilica.distributed(..., bench=True) decorator (S1+S2+S4 collapse)
  • Removed inline source= string with embedded torch script
  • Per-rank entrypoint is now the decorated workload() function body (Callable shape; the canonical input per S4)
  • Removed wait_until_bench_complete + 4-phase enum checks + bench_status.phase != "Succeeded" branch
  • Added training.wait_until_complete(timeout=1800) for terminal-phase block
  • training.bench is the only bench read; training.bench_diagnostics is shown for the rare None case

Why this matters

The plan calls out three anti-patterns the SDK exposes today: deploy_distributed_managed overlapping with @distributed, bench: str modes + 4-phase BenchStatus enum + 3 access paths for one diagnostic, source: Union[str, Path, Callable] taking three forms for one input. After S1-S4 landed (PRs #483-#487), the canonical surface exists but the flagship examples were still showing the deprecated forms to users. S5 closes the gap: the examples now demonstrate the same surface the user runbook (docs/runbooks/USER-RUNBOOK-DISTRIBUTED-NCCL.md on basilica-backend) describes.

Verification

Pre-fix DeprecationWarning capture

Captured before any changes (saved in basilica-backend data/evidence/sdk-simpl-664/01-deprecation-warnings-pre-fix.txt):

  • ex20 emitted: bench='on-start' (str) is deprecated (S2), wait_until_bench_complete is deprecated (S2)
  • ex21 emitted: BasilicaClient.deploy_distributed_managed is deprecated (S1)
  • ex22 emitted: BasilicaClient.deploy_distributed_managed is deprecated (S1), wait_until_bench_complete is deprecated (S2), source='str'-typed value is deprecated (S4)

Post-fix DeprecationWarning audit

Captured after the migration (saved in basilica-backend data/evidence/sdk-simpl-664/02-deprecation-warnings-post-fix.txt):

=== ex20 (post-fix) ===
[probe] ex20 import-time SDK DeprecationWarnings: 0
[probe] ex20 bench=True normalisation SDK DeprecationWarnings: 0
[probe] ex20 training.bench + bench_diagnostics SDK DeprecationWarnings: 0

=== ex21 (post-fix) ===
[probe] ex21 basilica.distributed(command=...) SDK DeprecationWarnings: 0

=== ex22 (post-fix) ===
[probe] ex22 import-time SDK DeprecationWarnings: 0
[probe] ex22 bench=True normalisation SDK DeprecationWarnings: 0
[probe] ex22 training.bench + bench_diagnostics SDK DeprecationWarnings: 0

SDK test suite

pytest tests/ -q on the full SDK Python test suite (excluding test_dns_propagation_e2e.py which needs httpx not in the venv): 211 passed, 76 warnings (all from existing S1-S4 acceptance tests that exercise the deprecated paths). No regression.

Runtime end-to-end (ex21 against api.basilica.ai)

Ran python examples/21_distributed_torchrun.py against the live API:

Deployed: f01dd43a-760b-428b-abdd-c43e20d6dce2
Namespace: u-github-434149
World: WorldStatus(ready=2, target=3, min=2, max=4, below_minimum=False)
Scaled to target=3; world now: WorldStatus(ready=1, target=3, min=2, max=4, below_minimum=True)
BelowMinimumWorld: ... ready=2, required_min=3 ...

Surface-validated:

  • basilica.distributed(command=[...]) factory deployed successfully
  • mid-run training.scale(target=3) succeeded
  • with block's __exit__ ran delete() on the BelowMinimumWorld exception (post-run basilica deploy status returned "Deployment not found")
  • Zero DeprecationWarning in the runtime stdout/stderr under -W default

The BelowMinimumWorld is a provider-capacity event (3rd rank could not be scheduled in 300s), not a SDK/API issue. The pre-S5 example would have hit the same outcome.

Runtime gap

ex22 is verda-pinned per the existing example contract; verda has no A100 capacity at PR time (a stale dlc-example-bench UD from May-4 is still occupying the slot). ex20 needs 4 ranks A100/H100. These were not executed end-to-end here; both were validated via module-import-time DeprecationWarning capture (which exercises every SDK surface the example touches except the actual network call).

What is NOT in this PR

  • No SDK source changes (S5 is examples-only; S1-S4 already shipped the canonical surface)
  • No version bump (pyproject.toml + Cargo.toml stay at 0.29.7)
  • No CHANGELOG entry (the changelog tracks API surface changes; this is an examples-only refresh)
  • No removal of the deprecated APIs (that is SDK-S7's job, on the next major bump)

Summary by CodeRabbit

  • Documentation
    • Updated distributed training examples to showcase simplified deployment patterns with improved resource lifecycle management.
    • Enhanced example documentation demonstrating cleaner usage patterns for benchmark result handling and distributed job monitoring.
    • Examples now guide users through best practices for graceful cleanup and system state inspection.

Review Change Stack

…l + context-manager surface (fixes one-covenant/basilica-backend#664)

Examples 20, 21, 22 now use the canonical SDK surface delivered by S1-S4:

- ex20 (DiLoCo): drops bench="on-start" str -> bench=True bool. Drops the
  legacy wait_until_bench_complete + bench_status.phase ceremony for the
  lazy training.bench / training.bench_diagnostics reads. Wraps the deploy
  in `with train() as training:` so the context manager's __exit__ does
  cleanup on success OR exception.

- ex21 (BYO torchrun): migrates from client.deploy_distributed_managed
  to the basilica.distributed(command=[...]) factory shape (per PR #486).
  No _managed suffix on the surface. Same `with training:` block, same
  scale-mid-run + wait_until_target_world + logs flow.

- ex22 (per-UD bench): converts from inline source=<str> +
  deploy_distributed_managed + wait_until_bench_complete + 4-phase enum
  checks to the canonical decorator + bench=True bool + lazy
  training.bench. The runbook's "99% of users only read training.bench"
  pattern is now the example pattern.

All three examples now read cleanly against the user-facing runbook in
basilica-backend (docs/runbooks/USER-RUNBOOK-DISTRIBUTED-NCCL.md).

Verification:
- 211/211 SDK tests still pass (no regression).
- pre-fix probe: ex20/21/22 emitted SDK DeprecationWarnings from
  bench="on-start", deploy_distributed_managed, source=<str>,
  wait_until_bench_complete (captured in
  basilica-backend data/evidence/sdk-simpl-664/01).
- post-fix probe: all three examples emit ZERO SDK DeprecationWarning
  on the surfaces they exercise (import-time decorator/factory
  application + bench-bool normalisation + training.bench /
  bench_diagnostics reads). Captured in
  basilica-backend data/evidence/sdk-simpl-664/02.
- ex21 ran end-to-end against api.basilica.ai: factory deploy returned
  UD f01dd43a-...; mid-run scale(target=3) succeeded; wait_until_target_world
  raised BelowMinimumWorld on capacity (not API/SDK); with-block __exit__
  ran delete() (post-run lookup returned "Deployment not found"). Zero
  DeprecationWarning in the runtime stdout/stderr under -W default.

No public-API change; S5 is examples-only. No version bump (pyproject /
Cargo stay at 0.29.7).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bd703f79-f55d-4064-856b-701701d53150

📥 Commits

Reviewing files that changed from the base of the PR and between 3f7a568 and 8669c3e.

📒 Files selected for processing (3)
  • examples/20_distributed_diloco.py
  • examples/21_distributed_torchrun.py
  • examples/22_distributed_with_bench.py

Walkthrough

Three distributed training examples are refactored to use a canonical @basilica.distributed(...) decorator surface with context-manager cleanup, boolean bench=True opt-in, and post-completion training.bench result reading. The patterns converge across DiLoCo, torchrun, and bench use cases.

Changes

Distributed Examples Canonical Decorator Surface

Layer / File(s) Summary
DiLoCo context manager and bench reading
examples/20_distributed_diloco.py
Decorator configured with bench=True and topology_spread="pack". Docstring documents with train() as training: as primary pattern. Main block refactored to use context manager for automatic cleanup, wait_until_complete(timeout=1800) for blocking on completion, and training.bench for canonical result reading instead of prior manual polling and explicit bench-wait phases.
Torchrun BYO-command factory mode
examples/21_distributed_torchrun.py
Module and comments updated to describe basilica.distributed(command=...) factory mode where the command is passed verbatim. Context creation switches from with client.deploy_distributed_managed(...) to training = basilica.distributed(...); with training: pattern. Elastic scaling (scale(target=3), wait_until_target_world()) and rendezvous handling remain unchanged.
Bench example workload entrypoint and deployment
examples/22_distributed_with_bench.py
New top-level workload() entrypoint decorated with @basilica.distributed(bench=True) initializes NCCL per rank. Main function switches from BasilicaClient.deploy_distributed_managed(...) to with workload() as training:, waits via wait_until_complete(), then reads training.bench. Removed prior bench-phase/status branching, wait_until_bench_complete() call, and nonzero-exit-on-bench-failure logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • one-covenant/basilica#483: SDK changes that collapse the bench API to boolean opt-in (bench=True) and add lazy training.bench_diagnostics surface, which these examples now consume.
  • one-covenant/basilica#472: Earlier migration of examples 21 and 22 to client.deploy_distributed_managed(...) scope-exit cleanup, directly preceding this refactor to decorator-based factory mode.

Poem

🐰 Three examples hop in sync,
From client calls to decorator links,
Context managers clean the way,
Bench results in training.bench stay,
A canonical surface, sleek and bright!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: migrating three examples to the new canonical SDK surface with decorator pattern, boolean bench configuration, and context-manager cleanup.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sdk-s5-examples-664

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant