feat(sdk): migrate examples 20/21/22 to canonical decorator + bench bool + context-manager surface (fixes one-covenant/basilica-backend#664) by epappas · Pull Request #488 · one-covenant/basilica

epappas · 2026-05-18T15:36:14Z

Summary

S5 of the SDK API simplification plan (docs/plans/SDK-API-SIMPLIFICATION-PLAN.md on basilica-backend). Migrates the three flagship distributed-training examples to the canonical surface delivered by S1-S4 (PRs #483, #484, #485, #486, #487).

Examples now use:

@basilica.distributed(...) decorator (or basilica.distributed(command=...) factory) — no _managed suffix
bench=True bool (not bench='on-start' string)
with training: context manager for orchestration + auto-cleanup
lazy training.bench for the bench result; training.bench_diagnostics for the rare debug case
no wait_until_bench_complete, no bench_status.phase ceremony

Per-example diff

`examples/20_distributed_diloco.py`

bench="on-start" (str, deprecated by S2) -> bench=True (bool)
Removed time import + manual poll loop after the decorator call
Removed training.wait_until_bench_complete(timeout=1200) + 4-phase enum check
Added with train() as training: block (auto-cleanup on success OR exception)
Added training.wait_until_complete(timeout=1800) for terminal-phase block
Added canonical training.bench / training.bench_diagnostics reads
topology_spread="provider-aware" -> topology_spread="pack" (mesh-throughput recommendation per the user runbook)

`examples/21_distributed_torchrun.py`

client.deploy_distributed_managed(command=[...]) (S1-deprecated) -> basilica.distributed(command=[...]) (S3 factory shape)
Dropped explicit BasilicaClient() instantiation (factory creates one lazily; mirrors decorator semantics)
All other behaviour unchanged: same with training: block, same scale(target=3), same wait_until_target_world(timeout=300), same logs(tail=30)

`examples/22_distributed_with_bench.py`

client.deploy_distributed_managed(source=<inline str>, bench="on-start") -> @basilica.distributed(..., bench=True) decorator (S1+S2+S4 collapse)
Removed inline source= string with embedded torch script
Per-rank entrypoint is now the decorated workload() function body (Callable shape; the canonical input per S4)
Removed wait_until_bench_complete + 4-phase enum checks + bench_status.phase != "Succeeded" branch
Added training.wait_until_complete(timeout=1800) for terminal-phase block
training.bench is the only bench read; training.bench_diagnostics is shown for the rare None case

Why this matters

The plan calls out three anti-patterns the SDK exposes today: deploy_distributed_managed overlapping with @distributed, bench: str modes + 4-phase BenchStatus enum + 3 access paths for one diagnostic, source: Union[str, Path, Callable] taking three forms for one input. After S1-S4 landed (PRs #483-#487), the canonical surface exists but the flagship examples were still showing the deprecated forms to users. S5 closes the gap: the examples now demonstrate the same surface the user runbook (docs/runbooks/USER-RUNBOOK-DISTRIBUTED-NCCL.md on basilica-backend) describes.

Verification

Pre-fix DeprecationWarning capture

Captured before any changes (saved in basilica-backend data/evidence/sdk-simpl-664/01-deprecation-warnings-pre-fix.txt):

ex20 emitted: bench='on-start' (str) is deprecated (S2), wait_until_bench_complete is deprecated (S2)
ex21 emitted: BasilicaClient.deploy_distributed_managed is deprecated (S1)
ex22 emitted: BasilicaClient.deploy_distributed_managed is deprecated (S1), wait_until_bench_complete is deprecated (S2), source='str'-typed value is deprecated (S4)

Post-fix DeprecationWarning audit

Captured after the migration (saved in basilica-backend data/evidence/sdk-simpl-664/02-deprecation-warnings-post-fix.txt):

=== ex20 (post-fix) ===
[probe] ex20 import-time SDK DeprecationWarnings: 0
[probe] ex20 bench=True normalisation SDK DeprecationWarnings: 0
[probe] ex20 training.bench + bench_diagnostics SDK DeprecationWarnings: 0

=== ex21 (post-fix) ===
[probe] ex21 basilica.distributed(command=...) SDK DeprecationWarnings: 0

=== ex22 (post-fix) ===
[probe] ex22 import-time SDK DeprecationWarnings: 0
[probe] ex22 bench=True normalisation SDK DeprecationWarnings: 0
[probe] ex22 training.bench + bench_diagnostics SDK DeprecationWarnings: 0

SDK test suite

pytest tests/ -q on the full SDK Python test suite (excluding test_dns_propagation_e2e.py which needs httpx not in the venv): 211 passed, 76 warnings (all from existing S1-S4 acceptance tests that exercise the deprecated paths). No regression.

Runtime end-to-end (ex21 against api.basilica.ai)

Ran python examples/21_distributed_torchrun.py against the live API:

Deployed: f01dd43a-760b-428b-abdd-c43e20d6dce2
Namespace: u-github-434149
World: WorldStatus(ready=2, target=3, min=2, max=4, below_minimum=False)
Scaled to target=3; world now: WorldStatus(ready=1, target=3, min=2, max=4, below_minimum=True)
BelowMinimumWorld: ... ready=2, required_min=3 ...

Surface-validated:

basilica.distributed(command=[...]) factory deployed successfully
mid-run training.scale(target=3) succeeded
with block's __exit__ ran delete() on the BelowMinimumWorld exception (post-run basilica deploy status returned "Deployment not found")
Zero DeprecationWarning in the runtime stdout/stderr under -W default

The BelowMinimumWorld is a provider-capacity event (3rd rank could not be scheduled in 300s), not a SDK/API issue. The pre-S5 example would have hit the same outcome.

Runtime gap

ex22 is verda-pinned per the existing example contract; verda has no A100 capacity at PR time (a stale dlc-example-bench UD from May-4 is still occupying the slot). ex20 needs 4 ranks A100/H100. These were not executed end-to-end here; both were validated via module-import-time DeprecationWarning capture (which exercises every SDK surface the example touches except the actual network call).

What is NOT in this PR

No SDK source changes (S5 is examples-only; S1-S4 already shipped the canonical surface)
No version bump (pyproject.toml + Cargo.toml stay at 0.29.7)
No CHANGELOG entry (the changelog tracks API surface changes; this is an examples-only refresh)
No removal of the deprecated APIs (that is SDK-S7's job, on the next major bump)

Summary by CodeRabbit

Documentation
- Updated distributed training examples to showcase simplified deployment patterns with improved resource lifecycle management.
- Enhanced example documentation demonstrating cleaner usage patterns for benchmark result handling and distributed job monitoring.
- Examples now guide users through best practices for graceful cleanup and system state inspection.

…l + context-manager surface (fixes one-covenant/basilica-backend#664) Examples 20, 21, 22 now use the canonical SDK surface delivered by S1-S4: - ex20 (DiLoCo): drops bench="on-start" str -> bench=True bool. Drops the legacy wait_until_bench_complete + bench_status.phase ceremony for the lazy training.bench / training.bench_diagnostics reads. Wraps the deploy in `with train() as training:` so the context manager's __exit__ does cleanup on success OR exception. - ex21 (BYO torchrun): migrates from client.deploy_distributed_managed to the basilica.distributed(command=[...]) factory shape (per PR #486). No _managed suffix on the surface. Same `with training:` block, same scale-mid-run + wait_until_target_world + logs flow. - ex22 (per-UD bench): converts from inline source=<str> + deploy_distributed_managed + wait_until_bench_complete + 4-phase enum checks to the canonical decorator + bench=True bool + lazy training.bench. The runbook's "99% of users only read training.bench" pattern is now the example pattern. All three examples now read cleanly against the user-facing runbook in basilica-backend (docs/runbooks/USER-RUNBOOK-DISTRIBUTED-NCCL.md). Verification: - 211/211 SDK tests still pass (no regression). - pre-fix probe: ex20/21/22 emitted SDK DeprecationWarnings from bench="on-start", deploy_distributed_managed, source=<str>, wait_until_bench_complete (captured in basilica-backend data/evidence/sdk-simpl-664/01). - post-fix probe: all three examples emit ZERO SDK DeprecationWarning on the surfaces they exercise (import-time decorator/factory application + bench-bool normalisation + training.bench / bench_diagnostics reads). Captured in basilica-backend data/evidence/sdk-simpl-664/02. - ex21 ran end-to-end against api.basilica.ai: factory deploy returned UD f01dd43a-...; mid-run scale(target=3) succeeded; wait_until_target_world raised BelowMinimumWorld on capacity (not API/SDK); with-block __exit__ ran delete() (post-run lookup returned "Deployment not found"). Zero DeprecationWarning in the runtime stdout/stderr under -W default. No public-API change; S5 is examples-only. No version bump (pyproject / Cargo stay at 0.29.7).

coderabbitai · 2026-05-18T15:37:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bd703f79-f55d-4064-856b-701701d53150

📥 Commits

Reviewing files that changed from the base of the PR and between 3f7a568 and 8669c3e.

📒 Files selected for processing (3)

examples/20_distributed_diloco.py
examples/21_distributed_torchrun.py
examples/22_distributed_with_bench.py

Walkthrough

Three distributed training examples are refactored to use a canonical @basilica.distributed(...) decorator surface with context-manager cleanup, boolean bench=True opt-in, and post-completion training.bench result reading. The patterns converge across DiLoCo, torchrun, and bench use cases.

Changes

Distributed Examples Canonical Decorator Surface

Layer / File(s)	Summary
DiLoCo context manager and bench reading `examples/20_distributed_diloco.py`	Decorator configured with `bench=True` and `topology_spread="pack"`. Docstring documents `with train() as training:` as primary pattern. Main block refactored to use context manager for automatic cleanup, `wait_until_complete(timeout=1800)` for blocking on completion, and `training.bench` for canonical result reading instead of prior manual polling and explicit bench-wait phases.
Torchrun BYO-command factory mode `examples/21_distributed_torchrun.py`	Module and comments updated to describe `basilica.distributed(command=...)` factory mode where the command is passed verbatim. Context creation switches from `with client.deploy_distributed_managed(...)` to `training = basilica.distributed(...); with training:` pattern. Elastic scaling (`scale(target=3)`, `wait_until_target_world()`) and rendezvous handling remain unchanged.
Bench example workload entrypoint and deployment `examples/22_distributed_with_bench.py`	New top-level `workload()` entrypoint decorated with `@basilica.distributed(bench=True)` initializes NCCL per rank. Main function switches from `BasilicaClient.deploy_distributed_managed(...)` to `with workload() as training:`, waits via `wait_until_complete()`, then reads `training.bench`. Removed prior bench-phase/status branching, `wait_until_bench_complete()` call, and nonzero-exit-on-bench-failure logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

one-covenant/basilica#483: SDK changes that collapse the bench API to boolean opt-in (bench=True) and add lazy training.bench_diagnostics surface, which these examples now consume.
one-covenant/basilica#472: Earlier migration of examples 21 and 22 to client.deploy_distributed_managed(...) scope-exit cleanup, directly preceding this refactor to decorator-based factory mode.

Poem

🐰 Three examples hop in sync,
From client calls to decorator links,
Context managers clean the way,
Bench results in training.bench stay,
A canonical surface, sleek and bright!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: migrating three examples to the new canonical SDK surface with decorator pattern, boolean bench configuration, and context-manager cleanup.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sdk-s5-examples-664

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

epappas merged commit da5b95b into main May 18, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sdk): migrate examples 20/21/22 to canonical decorator + bench bool + context-manager surface (fixes one-covenant/basilica-backend#664)#488

feat(sdk): migrate examples 20/21/22 to canonical decorator + bench bool + context-manager surface (fixes one-covenant/basilica-backend#664)#488
epappas merged 1 commit into
mainfrom
sdk-s5-examples-664

epappas commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epappas commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Per-example diff

examples/20_distributed_diloco.py

examples/21_distributed_torchrun.py

examples/22_distributed_with_bench.py

Why this matters

Verification

Pre-fix DeprecationWarning capture

Post-fix DeprecationWarning audit

SDK test suite

Runtime end-to-end (ex21 against api.basilica.ai)

Runtime gap

What is NOT in this PR

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

epappas commented May 18, 2026 •

edited by coderabbitai Bot

Loading

`examples/20_distributed_diloco.py`

`examples/21_distributed_torchrun.py`

`examples/22_distributed_with_bench.py`

coderabbitai Bot commented May 18, 2026 •

edited

Loading