feat(sdk): migrate examples 20/21/22 to canonical decorator + bench bool + context-manager surface (fixes one-covenant/basilica-backend#664)#488
Conversation
…l + context-manager surface (fixes one-covenant/basilica-backend#664) Examples 20, 21, 22 now use the canonical SDK surface delivered by S1-S4: - ex20 (DiLoCo): drops bench="on-start" str -> bench=True bool. Drops the legacy wait_until_bench_complete + bench_status.phase ceremony for the lazy training.bench / training.bench_diagnostics reads. Wraps the deploy in `with train() as training:` so the context manager's __exit__ does cleanup on success OR exception. - ex21 (BYO torchrun): migrates from client.deploy_distributed_managed to the basilica.distributed(command=[...]) factory shape (per PR #486). No _managed suffix on the surface. Same `with training:` block, same scale-mid-run + wait_until_target_world + logs flow. - ex22 (per-UD bench): converts from inline source=<str> + deploy_distributed_managed + wait_until_bench_complete + 4-phase enum checks to the canonical decorator + bench=True bool + lazy training.bench. The runbook's "99% of users only read training.bench" pattern is now the example pattern. All three examples now read cleanly against the user-facing runbook in basilica-backend (docs/runbooks/USER-RUNBOOK-DISTRIBUTED-NCCL.md). Verification: - 211/211 SDK tests still pass (no regression). - pre-fix probe: ex20/21/22 emitted SDK DeprecationWarnings from bench="on-start", deploy_distributed_managed, source=<str>, wait_until_bench_complete (captured in basilica-backend data/evidence/sdk-simpl-664/01). - post-fix probe: all three examples emit ZERO SDK DeprecationWarning on the surfaces they exercise (import-time decorator/factory application + bench-bool normalisation + training.bench / bench_diagnostics reads). Captured in basilica-backend data/evidence/sdk-simpl-664/02. - ex21 ran end-to-end against api.basilica.ai: factory deploy returned UD f01dd43a-...; mid-run scale(target=3) succeeded; wait_until_target_world raised BelowMinimumWorld on capacity (not API/SDK); with-block __exit__ ran delete() (post-run lookup returned "Deployment not found"). Zero DeprecationWarning in the runtime stdout/stderr under -W default. No public-API change; S5 is examples-only. No version bump (pyproject / Cargo stay at 0.29.7).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughThree distributed training examples are refactored to use a canonical ChangesDistributed Examples Canonical Decorator Surface
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
S5 of the SDK API simplification plan (docs/plans/SDK-API-SIMPLIFICATION-PLAN.md on basilica-backend). Migrates the three flagship distributed-training examples to the canonical surface delivered by S1-S4 (PRs #483, #484, #485, #486, #487).
Examples now use:
@basilica.distributed(...)decorator (orbasilica.distributed(command=...)factory) — no_managedsuffixbench=Truebool (notbench='on-start'string)with training:context manager for orchestration + auto-cleanuptraining.benchfor the bench result;training.bench_diagnosticsfor the rare debug casewait_until_bench_complete, nobench_status.phaseceremonyPer-example diff
examples/20_distributed_diloco.pybench="on-start"(str, deprecated by S2) ->bench=True(bool)timeimport + manual poll loop after the decorator calltraining.wait_until_bench_complete(timeout=1200)+ 4-phase enum checkwith train() as training:block (auto-cleanup on success OR exception)training.wait_until_complete(timeout=1800)for terminal-phase blocktraining.bench/training.bench_diagnosticsreadstopology_spread="provider-aware"->topology_spread="pack"(mesh-throughput recommendation per the user runbook)examples/21_distributed_torchrun.pyclient.deploy_distributed_managed(command=[...])(S1-deprecated) ->basilica.distributed(command=[...])(S3 factory shape)BasilicaClient()instantiation (factory creates one lazily; mirrors decorator semantics)with training:block, samescale(target=3), samewait_until_target_world(timeout=300), samelogs(tail=30)examples/22_distributed_with_bench.pyclient.deploy_distributed_managed(source=<inline str>, bench="on-start")->@basilica.distributed(..., bench=True)decorator (S1+S2+S4 collapse)source=string with embedded torch scriptworkload()function body (Callable shape; the canonical input per S4)wait_until_bench_complete+ 4-phase enum checks +bench_status.phase != "Succeeded"branchtraining.wait_until_complete(timeout=1800)for terminal-phase blocktraining.benchis the only bench read;training.bench_diagnosticsis shown for the rareNonecaseWhy this matters
The plan calls out three anti-patterns the SDK exposes today:
deploy_distributed_managedoverlapping with@distributed,bench: strmodes + 4-phaseBenchStatusenum + 3 access paths for one diagnostic,source: Union[str, Path, Callable]taking three forms for one input. After S1-S4 landed (PRs #483-#487), the canonical surface exists but the flagship examples were still showing the deprecated forms to users. S5 closes the gap: the examples now demonstrate the same surface the user runbook (docs/runbooks/USER-RUNBOOK-DISTRIBUTED-NCCL.mdon basilica-backend) describes.Verification
Pre-fix DeprecationWarning capture
Captured before any changes (saved in basilica-backend
data/evidence/sdk-simpl-664/01-deprecation-warnings-pre-fix.txt):bench='on-start' (str) is deprecated(S2),wait_until_bench_complete is deprecated(S2)BasilicaClient.deploy_distributed_managed is deprecated(S1)BasilicaClient.deploy_distributed_managed is deprecated(S1),wait_until_bench_complete is deprecated(S2),source='str'-typed value is deprecated(S4)Post-fix DeprecationWarning audit
Captured after the migration (saved in basilica-backend
data/evidence/sdk-simpl-664/02-deprecation-warnings-post-fix.txt):SDK test suite
pytest tests/ -qon the full SDK Python test suite (excludingtest_dns_propagation_e2e.pywhich needshttpxnot in the venv): 211 passed, 76 warnings (all from existing S1-S4 acceptance tests that exercise the deprecated paths). No regression.Runtime end-to-end (ex21 against api.basilica.ai)
Ran
python examples/21_distributed_torchrun.pyagainst the live API:Surface-validated:
basilica.distributed(command=[...])factory deployed successfullytraining.scale(target=3)succeededwithblock's__exit__randelete()on the BelowMinimumWorld exception (post-runbasilica deploy statusreturned "Deployment not found")-W defaultThe
BelowMinimumWorldis a provider-capacity event (3rd rank could not be scheduled in 300s), not a SDK/API issue. The pre-S5 example would have hit the same outcome.Runtime gap
ex22 is verda-pinned per the existing example contract; verda has no A100 capacity at PR time (a stale
dlc-example-benchUD from May-4 is still occupying the slot). ex20 needs 4 ranks A100/H100. These were not executed end-to-end here; both were validated via module-import-time DeprecationWarning capture (which exercises every SDK surface the example touches except the actual network call).What is NOT in this PR
pyproject.toml+Cargo.tomlstay at 0.29.7)Summary by CodeRabbit