fix(sdk): expose status.distributed (and other fields) through PyO3 DeploymentResponse (closes #449)#451
Conversation
Issue #449. The basilica-api correctly returns status.distributed in the DeploymentResponse JSON (wired via PR #433/#447 server-side), but the SDK silently dropped it because basilica-sdk::types::DeploymentResponse had no `distributed` field. Round-tripping through serde would discard the block before it ever reached the PyO3 binding or the Python facade. Changes: - Extend `DistributedStatus` to mirror the canonical `basilica-api::DistributedStatus` byte-for-byte at the JSON layer (issue #431 / #449): `conditions`, `rendezvous`, `worldSizeHistory`, `lastResize`, `milestones`, `originalMax`, `rankLossResets`, `preflight`, `preflightDeprecationWarned`. - Add new sub-types: `DistributedCondition`, `DistributedRendezvousStatus`, `DistributedWorldSizeTransition`, `DistributedResizeRecord`, `DistributedMilestone`, `DistributedRankLossReset`, `DistributedPreflightEstimate`. All use `#[serde(rename_all = "camelCase")]` and `Option<...>` for nullable fields, matching the canonical shape exactly. - Add `pub image: String` (defaulted) and `pub distributed: Option<DistributedStatus>` to `DeploymentResponse`. Both use `#[serde(default)]` / `skip_serializing_if = "Option::is_none"` so older API responses without these keys continue to deserialize correctly. - 4 new serde tests in `types::tests`: - Full round-trip with `distributed` populated (worldSize + 2 ranks + bench result with f64 fields). - Backwards-compat: `distributed = None` omits the JSON key. - Wire-shape lock: every JSON key is camelCase (`belowMinimum`, `worldSize`, `lastResize`, `originalMax`, `worldSizeHistory`, `rankLossResets`, `preflightDeprecationWarned`). - BenchResult sub-shape round-trips a populated probe. Existing test `test_distributed_status_camelcase` updated to use `..Default::default()` for the newly added struct fields. Out of scope (parallel PRs / separate fixes): - No basilica-api changes (server-side already correct via #433). - No PyO3 binding changes yet (next commit). - No Python facade changes yet (next commit after that). - No `__init__.py` /workspace/ -> /tmp/ change (#448's scope). - No version bump or release.
Issue #449. The PyO3 `DeploymentResponse` mirror was missing two fields that the basilica-api has always returned: - `image` (container image) - `distributed` (the full `status.distributed` mirror added by #433 server-side and #447 to the SDK Rust types in the previous commit) Effect: `_coerce_to_dict(client.get(...))` in distributed.py saw only 4 attributes (`namespace`, `state`, `url`, `userId`). Every read property on `DistributedTraining` (`world`, `ranks`, `bench`, `metrics`, `wait_until_min_world`) returned zeros / empty / None on a healthy cluster. Changes: - Add `pub image: String` with `#[pyo3(get)]` (one-liner mirror of the SDK type added in the previous commit). - Add `pub distributed: Option<serde_json::Value>` (no `#[pyo3(get)]`) + `#[getter] fn distributed(&self, py)` that uses `pythonize` to hand the JSON to Python as a camelCase dict. The Python facade walks the dict directly; this matches the existing `pythonize`-passthrough pattern used by `get_deployment_events`. - Wire both fields through `From<SdkDeploymentResponse>`. - Update `_basilica.pyi` type stubs accordingly. - Add `serde_json` to the crate's dependencies (via workspace). The Python facade changes that consume this binding land in the next commit (Milestone 3) so the bug fix is bisectable.
Issue #449. Now that the PyO3 binding exposes `image` + `distributed`, update the Python facade's `_coerce_to_dict` helper to walk those attributes onto the camelCase output dict that `DistributedTraining.world` / `.ranks` / `.bench` / `.metrics` read from. Changes: - Walk `instance_name`, `user_id`, `namespace`, `image`, `state`, `url`, `created_at`, `updated_at`, `phase`, `message`, `share_token`, `share_url`, `public_metadata`. Each maps to its camelCase wire key via `_to_camel`. - Pass `distributed` through verbatim when present (it is already camelCase from the PyO3 `pythonize` conversion). - Type-check the `distributed` attribute against `dict` so MagicMock auto-attributes (truthy but not dicts) do not silently leak in and shadow the `_distributed_status` test-mock fallback. - Keep the `_distributed_status` fallback path for existing test mocks that pre-date the PyO3 binding fix. The four read properties (`world`, `ranks`, `bench`, `metrics`) were already correctly walking `cached_status["distributed"]["worldSize"] [...]` (camelCase) -- they did not need to change. They now return real values instead of zeros once the dict carries the operator's status block end-to-end.
Issue #449. Six new tests under `TestIssue449DeploymentResponseDistributed` mock the PyO3 `DeploymentResponse` shape that lands in the previous commits and assert every read property returns the operator's actual values (not zeros). - test_world_returns_real_values_not_zeros: t.world -> WorldStatus(ready=2, target=2, min=2, max=3, below_minimum=False). Pre-#449 returned WorldStatus(0, 0, 0, 0). - test_ranks_returns_two_pods_with_provider_and_region: t.ranks -> 2 RankStatus with the right pod_name / node / provider / region. Pre-#449 returned []. - test_bench_returns_populated_result: t.bench -> BenchResult(busbw_gbps_p50=0.00897, ...). Pre-#449 returned None. - test_metrics_returns_real_world_size: t.metrics() carries non-zero world_ready/world_target. Pre-#449 returned zeros. - test_wait_until_min_world_returns_immediately_when_at_min: t.wait_until_min_world(timeout=10) returns cleanly. Pre-#449 raised BelowMinimumWorld(ready=0, required_min=0) on a healthy cluster. NEGATIVE TEST (the regression lock): - test_NEGATIVE_coerce_to_dict_carries_distributed_and_image: asserts `_coerce_to_dict` produces a dict with `instanceName`, `userId`, `namespace`, `image`, `state`, `url`, `createdAt`, `updatedAt`, `phase`, `distributed`. A future regression that drops the PyO3 binding's `distributed` getter (or stops walking it in `_coerce_to_dict`) would fail this test loudly, not silently produce zeros across every read property. The fixture uses an explicit `FakeDeployment` instance (not a plain MagicMock) because post-#449 `_coerce_to_dict` checks `isinstance(d, dict)` to distinguish a real PyO3 attribute from a MagicMock auto-generated attribute. All 36 distributed tests pass; the 30 pre-existing tests are unaffected.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 32 minutes and 24 seconds.Comment |
Bug analysis (issue #449)
The basilica-api correctly returns 11 fields on
DeploymentResponseJSON including the fulldistributed: {worldSize, ranks, bench, ...}block (verified live by raw HTTP today on production, server-side wired by PR #433). But when read through the SDK,_coerce_to_dict(client.get(name))returned only 4 fields:[namespace, state, url, userId].Root cause was at TWO layers:
crates/basilica-sdk/src/types.rs:952(DeploymentResponse) — missingimageanddistributedfields. Serde silently dropped both during JSON deserialization because the struct didn't declare them.crates/basilica-sdk-python/src/types.rs:1449(PyO3 mirror) — even if the Rust struct had carried them, the#[pyclass]mirror had no#[pyo3(get)]for them either, so the Python facade saw nothing.Effect: every read property on
DistributedTrainingreturned zeros / empty / None on a healthy cluster:t.world->WorldStatus(0, 0, 0, 0)t.ranks->[]t.bench->Nonet.metrics()-> zerost.wait_until_min_world(timeout=N)-> raisedBelowMinimumWorld(ready=0, required_min=0)immediatelyThe SDK's writer paths (
deploy_distributed,scale,delete) were unaffected because they only POST/DELETE; they don't deserialize the response body.What this PR changes
Four bisectable commits, each landing one milestone:
1.
fix(sdk): mirror status.distributed wire shape on DeploymentResponse(42519a1)DistributedStatusincrates/basilica-sdk/src/types.rsto mirror the canonicalbasilica-api::DistributedStatus(basilica-api/src/api/routes/deployments/types.rs:610) byte-for-byte at the JSON layer. New sub-types:DistributedCondition,DistributedRendezvousStatus,DistributedWorldSizeTransition,DistributedResizeRecord,DistributedMilestone,DistributedRankLossReset,DistributedPreflightEstimate(mirrors of basilica-api'sDistributedConditionline 465,RendezvousStatusline 482,WorldSizeTransitionline 504,ResizeRecordline 514,Milestoneline 525,RankLossResetline 535,PreflightEstimateline 544).pub image: Stringandpub distributed: Option<DistributedStatus>toDeploymentResponse(line 952). Both#[serde(default)]/skip_serializing_if = "Option::is_none"so older API responses without these keys continue to deserialize correctly.types::tests:distributedpopulated (worldSize + 2 ranks + bench result with f64 fields).distributed = Noneomits the JSON key.2.
fix(sdk-python): expose image + status.distributed via PyO3(9953ba4)pub image: Stringwith#[pyo3(get)]on the PyO3DeploymentResponse.pub distributed: Option<serde_json::Value>(no#[pyo3(get)]) +#[getter] fn distributed(&self, py)that converts to a Python dict viapythonize. Uses the existingpythonize-passthrough pattern already established byget_deployment_events._basilica.pyitype stubs.serde_jsonto the crate's deps (via workspace).3.
fix(sdk-python): walk full DeploymentResponse in _coerce_to_dict(4ad64c1)instance_name,user_id,namespace,image,state,url,created_at,updated_at,phase,message,share_token,share_url,public_metadata-> camelCase wire keys.distributedthrough verbatim (already camelCase from PyO3pythonize).dictso MagicMock auto-attrs do not silently leak in._distributed_statustest-mock fallback for back-compat.4.
test(sdk-python): lock end-to-end status.distributed read path(60145f9)Six new tests under
TestIssue449DeploymentResponseDistributed:test_world_returns_real_values_not_zerostest_ranks_returns_two_pods_with_provider_and_regiontest_bench_returns_populated_resulttest_metrics_returns_real_world_sizetest_wait_until_min_world_returns_immediately_when_at_mintest_NEGATIVE_coerce_to_dict_carries_distributed_and_image— the regression lock; asserts the dict carriesinstanceName,userId,namespace,image,state,url,createdAt,updatedAt,phase,distributed.Test transcript
Rust serde tests (
cargo test -p basilica-sdk --lib)Python tests (
pytest crates/basilica-sdk-python/tests/test_distributed.py -v)Pre-push validation
cargo fmt --all -- --checkcleancargo clippy --workspace --tests -- -D warningscleancargo test -p basilica-sdk --lib77/77 passmaturin develop --quietsucceededpytest crates/basilica-sdk-python/tests/test_distributed.py -v36/36 passVerified by mocked Python integration test; live verification deferred to a post-merge re-smoke (the user's call) since that requires API token + GPU capacity.
What is NOT in this PR
deploy_distributed,scale,deletealways worked.__init__.py:1220/workspace/->/tmp/change — that's bug(sdk): deploy_distributed source= hardcodes /workspace, fails on uid=1000 pods #448's scope, parallel agent.pyproject.tomlnorCargo.tomlwas touched. No PyPI upload, no release tag, nomaturin publish. The user explicitly said "do not release the sdk or cli".Closes #449.