Skip to content

fix(sdk): expose status.distributed (and other fields) through PyO3 DeploymentResponse (closes #449)#451

Merged
epappas merged 4 commits into
mainfrom
fix/sdk-pyo3-deploymentresponse-distributed-449
May 3, 2026
Merged

fix(sdk): expose status.distributed (and other fields) through PyO3 DeploymentResponse (closes #449)#451
epappas merged 4 commits into
mainfrom
fix/sdk-pyo3-deploymentresponse-distributed-449

Conversation

@epappas
Copy link
Copy Markdown
Contributor

@epappas epappas commented May 3, 2026

Bug analysis (issue #449)

The basilica-api correctly returns 11 fields on DeploymentResponse JSON including the full distributed: {worldSize, ranks, bench, ...} block (verified live by raw HTTP today on production, server-side wired by PR #433). But when read through the SDK, _coerce_to_dict(client.get(name)) returned only 4 fields: [namespace, state, url, userId].

Root cause was at TWO layers:

  1. crates/basilica-sdk/src/types.rs:952 (DeploymentResponse) — missing image and distributed fields. Serde silently dropped both during JSON deserialization because the struct didn't declare them.
  2. crates/basilica-sdk-python/src/types.rs:1449 (PyO3 mirror) — even if the Rust struct had carried them, the #[pyclass] mirror had no #[pyo3(get)] for them either, so the Python facade saw nothing.

Effect: every read property on DistributedTraining returned zeros / empty / None on a healthy cluster:

  • t.world -> WorldStatus(0, 0, 0, 0)
  • t.ranks -> []
  • t.bench -> None
  • t.metrics() -> zeros
  • t.wait_until_min_world(timeout=N) -> raised BelowMinimumWorld(ready=0, required_min=0) immediately

The SDK's writer paths (deploy_distributed, scale, delete) were unaffected because they only POST/DELETE; they don't deserialize the response body.

What this PR changes

Four bisectable commits, each landing one milestone:

1. fix(sdk): mirror status.distributed wire shape on DeploymentResponse (42519a1)

  • Extends DistributedStatus in crates/basilica-sdk/src/types.rs to mirror the canonical basilica-api::DistributedStatus (basilica-api/src/api/routes/deployments/types.rs:610) byte-for-byte at the JSON layer. New sub-types: DistributedCondition, DistributedRendezvousStatus, DistributedWorldSizeTransition, DistributedResizeRecord, DistributedMilestone, DistributedRankLossReset, DistributedPreflightEstimate (mirrors of basilica-api's DistributedCondition line 465, RendezvousStatus line 482, WorldSizeTransition line 504, ResizeRecord line 514, Milestone line 525, RankLossReset line 535, PreflightEstimate line 544).
  • Adds pub image: String and pub distributed: Option<DistributedStatus> to DeploymentResponse (line 952). Both #[serde(default)] / skip_serializing_if = "Option::is_none" so older API responses without these keys continue to deserialize correctly.
  • 4 new serde tests in types::tests:
    • Full round-trip with distributed populated (worldSize + 2 ranks + bench result with f64 fields).
    • Backwards-compat: distributed = None omits the JSON key.
    • Wire-shape lock: every JSON key is camelCase.
    • BenchResult sub-shape round-trips a populated probe.

2. fix(sdk-python): expose image + status.distributed via PyO3 (9953ba4)

  • Adds pub image: String with #[pyo3(get)] on the PyO3 DeploymentResponse.
  • Adds pub distributed: Option<serde_json::Value> (no #[pyo3(get)]) + #[getter] fn distributed(&self, py) that converts to a Python dict via pythonize. Uses the existing pythonize-passthrough pattern already established by get_deployment_events.
  • Updates _basilica.pyi type stubs.
  • Adds serde_json to the crate's deps (via workspace).

3. fix(sdk-python): walk full DeploymentResponse in _coerce_to_dict (4ad64c1)

  • Walks instance_name, user_id, namespace, image, state, url, created_at, updated_at, phase, message, share_token, share_url, public_metadata -> camelCase wire keys.
  • Passes distributed through verbatim (already camelCase from PyO3 pythonize).
  • Type-checks against dict so MagicMock auto-attrs do not silently leak in.
  • Keeps the legacy _distributed_status test-mock fallback for back-compat.

4. test(sdk-python): lock end-to-end status.distributed read path (60145f9)

Six new tests under TestIssue449DeploymentResponseDistributed:

  • test_world_returns_real_values_not_zeros
  • test_ranks_returns_two_pods_with_provider_and_region
  • test_bench_returns_populated_result
  • test_metrics_returns_real_world_size
  • test_wait_until_min_world_returns_immediately_when_at_min
  • test_NEGATIVE_coerce_to_dict_carries_distributed_and_image — the regression lock; asserts the dict carries instanceName, userId, namespace, image, state, url, createdAt, updatedAt, phase, distributed.

Test transcript

Rust serde tests (cargo test -p basilica-sdk --lib)

test types::tests::test_deployment_response_with_distributed_round_trips ... ok
test types::tests::test_deployment_response_without_distributed_omits_key ... ok
test types::tests::test_distributed_status_json_is_camelcase ... ok
test types::tests::test_distributed_status_bench_result_round_trips ... ok
test types::tests::test_distributed_status_camelcase ... ok
[...77 total...]
test result: ok. 77 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Python tests (pytest crates/basilica-sdk-python/tests/test_distributed.py -v)

TestIssue449DeploymentResponseDistributed::test_world_returns_real_values_not_zeros PASSED
TestIssue449DeploymentResponseDistributed::test_ranks_returns_two_pods_with_provider_and_region PASSED
TestIssue449DeploymentResponseDistributed::test_bench_returns_populated_result PASSED
TestIssue449DeploymentResponseDistributed::test_metrics_returns_real_world_size PASSED
TestIssue449DeploymentResponseDistributed::test_wait_until_min_world_returns_immediately_when_at_min PASSED
TestIssue449DeploymentResponseDistributed::test_NEGATIVE_coerce_to_dict_carries_distributed_and_image PASSED
[...36 total...]
============================== 36 passed in 0.12s ==============================

Pre-push validation

  • cargo fmt --all -- --check clean
  • cargo clippy --workspace --tests -- -D warnings clean
  • cargo test -p basilica-sdk --lib 77/77 pass
  • maturin develop --quiet succeeded
  • pytest crates/basilica-sdk-python/tests/test_distributed.py -v 36/36 pass

Verified by mocked Python integration test; live verification deferred to a post-merge re-smoke (the user's call) since that requires API token + GPU capacity.

What is NOT in this PR

Closes #449.

epappas added 4 commits May 3, 2026 23:18
Issue #449. The basilica-api correctly returns status.distributed in the
DeploymentResponse JSON (wired via PR #433/#447 server-side), but the SDK
silently dropped it because basilica-sdk::types::DeploymentResponse had
no `distributed` field. Round-tripping through serde would discard the
block before it ever reached the PyO3 binding or the Python facade.

Changes:
- Extend `DistributedStatus` to mirror the canonical
  `basilica-api::DistributedStatus` byte-for-byte at the JSON layer
  (issue #431 / #449): `conditions`, `rendezvous`, `worldSizeHistory`,
  `lastResize`, `milestones`, `originalMax`, `rankLossResets`,
  `preflight`, `preflightDeprecationWarned`.
- Add new sub-types: `DistributedCondition`,
  `DistributedRendezvousStatus`, `DistributedWorldSizeTransition`,
  `DistributedResizeRecord`, `DistributedMilestone`,
  `DistributedRankLossReset`, `DistributedPreflightEstimate`. All use
  `#[serde(rename_all = "camelCase")]` and `Option<...>` for
  nullable fields, matching the canonical shape exactly.
- Add `pub image: String` (defaulted) and
  `pub distributed: Option<DistributedStatus>` to
  `DeploymentResponse`. Both use `#[serde(default)]` /
  `skip_serializing_if = "Option::is_none"` so older API responses
  without these keys continue to deserialize correctly.
- 4 new serde tests in `types::tests`:
  - Full round-trip with `distributed` populated
    (worldSize + 2 ranks + bench result with f64 fields).
  - Backwards-compat: `distributed = None` omits the JSON key.
  - Wire-shape lock: every JSON key is camelCase
    (`belowMinimum`, `worldSize`, `lastResize`, `originalMax`,
    `worldSizeHistory`, `rankLossResets`,
    `preflightDeprecationWarned`).
  - BenchResult sub-shape round-trips a populated probe.

Existing test `test_distributed_status_camelcase` updated to use
`..Default::default()` for the newly added struct fields.

Out of scope (parallel PRs / separate fixes):
- No basilica-api changes (server-side already correct via #433).
- No PyO3 binding changes yet (next commit).
- No Python facade changes yet (next commit after that).
- No `__init__.py` /workspace/ -> /tmp/ change (#448's scope).
- No version bump or release.
Issue #449. The PyO3 `DeploymentResponse` mirror was missing two fields
that the basilica-api has always returned:
- `image` (container image)
- `distributed` (the full `status.distributed` mirror added by #433
  server-side and #447 to the SDK Rust types in the previous commit)

Effect: `_coerce_to_dict(client.get(...))` in distributed.py saw only
4 attributes (`namespace`, `state`, `url`, `userId`). Every read
property on `DistributedTraining` (`world`, `ranks`, `bench`,
`metrics`, `wait_until_min_world`) returned zeros / empty / None on a
healthy cluster.

Changes:
- Add `pub image: String` with `#[pyo3(get)]` (one-liner mirror of the
  SDK type added in the previous commit).
- Add `pub distributed: Option<serde_json::Value>` (no `#[pyo3(get)]`)
  + `#[getter] fn distributed(&self, py)` that uses `pythonize` to
  hand the JSON to Python as a camelCase dict. The Python facade
  walks the dict directly; this matches the existing
  `pythonize`-passthrough pattern used by `get_deployment_events`.
- Wire both fields through `From<SdkDeploymentResponse>`.
- Update `_basilica.pyi` type stubs accordingly.
- Add `serde_json` to the crate's dependencies (via workspace).

The Python facade changes that consume this binding land in the next
commit (Milestone 3) so the bug fix is bisectable.
Issue #449. Now that the PyO3 binding exposes `image` + `distributed`,
update the Python facade's `_coerce_to_dict` helper to walk those
attributes onto the camelCase output dict that
`DistributedTraining.world` / `.ranks` / `.bench` / `.metrics` read
from.

Changes:
- Walk `instance_name`, `user_id`, `namespace`, `image`, `state`,
  `url`, `created_at`, `updated_at`, `phase`, `message`,
  `share_token`, `share_url`, `public_metadata`. Each maps to its
  camelCase wire key via `_to_camel`.
- Pass `distributed` through verbatim when present (it is already
  camelCase from the PyO3 `pythonize` conversion).
- Type-check the `distributed` attribute against `dict` so MagicMock
  auto-attributes (truthy but not dicts) do not silently leak in and
  shadow the `_distributed_status` test-mock fallback.
- Keep the `_distributed_status` fallback path for existing test
  mocks that pre-date the PyO3 binding fix.

The four read properties (`world`, `ranks`, `bench`, `metrics`) were
already correctly walking `cached_status["distributed"]["worldSize"]
[...]` (camelCase) -- they did not need to change. They now return
real values instead of zeros once the dict carries the operator's
status block end-to-end.
Issue #449. Six new tests under
`TestIssue449DeploymentResponseDistributed` mock the PyO3
`DeploymentResponse` shape that lands in the previous commits and
assert every read property returns the operator's actual values
(not zeros).

- test_world_returns_real_values_not_zeros: t.world ->
  WorldStatus(ready=2, target=2, min=2, max=3, below_minimum=False).
  Pre-#449 returned WorldStatus(0, 0, 0, 0).
- test_ranks_returns_two_pods_with_provider_and_region: t.ranks ->
  2 RankStatus with the right pod_name / node / provider / region.
  Pre-#449 returned [].
- test_bench_returns_populated_result: t.bench ->
  BenchResult(busbw_gbps_p50=0.00897, ...). Pre-#449 returned None.
- test_metrics_returns_real_world_size: t.metrics() carries non-zero
  world_ready/world_target. Pre-#449 returned zeros.
- test_wait_until_min_world_returns_immediately_when_at_min:
  t.wait_until_min_world(timeout=10) returns cleanly. Pre-#449 raised
  BelowMinimumWorld(ready=0, required_min=0) on a healthy cluster.

NEGATIVE TEST (the regression lock):
- test_NEGATIVE_coerce_to_dict_carries_distributed_and_image:
  asserts `_coerce_to_dict` produces a dict with
  `instanceName`, `userId`, `namespace`, `image`, `state`, `url`,
  `createdAt`, `updatedAt`, `phase`, `distributed`. A future
  regression that drops the PyO3 binding's `distributed` getter (or
  stops walking it in `_coerce_to_dict`) would fail this test
  loudly, not silently produce zeros across every read property.

The fixture uses an explicit `FakeDeployment` instance (not a plain
MagicMock) because post-#449 `_coerce_to_dict` checks
`isinstance(d, dict)` to distinguish a real PyO3 attribute from a
MagicMock auto-generated attribute.

All 36 distributed tests pass; the 30 pre-existing tests are
unaffected.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

Warning

Rate limit exceeded

@epappas has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 32 minutes and 24 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bd5a259f-57db-4928-8086-540cdf275d5b

📥 Commits

Reviewing files that changed from the base of the PR and between 7b34f67 and 60145f9.

📒 Files selected for processing (6)
  • crates/basilica-sdk-python/Cargo.toml
  • crates/basilica-sdk-python/python/basilica/_basilica.pyi
  • crates/basilica-sdk-python/python/basilica/distributed.py
  • crates/basilica-sdk-python/src/types.rs
  • crates/basilica-sdk-python/tests/test_distributed.py
  • crates/basilica-sdk/src/types.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sdk-pyo3-deploymentresponse-distributed-449

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 32 minutes and 24 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(sdk): PyO3 DeploymentResponse drops most fields incl. 'distributed' (SDK readers blind)

1 participant