fix(sdk): wait_until_target_world races stale status.worldSize.target (#495) by epappas · Pull Request #464 · one-covenant/basilica

epappas · 2026-05-08T15:23:24Z

Summary

DistributedTraining.scale(target=N) patches spec.distributed.worldSize.target only; the operator mirrors spec → status on its next reconcile pass. The SDK's wait_until_target_world checked ws.ready >= ws.target against the stale status.worldSize.target, so on the first loop iteration after scale() it saw ready == stale_target and returned — claiming N ranks ready while only the original count had ever been scheduled.

Track the requested target on the DistributedTraining instance and gate the success predicate so a pending scale forces the loop to wait until both status.worldSize.target reflects the patch AND ready >= requested. Also return a WorldStatus from scale() that reports the requested target, so example output matches reality.

Closes #495.

Layer responsibility

Per investigation in docs/incidents/2026-05-08-scale-silent-failure-rca.md (basilica-private):

SDK is the load-bearing fix. wait_until_target_world had no way to distinguish "operator hasn't yet observed my scale request" from "world is settled at the current target."
API/operator is a contributor but per the documented contract (crates/basilica-api/.../handlers.rs:1480-1489) POST /scale-distributed is intentionally async and the SDK wait-helpers are the documented synchronization point. No separate operator-side issue filed.

Test plan

`pytest tests/test_distributed.py` — 70 tests pass (66 existing + 4 new).
`pytest tests/` (full SDK suite minus e2e) — 102 pass.
Anti-regression check: with the prod fix reverted, 3 of the 4 new `issue_495` tests fail (the load-bearing `does_not_shortcircuit_on_stale_target` test fails with the exact bug pattern).
Live re-run of `examples/21_distributed_torchrun.py` (paid; deferred to reviewer).

What's NOT in this PR

No changes to the API or operator. `POST /scale-distributed` remains async; the resize-observed acknowledgement is not part of this change.
No changes to `examples/21_distributed_torchrun.py`. The example was the canary; with the SDK fix, its existing `wait_until_target_world(timeout=300)` will block correctly until the new rank joins.

Live evidence (the bug)

ex21-r2, 2026-05-08T12:52Z, exit 0 in 50s wall-clock:

```
World: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
Scaled to target=3; world now: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
All 3 ranks ready: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
```

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed distributed training scaling operations to correctly report requested scale targets and prevent premature timeouts. The system now properly tracks scaling requests through operator reconciliation.
Tests
- Added comprehensive regression tests covering distributed training scaling and wait-until-target operations.

… (refs #495) scale(target=N) patches spec.distributed.worldSize.target only; the operator mirrors spec to status on its next reconcile pass. The SDK's wait_until_target_world checked ws.ready >= ws.target, where ws.target came from the stale status. On the first iteration after scale(), the loop saw ready==stale_target and returned, claiming N ranks ready while only the original count had ever been scheduled. Track the requested target on the DistributedTraining instance and gate the success predicate so a pending scale forces the loop to wait until status.worldSize.target reflects the patch AND ready >= requested. Also return a WorldStatus from scale() that reports the requested target, so example output matches reality. Tests: - test_issue_495_wait_until_target_world_does_not_shortcircuit_on_stale_target (load-bearing): scale + stale refresh + delayed reconcile, asserts the loop sleeps and only returns when status reflects the patched target. - test_issue_495_scale_returns_world_status_with_requested_target. - test_issue_495_wait_until_target_world_calls_refresh_before_first_check. - test_issue_495_wait_until_target_world_times_out_with_pending_required_min. Live evidence: ex21-r2 (2026-05-08T12:52Z) — exit 0 in 50s, "All 3 ranks ready" while ws.target=2 ws.ready=2.

coderabbitai · 2026-05-08T15:23:34Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c15e1f1b-ca22-4952-bf16-a9d38738b2e4

📥 Commits

Reviewing files that changed from the base of the PR and between d8f06bb and a5378cd.

📒 Files selected for processing (2)

crates/basilica-sdk-python/python/basilica/distributed.py
crates/basilica-sdk-python/tests/test_distributed.py

Walkthrough

DistributedTraining fixes stale operator target reflection by adding pending-scale tracking. The scale(target=N) method now records the requested target and immediately returns a WorldStatus with that target value. Both wait_until_target_world methods use pending-aware predicates to gate success and report accurate required targets on timeout. New tests verify the complete flow.

Changes

Distributed Training Stale Target Resolution

Layer / File(s)	Summary
Pending Scale State Initialization `crates/basilica-sdk-python/python/basilica/distributed.py`	`DistributedTraining.__init__` adds `_pending_scale_target` field to track the most recently requested scale target.
Scale Method & Pending-Aware Helpers `crates/basilica-sdk-python/python/basilica/distributed.py`	`scale(target)` sets `_pending_scale_target`, refreshes status, and returns `WorldStatus` with target forced to requested value. New internal helpers `_target_world_reached(ws)` and `_effective_required_target(ws)` implement pending-gated success predicates and error-target computation.
Synchronous Wait with Pending Awareness `crates/basilica-sdk-python/python/basilica/distributed.py`	`wait_until_target_world(timeout)` uses pending-gated predicate to prevent short-circuit on stale target, clears `_pending_scale_target` on success, and reports effective required target on timeout.
Asynchronous Wait with Pending Awareness `crates/basilica-sdk-python/python/basilica/distributed.py`	`wait_until_target_world_async(timeout)` mirrors synchronous behavior with pending-aware logic and state clearing on success or terminal error.
Regression Test Coverage for Issue `#495` `crates/basilica-sdk-python/tests/test_distributed.py`	`TestIssue495ScaleStaleTargetGate` verifies `scale()` returns requested target, `wait_until_target_world` does not short-circuit on stale status, refresh occurs before first comparison, and timeout errors surface the requested (not stale) required target.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

one-covenant/basilica#461: Both PRs modify DistributedTraining.scale() and wait_until_target_world() (and async variant) in python/basilica/distributed.py.

Poem

🐰 A pending target marks the way,
No stale reflections slow our day!
Scale requests are now preserved,
Until the operator's noticed what we served.
Waiting wisely, we finally see,
What the distributed world should be! ✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/sdk-scale-stale-target-495

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

epappas marked this pull request as ready for review May 8, 2026 15:53

epappas merged commit 2abc1cb into main May 8, 2026
14 checks passed

epappas deleted the fix/sdk-scale-stale-target-495 branch May 8, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): wait_until_target_world races stale status.worldSize.target (#495)#464

fix(sdk): wait_until_target_world races stale status.worldSize.target (#495)#464
epappas merged 1 commit into
mainfrom
fix/sdk-scale-stale-target-495

epappas commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Review failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epappas commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Layer responsibility

Test plan

What's NOT in this PR

Live evidence (the bug)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

epappas commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading