fix(sdk): wait_until_target_world races stale status.worldSize.target (#495)#464
Conversation
… (refs #495) scale(target=N) patches spec.distributed.worldSize.target only; the operator mirrors spec to status on its next reconcile pass. The SDK's wait_until_target_world checked ws.ready >= ws.target, where ws.target came from the stale status. On the first iteration after scale(), the loop saw ready==stale_target and returned, claiming N ranks ready while only the original count had ever been scheduled. Track the requested target on the DistributedTraining instance and gate the success predicate so a pending scale forces the loop to wait until status.worldSize.target reflects the patch AND ready >= requested. Also return a WorldStatus from scale() that reports the requested target, so example output matches reality. Tests: - test_issue_495_wait_until_target_world_does_not_shortcircuit_on_stale_target (load-bearing): scale + stale refresh + delayed reconcile, asserts the loop sleeps and only returns when status reflects the patched target. - test_issue_495_scale_returns_world_status_with_requested_target. - test_issue_495_wait_until_target_world_calls_refresh_before_first_check. - test_issue_495_wait_until_target_world_times_out_with_pending_required_min. Live evidence: ex21-r2 (2026-05-08T12:52Z) — exit 0 in 50s, "All 3 ranks ready" while ws.target=2 ws.ready=2.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughDistributedTraining fixes stale operator target reflection by adding pending-scale tracking. The ChangesDistributed Training Stale Target Resolution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
DistributedTraining.scale(target=N)patchesspec.distributed.worldSize.targetonly; the operator mirrorsspec → statuson its next reconcile pass. The SDK'swait_until_target_worldcheckedws.ready >= ws.targetagainst the stalestatus.worldSize.target, so on the first loop iteration afterscale()it sawready == stale_targetand returned — claimingNranks ready while only the original count had ever been scheduled.Track the requested target on the
DistributedTraininginstance and gate the success predicate so a pending scale forces the loop to wait until bothstatus.worldSize.targetreflects the patch ANDready >= requested. Also return aWorldStatusfromscale()that reports the requested target, so example output matches reality.Closes #495.
Layer responsibility
Per investigation in
docs/incidents/2026-05-08-scale-silent-failure-rca.md(basilica-private):wait_until_target_worldhad no way to distinguish "operator hasn't yet observed my scale request" from "world is settled at the current target."crates/basilica-api/.../handlers.rs:1480-1489)POST /scale-distributedis intentionally async and the SDK wait-helpers are the documented synchronization point. No separate operator-side issue filed.Test plan
What's NOT in this PR
Live evidence (the bug)
ex21-r2, 2026-05-08T12:52Z, exit 0 in 50s wall-clock:
```
World: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
Scaled to target=3; world now: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
All 3 ranks ready: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
```
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests