Skip to content

fix(sdk): wait_until_target_world races stale status.worldSize.target (#495)#464

Merged
epappas merged 1 commit into
mainfrom
fix/sdk-scale-stale-target-495
May 8, 2026
Merged

fix(sdk): wait_until_target_world races stale status.worldSize.target (#495)#464
epappas merged 1 commit into
mainfrom
fix/sdk-scale-stale-target-495

Conversation

@epappas
Copy link
Copy Markdown
Contributor

@epappas epappas commented May 8, 2026

Summary

DistributedTraining.scale(target=N) patches spec.distributed.worldSize.target only; the operator mirrors spec → status on its next reconcile pass. The SDK's wait_until_target_world checked ws.ready >= ws.target against the stale status.worldSize.target, so on the first loop iteration after scale() it saw ready == stale_target and returned — claiming N ranks ready while only the original count had ever been scheduled.

Track the requested target on the DistributedTraining instance and gate the success predicate so a pending scale forces the loop to wait until both status.worldSize.target reflects the patch AND ready >= requested. Also return a WorldStatus from scale() that reports the requested target, so example output matches reality.

Closes #495.

Layer responsibility

Per investigation in docs/incidents/2026-05-08-scale-silent-failure-rca.md (basilica-private):

  • SDK is the load-bearing fix. wait_until_target_world had no way to distinguish "operator hasn't yet observed my scale request" from "world is settled at the current target."
  • API/operator is a contributor but per the documented contract (crates/basilica-api/.../handlers.rs:1480-1489) POST /scale-distributed is intentionally async and the SDK wait-helpers are the documented synchronization point. No separate operator-side issue filed.

Test plan

  • `pytest tests/test_distributed.py` — 70 tests pass (66 existing + 4 new).
  • `pytest tests/` (full SDK suite minus e2e) — 102 pass.
  • Anti-regression check: with the prod fix reverted, 3 of the 4 new `issue_495` tests fail (the load-bearing `does_not_shortcircuit_on_stale_target` test fails with the exact bug pattern).
  • Live re-run of `examples/21_distributed_torchrun.py` (paid; deferred to reviewer).

What's NOT in this PR

  • No changes to the API or operator. `POST /scale-distributed` remains async; the resize-observed acknowledgement is not part of this change.
  • No changes to `examples/21_distributed_torchrun.py`. The example was the canary; with the SDK fix, its existing `wait_until_target_world(timeout=300)` will block correctly until the new rank joins.

Live evidence (the bug)

ex21-r2, 2026-05-08T12:52Z, exit 0 in 50s wall-clock:

```
World: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
Scaled to target=3; world now: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
All 3 ranks ready: WorldStatus(ready=2, target=2, min=2, max=4, below_minimum=False)
```

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed distributed training scaling operations to correctly report requested scale targets and prevent premature timeouts. The system now properly tracks scaling requests through operator reconciliation.
  • Tests

    • Added comprehensive regression tests covering distributed training scaling and wait-until-target operations.

… (refs #495)

scale(target=N) patches spec.distributed.worldSize.target only; the
operator mirrors spec to status on its next reconcile pass. The SDK's
wait_until_target_world checked ws.ready >= ws.target, where ws.target
came from the stale status. On the first iteration after scale(), the
loop saw ready==stale_target and returned, claiming N ranks ready while
only the original count had ever been scheduled.

Track the requested target on the DistributedTraining instance and gate
the success predicate so a pending scale forces the loop to wait until
status.worldSize.target reflects the patch AND ready >= requested. Also
return a WorldStatus from scale() that reports the requested target, so
example output matches reality.

Tests:
- test_issue_495_wait_until_target_world_does_not_shortcircuit_on_stale_target
  (load-bearing): scale + stale refresh + delayed reconcile, asserts the
  loop sleeps and only returns when status reflects the patched target.
- test_issue_495_scale_returns_world_status_with_requested_target.
- test_issue_495_wait_until_target_world_calls_refresh_before_first_check.
- test_issue_495_wait_until_target_world_times_out_with_pending_required_min.

Live evidence: ex21-r2 (2026-05-08T12:52Z) — exit 0 in 50s, "All 3 ranks
ready" while ws.target=2 ws.ready=2.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c15e1f1b-ca22-4952-bf16-a9d38738b2e4

📥 Commits

Reviewing files that changed from the base of the PR and between d8f06bb and a5378cd.

📒 Files selected for processing (2)
  • crates/basilica-sdk-python/python/basilica/distributed.py
  • crates/basilica-sdk-python/tests/test_distributed.py

Walkthrough

DistributedTraining fixes stale operator target reflection by adding pending-scale tracking. The scale(target=N) method now records the requested target and immediately returns a WorldStatus with that target value. Both wait_until_target_world methods use pending-aware predicates to gate success and report accurate required targets on timeout. New tests verify the complete flow.

Changes

Distributed Training Stale Target Resolution

Layer / File(s) Summary
Pending Scale State Initialization
crates/basilica-sdk-python/python/basilica/distributed.py
DistributedTraining.__init__ adds _pending_scale_target field to track the most recently requested scale target.
Scale Method & Pending-Aware Helpers
crates/basilica-sdk-python/python/basilica/distributed.py
scale(target) sets _pending_scale_target, refreshes status, and returns WorldStatus with target forced to requested value. New internal helpers _target_world_reached(ws) and _effective_required_target(ws) implement pending-gated success predicates and error-target computation.
Synchronous Wait with Pending Awareness
crates/basilica-sdk-python/python/basilica/distributed.py
wait_until_target_world(timeout) uses pending-gated predicate to prevent short-circuit on stale target, clears _pending_scale_target on success, and reports effective required target on timeout.
Asynchronous Wait with Pending Awareness
crates/basilica-sdk-python/python/basilica/distributed.py
wait_until_target_world_async(timeout) mirrors synchronous behavior with pending-aware logic and state clearing on success or terminal error.
Regression Test Coverage for Issue #495
crates/basilica-sdk-python/tests/test_distributed.py
TestIssue495ScaleStaleTargetGate verifies scale() returns requested target, wait_until_target_world does not short-circuit on stale status, refresh occurs before first comparison, and timeout errors surface the requested (not stale) required target.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • one-covenant/basilica#461: Both PRs modify DistributedTraining.scale() and wait_until_target_world() (and async variant) in python/basilica/distributed.py.

Poem

🐰 A pending target marks the way,
No stale reflections slow our day!
Scale requests are now preserved,
Until the operator's noticed what we served.
Waiting wisely, we finally see,
What the distributed world should be! ✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sdk-scale-stale-target-495

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@epappas epappas marked this pull request as ready for review May 8, 2026 15:53
@epappas epappas merged commit 2abc1cb into main May 8, 2026
14 checks passed
@epappas epappas deleted the fix/sdk-scale-stale-target-495 branch May 8, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant