Skip to content

feat(sdk): expose UD phase + wait_until_complete + UDTerminalState (refs #445)#461

Merged
epappas merged 1 commit into
mainfrom
feat/sdk-distributed-completion-semantics-445
May 5, 2026
Merged

feat(sdk): expose UD phase + wait_until_complete + UDTerminalState (refs #445)#461
epappas merged 1 commit into
mainfrom
feat/sdk-distributed-completion-semantics-445

Conversation

@epappas
Copy link
Copy Markdown
Contributor

@epappas epappas commented May 5, 2026

Summary

Refs #445 (closed by the matching basilica-backend PR). Exposes the operator's Succeeded terminal state to Python users via:

  • DistributedTraining.phase — operator-driven phase string (pending ... succeeded | failed | cancelled).
  • DistributedTraining.is_terminal — convenience for the terminal triple.
  • DistributedTraining.rank_exits: List[RankExit] — per-rank exit diagnostics, populated when the UD reaches a terminal state.
  • wait_until_complete(timeout) — sync + async; blocks until any terminal state, returns WorldStatus on succeeded, raises BelowMinimumWorld (with per-rank exit detail) on failed, raises UDTerminalState if the UD is already terminal at entry.
  • wait_until_target_world(timeout) — returns cleanly on succeeded (training is done by definition); raises BelowMinimumWorld on failed / cancelled.
  • t.scale(target=N) — raises UDTerminalState BEFORE the API call when the UD is already terminal. The operator-side defense in depth (UDTerminalState Warning Event) covers kubectl edit mutations that bypass the SDK.
  • New UDTerminalState(DistributedError) exception with phase and requested_target attributes.

Per-deliverable status

Rust wire types

  • DistributedRankExit struct mirrors the operator's RankExit.
  • DistributedStatus.rank_exits: Vec<DistributedRankExit> field.
  • cargo check -p basilica-sdk clean.

Python facade

  • RankExit dataclass.
  • phase / is_terminal / rank_exits properties on DistributedTraining.
  • wait_until_complete (sync + async) with BelowMinimumWorld (carrying per-rank exit context) on failed, UDTerminalState on already-terminal-at-entry, BelowMinimumWorld on timeout.
  • wait_until_target_world (sync + async) returns clean on succeeded.
  • scale raises UDTerminalState on terminal phase BEFORE the API call.
  • UDTerminalState exception added to basilica.exceptions.
  • All re-exported in basilica/__init__.py.

Tests

  • 11 new tests in tests/test_distributed.py::TestPhase5bCompletionSemantics:
    1. phase property reads succeeded from cached status.
    2. phase property reads non-terminal ready; is_terminal False.
    3. rank_exits parses camelCase JSON correctly.
    4. rank_exits empty when no rankExits key in distributed status.
    5. scale(target=N) on succeeded raises UDTerminalState; PyO3 patch NOT issued.
    6. scale(target=N) on failed raises UDTerminalState.
    7. wait_until_complete on UD already terminal at entry raises UDTerminalState.
    8. wait_until_complete returns WorldStatus when phase flips to succeeded mid-wait.
    9. wait_until_complete raises BelowMinimumWorld with rank 1: exit_code=137 reason=OOMKilled detail on mid-wait failed.
    10. wait_until_target_world returns clean on succeeded even with ready=0 (StatefulSet scaled to 0 by operator).
    11. wait_until_complete_async exists and is a coroutine function (SDK arch § 9 parity).
  • Full distributed test suite passes (60 tests; 49 prior + 11 Phase 5b).

Architecture doc

  • SDK doc § 6 / § 8 / § 11 updates land in the matching backend PR (the doc lives in basilica-private/docs/architecture/, not in this repo).

Test plan

What is NOT in this PR

  • Backend-side operator/reaper/billing-daemon changes (in the matching basilica-backend PR).
  • A typed Phase enum on the Python side. We use the snake_case string straight from the operator's status.phase because the SDK's existing phase: Optional[String] field on DeploymentResponse is already string-typed; introducing a new enum would have churned existing callers without buying type safety the existing surface lacks.
  • Auto-cleanup of Succeeded UDs in the SDK. The user's only entry point for cleanup is t.delete(), matching the operator-side contract.

…efs #445)

SDK-side surface for the distributed UD completion semantics shipped in
basilica-backend (#445). Exposes the operator's terminal phase machine
to Python users: `Pending -> Ready -> {Succeeded | Failed | Cancelled}`.

Rust wire types (crates/basilica-sdk/src/types.rs):
- DistributedRankExit struct mirrors the operator's RankExit, populated
  by basilica-api on `status.distributed.rankExits` when a UD reaches a
  terminal state.
- DistributedStatus.rank_exits field carries them end-to-end.

Python facade (crates/basilica-sdk-python/python/basilica/distributed.py):
- New `RankExit` dataclass mirrors the wire type.
- `DistributedTraining.phase` property reads `DeploymentResponse.phase`
  as a snake_case string (succeeded | failed | cancelled | ...).
- `DistributedTraining.is_terminal` convenience derives membership in
  the terminal triple.
- `DistributedTraining.rank_exits` reads the camelCase `rankExits` JSON
  list as `List[RankExit]`. Empty while the UD is non-terminal.
- `wait_until_complete(timeout)` blocks until any terminal state.
  Returns the final `WorldStatus` on `succeeded`. Raises `BelowMinimumWorld`
  on `failed` (with per-rank exit context). Raises `UDTerminalState`
  if the UD is already terminal at entry.
- `wait_until_target_world(timeout)` returns cleanly on `succeeded`
  (training is done by definition). Raises `BelowMinimumWorld` on
  `failed` / `cancelled`.
- `scale(target)` raises `UDTerminalState` BEFORE the API call when the
  UD is in a terminal phase. Operator-side defense in depth (UDTerminalState
  Warning Event) covers kubectl-edit mutations that bypass the SDK.
- Async parity: `wait_until_complete_async`.

Exceptions (python/basilica/exceptions.py):
- New `UDTerminalState(DistributedError)` carrying `phase` and
  `requested_target` attributes.

Re-exports updated in `python/basilica/__init__.py`.

Tests (crates/basilica-sdk-python/tests/test_distributed.py):
- 11 new tests in `TestPhase5bCompletionSemantics` cover: phase property
  reads succeeded / running, rank_exits camelCase parsing, scale rejects
  on succeeded / failed, wait_until_complete entry-guard, wait_until_complete
  flip mid-wait, wait_until_complete failed-mid-wait carries OOMKilled
  detail, wait_until_target_world returns clean on succeeded, async parity.
- Full distributed test suite passes (60 tests).

Pairs with `feat(operator,api,billing-daemon): Succeeded terminal state
for distributed UDs (closes #445)` in basilica-backend.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Walkthrough

This PR adds Phase 5b terminal-phase support to the distributed training SDK by introducing terminal phase detection, rank exit diagnostics, and updated wait-until-complete semantics. It includes new RankExit and UDTerminalState types, properties to read phase and terminal status, updated scale/wait methods to respect terminal states, and comprehensive test coverage across Python and Rust.

Changes

Phase 5b Terminal-Phase Support

Layer / File(s) Summary
Data Shape (Rust)
crates/basilica-sdk/src/types.rs
New DistributedRankExit struct (rank, exit_code, optional termination_reason, restart_count) added to model per-rank termination diagnostics. DistributedStatus extended with rank_exits: Vec<DistributedRankExit> field (camelCase serialized, defaults to empty, skipped when empty).
Data Shape (Python)
crates/basilica-sdk-python/python/basilica/distributed.py
RankExit dataclass mirrors Rust type; TERMINAL_PHASES constant defines terminal states ("succeeded", "failed", "cancelled").
Terminal-Phase Properties
crates/basilica-sdk-python/python/basilica/distributed.py
DistributedTraining.phase reads top-level deployment phase from cached status; is_terminal checks membership in TERMINAL_PHASES; rank_exits parses distributed.rankExits into RankExit objects.
Mutation Guard
crates/basilica-sdk-python/python/basilica/distributed.py
DistributedTraining.scale() updated to raise UDTerminalState if current phase is terminal, preventing scaling on completed UDs.
Completion Polling
crates/basilica-sdk-python/python/basilica/distributed.py
New wait_until_complete() and wait_until_complete_async() methods poll until terminal state, raising UDTerminalState if already terminal at entry, returning WorldStatus on "succeeded", and raising BelowMinimumWorld (with per-rank exit details) on "failed"/"cancelled".
Phase Transition Handling
crates/basilica-sdk-python/python/basilica/distributed.py
wait_until_target_world() and async variant reworked with Phase 5b semantics: return immediately on phase == "succeeded"; raise BelowMinimumWorld when phase becomes "failed"/"cancelled" before target condition met.
Exception Support
crates/basilica-sdk-python/python/basilica/exceptions.py
New UDTerminalState(DistributedError) exception with code="DISTRIBUTED_UD_TERMINAL_STATE", retryable=False; stores phase and optional requested_target attributes.
Module Exports
crates/basilica-sdk-python/python/basilica/__init__.py
RankExit imported from .distributed and UDTerminalState from .exceptions added to package-level exports.
Tests & Coverage
crates/basilica-sdk-python/tests/test_distributed.py
TestPhase5bCompletionSemantics validates cached phase reading, is_terminal correctness, rank_exits parsing, scale() rejection on terminal phases, wait_until_complete() behaviors (already-terminal entry, phase flip to succeeded, mid-wait failure with rank diagnostics), wait_until_target_world() regression coverage, and async parity.

Sequence Diagram

sequenceDiagram
    participant Client
    participant DistributedTraining
    participant StatusCache
    participant PhaseMonitor
    participant ExceptionHandler

    Client->>DistributedTraining: wait_until_complete()
    DistributedTraining->>StatusCache: refresh()
    StatusCache-->>DistributedTraining: cached status
    DistributedTraining->>PhaseMonitor: check phase
    
    alt Already Terminal
        PhaseMonitor-->>DistributedTraining: phase ∈ {succeeded, failed, cancelled}
        DistributedTraining->>ExceptionHandler: raise UDTerminalState
        ExceptionHandler-->>Client: UDTerminalState
    else Poll for Completion
        loop Until Terminal or Timeout
            PhaseMonitor->>StatusCache: refresh status
            StatusCache-->>PhaseMonitor: updated phase
            
            alt Phase == "succeeded"
                PhaseMonitor-->>DistributedTraining: return WorldStatus
                DistributedTraining-->>Client: WorldStatus
            else Phase == "failed" or "cancelled"
                PhaseMonitor->>ExceptionHandler: extract rank_exits
                ExceptionHandler-->>DistributedTraining: raise BelowMinimumWorld(with rank diagnostics)
                DistributedTraining-->>Client: BelowMinimumWorld
            else Non-Terminal
                Note over PhaseMonitor: continue polling
            end
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • one-covenant/basilica#447: Directly related — modifies the same distributed SDK surfaces (distributed.py, exceptions.py) and package exports (__init__.py) to add Phase 5b terminal-phase semantics and RankExit/UDTerminalState types.

Poem

🐰 A hop through phases, terminal and wise,
RankExits surface before our eyes,
No scaling when the UD's reached the end,
On completion's call, let us descend,
Phase 5b brings closure—hip, hip, hooray! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main changes: exposing UD phase, wait_until_complete functionality, and the UDTerminalState exception.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sdk-distributed-completion-semantics-445

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/basilica-sdk-python/python/basilica/distributed.py (1)

575-584: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve failed/cancelled semantics after the final refresh.

The loop handles terminal failure states correctly, but the post-deadline path only checks for "succeeded". If the UD flips to "failed" or "cancelled" on the last refresh — or timeout=0 skips the loop entirely — callers get a generic timeout message instead of the terminal-state error this method promises.

Suggested fix
         self.refresh()
         ws = self.world
         if self.phase == "succeeded":
             return
+        if self.phase in ("failed", "cancelled"):
+            raise BelowMinimumWorld(
+                f"wait_until_target_world: UD '{self.name}' reached "
+                f"terminal phase '{self.phase}' before world reached target",
+                ready=ws.ready,
+                required_min=ws.target,
+                timeout=timeout,
+            )
         if ws.ready < ws.target or ws.target == 0:
             raise BelowMinimumWorld(
                 f"wait_until_target_world timed out after {timeout}s "
                 f"(ready={ws.ready}, required_min={ws.target})",
                 ready=ws.ready,
                 required_min=ws.target,
                 timeout=timeout,
             )

Apply the same final-state branch to wait_until_target_world_async().

Also applies to: 609-618

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/basilica-sdk-python/python/basilica/distributed.py` around lines 575 -
584, The post-timeout path in wait_until_target_world_async (and the analogous
block at the other location) only checks for "succeeded" and therefore returns a
generic timeout even if the last refresh set self.phase to "failed" or
"cancelled"; update the final-state handling so after the loop (or when
timeout=0) you re-check self.phase and raise the same terminal-state exceptions
used in the loop for "failed" and "cancelled" (the same branches/exception types
used earlier in wait_until_target_world_async/ wait_until_target_world to
preserve semantics) before falling back to the BelowMinimumWorld timeout error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/basilica-sdk-python/python/basilica/distributed.py`:
- Around line 680-691: The timeout path in wait_until_complete (and the async
variant wait_until_complete_async) only treats non-succeeded final states as a
timeout, losing terminal failure diagnostics; modify the final refresh branch in
wait_until_complete (and mirror the change in wait_until_complete_async) to
check for terminal phases like "failed" or "cancelled" after self.refresh() and
raise the corresponding RankExit or Cancelled exception (or raise
BelowMinimumWorld only when truly timed out), including the live world
(self.world), ready/required_min/timeout and any rank-exit context so terminal
failure details are surfaced instead of being reported as a generic timeout.

In `@crates/basilica-sdk-python/tests/test_distributed.py`:
- Line 1275: The fixture helper currently uses "rank_exits or [...]" which
treats an explicitly passed empty list as falsy and replaces it with defaults;
change the assignment to use an explicit None check (e.g., use "rank_exits if
rank_exits is not None else [...]") so that an empty list passed into the
fixture helper is preserved; adjust the initialization where "rankExits":
rank_exits or [...] appears to use the None-check form and keep the default list
only when rank_exits is actually None.
- Line 1254: Change the implicit optional annotation for the parameter/variable
named rank_exits from "rank_exits: list = None" to an explicit optional type
like "rank_exits: list | None = None" so the type clearly indicates
Optional[List] (satisfying Ruff RUF013); locate the declaration of rank_exits in
the affected function or test and update the type annotation accordingly.

In `@crates/basilica-sdk/src/types.rs`:
- Around line 1611-1617: The DistributedStatus struct gained a new field
rank_exits so any manual DistributedStatus literal must include it; update the
_sample_deployment_with_distributed() test/sample initializer to set rank_exits
(e.g., an empty Vec::new()) or replace the explicit DistributedStatus literal
with DistributedStatus { ..Default::default() } (or otherwise fill the new
field) so the crate compiles; target the initializer in the
_sample_deployment_with_distributed() function and ensure the DistributedStatus
literal includes rank_exits: Vec::new().

---

Outside diff comments:
In `@crates/basilica-sdk-python/python/basilica/distributed.py`:
- Around line 575-584: The post-timeout path in wait_until_target_world_async
(and the analogous block at the other location) only checks for "succeeded" and
therefore returns a generic timeout even if the last refresh set self.phase to
"failed" or "cancelled"; update the final-state handling so after the loop (or
when timeout=0) you re-check self.phase and raise the same terminal-state
exceptions used in the loop for "failed" and "cancelled" (the same
branches/exception types used earlier in wait_until_target_world_async/
wait_until_target_world to preserve semantics) before falling back to the
BelowMinimumWorld timeout error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 15d6dca2-e7bd-4312-9331-20a13fd6221b

📥 Commits

Reviewing files that changed from the base of the PR and between c1d1c0a and 6ee257f.

📒 Files selected for processing (5)
  • crates/basilica-sdk-python/python/basilica/__init__.py
  • crates/basilica-sdk-python/python/basilica/distributed.py
  • crates/basilica-sdk-python/python/basilica/exceptions.py
  • crates/basilica-sdk-python/tests/test_distributed.py
  • crates/basilica-sdk/src/types.rs

Comment on lines +680 to +691
# Timeout: surface as `BelowMinimumWorld` with the live shape.
self.refresh()
ws = self.world
if self.phase == "succeeded":
return ws
raise BelowMinimumWorld(
f"wait_until_complete: timed out after {timeout}s "
f"(phase={self.phase}, ready={ws.ready}, target={ws.target})",
ready=ws.ready,
required_min=ws.min,
timeout=timeout,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return terminal failure details here too, not just inside the polling loop.

wait_until_complete*() promises to surface failed / cancelled with rank-exit context, but the final refresh path only checks for "succeeded". A terminal failure that lands on the last refresh is currently reported as a timeout, which loses the new diagnostics this PR adds.

Suggested fix
         self.refresh()
         ws = self.world
         if self.phase == "succeeded":
             return ws
+        if self.phase in ("failed", "cancelled"):
+            exits = self.rank_exits
+            bad = [
+                f"rank {e.rank}: exit_code={e.exit_code}"
+                + (f" reason={e.termination_reason}" if e.termination_reason else "")
+                for e in exits
+                if e.exit_code != 0
+            ]
+            detail = "; ".join(bad) if bad else "no per-rank exits recorded"
+            raise BelowMinimumWorld(
+                f"wait_until_complete: UD '{self.name}' reached "
+                f"terminal phase '{self.phase}' ({detail})",
+                ready=ws.ready,
+                required_min=ws.min,
+                timeout=timeout,
+            )
         raise BelowMinimumWorld(
             f"wait_until_complete: timed out after {timeout}s "
             f"(phase={self.phase}, ready={ws.ready}, target={ws.target})",
             ready=ws.ready,
             required_min=ws.min,
             timeout=timeout,
         )

Mirror the same branch in wait_until_complete_async().

Also applies to: 730-740

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/basilica-sdk-python/python/basilica/distributed.py` around lines 680 -
691, The timeout path in wait_until_complete (and the async variant
wait_until_complete_async) only treats non-succeeded final states as a timeout,
losing terminal failure diagnostics; modify the final refresh branch in
wait_until_complete (and mirror the change in wait_until_complete_async) to
check for terminal phases like "failed" or "cancelled" after self.refresh() and
raise the corresponding RankExit or Cancelled exception (or raise
BelowMinimumWorld only when truly timed out), including the live world
(self.world), ready/required_min/timeout and any rank-exit context so terminal
failure details are surfaced instead of being reported as a generic timeout.

name: str = "dlc-done",
namespace: str = "u-test",
phase: str = "succeeded",
rank_exits: list = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the implicit Optional occurrence and whether Ruff config enables RUF013.
rg -n 'rank_exits:\s*list\s*=\s*None' crates/basilica-sdk-python/tests/test_distributed.py
fd -HI 'pyproject.toml' | xargs -r rg -n 'ruff|RUF013|select|extend-select|ignore'

Repository: one-covenant/basilica

Length of output: 161


🏁 Script executed:

fd -HI 'pyproject.toml' --max-results 5 | head -10

Repository: one-covenant/basilica

Length of output: 147


🏁 Script executed:

cat -n pyproject.toml | head -100

Repository: one-covenant/basilica

Length of output: 112


🏁 Script executed:

cat -n crates/basilica-sdk-python/pyproject.toml | grep -A 20 -B 5 'ruff\|RUF013'

Repository: one-covenant/basilica

Length of output: 47


🏁 Script executed:

cat crates/basilica-sdk-python/pyproject.toml

Repository: one-covenant/basilica

Length of output: 1949


🏁 Script executed:

find . -name '.ruff.toml' -o -name 'ruff.toml' 2>/dev/null

Repository: one-covenant/basilica

Length of output: 77


🏁 Script executed:

cd crates/basilica-sdk-python && find . -name '*.py' -path '*/tests/*' | head -5 && cd - && echo "---" && rg 'from typing import|Optional' crates/basilica-sdk-python/tests/test_distributed.py | head -20

Repository: one-covenant/basilica

Length of output: 283


🏁 Script executed:

cat -n ./ruff.toml

Repository: one-covenant/basilica

Length of output: 1404


🏁 Script executed:

sed -n '1250,1280p' crates/basilica-sdk-python/tests/test_distributed.py

Repository: one-covenant/basilica

Length of output: 1055


🏁 Script executed:

grep 'requires-python' crates/basilica-sdk-python/pyproject.toml

Repository: one-covenant/basilica

Length of output: 92


🏁 Script executed:

sed -n '1250,1310p' crates/basilica-sdk-python/tests/test_distributed.py | cat -n

Repository: one-covenant/basilica

Length of output: 2424


🏁 Script executed:

rg -n '_make_terminal_training' crates/basilica-sdk-python/tests/test_distributed.py | head -20

Repository: one-covenant/basilica

Length of output: 481


Use explicit optional typing for rank_exits.

rank_exits: list = None violates Ruff RUF013 (implicit Optional). Change to rank_exits: list | None = None to make the optionality explicit in the type annotation.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 1254-1254: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/basilica-sdk-python/tests/test_distributed.py` at line 1254, Change
the implicit optional annotation for the parameter/variable named rank_exits
from "rank_exits: list = None" to an explicit optional type like "rank_exits:
list | None = None" so the type clearly indicates Optional[List] (satisfying
Ruff RUF013); locate the declaration of rank_exits in the affected function or
test and update the type annotation accordingly.

"max": 2,
"belowMinimum": True,
},
"rankExits": rank_exits or [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve explicit empty rank_exits in the fixture helper.

rank_exits or [...] replaces an intentionally passed empty list with defaults. That makes it impossible to model terminal states with zero exits.

Suggested fix
-        "rankExits": rank_exits or [
+        "rankExits": rank_exits if rank_exits is not None else [
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"rankExits": rank_exits or [
"rankExits": rank_exits if rank_exits is not None else [
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/basilica-sdk-python/tests/test_distributed.py` at line 1275, The
fixture helper currently uses "rank_exits or [...]" which treats an explicitly
passed empty list as falsy and replaces it with defaults; change the assignment
to use an explicit None check (e.g., use "rank_exits if rank_exits is not None
else [...]") so that an empty list passed into the fixture helper is preserved;
adjust the initialization where "rankExits": rank_exits or [...] appears to use
the None-check form and keep the default list only when rank_exits is actually
None.

Comment on lines +1611 to +1617
/// Phase 5b (#445): per-rank exit diagnostics. Empty while the UD is
/// non-terminal. On transition to `Succeeded` / `Failed` the operator
/// snapshots each worker pod's container `terminated` state and
/// persists it here so the SDK can surface them after the worker
/// StatefulSet has been scaled to `replicas: 0`.
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub rank_exits: Vec<DistributedRankExit>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Update explicit DistributedStatus literals for the new field.

Adding rank_exits here makes every manual DistributedStatus { ... } initializer require it. In this file, Line 2730’s _sample_deployment_with_distributed() literal is still missing rank_exits, so this crate will not compile until that initializer is updated (or switched to ..Default::default() where appropriate).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/basilica-sdk/src/types.rs` around lines 1611 - 1617, The
DistributedStatus struct gained a new field rank_exits so any manual
DistributedStatus literal must include it; update the
_sample_deployment_with_distributed() test/sample initializer to set rank_exits
(e.g., an empty Vec::new()) or replace the explicit DistributedStatus literal
with DistributedStatus { ..Default::default() } (or otherwise fill the new
field) so the crate compiles; target the initializer in the
_sample_deployment_with_distributed() function and ensure the DistributedStatus
literal includes rank_exits: Vec::new().

@epappas epappas merged commit 8bc2037 into main May 5, 2026
12 of 14 checks passed
epappas added a commit that referenced this pull request May 8, 2026
PR #461 added rank_exits: Vec<DistributedRankExit> to DistributedStatus
but missed initializing the field in the
_sample_deployment_with_distributed test fixture, breaking
`cargo check --tests` on every PR opened against main.

The other DistributedStatus initializer in the same test module uses
..Default::default() and is unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant