Cross-repo follow-up — SDK basilica.mesh_wait(timeout=60) (Phase 4 of #419)
basilica-backend #605 shipped the operator-side etcd-key gate and the trainer-image entrypoint shim that polls it. BYO-image users do not get the entrypoint shim — they need an SDK helper.
Etcd key contract (consumer-side)
The per-UD rendezvous etcd Pod (always present on a distributed UD; BASILICA_RDZV_ENDPOINT env carries <svc>.<ns>.svc.cluster.local:<port>, default port 2379) carries one of two keys per worker host:
| Key |
Set by |
Meaning |
/basilica/mesh-ready/<host-node-name> |
operator, when MeshState::Ready |
Mesh fabric converged. Caller can init_process_group() and expect direct WG transport. |
/basilica/mesh-degraded/<host-node-name> |
operator, when MeshState::Degraded (operator-side timeout) |
Mesh fabric did NOT converge in time; operator fail-opened. Caller can proceed but transport falls back to hub-relay. |
The etcd Pod speaks the v2 KV API (--enable-v2=true; same path _etcd_barrier in basilica_distributed_trainer/basilica_bench.py already uses).
Expected SDK helper shape (Python)
import basilica
# Block until the operator releases this rank's host node, or fail-open
# after `timeout` seconds. Call BEFORE torch.distributed.init_process_group().
basilica.mesh_wait(timeout=60)
# Optional: introspect what was observed.
status = basilica.mesh_status() # "ready" | "degraded" | "timeout"
Behaviour to mirror (read crates/basilica-distributed-trainer/wait_for_mesh.sh on main for the canonical reference)
- Read
NODE_NAME (or KUBERNETES_NODE_NAME) and BASILICA_RDZV_ENDPOINT from env.
- If either is absent: skip the wait (non-distributed pod) and return immediately.
- Poll the v2 KV API every 2s; success on HTTP 200 for either prefix.
- Always exit success (
return None) — the operator's DegradedMesh=True condition is the authoritative alert. SDK MUST NOT raise on timeout; that would break workloads on a slow rollout.
- Stderr-style logging is fine (
mesh-wait: BEGIN ... / mesh-wait: OK after=<n>s / mesh-wait: DEGRADED ... / mesh-wait: TIMEOUT ...).
Implementation pointer
python-etcd (already a trainer-image dep) exposes the v2 API; alternatively httpx / requests straight against http://${ETCD}/v2/keys/basilica/mesh-ready/${HOST} works. Either is fine — the helper is a sub-1s read of two known keys.
Test plan
- Unit: a
requests-mock test asserting GET against both URL shapes; success when either returns 200; timeout when both 404 for the full window.
- Integration (manual): spawn a BYO-image UD that calls
basilica.mesh_wait(), confirm it blocks until the operator writes /basilica/mesh-ready/<host> and proceeds to init_process_group().
Where this fits
This PR closes #419 Phase 4's "SDK surface for BYO-image flows" deliverable. The operator + trainer image path is shipped in basilica-backend PR https://github.com/one-covenant/basilica-backend/pull/605.
Cross-repo follow-up — SDK
basilica.mesh_wait(timeout=60)(Phase 4 of #419)basilica-backend#605 shipped the operator-side etcd-key gate and the trainer-image entrypoint shim that polls it. BYO-image users do not get the entrypoint shim — they need an SDK helper.Etcd key contract (consumer-side)
The per-UD rendezvous etcd Pod (always present on a distributed UD;
BASILICA_RDZV_ENDPOINTenv carries<svc>.<ns>.svc.cluster.local:<port>, default port2379) carries one of two keys per worker host:/basilica/mesh-ready/<host-node-name>MeshState::Readyinit_process_group()and expect direct WG transport./basilica/mesh-degraded/<host-node-name>MeshState::Degraded(operator-side timeout)The etcd Pod speaks the v2 KV API (
--enable-v2=true; same path_etcd_barrierinbasilica_distributed_trainer/basilica_bench.pyalready uses).Expected SDK helper shape (Python)
Behaviour to mirror (read
crates/basilica-distributed-trainer/wait_for_mesh.shonmainfor the canonical reference)NODE_NAME(orKUBERNETES_NODE_NAME) andBASILICA_RDZV_ENDPOINTfrom env.return None) — the operator'sDegradedMesh=Truecondition is the authoritative alert. SDK MUST NOT raise on timeout; that would break workloads on a slow rollout.mesh-wait: BEGIN .../mesh-wait: OK after=<n>s/mesh-wait: DEGRADED .../mesh-wait: TIMEOUT ...).Implementation pointer
python-etcd(already a trainer-image dep) exposes the v2 API; alternativelyhttpx/requestsstraight againsthttp://${ETCD}/v2/keys/basilica/mesh-ready/${HOST}works. Either is fine — the helper is a sub-1s read of two known keys.Test plan
requests-mock test asserting GET against both URL shapes; success when either returns 200; timeout when both 404 for the full window.basilica.mesh_wait(), confirm it blocks until the operator writes/basilica/mesh-ready/<host>and proceeds toinit_process_group().Where this fits
This PR closes #419 Phase 4's "SDK surface for BYO-image flows" deliverable. The operator + trainer image path is shipped in
basilica-backendPR https://github.com/one-covenant/basilica-backend/pull/605.