Skip to content

SDK: basilica.mesh_wait(timeout=60) for BYO-image distributed UDs (#419 Phase 4 follow-up) #475

@epappas

Description

@epappas

Cross-repo follow-up — SDK basilica.mesh_wait(timeout=60) (Phase 4 of #419)

basilica-backend #605 shipped the operator-side etcd-key gate and the trainer-image entrypoint shim that polls it. BYO-image users do not get the entrypoint shim — they need an SDK helper.

Etcd key contract (consumer-side)

The per-UD rendezvous etcd Pod (always present on a distributed UD; BASILICA_RDZV_ENDPOINT env carries <svc>.<ns>.svc.cluster.local:<port>, default port 2379) carries one of two keys per worker host:

Key Set by Meaning
/basilica/mesh-ready/<host-node-name> operator, when MeshState::Ready Mesh fabric converged. Caller can init_process_group() and expect direct WG transport.
/basilica/mesh-degraded/<host-node-name> operator, when MeshState::Degraded (operator-side timeout) Mesh fabric did NOT converge in time; operator fail-opened. Caller can proceed but transport falls back to hub-relay.

The etcd Pod speaks the v2 KV API (--enable-v2=true; same path _etcd_barrier in basilica_distributed_trainer/basilica_bench.py already uses).

Expected SDK helper shape (Python)

import basilica

# Block until the operator releases this rank's host node, or fail-open
# after `timeout` seconds. Call BEFORE torch.distributed.init_process_group().
basilica.mesh_wait(timeout=60)

# Optional: introspect what was observed.
status = basilica.mesh_status()   # "ready" | "degraded" | "timeout"

Behaviour to mirror (read crates/basilica-distributed-trainer/wait_for_mesh.sh on main for the canonical reference)

  • Read NODE_NAME (or KUBERNETES_NODE_NAME) and BASILICA_RDZV_ENDPOINT from env.
  • If either is absent: skip the wait (non-distributed pod) and return immediately.
  • Poll the v2 KV API every 2s; success on HTTP 200 for either prefix.
  • Always exit success (return None) — the operator's DegradedMesh=True condition is the authoritative alert. SDK MUST NOT raise on timeout; that would break workloads on a slow rollout.
  • Stderr-style logging is fine (mesh-wait: BEGIN ... / mesh-wait: OK after=<n>s / mesh-wait: DEGRADED ... / mesh-wait: TIMEOUT ...).

Implementation pointer

python-etcd (already a trainer-image dep) exposes the v2 API; alternatively httpx / requests straight against http://${ETCD}/v2/keys/basilica/mesh-ready/${HOST} works. Either is fine — the helper is a sub-1s read of two known keys.

Test plan

  • Unit: a requests-mock test asserting GET against both URL shapes; success when either returns 200; timeout when both 404 for the full window.
  • Integration (manual): spawn a BYO-image UD that calls basilica.mesh_wait(), confirm it blocks until the operator writes /basilica/mesh-ready/<host> and proceeds to init_process_group().

Where this fits

This PR closes #419 Phase 4's "SDK surface for BYO-image flows" deliverable. The operator + trainer image path is shipped in basilica-backend PR https://github.com/one-covenant/basilica-backend/pull/605.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions