Prerequisites
Code of Conduct
Feature Summary
Enable preflight health checks to target only the subset of nodes required by a workload's parallelism configuration, rather than the full cluster.
Problem/Use Case
Preflight health checks today require full gang scheduling across the entire cluster, even when a workload's parallelism configuration only requires a small subset of nodes to communicate with each other. For example, a workload with EP=32 across 4 nodes (4x8xB300) only needs those 4 nodes to validate — holding up the full cluster gang adds unnecessary overhead and delays job startup.
Proposed Solution
Run preflight health checks against only the subset of nodes relevant to a workload's parallelism configuration, using pod group gangs to coordinate scheduling across that subset. This requires changes at the scheduler level to support sub-cluster gang coordination.
Component
Preflight
Prerequisites
Code of Conduct
Feature Summary
Enable preflight health checks to target only the subset of nodes required by a workload's parallelism configuration, rather than the full cluster.
Problem/Use Case
Preflight health checks today require full gang scheduling across the entire cluster, even when a workload's parallelism configuration only requires a small subset of nodes to communicate with each other. For example, a workload with EP=32 across 4 nodes (4x8xB300) only needs those 4 nodes to validate — holding up the full cluster gang adds unnecessary overhead and delays job startup.
Proposed Solution
Run preflight health checks against only the subset of nodes relevant to a workload's parallelism configuration, using pod group gangs to coordinate scheduling across that subset. This requires changes at the scheduler level to support sub-cluster gang coordination.
Component
Preflight