Skip to content

[Feature]: Parallelism-aware preflight checks #1354

Description

@kwombach12

Prerequisites

  • I searched existing issues

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Feature Summary

Enable preflight health checks to target only the subset of nodes required by a workload's parallelism configuration, rather than the full cluster.

Problem/Use Case

Preflight health checks today require full gang scheduling across the entire cluster, even when a workload's parallelism configuration only requires a small subset of nodes to communicate with each other. For example, a workload with EP=32 across 4 nodes (4x8xB300) only needs those 4 nodes to validate — holding up the full cluster gang adds unnecessary overhead and delays job startup.

Proposed Solution

Run preflight health checks against only the subset of nodes relevant to a workload's parallelism configuration, using pod group gangs to coordinate scheduling across that subset. This requires changes at the scheduler level to support sub-cluster gang coordination.

Component

Preflight

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions