Skip to content

[Bug]: NIC monitor emits false FATAL for unprovisioned Ethernet/RoCE Aux ports on cloud shapes #1361

Description

@KaivalyaMDabhadkar

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Bug Description

What happened

On cloud bare-metal GPU instances (e.g., OCI BM.GPU.H100.8), the frontend NIC adapter
is a dual-port card where only one port (Prime) is provisioned and active. The second
port (Aux) is intentionally disabled by the cloud provider — it has no VNIC attached,
no IP address, no routes, and phys_state: Disabled.

The NIC health monitor emits a FATAL REPLACE_VM event for this Aux port on every
first poll, because it sees state: DOWN and has no way to determine that the port is
intentionally unprovisioned.

Why existing safety nets don't catch it

  1. Default-route exclusion correctly removes the active Prime port (which carries
    the host's default route) from monitoring by classifying it as Management. But this
    leaves the Aux port as a singleton in the Storage role group.

  2. Card homogeneity requires ≥2 cards in a role group to infer expected-down
    patterns. With only one card in the Storage group, the check is skipped entirely.
    The ExpectedDownCards set is empty, so first-poll suppression never fires.

  3. The monitor has no concept of "administratively disabled / unprovisioned port."
    It treats all state: DOWN Ethernet/RoCE ports identically regardless of whether
    phys_state is Disabled (intentional) vs Polling (searching for link).

What I expected

The monitor should not emit FATAL events for ports that are intentionally disabled and
have no provisioning evidence (no admin-up, no carrier history, no routes, no
bond/bridge membership, no configured VFs). These ports should be skipped from
monitoring entirely.

Impact

  • False REPLACE_VM recommendation on healthy nodes
  • Healthy nodes may be unnecessarily drained/cordoned
  • The false positive is currently masked by a separate platform-connector entity-matching
    bug (tracked independently), so the node condition appears healthy for wrong reasons.
    Fixing the connector bug alone would surface this false positive on all affected nodes.

Component

Health Monitor

Steps to Reproduce

  1. Deploy NVSentinel on a cloud bare-metal GPU node where the frontend NIC is a
    dual-port adapter with only one port provisioned (e.g., Prime active, Aux disabled).
  2. Verify on the host that the Aux port has:
    • phys_state: Disabled
    • No IP address
    • No routes
    • No cloud VNIC attachment
    • operstate: down
  3. Check nic-health-monitor logs:
    • Default route NIC will be excluded as management for the Prime port
    • NIC classification summary shows the Aux port as the only entry in storage
    • No Suppressing first-poll or Card homogeneity anomaly lines for the Aux port
  4. Observe FATAL health event emitted with RecommendedAction: REPLACE_VM for the
    unprovisioned Aux port.
  5. Verify on a sibling node of the same shape — the Aux port is identically disabled,
    confirming this is by-design cloud topology, not a per-node fault.

Environment

  • NVSentinel version: v1.8.0

Logs/Output

No response

Metadata

Metadata

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions