Prerequisites
Code of Conduct
Bug Description
What happened
On cloud bare-metal GPU instances (e.g., OCI BM.GPU.H100.8), the frontend NIC adapter
is a dual-port card where only one port (Prime) is provisioned and active. The second
port (Aux) is intentionally disabled by the cloud provider — it has no VNIC attached,
no IP address, no routes, and phys_state: Disabled.
The NIC health monitor emits a FATAL REPLACE_VM event for this Aux port on every
first poll, because it sees state: DOWN and has no way to determine that the port is
intentionally unprovisioned.
Why existing safety nets don't catch it
-
Default-route exclusion correctly removes the active Prime port (which carries
the host's default route) from monitoring by classifying it as Management. But this
leaves the Aux port as a singleton in the Storage role group.
-
Card homogeneity requires ≥2 cards in a role group to infer expected-down
patterns. With only one card in the Storage group, the check is skipped entirely.
The ExpectedDownCards set is empty, so first-poll suppression never fires.
-
The monitor has no concept of "administratively disabled / unprovisioned port."
It treats all state: DOWN Ethernet/RoCE ports identically regardless of whether
phys_state is Disabled (intentional) vs Polling (searching for link).
What I expected
The monitor should not emit FATAL events for ports that are intentionally disabled and
have no provisioning evidence (no admin-up, no carrier history, no routes, no
bond/bridge membership, no configured VFs). These ports should be skipped from
monitoring entirely.
Impact
- False
REPLACE_VM recommendation on healthy nodes
- Healthy nodes may be unnecessarily drained/cordoned
- The false positive is currently masked by a separate platform-connector entity-matching
bug (tracked independently), so the node condition appears healthy for wrong reasons.
Fixing the connector bug alone would surface this false positive on all affected nodes.
Component
Health Monitor
Steps to Reproduce
- Deploy NVSentinel on a cloud bare-metal GPU node where the frontend NIC is a
dual-port adapter with only one port provisioned (e.g., Prime active, Aux disabled).
- Verify on the host that the Aux port has:
phys_state: Disabled
- No IP address
- No routes
- No cloud VNIC attachment
operstate: down
- Check
nic-health-monitor logs:
Default route NIC will be excluded as management for the Prime port
NIC classification summary shows the Aux port as the only entry in storage
- No
Suppressing first-poll or Card homogeneity anomaly lines for the Aux port
- Observe FATAL health event emitted with
RecommendedAction: REPLACE_VM for the
unprovisioned Aux port.
- Verify on a sibling node of the same shape — the Aux port is identically disabled,
confirming this is by-design cloud topology, not a per-node fault.
Environment
- NVSentinel version: v1.8.0
Logs/Output
No response
Prerequisites
Code of Conduct
Bug Description
What happened
On cloud bare-metal GPU instances (e.g., OCI BM.GPU.H100.8), the frontend NIC adapter
is a dual-port card where only one port (Prime) is provisioned and active. The second
port (Aux) is intentionally disabled by the cloud provider — it has no VNIC attached,
no IP address, no routes, and
phys_state: Disabled.The NIC health monitor emits a FATAL
REPLACE_VMevent for this Aux port on everyfirst poll, because it sees
state: DOWNand has no way to determine that the port isintentionally unprovisioned.
Why existing safety nets don't catch it
Default-route exclusion correctly removes the active Prime port (which carries
the host's default route) from monitoring by classifying it as Management. But this
leaves the Aux port as a singleton in the Storage role group.
Card homogeneity requires ≥2 cards in a role group to infer expected-down
patterns. With only one card in the Storage group, the check is skipped entirely.
The
ExpectedDownCardsset is empty, so first-poll suppression never fires.The monitor has no concept of "administratively disabled / unprovisioned port."
It treats all
state: DOWNEthernet/RoCE ports identically regardless of whetherphys_stateisDisabled(intentional) vsPolling(searching for link).What I expected
The monitor should not emit FATAL events for ports that are intentionally disabled and
have no provisioning evidence (no admin-up, no carrier history, no routes, no
bond/bridge membership, no configured VFs). These ports should be skipped from
monitoring entirely.
Impact
REPLACE_VMrecommendation on healthy nodesbug (tracked independently), so the node condition appears healthy for wrong reasons.
Fixing the connector bug alone would surface this false positive on all affected nodes.
Component
Health Monitor
Steps to Reproduce
dual-port adapter with only one port provisioned (e.g., Prime active, Aux disabled).
phys_state: Disabledoperstate: downnic-health-monitorlogs:Default route NIC will be excluded as managementfor the Prime portNIC classification summaryshows the Aux port as the only entry instorageSuppressing first-pollorCard homogeneity anomalylines for the Aux portRecommendedAction: REPLACE_VMfor theunprovisioned Aux port.
confirming this is by-design cloud topology, not a per-node fault.
Environment
Logs/Output
No response