[Bug]: NIC monitor emits false FATAL for unprovisioned Ethernet/RoCE Aux ports on cloud shapes

### Prerequisites

- [x] I searched existing issues
- [x] I can reproduce this issue

### Code of Conduct

- [x] I agree to follow NVSentinel's Code of Conduct

### Bug Description

## What happened

On cloud bare-metal GPU instances (e.g., OCI BM.GPU.H100.8), the frontend NIC adapter
is a dual-port card where only one port (Prime) is provisioned and active. The second
port (Aux) is intentionally disabled by the cloud provider — it has no VNIC attached,
no IP address, no routes, and `phys_state: Disabled`.

The NIC health monitor emits a **FATAL `REPLACE_VM`** event for this Aux port on every
first poll, because it sees `state: DOWN` and has no way to determine that the port is
intentionally unprovisioned.

## Why existing safety nets don't catch it

1. **Default-route exclusion** correctly removes the active Prime port (which carries
   the host's default route) from monitoring by classifying it as Management. But this
   leaves the Aux port as a **singleton** in the Storage role group.

2. **Card homogeneity** requires ≥2 cards in a role group to infer expected-down
   patterns. With only one card in the Storage group, the check is skipped entirely.
   The `ExpectedDownCards` set is empty, so first-poll suppression never fires.

3. The monitor has no concept of "administratively disabled / unprovisioned port."
   It treats all `state: DOWN` Ethernet/RoCE ports identically regardless of whether
   `phys_state` is `Disabled` (intentional) vs `Polling` (searching for link).

## What I expected

The monitor should not emit FATAL events for ports that are intentionally disabled and
have no provisioning evidence (no admin-up, no carrier history, no routes, no
bond/bridge membership, no configured VFs). These ports should be skipped from
monitoring entirely.

## Impact

- False `REPLACE_VM` recommendation on healthy nodes
- Healthy nodes may be unnecessarily drained/cordoned
- The false positive is currently masked by a separate platform-connector entity-matching
  bug (tracked independently), so the node condition appears healthy for wrong reasons.
  Fixing the connector bug alone would surface this false positive on all affected nodes.

### Component

Health Monitor

### Steps to Reproduce

1. Deploy NVSentinel on a cloud bare-metal GPU node where the frontend NIC is a
   dual-port adapter with only one port provisioned (e.g., Prime active, Aux disabled).
2. Verify on the host that the Aux port has:
   - `phys_state: Disabled`
   - No IP address
   - No routes
   - No cloud VNIC attachment
   - `operstate: down`
3. Check `nic-health-monitor` logs:
   - `Default route NIC will be excluded as management` for the Prime port
   - `NIC classification summary` shows the Aux port as the only entry in `storage`
   - No `Suppressing first-poll` or `Card homogeneity anomaly` lines for the Aux port
4. Observe FATAL health event emitted with `RecommendedAction: REPLACE_VM` for the
   unprovisioned Aux port.
5. Verify on a sibling node of the same shape — the Aux port is identically disabled,
   confirming this is by-design cloud topology, not a per-node fault.

### Environment

- NVSentinel version: v1.8.0


### Logs/Output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: NIC monitor emits false FATAL for unprovisioned Ethernet/RoCE Aux ports on cloud shapes #1361

Prerequisites

Code of Conduct

Bug Description

What happened

Why existing safety nets don't catch it

What I expected

Impact

Component

Steps to Reproduce

Environment

Logs/Output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: NIC monitor emits false FATAL for unprovisioned Ethernet/RoCE Aux ports on cloud shapes #1361

Description

Prerequisites

Code of Conduct

Bug Description

What happened

Why existing safety nets don't catch it

What I expected

Impact

Component

Steps to Reproduce

Environment

Logs/Output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions