Skip to content

[Bug]: NIC monitor emits non-fatal events for expected-down ports after reboot #1379

Description

@KaivalyaMDabhadkar

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Bug Description

What happened

After a node reboot, the NIC health monitor's first-poll suppression logic correctly
identifies expected-down cards via card homogeneity and downgrades them from FATAL to
non-fatal. However, the downgraded event is still emitted as an external HealthEvent
with is_fatal=false, is_healthy=false, recommended_action=NONE.

This creates noisy Kubernetes node events (InfiniBandStateCheckIsNotHealthy) for ports
that are intentionally disabled, which pollutes NIC health correlation workflows and can
be mistaken for actual NIC degradation.

Root cause

The issue is in portTransitionEvents() in
pkg/checks/state/ibstate.go:

func (c *InfiniBandStateCheck) portTransitionEvents(
    st *ibPollState, firstPoll, baselineRun bool, expectedCards map[string]struct{},
) []*pb.HealthEvent {
    var events []*pb.HealthEvent

    for _, port := range st.ibPorts {
        key := portKey(port.Device, port.Port)
        prev, hasPrev := c.previousPorts[key]

        evt := c.evaluatePortTransition(st.currentPorts[key], prev, hasPrev, baselineRun)
        if evt == nil {
            continue
        }

        if firstPoll && !evt.IsHealthy {
            c.applyFirstPollSuppression(evt, port, st.portCard[key], expectedCards)
        }

        events = append(events, evt)  // <-- event is always appended, even after suppression
    }

    return events
}

applyFirstPollSuppression() mutates the event in-place:

func (c *InfiniBandStateCheck) applyFirstPollSuppression(
    evt *pb.HealthEvent, port discovery.IBPort, card string, expected map[string]struct{},
) {
    if _, ok := expected[card]; !ok {
        return
    }

    slog.Info("Suppressing first-poll fatal for expected-down card",
        "device", port.Device, "port", port.Port, "card", card)

    evt.IsFatal = false
    evt.RecommendedAction = pb.RecommendedAction_NONE
}

The suppression only downgrades IsFatal and RecommendedAction. It does not prevent
the event from being appended to the result slice. The event is still published via
monitor.runChecks()pub.Publish(), creating a Kubernetes node event through the
platform connector's non-fatal event path (which writes !IsHealthy && !IsFatal events
as Kubernetes Events).

The same pattern exists in EthernetStateCheck.unhealthyEvent() in
pkg/checks/state/ethstate.go,
where expected-down first-poll suppression also only downgrades but does not drop the event.

Expected behavior

When applyFirstPollSuppression() matches an expected-down card, the event should be
dropped entirely — not emitted as a non-fatal unhealthy event. The suppressed port should
produce no external HealthEvent, no Kubernetes node event, and no datastore record.

A structured debug log is sufficient for observability.

Actual behavior

The event is downgraded to non-fatal but still emitted:

Health event sent  check=InfiniBandStateCheck  is_fatal=false  is_healthy=false  recommended_action=NONE
  entities="NIC=mlx5_1, NICPort=1"  message="Port mlx5_1 port 1: state DOWN, phys_state Disabled"

Platform connector writes this as a Kubernetes node event:

InfiniBandStateCheck  InfiniBandStateCheckIsNotHealthy  nic-health-monitor
  NIC:mlx5_1 NICPort:1 Port mlx5_1 port 1: state DOWN, phys_state Disabled Recommended Action=NONE;

Impact

  • Noisy Kubernetes node events for expected topology
  • Pollutes MongoDB event history if datastore is enabled

Suggested Fix

In portTransitionEvents(), after applyFirstPollSuppression() runs, check whether
the event was suppressed. If the card was in the expectedCards set, drop the event
instead of appending it:

if firstPoll && !evt.IsHealthy {
    if c.isExpectedDown(st.portCard[key], expectedCards) {
        slog.Info("Dropping event for expected-down card",
            "device", port.Device, "port", port.Port)
        continue  // do not append
    }
}

events = append(events, evt)

Or have applyFirstPollSuppression() return a boolean indicating whether the event
should be dropped:

if firstPoll && !evt.IsHealthy {
    if c.applyFirstPollSuppression(evt, port, st.portCard[key], expectedCards) {
        continue
    }
}

Apply the same fix to EthernetStateCheck.portTransitionEvents().

Acceptance Criteria

  • Rebooting a node with expected-down cards logs Suppressing first-poll fatal for expected-down card
  • No Health event sent with is_healthy=false is emitted for those suppressed ports
  • No Kubernetes node event is created for suppressed expected-down ports
  • Actually unexpected DOWN/Disabled ports (not in the expected-down set) still emit
    fatal HealthEvents with RecommendedAction=REPLACE_VM
  • Healthy baseline events after reboot still work for genuinely healthy ports and counters

Component

Health Monitor

Steps to Reproduce

  1. Deploy NVSentinel on a node where the IB NICs are dual-port cards with only one
    port active per card (common on Forge and cloud bare-metal GPU nodes).
  2. Reboot the node.
  3. Wait for the NIC health monitor pod to start and complete its first poll.
  4. Observe the monitor logs:
    • First-seen IB port with healthy=false for the disabled ports
    • Suppressing first-poll fatal for expected-down card
    • Health event sent with is_fatal=false, is_healthy=false for the same ports
  5. Observe the Kubernetes node events:
    • InfiniBandStateCheckIsNotHealthy events for the expected-down ports

Environment

  • NVSentinel version: v1.8.0

Logs/Output

No response

Metadata

Metadata

Labels

bugSomething isn't workingpriority/P1Max fix SLA: 183 days

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions