Prerequisites
Code of Conduct
Bug Description
What happened
After a node reboot, the NIC health monitor's first-poll suppression logic correctly
identifies expected-down cards via card homogeneity and downgrades them from FATAL to
non-fatal. However, the downgraded event is still emitted as an external HealthEvent
with is_fatal=false, is_healthy=false, recommended_action=NONE.
This creates noisy Kubernetes node events (InfiniBandStateCheckIsNotHealthy) for ports
that are intentionally disabled, which pollutes NIC health correlation workflows and can
be mistaken for actual NIC degradation.
Root cause
The issue is in portTransitionEvents() in
pkg/checks/state/ibstate.go:
func (c *InfiniBandStateCheck) portTransitionEvents(
st *ibPollState, firstPoll, baselineRun bool, expectedCards map[string]struct{},
) []*pb.HealthEvent {
var events []*pb.HealthEvent
for _, port := range st.ibPorts {
key := portKey(port.Device, port.Port)
prev, hasPrev := c.previousPorts[key]
evt := c.evaluatePortTransition(st.currentPorts[key], prev, hasPrev, baselineRun)
if evt == nil {
continue
}
if firstPoll && !evt.IsHealthy {
c.applyFirstPollSuppression(evt, port, st.portCard[key], expectedCards)
}
events = append(events, evt) // <-- event is always appended, even after suppression
}
return events
}
applyFirstPollSuppression() mutates the event in-place:
func (c *InfiniBandStateCheck) applyFirstPollSuppression(
evt *pb.HealthEvent, port discovery.IBPort, card string, expected map[string]struct{},
) {
if _, ok := expected[card]; !ok {
return
}
slog.Info("Suppressing first-poll fatal for expected-down card",
"device", port.Device, "port", port.Port, "card", card)
evt.IsFatal = false
evt.RecommendedAction = pb.RecommendedAction_NONE
}
The suppression only downgrades IsFatal and RecommendedAction. It does not prevent
the event from being appended to the result slice. The event is still published via
monitor.runChecks() → pub.Publish(), creating a Kubernetes node event through the
platform connector's non-fatal event path (which writes !IsHealthy && !IsFatal events
as Kubernetes Events).
The same pattern exists in EthernetStateCheck.unhealthyEvent() in
pkg/checks/state/ethstate.go,
where expected-down first-poll suppression also only downgrades but does not drop the event.
Expected behavior
When applyFirstPollSuppression() matches an expected-down card, the event should be
dropped entirely — not emitted as a non-fatal unhealthy event. The suppressed port should
produce no external HealthEvent, no Kubernetes node event, and no datastore record.
A structured debug log is sufficient for observability.
Actual behavior
The event is downgraded to non-fatal but still emitted:
Health event sent check=InfiniBandStateCheck is_fatal=false is_healthy=false recommended_action=NONE
entities="NIC=mlx5_1, NICPort=1" message="Port mlx5_1 port 1: state DOWN, phys_state Disabled"
Platform connector writes this as a Kubernetes node event:
InfiniBandStateCheck InfiniBandStateCheckIsNotHealthy nic-health-monitor
NIC:mlx5_1 NICPort:1 Port mlx5_1 port 1: state DOWN, phys_state Disabled Recommended Action=NONE;
Impact
- Noisy Kubernetes node events for expected topology
- Pollutes MongoDB event history if datastore is enabled
Suggested Fix
In portTransitionEvents(), after applyFirstPollSuppression() runs, check whether
the event was suppressed. If the card was in the expectedCards set, drop the event
instead of appending it:
if firstPoll && !evt.IsHealthy {
if c.isExpectedDown(st.portCard[key], expectedCards) {
slog.Info("Dropping event for expected-down card",
"device", port.Device, "port", port.Port)
continue // do not append
}
}
events = append(events, evt)
Or have applyFirstPollSuppression() return a boolean indicating whether the event
should be dropped:
if firstPoll && !evt.IsHealthy {
if c.applyFirstPollSuppression(evt, port, st.portCard[key], expectedCards) {
continue
}
}
Apply the same fix to EthernetStateCheck.portTransitionEvents().
Acceptance Criteria
- Rebooting a node with expected-down cards logs
Suppressing first-poll fatal for expected-down card
- No
Health event sent with is_healthy=false is emitted for those suppressed ports
- No Kubernetes node event is created for suppressed expected-down ports
- Actually unexpected
DOWN/Disabled ports (not in the expected-down set) still emit
fatal HealthEvents with RecommendedAction=REPLACE_VM
- Healthy baseline events after reboot still work for genuinely healthy ports and counters
Component
Health Monitor
Steps to Reproduce
- Deploy NVSentinel on a node where the IB NICs are dual-port cards with only one
port active per card (common on Forge and cloud bare-metal GPU nodes).
- Reboot the node.
- Wait for the NIC health monitor pod to start and complete its first poll.
- Observe the monitor logs:
First-seen IB port with healthy=false for the disabled ports
Suppressing first-poll fatal for expected-down card
Health event sent with is_fatal=false, is_healthy=false for the same ports
- Observe the Kubernetes node events:
InfiniBandStateCheckIsNotHealthy events for the expected-down ports
Environment
- NVSentinel version: v1.8.0
Logs/Output
No response
Prerequisites
Code of Conduct
Bug Description
What happened
After a node reboot, the NIC health monitor's first-poll suppression logic correctly
identifies expected-down cards via card homogeneity and downgrades them from FATAL to
non-fatal. However, the downgraded event is still emitted as an external
HealthEventwith
is_fatal=false,is_healthy=false,recommended_action=NONE.This creates noisy Kubernetes node events (
InfiniBandStateCheckIsNotHealthy) for portsthat are intentionally disabled, which pollutes NIC health correlation workflows and can
be mistaken for actual NIC degradation.
Root cause
The issue is in
portTransitionEvents()inpkg/checks/state/ibstate.go:applyFirstPollSuppression()mutates the event in-place:The suppression only downgrades
IsFatalandRecommendedAction. It does not preventthe event from being appended to the result slice. The event is still published via
monitor.runChecks()→pub.Publish(), creating a Kubernetes node event through theplatform connector's non-fatal event path (which writes
!IsHealthy && !IsFataleventsas Kubernetes Events).
The same pattern exists in
EthernetStateCheck.unhealthyEvent()inpkg/checks/state/ethstate.go,where expected-down first-poll suppression also only downgrades but does not drop the event.
Expected behavior
When
applyFirstPollSuppression()matches an expected-down card, the event should bedropped entirely — not emitted as a non-fatal unhealthy event. The suppressed port should
produce no external
HealthEvent, no Kubernetes node event, and no datastore record.A structured debug log is sufficient for observability.
Actual behavior
The event is downgraded to non-fatal but still emitted:
Platform connector writes this as a Kubernetes node event:
Impact
Suggested Fix
In
portTransitionEvents(), afterapplyFirstPollSuppression()runs, check whetherthe event was suppressed. If the card was in the
expectedCardsset, drop the eventinstead of appending it:
Or have
applyFirstPollSuppression()return a boolean indicating whether the eventshould be dropped:
Apply the same fix to
EthernetStateCheck.portTransitionEvents().Acceptance Criteria
Suppressing first-poll fatal for expected-down cardHealth event sentwithis_healthy=falseis emitted for those suppressed portsDOWN/Disabledports (not in the expected-down set) still emitfatal
HealthEvents withRecommendedAction=REPLACE_VMComponent
Health Monitor
Steps to Reproduce
port active per card (common on Forge and cloud bare-metal GPU nodes).
First-seen IB portwithhealthy=falsefor the disabled portsSuppressing first-poll fatal for expected-down cardHealth event sentwithis_fatal=false, is_healthy=falsefor the same portsInfiniBandStateCheckIsNotHealthyevents for the expected-down portsEnvironment
Logs/Output
No response