Skip to content

[Bug]: Platform connector OR-based entity matching silently clears unrelated NIC failures #1360

Description

@KaivalyaMDabhadkar

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Bug Description

When the NIC health monitor reports a FATAL port-level event for a NIC, the platform connector
can immediately clear it if a HEALTHY event arrives from a different NIC on the same port number.

This happens because removeImpactedEntitiesMessagesScoped in the platform connector uses OR logic
when matching entities: if any single entity tag from the healthy event matches a stored failure
message, the failure is cleared.

NIC port events carry two entity tags:

  • NIC=<device>
  • NICPort=<port_number>

Since every NIC has port 1, NICPort=1 is shared across all NICs. A healthy event from NIC-B
with NICPort=1 will match and clear a stored FATAL from NIC-A that also has NICPort=1,
even though they are completely different devices.

Expected behavior: The failure should only be cleared when all entities match (AND logic),
not when any single entity matches (OR logic). [NIC=A, NICPort=1] and [NIC=B, NICPort=1]
are different composite identities and should not cross-clear.

Second variant of same bug: A healthy counter check event for a NIC can also clear
a state check failure for the same NIC, because they share the same entity tags and the
healthy event carries no ErrorCode (which matches everything). "Counters are healthy" should
not clear "link state is DOWN".

Component

Health Monitor

Steps to Reproduce

  1. Have a node with multiple Mellanox NICs monitored by the NIC health monitor
  2. One NIC has a port in DOWN state — the monitor emits a FATAL EthernetStateCheck event
    with entities [NIC=<broken-nic>, NICPort=1]
  3. Any other NIC on the same node reports a HEALTHY event (state or counter check)
    with entities [NIC=<healthy-nic>, NICPort=1]
  4. Observe the node condition EthernetStateCheck — it shows False / "No Health Failures"
    despite the NIC still being down
  5. The platform connector log will show both the FATAL and HEALTHY events being processed
    in the same batch, with the healthy event clearing the failure

Environment

  • NVSentinel version: v1.8.0

Logs/Output

No response

Metadata

Metadata

Labels

bugSomething isn't workingpriority/P1Max fix SLA: 183 days

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions