Prerequisites
Code of Conduct
Bug Description
When the NIC health monitor reports a FATAL port-level event for a NIC, the platform connector
can immediately clear it if a HEALTHY event arrives from a different NIC on the same port number.
This happens because removeImpactedEntitiesMessagesScoped in the platform connector uses OR logic
when matching entities: if any single entity tag from the healthy event matches a stored failure
message, the failure is cleared.
NIC port events carry two entity tags:
NIC=<device>
NICPort=<port_number>
Since every NIC has port 1, NICPort=1 is shared across all NICs. A healthy event from NIC-B
with NICPort=1 will match and clear a stored FATAL from NIC-A that also has NICPort=1,
even though they are completely different devices.
Expected behavior: The failure should only be cleared when all entities match (AND logic),
not when any single entity matches (OR logic). [NIC=A, NICPort=1] and [NIC=B, NICPort=1]
are different composite identities and should not cross-clear.
Second variant of same bug: A healthy counter check event for a NIC can also clear
a state check failure for the same NIC, because they share the same entity tags and the
healthy event carries no ErrorCode (which matches everything). "Counters are healthy" should
not clear "link state is DOWN".
Component
Health Monitor
Steps to Reproduce
- Have a node with multiple Mellanox NICs monitored by the NIC health monitor
- One NIC has a port in DOWN state — the monitor emits a FATAL
EthernetStateCheck event
with entities [NIC=<broken-nic>, NICPort=1]
- Any other NIC on the same node reports a HEALTHY event (state or counter check)
with entities [NIC=<healthy-nic>, NICPort=1]
- Observe the node condition
EthernetStateCheck — it shows False / "No Health Failures"
despite the NIC still being down
- The platform connector log will show both the FATAL and HEALTHY events being processed
in the same batch, with the healthy event clearing the failure
Environment
- NVSentinel version: v1.8.0
Logs/Output
No response
Prerequisites
Code of Conduct
Bug Description
When the NIC health monitor reports a FATAL port-level event for a NIC, the platform connector
can immediately clear it if a HEALTHY event arrives from a different NIC on the same port number.
This happens because
removeImpactedEntitiesMessagesScopedin the platform connector uses OR logicwhen matching entities: if any single entity tag from the healthy event matches a stored failure
message, the failure is cleared.
NIC port events carry two entity tags:
NIC=<device>NICPort=<port_number>Since every NIC has port 1,
NICPort=1is shared across all NICs. A healthy event from NIC-Bwith
NICPort=1will match and clear a stored FATAL from NIC-A that also hasNICPort=1,even though they are completely different devices.
Expected behavior: The failure should only be cleared when all entities match (AND logic),
not when any single entity matches (OR logic).
[NIC=A, NICPort=1]and[NIC=B, NICPort=1]are different composite identities and should not cross-clear.
Second variant of same bug: A healthy counter check event for a NIC can also clear
a state check failure for the same NIC, because they share the same entity tags and the
healthy event carries no ErrorCode (which matches everything). "Counters are healthy" should
not clear "link state is DOWN".
Component
Health Monitor
Steps to Reproduce
EthernetStateCheckeventwith entities
[NIC=<broken-nic>, NICPort=1]with entities
[NIC=<healthy-nic>, NICPort=1]EthernetStateCheck— it showsFalse/"No Health Failures"despite the NIC still being down
in the same batch, with the healthy event clearing the failure
Environment
Logs/Output
No response