Prerequisites
Code of Conduct
Feature Summary
This feature would allow nodes cordoned prior to NVSentinel triggering the breakfix pipeline to remain cordoned after recovering from a fatal unhealthy event when NVSentinel releases ownership of the node.
Problem/Use Case
Currently, the fault-quarantine module does not track whether a node was cordoned prior to a fatal unhealthy event arriving. As a result, if a node was cordoned by an external system and then experiences a fatal unhealthy event which recovers, the node will be uncordoned by fault-quarantine. This can result in nodes isolated for maintenance or faults not detected by NVSentinel becoming schedulable.
Proposed Solution
Add a new node annotation as part of the fault-quarantine module which indicates whether a node was cordoned prior to fault-quarantine receiving a fatal unhealthy event. If this annotation is set when the last unhealthy event is cleared in fault-quarantine, it should release ownership of the node but keep the node cordoned.
Component
Health Monitor
Prerequisites
Code of Conduct
Feature Summary
This feature would allow nodes cordoned prior to NVSentinel triggering the breakfix pipeline to remain cordoned after recovering from a fatal unhealthy event when NVSentinel releases ownership of the node.
Problem/Use Case
Currently, the fault-quarantine module does not track whether a node was cordoned prior to a fatal unhealthy event arriving. As a result, if a node was cordoned by an external system and then experiences a fatal unhealthy event which recovers, the node will be uncordoned by fault-quarantine. This can result in nodes isolated for maintenance or faults not detected by NVSentinel becoming schedulable.
Proposed Solution
Add a new node annotation as part of the fault-quarantine module which indicates whether a node was cordoned prior to fault-quarantine receiving a fatal unhealthy event. If this annotation is set when the last unhealthy event is cleared in fault-quarantine, it should release ownership of the node but keep the node cordoned.
Component
Health Monitor