Skip to content

[Feature]: Do not uncordon nodes cordoned independently of NVSentinel #1424

Description

@natherz97

Prerequisites

  • I searched existing issues

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Feature Summary

This feature would allow nodes cordoned prior to NVSentinel triggering the breakfix pipeline to remain cordoned after recovering from a fatal unhealthy event when NVSentinel releases ownership of the node.

Problem/Use Case

Currently, the fault-quarantine module does not track whether a node was cordoned prior to a fatal unhealthy event arriving. As a result, if a node was cordoned by an external system and then experiences a fatal unhealthy event which recovers, the node will be uncordoned by fault-quarantine. This can result in nodes isolated for maintenance or faults not detected by NVSentinel becoming schedulable.

Proposed Solution

Add a new node annotation as part of the fault-quarantine module which indicates whether a node was cordoned prior to fault-quarantine receiving a fatal unhealthy event. If this annotation is set when the last unhealthy event is cleared in fault-quarantine, it should release ownership of the node but keep the node cordoned.

Component

Health Monitor

Metadata

Metadata

Assignees

Fields

No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions