-
Notifications
You must be signed in to change notification settings - Fork 698
Mark device as healthy if checkHealth function does not receive unhealthy event #1211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks @Zeel-Patel! |
Is it possible to listen to nvmlEventTypeGpuRecoveryAction to determine if a device is available?
The following new functionality is exposed on NVIDIA display drivers version 565 Production or later. |
Thanks, this is better. |
cc @klueska |
We are in the process of pulling a new component into the GPU Operator that has a gPRC interface to stream GPU health events to the plugin in a more comprehensive / reliable way than using NVML directly. This event stream also contains metadata specifying what remediation must be taken to recover from an error detected on the GPU. The plan is to leverage this component, once available, to do comprehensive error detection and remediation. We are just waiting for its approval to be open-sourced. |
@klueska any eta for the component? |
stop chan interface{} | ||
socket string | ||
server *grpc.Server | ||
healthy chan *rm.Device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a new channel, is there any benefit in having a single channel that accepts a device and the desired status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could even mark the device as healthy or unhealthy at the point where we send the device on the channel and then keep the health channel as is.
@@ -174,6 +164,18 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic | |||
continue | |||
} | |||
|
|||
if e.EventType != nvml.EventTypeXidCriticalError { | |||
klog.Infof("Skipping non-nvmlEventTypeXidCriticalError event: %+v", e) | |||
healthy <- d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are xid events mutually exclusive? Why are we treating non-critical (or skipped events) as indicators of device health?
Solves: 1014
Currently, if device is marked as unhealthy, it's status remain unhealthy and kubelet does not show resource on node.
The ListAndWatch stream Receiver of kubelet device manager checks health of device and takes action to update node resources.
Making the change in checkHealth function to send device healthy signal (using healthy channel) to server ListAndWatch to update device health as healthy in case no event related to device unhealthy is received