Skip to content

Mark device as healthy if checkHealth function does not receive unhealthy event #1211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Zeel-Patel
Copy link

@Zeel-Patel Zeel-Patel commented Mar 24, 2025

Solves: 1014
Currently, if device is marked as unhealthy, it's status remain unhealthy and kubelet does not show resource on node.

The ListAndWatch stream Receiver of kubelet device manager checks health of device and takes action to update node resources.

Making the change in checkHealth function to send device healthy signal (using healthy channel) to server ListAndWatch to update device health as healthy in case no event related to device unhealthy is received

Copy link

copy-pr-bot bot commented Mar 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jslouisyou
Copy link

Thanks @Zeel-Patel!
Could you please let me know when the next version of k8s-device-plugin will be released?
If possible then I can test it on my side.

@ppd324
Copy link

ppd324 commented Apr 24, 2025

Is it possible to listen to nvmlEventTypeGpuRecoveryAction to determine if a device is available?

Changes between v560 and v565

The following new functionality is exposed on NVIDIA display drivers version 565 Production or later.
• Added new event type nvmlEventTypeGpuRecoveryAction.
• Added new fieldId to query GPU recovery action NVML_FI_DEV_GET_GPU_RECOVERY_ACTION.

@Zeel-Patel
Copy link
Author

Is it possible to listen to nvmlEventTypeGpuRecoveryAction to determine if a device is available?

Changes between v560 and v565

The following new functionality is exposed on NVIDIA display drivers version 565 Production or later. • Added new event type nvmlEventTypeGpuRecoveryAction. • Added new fieldId to query GPU recovery action NVML_FI_DEV_GET_GPU_RECOVERY_ACTION.

Thanks, this is better.
Hi @elezar, I can see that this event is not publicly available. Is there any plan to expose this in open source version of nvml library?

@Zeel-Patel
Copy link
Author

cc @klueska

@klueska
Copy link
Contributor

klueska commented May 16, 2025

We are in the process of pulling a new component into the GPU Operator that has a gPRC interface to stream GPU health events to the plugin in a more comprehensive / reliable way than using NVML directly. This event stream also contains metadata specifying what remediation must be taken to recover from an error detected on the GPU. The plan is to leverage this component, once available, to do comprehensive error detection and remediation. We are just waiting for its approval to be open-sourced.

@amitaekbote
Copy link

@klueska any eta for the component?

stop chan interface{}
socket string
server *grpc.Server
healthy chan *rm.Device
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a new channel, is there any benefit in having a single channel that accepts a device and the desired status?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even mark the device as healthy or unhealthy at the point where we send the device on the channel and then keep the health channel as is.

@@ -174,6 +164,18 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic
continue
}

if e.EventType != nvml.EventTypeXidCriticalError {
klog.Infof("Skipping non-nvmlEventTypeXidCriticalError event: %+v", e)
healthy <- d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are xid events mutually exclusive? Why are we treating non-critical (or skipped events) as indicators of device health?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants