Mark device as healthy if checkHealth function does not receive unhealthy event #1211

Zeel-Patel · 2025-03-24T12:41:06Z

Solves: 1014
Currently, if device is marked as unhealthy, it's status remain unhealthy and kubelet does not show resource on node.

The ListAndWatch stream Receiver of kubelet device manager checks health of device and takes action to update node resources.

Making the change in checkHealth function to send device healthy signal (using healthy channel) to server ListAndWatch to update device health as healthy in case no event related to device unhealthy is received

…lthy event

copy-pr-bot · 2025-03-24T12:41:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jslouisyou · 2025-03-25T01:26:26Z

Thanks @Zeel-Patel!
Could you please let me know when the next version of k8s-device-plugin will be released?
If possible then I can test it on my side.

ppd324 · 2025-04-24T05:54:56Z

Is it possible to listen to nvmlEventTypeGpuRecoveryAction to determine if a device is available?

Changes between v560 and v565

The following new functionality is exposed on NVIDIA display drivers version 565 Production or later.
• Added new event type nvmlEventTypeGpuRecoveryAction.
• Added new fieldId to query GPU recovery action NVML_FI_DEV_GET_GPU_RECOVERY_ACTION.

Zeel-Patel · 2025-05-07T18:56:55Z

Is it possible to listen to nvmlEventTypeGpuRecoveryAction to determine if a device is available?

Changes between v560 and v565

The following new functionality is exposed on NVIDIA display drivers version 565 Production or later. • Added new event type nvmlEventTypeGpuRecoveryAction. • Added new fieldId to query GPU recovery action NVML_FI_DEV_GET_GPU_RECOVERY_ACTION.

Thanks, this is better.
Hi @elezar, I can see that this event is not publicly available. Is there any plan to expose this in open source version of nvml library?

Zeel-Patel · 2025-05-07T18:57:11Z

cc @klueska

klueska · 2025-05-16T11:00:31Z

We are in the process of pulling a new component into the GPU Operator that has a gPRC interface to stream GPU health events to the plugin in a more comprehensive / reliable way than using NVML directly. This event stream also contains metadata specifying what remediation must be taken to recover from an error detected on the GPU. The plan is to leverage this component, once available, to do comprehensive error detection and remediation. We are just waiting for its approval to be open-sourced.

amitaekbote · 2025-06-26T21:07:57Z

@klueska any eta for the component?

elezar · 2025-06-27T06:23:42Z

internal/plugin/server.go

-	stop   chan interface{}
+	socket    string
+	server    *grpc.Server
+	healthy   chan *rm.Device


Instead of introducing a new channel, is there any benefit in having a single channel that accepts a device and the desired status?

We could even mark the device as healthy or unhealthy at the point where we send the device on the channel and then keep the health channel as is.

elezar · 2025-06-27T06:27:49Z

internal/rm/health.go

@@ -174,6 +164,18 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic
 			continue
 		}

+		if e.EventType != nvml.EventTypeXidCriticalError {
+			klog.Infof("Skipping non-nvmlEventTypeXidCriticalError event: %+v", e)
+			healthy <- d


Are xid events mutually exclusive? Why are we treating non-critical (or skipped events) as indicators of device health?

Zeel-Patel added 2 commits March 24, 2025 18:09

Mark device as healthy if checkHealth function does not receive unhea…

b88248e

…lthy event

Create workspace.xml

d56a95f

Delete workspace.xml

45b9133

Zeel-Patel mentioned this pull request Mar 24, 2025

GPU resources are not recovered even XID error is resolved #1014

Open

elezar reviewed Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mark device as healthy if checkHealth function does not receive unhealthy event #1211

Mark device as healthy if checkHealth function does not receive unhealthy event #1211

Uh oh!

Zeel-Patel commented Mar 24, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 24, 2025

Uh oh!

jslouisyou commented Mar 25, 2025

Uh oh!

ppd324 commented Apr 24, 2025

Changes between v560 and v565

Uh oh!

Zeel-Patel commented May 7, 2025

Changes between v560 and v565

Uh oh!

Zeel-Patel commented May 7, 2025

Uh oh!

klueska commented May 16, 2025

Uh oh!

amitaekbote commented Jun 26, 2025

Uh oh!

elezar Jun 27, 2025

Uh oh!

elezar Jun 27, 2025

Uh oh!

elezar Jun 27, 2025

Uh oh!

Uh oh!

Mark device as healthy if checkHealth function does not receive unhealthy event #1211

Are you sure you want to change the base?

Mark device as healthy if checkHealth function does not receive unhealthy event #1211

Uh oh!

Conversation

Zeel-Patel commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 24, 2025

Uh oh!

jslouisyou commented Mar 25, 2025

Uh oh!

ppd324 commented Apr 24, 2025

Changes between v560 and v565

Uh oh!

Zeel-Patel commented May 7, 2025

Changes between v560 and v565

Uh oh!

Zeel-Patel commented May 7, 2025

Uh oh!

klueska commented May 16, 2025

Uh oh!

amitaekbote commented Jun 26, 2025

Uh oh!

elezar Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Zeel-Patel commented Mar 24, 2025 •

edited

Loading