Description
Hello, NVIDIA team.
I recently faced an issue while GPU resources (nvidia.com/gpu
) can be shown from kubelet
are not recovered (e.g. 7 -> 8) even any XID error is resolved.
I got nvidia-device-plugin-daemonset
from gpu-operator
and I'm using gpu-operator v23.9.2.
Here's more details:
I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:
Capacity:
cpu: 128
ephemeral-storage: 7441183616Ki
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2113276288Ki
nvidia.com/gpu: 8
pods: 110
Allocatable:
cpu: 128
ephemeral-storage: 6857794809152
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2062682496Ki
nvidia.com/gpu: 7 <=========== here
pods: 110
nvidia-device-plugin-daemonset
reports that there is XID 94 error is coming out in one of GPU card:
I1025 02:19:08.002792 1 health.go:151] Skipping non-nvmlEventTypeXidCriticalError event: {Device:{Handle:0x7f0dcf40bdf8} EventType:2 EventData:0 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048144 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048185 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.048239 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.049436 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.049451 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.049483 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.059938 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.059948 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.059980 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.074343 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.074366 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.074389 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from nvidia-smi
:
$ nvidia-smi
Fri Oct 25 11:35:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:1A:00.0 Off | 2 |
| N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:40:00.0 Off | 0 |
| N/A 31C P0 70W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:53:00.0 Off | 0 |
| N/A 31C P0 74W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:66:00.0 Off | 0 |
| N/A 33C P0 69W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9C:00.0 Off | 0 |
| N/A 35C P0 71W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:C0:00.0 Off | 0 |
| N/A 32C P0 68W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:D2:00.0 Off | 0 |
| N/A 34C P0 70W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:E4:00.0 Off | 0 |
| N/A 31C P0 71W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
But even if XID error is resolved, nvidia-device-plugin-daemonset
won't try to fetch new status of GPU cards and reports to kubelet
, so kubelet
thinks that only some of GPU cards can be used.
After I restarted nvidia-device-plugin-daemonset
pod, at then it reports kubelet
that they can use 8 GPU cards (the number of nvidia.com/gpu
is changed in Allocatable
):
Capacity:
cpu: 128
ephemeral-storage: 7441183616Ki
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2113276288Ki
nvidia.com/gpu: 8
pods: 110
Allocatable:
cpu: 128
ephemeral-storage: 6857794809152
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2062682496Ki
nvidia.com/gpu: 8 <=========== here is changed
pods: 110
I think nvidia-device-plugin-daemonset
should fetch status correctly and report to kubelet
.
Could you please take a look this issue?
Thanks.