Prerequisites
Code of Conduct
Bug Description
NVSentinel doesn't work in IPv6-only K8s clusters. There are a few reasons why:
1. DCGM client can't connect via hostnames that resolve IPv6 addresses
2. The metrics server only binds to 0.0.0.0, so liveness probes that call it can't reach it in IPv6-only clusters.
The metrics server inside gpu-health-monitor pods is started by calling the start_http_server from prometheus_client without passing a value for the optional addr argument, so it defaults to 0.0.0.0, making it unreachable from liveness probes in IPv6 clusters.
Inside the gpu-health-montior pods:
root@gpu-health-monitor-dcgm-4:/# netstat -tlpn | grep 2112
tcp 0 0 0.0.0.0:2112 0.0.0.0:* LISTEN 1/python3
From the pod events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 16m (x206 over 5h10m) kubelet spec.containers{gpu-health-monitor}: Liveness probe failed: Get "http://[2600:1f18:b39:ec10::611b[]:2112/metrics": dial tcp [2600:1f18:b39:ec10::611b]:2112: connect: connection refused
Warning Unhealthy 6m40s (x868 over 5h11m) kubelet spec.containers{gpu-health-monitor}: Readiness probe failed: Get "http://[2600:1f18:b39:ec10::611b[]:2112/metrics": dial tcp [2600:1f18:b39:ec10::611b]:2112: connect: connection refused
Normal Killing 5m14s (x72 over 5h9m) kubelet spec.containers{gpu-health-monitor}: Container gpu-health-monitor failed liveness probe, will be restarted
3. No built-in way to make DCGM pods listen on [::] instead of 0.0.0.0
The only way I could find to make DCGM pods bind on IPv6 interfaces was to extract the hard-coded args from the DCGM pod image and change the 0.0.0.0 arg to :: in the GPU operator Helm chart:
dcgm:
args: ["-n", "-b", "::", "--log-level", "NONE", "-f", "-"]
This workaround does force DCGM pods to bind on IPv6 interfaces, but it's not particularly pretty. It would be nice if this was made easier, or just worked out of the box.
The end result of these issues is that it's not possible to use NVSentinel in IPv6-only clusters.
Component
Health Monitor
Steps to Reproduce
- Set up an IPv6-only cluster
- Install Nvidia GPU operator and NVSentinel
- Notice how multiple pods are stuck in CrashLoopBackoff due to the above issues.
Environment
- NVSentinel version: v1.9.0
- Kubernetes version: v1.35.2
- Deployment method: Helm
Logs/Output
No response
Prerequisites
Code of Conduct
Bug Description
NVSentinel doesn't work in IPv6-only K8s clusters. There are a few reasons why:
1. DCGM client can't connect via hostnames that resolve IPv6 addresses
2. The metrics server only binds to
0.0.0.0, so liveness probes that call it can't reach it in IPv6-only clusters.The metrics server inside
gpu-health-monitorpods is started by calling thestart_http_serverfromprometheus_clientwithout passing a value for the optionaladdrargument, so it defaults to0.0.0.0, making it unreachable from liveness probes in IPv6 clusters.Inside the
gpu-health-montiorpods:From the pod events:
3. No built-in way to make DCGM pods listen on
[::]instead of0.0.0.0The only way I could find to make DCGM pods bind on IPv6 interfaces was to extract the hard-coded args from the DCGM pod image and change the
0.0.0.0arg to::in the GPU operator Helm chart:This workaround does force DCGM pods to bind on IPv6 interfaces, but it's not particularly pretty. It would be nice if this was made easier, or just worked out of the box.
The end result of these issues is that it's not possible to use NVSentinel in IPv6-only clusters.
Component
Health Monitor
Steps to Reproduce
Environment
Logs/Output
No response