Skip to content

[Bug]: NVSentinel doesn't work in IPv6-only clusters #1407

Description

@bfbachmann

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Bug Description

NVSentinel doesn't work in IPv6-only K8s clusters. There are a few reasons why:

1. DCGM client can't connect via hostnames that resolve IPv6 addresses

2. The metrics server only binds to 0.0.0.0, so liveness probes that call it can't reach it in IPv6-only clusters.
The metrics server inside gpu-health-monitor pods is started by calling the start_http_server from prometheus_client without passing a value for the optional addr argument, so it defaults to 0.0.0.0, making it unreachable from liveness probes in IPv6 clusters.

Inside the gpu-health-montior pods:

root@gpu-health-monitor-dcgm-4:/# netstat -tlpn | grep 2112
tcp        0      0 0.0.0.0:2112            0.0.0.0:*               LISTEN      1/python3

From the pod events:

Events:
  Type     Reason                           Age                      From     Message
  ----     ------                           ----                     ----     -------
  Warning  Unhealthy                        16m (x206 over 5h10m)    kubelet  spec.containers{gpu-health-monitor}: Liveness probe failed: Get "http://[2600:1f18:b39:ec10::611b[]:2112/metrics": dial tcp [2600:1f18:b39:ec10::611b]:2112: connect: connection refused
  Warning  Unhealthy                        6m40s (x868 over 5h11m)  kubelet  spec.containers{gpu-health-monitor}: Readiness probe failed: Get "http://[2600:1f18:b39:ec10::611b[]:2112/metrics": dial tcp [2600:1f18:b39:ec10::611b]:2112: connect: connection refused
  Normal   Killing                          5m14s (x72 over 5h9m)    kubelet  spec.containers{gpu-health-monitor}: Container gpu-health-monitor failed liveness probe, will be restarted

3. No built-in way to make DCGM pods listen on [::] instead of 0.0.0.0

The only way I could find to make DCGM pods bind on IPv6 interfaces was to extract the hard-coded args from the DCGM pod image and change the 0.0.0.0 arg to :: in the GPU operator Helm chart:

dcgm:
  args: ["-n", "-b", "::", "--log-level", "NONE", "-f", "-"]

This workaround does force DCGM pods to bind on IPv6 interfaces, but it's not particularly pretty. It would be nice if this was made easier, or just worked out of the box.

The end result of these issues is that it's not possible to use NVSentinel in IPv6-only clusters.

Component

Health Monitor

Steps to Reproduce

  1. Set up an IPv6-only cluster
  2. Install Nvidia GPU operator and NVSentinel
  3. Notice how multiple pods are stuck in CrashLoopBackoff due to the above issues.

Environment

  • NVSentinel version: v1.9.0
  • Kubernetes version: v1.35.2
  • Deployment method: Helm

Logs/Output

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions