[Bug]: NVSentinel doesn't work in IPv6-only clusters

### Prerequisites

- [x] I searched existing issues
- [x] I can reproduce this issue

### Code of Conduct

- [x] I agree to follow NVSentinel's Code of Conduct

### Bug Description

NVSentinel doesn't work in IPv6-only K8s clusters. There are a few reasons why:

**1. [DCGM client can't connect via hostnames that resolve IPv6 addresses](https://github.com/NVIDIA/DCGM/issues/301)**

**2. The metrics server only binds to `0.0.0.0`, so liveness probes that call it can't reach it in IPv6-only clusters.**
The metrics server inside `gpu-health-monitor` pods is started by calling the `start_http_server` from `prometheus_client` without passing a value for the [optional `addr` argument](https://github.com/prometheus/client_python/blob/master/prometheus_client/exposition.py#L235), so it defaults to `0.0.0.0`, making it unreachable from liveness probes in IPv6 clusters.

Inside the `gpu-health-montior` pods:
```
root@gpu-health-monitor-dcgm-4:/# netstat -tlpn | grep 2112
tcp        0      0 0.0.0.0:2112            0.0.0.0:*               LISTEN      1/python3
```

From the pod events:
```
Events:
  Type     Reason                           Age                      From     Message
  ----     ------                           ----                     ----     -------
  Warning  Unhealthy                        16m (x206 over 5h10m)    kubelet  spec.containers{gpu-health-monitor}: Liveness probe failed: Get "http://[2600:1f18:b39:ec10::611b[]:2112/metrics": dial tcp [2600:1f18:b39:ec10::611b]:2112: connect: connection refused
  Warning  Unhealthy                        6m40s (x868 over 5h11m)  kubelet  spec.containers{gpu-health-monitor}: Readiness probe failed: Get "http://[2600:1f18:b39:ec10::611b[]:2112/metrics": dial tcp [2600:1f18:b39:ec10::611b]:2112: connect: connection refused
  Normal   Killing                          5m14s (x72 over 5h9m)    kubelet  spec.containers{gpu-health-monitor}: Container gpu-health-monitor failed liveness probe, will be restarted
```

**3. No built-in way to make DCGM pods listen on `[::]` instead of `0.0.0.0`**

The only way I could find to make DCGM pods bind on IPv6 interfaces was to extract the hard-coded args from the DCGM pod image and  change the `0.0.0.0` arg to `::` in the GPU operator Helm chart:
```yaml
dcgm:
  args: ["-n", "-b", "::", "--log-level", "NONE", "-f", "-"]
```

This workaround does force DCGM pods to bind on IPv6 interfaces, but it's not particularly pretty. It would be nice if this was made easier, or just worked out of the box.

The end result of these issues is that it's not possible to use NVSentinel in IPv6-only clusters.

### Component

Health Monitor

### Steps to Reproduce

1. Set up an IPv6-only cluster
2. Install Nvidia GPU operator and NVSentinel
3. Notice how multiple pods are stuck in CrashLoopBackoff due to the above issues.

### Environment

- NVSentinel version: v1.9.0
- Kubernetes version: v1.35.2
- Deployment method: Helm


### Logs/Output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: NVSentinel doesn't work in IPv6-only clusters #1407

Prerequisites

Code of Conduct

Bug Description

Component

Steps to Reproduce

Environment

Logs/Output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: NVSentinel doesn't work in IPv6-only clusters #1407

Description

Prerequisites

Code of Conduct

Bug Description

Component

Steps to Reproduce

Environment

Logs/Output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions