Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing default IMEX info fails for legacy images #797

Open
astefanutti opened this issue Nov 14, 2024 · 8 comments
Open

Parsing default IMEX info fails for legacy images #797

astefanutti opened this issue Nov 14, 2024 · 8 comments
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@astefanutti
Copy link

Since the latest 1.17.x versions, containers with images considered "legacy" and that do not have the NVIDIA_IMEX_CHANNELS environment variable set fail to start with the following error:

Error: container create failed: time="2024-11-13T16:24:41Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: error parsing IMEX info: unsupported IMEX channel value: all\n" 

It seems the NVIDIA_IMEX_CHANNELS environment variable is defaulted to all here for "legacy" images:

return NewVisibleDevices("all")

Which cannot be parsed by https://github.com/NVIDIA/libnvidia-container/blob/63d366ee3b4183513c310ac557bf31b05b83328f/src/cli/common.c#L446.

An occurrence of that issue has been reported here for example: pytorch/test-infra#5852.

That case should ideally be more gracefully handled.

@higi
Copy link

higi commented Nov 15, 2024

can anyone help me with this?:

2024-11-15T11:55:12Z create container gshaibi/gpu-burn:latest
2024-11-15T11:55:13Z latest Pulling from gshaibi/gpu-burn
2024-11-15T11:55:13Z Digest: sha256:ed07993b0581228c2bd7113fae0ed214549547f0fa91ba50165bc2473cfaf979
2024-11-15T11:55:13Z Status: Image is up to date for gshaibi/gpu-burn:latest
2024-11-15T11:55:14Z start container for gshaibi/gpu-burn:latest: begin
2024-11-15T11:55:14Z error starting container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown
2024-11-15T11:55:30Z start container for gshaibi/gpu-burn:latest: begin
2024-11-15T11:55:30Z error starting container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown

tested on NVIDIA Container Toolkit CLI version 1.17.1

@markjolah
Copy link

Possible WAR. Set NVIDIA_IMEX_CHANNELS=0 or empty string.

docker run ... -e NVIDIA_IMEX_CHANNELS=0 ...

Or for k8s Pod Spec, set:

    env:
    - name: NVIDIA_IMEX_CHANNELS
      value: 0

@elezar
Copy link
Member

elezar commented Nov 15, 2024

We have just released v1.17.2 that should address this issue. Please let us know if the problem persists.

@elezar elezar added the bug Issue/PR to expose/discuss/fix a bug label Nov 15, 2024
@higi
Copy link

higi commented Nov 16, 2024

We have just released v1.17.2 that should address this issue. Please let us know if the problem persists.

Now i am on, but its another problem think. Nvidia-smi showing all gpus without problem.

2024-11-16T09:26:40.108376033Z Failed to initialize NVML: Unknown Error
2024-11-16T09:26:40.207661945Z terminate called after throwing an instance of 'std::string'
2024-11-16T09:26:40.302441418Z No CUDA devices
2024-11-16T09:26:45.770657675Z Failed to initialize NVML: Unknown Error
2024-11-16T09:26:45.855912077Z terminate called after throwing an instance of 'std::string'
2024-11-16T09:26:45.957591526Z No CUDA devices

@higi
Copy link

higi commented Nov 16, 2024

Fixed it by:
sudo vim /etc/nvidia-container-runtime/config.toml, then changed no-cgroups = false, save

Think version 1.17.2 fixed problem with imex channel. Many thx for quick fix!!

@elezar
Copy link
Member

elezar commented Nov 29, 2024

@higi do you know why no-cgroups was set to true?

@higi
Copy link

higi commented Nov 29, 2024

@higi do you know why no-cgroups was set to true?

I think its just for nvml test, for ai tools. Nvml test doesnt work without settings this.

This script fixed it wget https://raw.githubusercontent.com/jjziets/vasttools/main/nvml_fix.py. an,way im3x channel error was fixed by your fix

@liatamax
Copy link

liatamax commented Feb 4, 2025

I'd like to confirm that this issue with imex still exist in my nvidia-container-toolkit version 1.17.4-1, when I run into issue to install the NVIDIA cloud native stack v14.

And /etc/nvidia-container-runtime/config.toml has "no-cgroups = false" commented out.

(base) user@h100:~$ apt list --installed | grep toolkit

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-container-toolkit-base/unknown,now 1.17.4-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.17.4-1 amd64 [installed]

I am following the instructions from NVIDIA - https://github.com/NVIDIA/cloud-native-stack/blob/master/install-guides/Ubuntu-22-04_Server_Developer-x86-arm64_v14.0.md#Validate-NVIDIA-Cloud-Native-Stack-with-an-application-from-NGC

And the 2nd validation task would run into this IMEX error with the suggested k8 yaml file from the link above:

(base) user@h100:~$ cat k8-pod-cuda-vector-add.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add-imex
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"

And the pod will fail with the following error.

user@h100:~/Downloads/NVIDIA/Github/ACE/workflows/tokkio/scripts/one-click/baremetal$ kubectl describe pod cuda-vector-add
Name: cuda-vector-add
Namespace: default
Priority: 0
Service Account: default
Node: h100/172.30.1.74
Start Time: Tue, 04 Feb 2025 01:24:44 +0000
Labels:
Annotations: cni.projectcalico.org/containerID: fcd7cde8524532b96fa03d662cc2fbd7cb1fbcb7d2e70b53d03c2a77917094d4
cni.projectcalico.org/podIP: 192.168.35.61/32
cni.projectcalico.org/podIPs: 192.168.35.61/32
Status: Running
IP: 192.168.35.61
IPs:
IP: 192.168.35.61
Containers:
cuda-vector-add:
Container ID: containerd://0fa1cd4479d54d83b96fafc2d26000eb3001bd7cef2ef8d83bc89423bbf06c99
Image: k8s.gcr.io/cuda-vector-add:v0.1
Image ID: k8s.gcr.io/cuda-vector-add@sha256:0705cd690bc0abf54c0f0489d82bb846796586e9d087e9a93b5794576a456aea
Port:
Host Port:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: error parsing IMEX info: unsupported IMEX channel value: all: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 00:00:00 +0000
Finished: Tue, 04 Feb 2025 01:25:28 +0000
Ready: False
Restart Count: 3
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cctr4 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-cctr4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 60s default-scheduler Successfully assigned default/cuda-vector-add to h100
Normal Pulled 16s (x4 over 59s) kubelet Container image "k8s.gcr.io/cuda-vector-add:v0.1" already present on machine
Normal Created 16s (x4 over 59s) kubelet Created container cuda-vector-add
Warning Failed 16s (x4 over 59s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: error parsing IMEX info: unsupported IMEX channel value: all: unknown
Warning BackOff 1s (x6 over 57s) kubelet Back-off restarting failed container cuda-vector-add in pod cuda-vector-add_default(431d0013-c58e-4ece-b69a-a79cfb579438)

The fix, is to add the following section to the k8 yaml file.
env:
- name: NVIDIA_IMEX_CHANNELS
value: "0"

(Side note that is unrelated to this project, this particular pod still fail with error code 137 when I launch this particular pod which will execute the /usr/local/cuda-8.0/samples/0_Simple/vectorAdd command)

root@cuda-vector-add-imex:/usr/local/cuda-8.0/samples/0_Simple/vectorAdd# env | grep NVIDIA
NVIDIA_IMEX_CHANNELS=0
root@cuda-vector-add-imex:/usr/local/cuda-8.0/samples/0_Simple/vectorAdd# command terminated with exit code 137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

5 participants