-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing default IMEX info fails for legacy images #797
Comments
can anyone help me with this?: 2024-11-15T11:55:12Z create container gshaibi/gpu-burn:latest tested on NVIDIA Container Toolkit CLI version 1.17.1 |
Possible WAR. Set
Or for k8s Pod Spec, set:
|
We have just released v1.17.2 that should address this issue. Please let us know if the problem persists. |
Now i am on, but its another problem think. Nvidia-smi showing all gpus without problem. 2024-11-16T09:26:40.108376033Z Failed to initialize NVML: Unknown Error |
Fixed it by: Think version 1.17.2 fixed problem with imex channel. Many thx for quick fix!! |
@higi do you know why |
I think its just for nvml test, for ai tools. Nvml test doesnt work without settings this. This script fixed it wget https://raw.githubusercontent.com/jjziets/vasttools/main/nvml_fix.py. an,way im3x channel error was fixed by your fix |
I'd like to confirm that this issue with imex still exist in my nvidia-container-toolkit version 1.17.4-1, when I run into issue to install the NVIDIA cloud native stack v14. And /etc/nvidia-container-runtime/config.toml has "no-cgroups = false" commented out. (base) user@h100:~$ apt list --installed | grep toolkit WARNING: apt does not have a stable CLI interface. Use with caution in scripts. nvidia-container-toolkit-base/unknown,now 1.17.4-1 amd64 [installed,automatic] I am following the instructions from NVIDIA - https://github.com/NVIDIA/cloud-native-stack/blob/master/install-guides/Ubuntu-22-04_Server_Developer-x86-arm64_v14.0.md#Validate-NVIDIA-Cloud-Native-Stack-with-an-application-from-NGC And the 2nd validation task would run into this IMEX error with the suggested k8 yaml file from the link above: (base) user@h100:~$ cat k8-pod-cuda-vector-add.yaml And the pod will fail with the following error. user@h100:~/Downloads/NVIDIA/Github/ACE/workflows/tokkio/scripts/one-click/baremetal$ kubectl describe pod cuda-vector-add Normal Scheduled 60s default-scheduler Successfully assigned default/cuda-vector-add to h100 The fix, is to add the following section to the k8 yaml file. (Side note that is unrelated to this project, this particular pod still fail with error code 137 when I launch this particular pod which will execute the /usr/local/cuda-8.0/samples/0_Simple/vectorAdd command) root@cuda-vector-add-imex:/usr/local/cuda-8.0/samples/0_Simple/vectorAdd# env | grep NVIDIA |
Since the latest 1.17.x versions, containers with images considered "legacy" and that do not have the
NVIDIA_IMEX_CHANNELS
environment variable set fail to start with the following error:It seems the
NVIDIA_IMEX_CHANNELS
environment variable is defaulted toall
here for "legacy" images:nvidia-container-toolkit/internal/config/image/cuda_image.go
Line 145 in 1995925
Which cannot be parsed by https://github.com/NVIDIA/libnvidia-container/blob/63d366ee3b4183513c310ac557bf31b05b83328f/src/cli/common.c#L446.
An occurrence of that issue has been reported here for example: pytorch/test-infra#5852.
That case should ideally be more gracefully handled.
The text was updated successfully, but these errors were encountered: