Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA_DRIVER_CAPABILITIES=graphics is broken on Jetson devices (1.17.1 or later) #795

Open
yeongrokgim opened this issue Nov 13, 2024 · 7 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@yeongrokgim
Copy link

yeongrokgim commented Nov 13, 2024

Summary

On Jetson(aarch64, Tegra SoC) devices, version 1.17.1 is not creating containers properly, if environment variable NVIDIA_DRIVER_CAPABILITIES contains any of display,graphics,all value.

This could be mitigated by overriding container env, for example docker run -e NVIDIA_DRIVER_CAPABILITIES=compute nvcr.io/....

Steps to reproduce

  1. Get a Jetson device. I tested with {Xavier, Orin} AGX DevKit as a reference.

  2. Install Docker runtime and nvidia-container-runtime=1.17.1-1

  3. Ensure nvidia container runtime has configured. To configure, run
    sudo nvidia-ctk runtime configure --set-as-default

  4. Try running a container. For example, l4t-base image could be used. For example:

    docker run -it --rm \
        -e NVIDIA_DRIVER_CAPABILITIES=all \
        nvcr.io/nvidia/l4t-base:r36.2.0

    OR, even with non-jetson base images:

    docker run -it --rm \
        -e NVIDIA_DRIVER_CAPABILITIES=display \
        -e NVIDIA_VISIBLE_DEVICES=all \
        ubuntu:22.04

Result

Example of error message

$ docker run -it --rm -e NVIDIA_DRIVER_CAPABILITIES=display -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="2024-11-13T17:38:55+09:00" level=info msg="Symlinking /var/lib/docker/overlay2/8af1b1d84ee57db598be489bb9ad58fb2d139b77604aead77526787d18a02900/merged/etc/vulkan/icd.d/nvidia_icd.json to /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json"
time="2024-11-13T17:38:55+09:00" level=error msg="failed to create link [/usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json /etc/vulkan/icd.d/nvidia_icd.json]: failed to create symlink: failed to remove existing file: remove /var/lib/docker/overlay2/8af1b1d84ee57db598be489bb9ad58fb2d139b77604aead77526787d18a02900/merged/etc/vulkan/icd.d/nvidia_icd.json: device or resource busy": unknown.
Hardware Jetpack nvidia-container-toolkit NVIDIA_DRIVER_CAPABILITIES result
Orin AGX 6.1 1.14.2 all Good
Orin AGX 6.1 1.17.1 all Error
Orin AGX 6.1 1.17.1 compute,utility Good
Orin AGX 6.1 1.17.1 display Error
Orin AGX 6.1 1.17.1 graphics Error
Xavier AGX 5.1.2 1.16.1 all Good
Xavier AGX 5.1.2 1.16.1 graphics Good
Xavier AGX 5.1.2 1.17.1 all Error
Xavier AGX 5.1.2 1.17.1 compute Good
Xavier AGX 5.1.2 1.17.1 display Error
Xavier AGX 5.1.2 1.17.1 graphics Error
@robcowie
Copy link

robcowie commented Nov 13, 2024

I can confirm this behaviour on the following additional env:

Hardware Jetpack nvidia-container-toolkit NVIDIA_DRIVER_CAPABILITIES result
Orin AGX 5.1 (l4t 35.2.1) 1.17.1 all Error
Orin AGX 5.1 (l4t 35.2.1) 1.16.2 all Good

Both on Ubuntu 20.04, docker version 27.3.1

The failing symlink happens to be the first sym declaration in /etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv. Removing it causes the container run to fail at the next symlink, suggesting it is not that specific file at fault but something more fundamental.

I suspect that somewhere in v1.16.2...v1.17.1 is a change to the handling of symlinks that has broken the functionality.

@YasharSL
Copy link

YasharSL commented Nov 18, 2024

Facing the same issue with:

Hardware Jetpack l4t nvidia-container-toolkit
Orin AGX 5.1.1 35.3.1 1.17.2

Also, I think it's worth mentioning that I have both CUDA 11.8 and 11.4 on my Jetson. When I try to run nvcr.io/nvidia/pytorch:22.12-py3 image with CUDA 11.8 support with nvidia container toolkit runtime, it works fine but my other images which used to work with the previous versions including CUDA 11.4 are showing this same error

Temporary Fix

Downgraded container toolkit to 1.16.2 with the following steps

sudo apt purge nvidia-container-toolkit

sudo apt-get install -y --allow-downgrades nvidia-container-toolkit-base=1.16.2-1

sudo apt-get install -y --allow-downgrades nvidia-container-toolkit=1.16.2-1

@yeongrokgim yeongrokgim changed the title 1.17.1 - NVIDIA_DRIVER_CAPABILITIES=graphics is broken on Jetson devices NVIDIA_DRIVER_CAPABILITIES=graphics is broken on Jetson devices (1.17.1 or later) Nov 19, 2024
@mcasasola
Copy link

I am also experiencing this issue on my Jetson device. Here are the details of my setup:

Hardware: Jetson Orin 16GB
JetPack Version: 5.1.1 (L4T 35.3.1)
NVIDIA Container Toolkit Version: 1.17.2-1
When I attempt to run a container using the NVIDIA runtime, I receive the following error message:

sudo docker run --rm --runtime=nvidia shibenyong/devicequery ./deviceQuery
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="2024-11-21T16:37:37-03:00" level=info msg="Symlinking /mnt/storage/docker/overlay2/9289f0d60214918d874fb047d047dd9b8fa01f89d8332a26c25ba071a9af599d/merged/etc/vulkan/icd.d/nvidia_icd.json to /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json"
time="2024-11-21T16:37:37-03:00" level=error msg="failed to create link [/usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json /etc/vulkan/icd.d/nvidia_icd.json]: failed to create symlink: failed to remove existing file: remove /mnt/storage/docker/overlay2/9289f0d60214918d874fb047d047dd9b8fa01f89d8332a26c25ba071a9af599d/merged/etc/vulkan/icd.d/nvidia_icd.json: device or resource busy": unknown.

After downgrading to nvidia-container-toolkit version 1.15.0-1, the container runs successfully:


sudo docker run --rm --runtime=nvidia shibenyong/devicequery ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.4 / 10.2
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 15389 MBytes (16136331264 bytes)
  (rest of the output)
Result = PASS

To resolve the issue, I only needed to purge nvidia-container-toolkit-base and nvidia-container-toolkit, and install version 1.15.0-1 of both. Here are the steps I followed:

sudo apt-get remove --purge nvidia-container-toolkit nvidia-container-toolkit-base
sudo apt-get install nvidia-container-toolkit=1.15.0-1 nvidia-container-toolkit-base=1.15.0-1

After downgrading, the containers are running correctly using the NVIDIA runtime.

Summary of my findings:

Hardware: Jetson Orin 16GB
JetPack Version: 5.1.1
NVIDIA Container Toolkit:
1.17.2-1: Error when running containers with the NVIDIA runtime
1.15.0-1: Works correctly when running containers with the NVIDIA runtime
It appears that the issue persists in version 1.17.2-1 on JetPack 5.1.1. Downgrading to an earlier version of the NVIDIA Container Toolkit resolves the problem. Note that it's sufficient to downgrade only nvidia-container-toolkit and nvidia-container-toolkit-base to version 1.15.0-1; there's no need to purge or downgrade other NVIDIA packages.

I hope this information helps in identifying and fixing the bug.

@elezar elezar self-assigned this Nov 25, 2024
@elezar elezar added the bug Issue/PR to expose/discuss/fix a bug label Nov 25, 2024
@Chao-Yao
Copy link

Chao-Yao commented Dec 8, 2024

Facing the same issue.
Hardware: Jetson Orin NX
JetPack 5.1.1
nvidia-container-toolkit 1.17.2-1

@rgobbel
Copy link

rgobbel commented Jan 12, 2025

I don't think the problem is in the environment variable. I can run a vanilla ubuntu image with any value of NVIDIA_DRIVER_CAPABILITIES and it runs without a problem. It seems more likely that the issue is with the specific symlink to nvidia_icd.json. I see in the changelogs that there were similar errors related to symlinks to other files, fixed in 1.17.1.

I suspect that there's some more general fix to symlink handling that needs to happen in order to prevent the issue from just moving on the next file that needs a symlink.

@rgobbel
Copy link

rgobbel commented Jan 13, 2025

Using git bisect, I narrowed the problem down to commit 7e0cd45b, "Check for valid paths in create-symlinks hook". At the revision immediately prior to that, I don't see the problem, but obviously (given the message on that problematic commit) there's something not quite right in that whole series, which branched off from main at d78868cd, then was merged back in at 8c9d3d8f.

@rgobbel
Copy link

rgobbel commented Jan 14, 2025

I chased this a bit harder, put a bunch of print statements into the createLink code in create-symlinks.go, and got this:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running createContainer hook #1: exit status 1, stdout: 

in createLink, containerRoot=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged, targetPath=libgstnvcustomhelper.so.1.0.0, link=/usr/lib/aarch64-linux-gnu/nvidia/libgstnvcustomhelper.so, linkPath=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/usr/lib/aarch64-linux-gnu/nvidia/libgstnvcustomhelper.so
in linkExists, target=libgstnvcustomhelper.so.1.0.0, link=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/usr/lib/aarch64-linux-gnu/nvidia/libgstnvcustomhelper.so, currentTarget=libgstnvcustomhelper.so.1.0.0

in createLink, containerRoot=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged, targetPath=libgstnvdsseimeta.so.1.0.0, link=/usr/lib/aarch64-linux-gnu/nvidia/libgstnvdsseimeta.so, linkPath=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/usr/lib/aarch64-linux-gnu/nvidia/libgstnvdsseimeta.so
in linkExists, target=libgstnvdsseimeta.so.1.0.0, link=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/usr/lib/aarch64-linux-gnu/nvidia/libgstnvdsseimeta.so, currentTarget=libgstnvdsseimeta.so.1.0.0

in createLink, containerRoot=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged, targetPath=/usr/lib/aarch64-linux-gnu/nvidia/nvidia_icd.json, link=/etc/vulkan/icd.d/nvidia_icd.json, linkPath=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/etc/vulkan/icd.d/nvidia_icd.json
in linkExists, target=/usr/lib/aarch64-linux-gnu/nvidia/nvidia_icd.json, link=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/etc/vulkan/icd.d/nvidia_icd.json, 

currentTarget=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/etc/vulkan/icd.d/nvidia_icd.json
, stderr: time="2025-01-14T01:01:21-08:00" level=info msg="Symlinking /var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/etc/vulkan/icd.d/nvidia_icd.json to /usr/lib/aarch64-linux-gnu/nvidia/nvidia_icd.json"
time="2025-01-14T01:01:21-08:00" level=error msg="failed to create link [/usr/lib/aarch64-linux-gnu/nvidia/nvidia_icd.json /etc/vulkan/icd.d/nvidia_icd.json]: failed to create symlink: failed to remove existing file: remove /var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/etc/vulkan/icd.d/nvidia_icd.json: device or resource busy, resolvedLinkPath=/var/lib/docker/overlay2/73904084e3cc987bb8889cacf556f5b94a3133da3460001807dfb1318d040e08/merged/etc/vulkan/icd.d/nvidia_icd.json": unknown

For some reason, the symlink that fails has a target that's just the symlink itself, rather than the real target. My understanding stops at that point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

7 participants