Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LXC hook doesn´t seem to work on OpenSUSE Leap #711

Open
javiertoledos opened this issue Sep 24, 2024 · 1 comment
Open

LXC hook doesn´t seem to work on OpenSUSE Leap #711

javiertoledos opened this issue Sep 24, 2024 · 1 comment

Comments

@javiertoledos
Copy link

Context of the issue here: https://discuss.linuxcontainers.org/t/nvidia-hook-not-working-with-opensuse-leap-15-6/21686

Basically, when Incus invokes Nvidia hook defined in LXC, the hook returns with a non-zero exit status. When cofinguring /etc/nvidia-container-runtime/config.toml to output debugging information I get the following logs:

-- WARNING, the following logs are for debugging purposes only --

I0924 03:55:25.997466 4 nvc.c:393] initializing library context (version=1.16.1, build=4c2494f16573b585788a42e9c7bee76ecd48c73d)
I0924 03:55:25.997530 4 nvc.c:364] using root /
I0924 03:55:25.997544 4 nvc.c:365] using ldcache /etc/ld.so.cache
I0924 03:55:25.997557 4 nvc.c:366] using unprivileged user 0:0
I0924 03:55:25.997586 4 nvc.c:410] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0924 03:55:25.997669 4 nvc.c:412] dxcore initialization failed, continuing assuming a non-WSL environment
I0924 03:55:25.998002 21 rpc.c:71] starting driver rpc service
I0924 03:55:26.004017 4 rpc.c:135] driver rpc service terminated with signal 15
I0924 03:55:26.004082 4 nvc.c:452] shutting down library context

The error says that driver rpc service terminated with signal 15 (SIGTERM) and nothing more, I'm not sure how to troubleshoot this but at this point I cannot tell if it's a bug of the Nvidia container toolkit or a combination of the Nvidia Hook.

I tried using Podman with CDI and I succeded runing CUDA loads inside a container so it doesn´t seem to be a first a driver problem. Tried with different driver versions as well.

@javiertoledos
Copy link
Author

javiertoledos commented Sep 26, 2024

I discovered that when running the hook from incus the toolkit returns insufficient permission error:

 exec nvidia-container-cli --debug=/tmp/nvidia.log --user configure --no-cgroups --ldconfig=@/sbin/ldconfig --compute --utility /usr/lib64/lxc/rootfs
nvidia-container-cli: initialization error: nvml error: insufficient permissions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant