Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompiled Driver Container for Linux Kernel other than 5.15 does not exist #1203

Open
utsumi-fj opened this issue Jan 16, 2025 · 5 comments

Comments

@utsumi-fj
Copy link

In the page for Precompiled Driver Containers https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#limitations-and-restrictions, the following limitations and restrictions are described.

Limitations and Restrictions

Support for deploying the driver containers with precompiled drivers is limited to hosts with the Ubuntu 22.04 operating system and x86_64 architecture.

Although no restrictions about Linux Kernel version are described, Precompiled Driver Container for Linux Kernel other than 5.15 does not exist in https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags.

According to the following kernel release schedule in https://ubuntu.com/kernel/lifecycle, newer Linux Kernel version e.g. 6.8 is available for Ubuntu 22.04.

Image

Precompiled Driver Container is useful since it avoids installing local package repository when installing GPU Operator in Air-Gapped environment. So, it is better if Precompiled Driver Container for newer Linux Kernel exists.

@justinthelaw
Copy link

@utsumi-fj When you try to deploy the precompiled driver containers to a newer Linux kernel, like 6.8.x or 6.9.x, what happens?

@utsumi-fj
Copy link
Author

@justinthelaw A non-existing container image, such as nvcr.io/nvidia/driver:550-6.8.0-50-generic-ubuntu22.04, is pulled unsuccessfully, resulting in an ImagePullBackOff error.

@justinthelaw
Copy link

justinthelaw commented Feb 25, 2025

Tell me if this is a slightly different problem, but the gpu-operator does mutate the final image tag which causes it to be incorrect. This can be seen in the final driver daemonset resource, when described. It basically follows this pattern:

I have been working around this problem by directly patching the tag back to the actual image tag, and then restarting the pod that has the image pull error. I am still testing things, as I am currently producing a Kernel mismatch workaround, so this may not be the right way to do things.

As for the Kernel mismatch, I am self-building the precompiled drivers container and modifying the Dockerfile and nvidia-driver script NVIDIA produces (via git sparse checkout and sed, for now). Here is my upstream issue regarding the Kernel mismatch: https://gitlab.com/nvidia/container-images/driver/-/issues/56

@justinthelaw
Copy link

CC: @tariq1890 @cdesiniotis Do you guys have any further insight as to why these two issues are a pattern or behavior in the gpu-operator's precompiled drivers deployment?

@justinthelaw
Copy link

justinthelaw commented Mar 20, 2025

@utsumi-fj As a final follow-up, I got everything working on two Linux kernels, 6.8.0-51 and 6.8.1-52, that are not 5.15.X-X on Ubuntu 22.04.X. Please refer to the upstream driver image repository instructions/code: https://gitlab.com/nvidia/container-images/driver, and my related issue regarding Linux kernels in their scripting: https://gitlab.com/nvidia/container-images/driver/-/issues/56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants