Precompiled Driver Container for Linux Kernel other than 5.15 does not exist #1203

utsumi-fj · 2025-01-16T03:00:49Z

In the page for Precompiled Driver Containers https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#limitations-and-restrictions, the following limitations and restrictions are described.

Limitations and Restrictions

Support for deploying the driver containers with precompiled drivers is limited to hosts with the Ubuntu 22.04 operating system and x86_64 architecture.

Although no restrictions about Linux Kernel version are described, Precompiled Driver Container for Linux Kernel other than 5.15 does not exist in https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags.

According to the following kernel release schedule in https://ubuntu.com/kernel/lifecycle, newer Linux Kernel version e.g. 6.8 is available for Ubuntu 22.04.

Precompiled Driver Container is useful since it avoids installing local package repository when installing GPU Operator in Air-Gapped environment. So, it is better if Precompiled Driver Container for newer Linux Kernel exists.

justinthelaw · 2025-02-24T20:25:09Z

@utsumi-fj When you try to deploy the precompiled driver containers to a newer Linux kernel, like 6.8.x or 6.9.x, what happens?

utsumi-fj · 2025-02-24T23:35:55Z

@justinthelaw A non-existing container image, such as nvcr.io/nvidia/driver:550-6.8.0-50-generic-ubuntu22.04, is pulled unsuccessfully, resulting in an ImagePullBackOff error.

justinthelaw · 2025-02-25T15:21:00Z

Tell me if this is a slightly different problem, but the gpu-operator does mutate the final image tag which causes it to be incorrect. This can be seen in the final driver daemonset resource, when described. It basically follows this pattern:

The operator performs an image tag mutation: https://github.com/NVIDIA/gpu-operator/blob/v24.9.2/controllers/object_controls.go#L2819-L2831
The operator sanitizes the kernel version for the tag: https://github.com/NVIDIA/gpu-operator/blob/v24.9.2/controllers/object_controls.go#L2808-L2817
For example, the data below becomes: 550-5.15.0-1071-nvidia-ubuntu22.04-6.9.3-76060903-generic-ubuntu22.04 when:
- Operating System: Ubuntu 22.04.5 LTS
- Kernel: Linux 6.9.3-76060903-generic
- Architecture: x86-64

I have been working around this problem by directly patching the tag back to the actual image tag, and then restarting the pod that has the image pull error. I am still testing things, as I am currently producing a Kernel mismatch workaround, so this may not be the right way to do things.

As for the Kernel mismatch, I am self-building the precompiled drivers container and modifying the Dockerfile and nvidia-driver script NVIDIA produces (via git sparse checkout and sed, for now). Here is my upstream issue regarding the Kernel mismatch: https://gitlab.com/nvidia/container-images/driver/-/issues/56

justinthelaw · 2025-02-25T15:26:01Z

CC: @tariq1890 @cdesiniotis Do you guys have any further insight as to why these two issues are a pattern or behavior in the gpu-operator's precompiled drivers deployment?

justinthelaw · 2025-03-20T21:27:31Z

@utsumi-fj As a final follow-up, I got everything working on two Linux kernels, 6.8.0-51 and 6.8.1-52, that are not 5.15.X-X on Ubuntu 22.04.X. Please refer to the upstream driver image repository instructions/code: https://gitlab.com/nvidia/container-images/driver, and my related issue regarding Linux kernels in their scripting: https://gitlab.com/nvidia/container-images/driver/-/issues/56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompiled Driver Container for Linux Kernel other than 5.15 does not exist #1203

Precompiled Driver Container for Linux Kernel other than 5.15 does not exist #1203

utsumi-fj commented Jan 16, 2025

justinthelaw commented Feb 24, 2025

utsumi-fj commented Feb 24, 2025

justinthelaw commented Feb 25, 2025 •

edited

Loading

justinthelaw commented Feb 25, 2025

justinthelaw commented Mar 20, 2025 •

edited

Loading

Precompiled Driver Container for Linux Kernel other than 5.15 does not exist #1203

Precompiled Driver Container for Linux Kernel other than 5.15 does not exist #1203

Comments

utsumi-fj commented Jan 16, 2025

justinthelaw commented Feb 24, 2025

utsumi-fj commented Feb 24, 2025

justinthelaw commented Feb 25, 2025 • edited Loading

justinthelaw commented Feb 25, 2025

justinthelaw commented Mar 20, 2025 • edited Loading

justinthelaw commented Feb 25, 2025 •

edited

Loading

justinthelaw commented Mar 20, 2025 •

edited

Loading