diff --git a/container-toolkit/cdi-support.md b/container-toolkit/cdi-support.md index 92dcb2fb0..105c0cdff 100644 --- a/container-toolkit/cdi-support.md +++ b/container-toolkit/cdi-support.md @@ -1,7 +1,6 @@ % Date: November 11 2022 -% Author: elezar (elezar@nvidia.com) -% Author: ArangoGutierrez (eduardoa@nvidia.com) +% Author: elezar % headings (h1/h2/h3/h4/h5) are # * = - @@ -30,99 +29,7 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f - You installed an NVIDIA GPU Driver. -### Automatic CDI Specification Generation - -As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service: - -- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when NVIDIA drivers are installed or upgraded -- Runs automatically on system boot to ensure the specification is up to date - -```{note} -The automatic CDI refresh service does not handle: -- Driver removal (the CDI file is intentionally preserved) -- MIG device reconfiguration - -For these scenarios, you may still need to manually regenerate the CDI specification. See [Manual CDI Specification Generation](#manual-cdi-specification-generation) for instructions. -``` - -#### Customizing the Automatic CDI Refresh Service - -You can customize the behavior of the `nvidia-cdi-refresh` service by adding environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env`. This file is read by the service and allows you to modify the `nvidia-ctk cdi generate` command behavior. - -Example configuration file: -```bash -# /etc/nvidia-container-toolkit/cdi-refresh.env -NVIDIA_CTK_DEBUG=1 -# Add other nvidia-ctk environment variables as needed -``` - -For a complete list of available environment variables, run `nvidia-ctk cdi generate --help` to see the command's documentation. - -```{important} -After modifying the environment file, you must reload the systemd daemon and restart the service for changes to take effect: - -```console -$ sudo systemctl daemon-reload -$ sudo systemctl restart nvidia-cdi-refresh.service -``` - -#### Managing the CDI Refresh Service - -The `nvidia-cdi-refresh` service consists of two systemd units: - -- `nvidia-cdi-refresh.path` - Monitors for changes to driver files and triggers the service -- `nvidia-cdi-refresh.service` - Executes the CDI specification generation - -You can manage these services using standard systemd commands: - -```console -# Check service status -$ sudo systemctl status nvidia-cdi-refresh.path -● nvidia-cdi-refresh.path - Trigger CDI refresh on NVIDIA driver install / uninstall events - Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.path; enabled; preset: enabled) - Active: active (waiting) since Fri 2025-06-27 06:04:54 EDT; 1h 47min ago - Triggers: ● nvidia-cdi-refresh.service - -$ sudo systemctl status nvidia-cdi-refresh.service -○ nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file - Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.service; enabled; preset: enabled) - Active: inactive (dead) since Fri 2025-06-27 07:17:26 EDT; 34min ago -TriggeredBy: ● nvidia-cdi-refresh.path - Process: 1317511 ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml (code=exited, status=0/SUCCESS) - Main PID: 1317511 (code=exited, status=0/SUCCESS) - CPU: 562ms - -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi" -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump" -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced" -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-control as /usr/bin/nvidia-cuda-mps-control" -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-server as /usr/bin/nvidia-cuda-mps-server" -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=warning msg="Could not locate nvidia-imex: pattern nvidia-imex not found" -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=warning msg="Could not locate nvidia-imex-ctl: pattern nvidia-imex-ctl not found" -Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Generated CDI spec with version 1.0.0" -Jun 27 00:04:30 ipp2-0502 systemd[1]: nvidia-cdi-refresh.service: Succeeded. -Jun 27 00:04:30 ipp2-0502 systemd[1]: Started Refresh NVIDIA CDI specification file. -``` - -You can enable/disable the automatic CDI refresh service using the following commands: - -```console -$ sudo systemctl enable --now nvidia-cdi-refresh.path -$ sudo systemctl enable --now nvidia-cdi-refresh.service -$ sudo systemctl disable nvidia-cdi-refresh.service -$ sudo systemctl disable nvidia-cdi-refresh.path -``` - -You can also view the service logs to see the output of the CDI generation process. - -```console -# View service logs -$ sudo journalctl -u nvidia-cdi-refresh.service -``` - -### Manual CDI Specification Generation - -If you need to manually generate a CDI specification, for example, after MIG configuration changes or if you are using a Container Toolkit version before v1.18.0, follow this procedure: +### Procedure Two common locations for CDI specifications are `/etc/cdi/` and `/var/run/cdi/`. The contents of the `/var/run/cdi/` directory are cleared on boot. @@ -132,10 +39,10 @@ However, the path to create and use can depend on the container engine that you 1. Generate the CDI specification file: ```console - $ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml + $ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml ``` - The sample command uses `sudo` to ensure that the file at `/var/run/cdi/nvidia.yaml` is created. + The sample command uses `sudo` to ensure that the file at `/etc/cdi/nvidia.yaml` is created. You can omit the `--output` argument to print the generated specification to `STDOUT`. *Example Output* @@ -170,8 +77,6 @@ You must generate a new CDI specification after any of the following changes: - You use a location such as `/var/run/cdi` that is cleared on boot. A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded. - -**Note**: As of NVIDIA Container Toolkit v1.18.0, the automatic CDI refresh service handles most of these scenarios automatically. ``` ## Running a Workload with CDI diff --git a/container-toolkit/install-guide.md b/container-toolkit/install-guide.md index 4a413c98c..411de8f53 100644 --- a/container-toolkit/install-guide.md +++ b/container-toolkit/install-guide.md @@ -229,6 +229,7 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/ For Podman, NVIDIA recommends using [CDI](./cdi-support.md) for accessing NVIDIA devices in containers. + ## Next Steps - [](./sample-workload.md) \ No newline at end of file diff --git a/container-toolkit/release-notes.md b/container-toolkit/release-notes.md index 66938b5d7..e3ab5bfbd 100644 --- a/container-toolkit/release-notes.md +++ b/container-toolkit/release-notes.md @@ -255,7 +255,7 @@ The following packages are included: - `libnvidia-container-tools 1.17.2` - `libnvidia-container1 1.17.2` -The following `container-toolkit` containers are included: +The following `container-toolkit` conatiners are included: - `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2-ubi8` - `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2-ubuntu20.04` (also as `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2`) diff --git a/container-toolkit/sample-workload.md b/container-toolkit/sample-workload.md index fe6be7444..3b19550a7 100644 --- a/container-toolkit/sample-workload.md +++ b/container-toolkit/sample-workload.md @@ -21,7 +21,7 @@ you can verify your installation by running a sample workload. ## Running a Sample Workload with Podman -After you install and configure the toolkit (including [generating a CDI specification](cdi-support.md)) and install an NVIDIA GPU Driver, +After you install and configura the toolkit (including [generating a CDI specification](cdi-support.md)) and install an NVIDIA GPU Driver, you can verify your installation by running a sample workload. - Run a sample CUDA container: