Skip to content

Commit e3ff414

Browse files
Add documentation for the systemd nvidia-container-toolkit.service
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1 parent 5211d31 commit e3ff414

File tree

4 files changed

+98
-7
lines changed

4 files changed

+98
-7
lines changed

container-toolkit/cdi-support.md

Lines changed: 96 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
% Date: November 11 2022
22

3-
% Author: elezar
3+
% Author: elezar ([email protected])
4+
% Author: ArangoGutierrez ([email protected])
45

56
% headings (h1/h2/h3/h4/h5) are # * = -
67

@@ -29,7 +30,96 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f
2930

3031
- You installed an NVIDIA GPU Driver.
3132

32-
### Procedure
33+
### Automatic CDI Specification Generation
34+
35+
As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service:
36+
37+
- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when NVIDIA drivers are installed or upgraded
38+
- Runs automatically on system boot to ensure the specification is up to date
39+
40+
```{note}
41+
The automatic CDI refresh service does not handle:
42+
- Driver removal (the CDI file is intentionally preserved)
43+
- MIG device reconfiguration
44+
45+
For these scenarios, you may still need to manually regenerate the CDI specification. See [Manual CDI Specification Generation](#manual-cdi-specification-generation) for instructions.
46+
```
47+
48+
#### Customizing the Automatic CDI Refresh Service
49+
50+
You can customize the behavior of the `nvidia-cdi-refresh` service by adding environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env`. This file is read by the service and allows you to modify the `nvidia-ctk cdi generate` command behavior.
51+
52+
Example configuration file:
53+
```bash
54+
# /etc/nvidia-container-toolkit/cdi-refresh.env
55+
NVIDIA_CTK_DEBUG=1
56+
# Add other nvidia-ctk environment variables as needed
57+
```
58+
59+
For a complete list of available environment variables, run `nvidia-ctk cdi generate --help` to see the command's documentation.
60+
61+
```{important}
62+
After modifying the environment file, you must reload the systemd daemon and restart the service for changes to take effect:
63+
64+
```console
65+
$ sudo systemctl daemon-reload
66+
$ sudo systemctl restart nvidia-cdi-refresh.service
67+
```
68+
69+
#### Managing the CDI Refresh Service
70+
71+
The `nvidia-cdi-refresh` service consists of two systemd units:
72+
73+
- `nvidia-cdi-refresh.path` - Monitors for changes to driver files and triggers the service
74+
- `nvidia-cdi-refresh.service` - Executes the CDI specification generation
75+
76+
You can manage these services using standard systemd commands:
77+
78+
```console
79+
# Check service status
80+
$ sudo systemctl status nvidia-cdi-refresh.path
81+
● nvidia-cdi-refresh.path - Trigger CDI refresh on NVIDIA driver install / uninstall events
82+
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.path; enabled; preset: enabled)
83+
Active: active (waiting) since Fri 2025-06-27 06:04:54 EDT; 1h 47min ago
84+
Triggers: ● nvidia-cdi-refresh.service
85+
86+
$ sudo systemctl status nvidia-cdi-refresh.service
87+
○ nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file
88+
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.service; enabled; preset: enabled)
89+
Active: inactive (dead) since Fri 2025-06-27 07:17:26 EDT; 34min ago
90+
TriggeredBy: ● nvidia-cdi-refresh.path
91+
Process: 1317511 ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml (code=exited, status=0/SUCCESS)
92+
Main PID: 1317511 (code=exited, status=0/SUCCESS)
93+
CPU: 562ms
94+
95+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi"
96+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump"
97+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced"
98+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-control as /usr/bin/nvidia-cuda-mps-control"
99+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-server as /usr/bin/nvidia-cuda-mps-server"
100+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=warning msg="Could not locate nvidia-imex: pattern nvidia-imex not found"
101+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=warning msg="Could not locate nvidia-imex-ctl: pattern nvidia-imex-ctl not found"
102+
Jun 27 00:04:30 ipp2-0502 nvidia-ctk[1623461]: time="2025-06-27T00:04:30-04:00" level=info msg="Generated CDI spec with version 1.0.0"
103+
Jun 27 00:04:30 ipp2-0502 systemd[1]: nvidia-cdi-refresh.service: Succeeded.
104+
Jun 27 00:04:30 ipp2-0502 systemd[1]: Started Refresh NVIDIA CDI specification file.
105+
106+
# Enable/disable the automatic CDI refresh service
107+
$ sudo systemctl enable --now nvidia-cdi-refresh.path
108+
$ sudo systemctl enable --now nvidia-cdi-refresh.service
109+
$ sudo systemctl disable nvidia-cdi-refresh.service
110+
$ sudo systemctl disable nvidia-cdi-refresh.path
111+
```
112+
113+
You can also view the service logs to see the output of the CDI generation process.
114+
115+
```console
116+
# View service logs
117+
$ sudo journalctl -u nvidia-cdi-refresh.service
118+
```
119+
120+
### Manual CDI Specification Generation
121+
122+
If you need to manually generate a CDI specification, for example, after MIG configuration changes or if you are using a Container Toolkit version before v1.18.0, follow this procedure:
33123

34124
Two common locations for CDI specifications are `/etc/cdi/` and `/var/run/cdi/`.
35125
The contents of the `/var/run/cdi/` directory are cleared on boot.
@@ -39,10 +129,10 @@ However, the path to create and use can depend on the container engine that you
39129
1. Generate the CDI specification file:
40130

41131
```console
42-
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
132+
$ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
43133
```
44134

45-
The sample command uses `sudo` to ensure that the file at `/etc/cdi/nvidia.yaml` is created.
135+
The sample command uses `sudo` to ensure that the file at `/var/run/cdi/nvidia.yaml` is created.
46136
You can omit the `--output` argument to print the generated specification to `STDOUT`.
47137

48138
*Example Output*
@@ -77,6 +167,8 @@ You must generate a new CDI specification after any of the following changes:
77167
- You use a location such as `/var/run/cdi` that is cleared on boot.
78168
79169
A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded.
170+
171+
**Note**: As of NVIDIA Container Toolkit v1.18.0, the automatic CDI refresh service handles most of these scenarios automatically.
80172
```
81173

82174
## Running a Workload with CDI

container-toolkit/install-guide.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,6 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/
224224

225225
For Podman, NVIDIA recommends using [CDI](./cdi-support.md) for accessing NVIDIA devices in containers.
226226

227-
228227
## Next Steps
229228

230229
- [](./sample-workload.md)

container-toolkit/release-notes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@ The following packages are included:
255255
- `libnvidia-container-tools 1.17.2`
256256
- `libnvidia-container1 1.17.2`
257257

258-
The following `container-toolkit` conatiners are included:
258+
The following `container-toolkit` containers are included:
259259

260260
- `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2-ubi8`
261261
- `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2-ubuntu20.04` (also as `nvcr.io/nvidia/k8s/container-toolkit:v1.17.2`)

container-toolkit/sample-workload.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ you can verify your installation by running a sample workload.
2121

2222
## Running a Sample Workload with Podman
2323

24-
After you install and configura the toolkit (including [generating a CDI specification](cdi-support.md)) and install an NVIDIA GPU Driver,
24+
After you install and configure the toolkit (including [generating a CDI specification](cdi-support.md)) and install an NVIDIA GPU Driver,
2525
you can verify your installation by running a sample workload.
2626

2727
- Run a sample CUDA container:

0 commit comments

Comments
 (0)