Improve support for container runtime configurations

The goal of this enhancement is to the provide better support for configuring container runtimes such as Containerd or Cri-o. The existing approach in the toolkit container has the following shortcomings:

* We rely on a single config file to determine the current configuration. This may not reflect the active config as it ignores drop-in-files or command line options, for example.
* We have no mechanism to update the NVIDIA Container Toolkit configuration based on the extracted config. This is an issue for cases where the low-level runtime has an explicit path set, or if it is not `runc` or `crun`.

The work required includes the following:

## Refactor the toolkit container tooling to prepare for the changes
In oder to be able to address the shortcomings above, it is recommended that we refactor the toolkit container tooling. The primary goal here is to move from invoking `toolkit`, `containerd`, `crio`, or `docker` as shell commands to apply the required changes and instead exposing this functionality over Go APIs. https://github.com/NVIDIA/nvidia-container-toolkit/pull/704 shows this for the `toolkit` executable, with `toolkit.Install` function being used to perform the toolkit installations. In each case, the command line arguments should be added to the top-level command. Common options could be moved to the top-level option structure.

Note that there may be some changes required from the GPU Operator to ensure that the correct options are being passed. These are not expected, since we *should* be relying on envvars for most options, but these should be audited.

The following could be used as a rough checklist:
- [ ] https://github.com/NVIDIA/nvidia-container-toolkit/pull/704
- [ ] Use GO API for docker config
- [ ] Use GO API for containerd config
- [ ] Use GO API for  crio config
- [ ] Audit the GPU Operator usage of the API
- [ ] Move the `nvidia-toolkit` command (possibly renaming it) to the `cmd` folder and include in the `operator-extensions` package.

## Refactor runtime configuration logic
As mentioned, the current implementation assumes a single config file as the reference for the modifications required to enable the NVIDIA Container Runtime for a container runtime such as Docker, Containerd, or Cri-o. In order to extend this to other sources it would make sense to refactor the logic for extracting the current config to something more generic. This could then be extended with other mechanisms to determine the current config such as querying a gRPC API or invoking a shell command. https://github.com/NVIDIA/nvidia-container-toolkit/pull/643 already demonstrates this by adding the concept of a TOML source that can be used for runtimes such as Containerd or CRI-O. (This PR needs to be iterated on).

The following could be used as a rough checklist:
- [ ] https://github.com/NVIDIA/nvidia-container-toolkit/pull/643

## Add new functionality
With the refactoring in place we can extend the existing functionality to add support for new use cases and configurations. The primary changes are:

### Add support for new config (i.e. TOML) sources that allow for the current config to be extracted

This is already included in https://github.com/NVIDIA/nvidia-container-toolkit/pull/686 but should be focussed to only address the additional config source.

### Update the toolkit configuration flow to allow for a config that is runtime dependent

The current flow when installing the toolkit and configuring the runtimes is linear and there is no dependence between the steps (see https://github.com/NVIDIA/nvidia-container-toolkit/blob/a5a5833c14a15fd9c86bcece85d5ec6621b65652/tools/container/nvidia-toolkit/run.go#L127). Instead of installing the toolkit and then configuring and restarting the runtime, we need to define a new flow that:

1. Extracts the current config from the selected runtime.
2. Installs the toolkit with the relevant config modifications (e.g. the low-level runtime path)
3. Updates the config for the selected runtime.
4. Restarts the selected runtime.

Note that moving to the GO API means that when un-configuring the NVIDIA runtime, the extracted config can be used as a reference to revert any toolkit-specific changes that need to be made at a global level.

One problem to solve is that there could theoretically be config changes made between the config extraction, the toolkit installation, and the runtime configuration. As a starting point, we could assume that the risk of this is low, but it may be good to add additional checks to confirm that the config read in 1. is equivalent to the config that is being modified in 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve support for container runtime configurations #712

Refactor the toolkit container tooling to prepare for the changes

Refactor runtime configuration logic

Add new functionality

Add support for new config (i.e. TOML) sources that allow for the current config to be extracted

Update the toolkit configuration flow to allow for a config that is runtime dependent

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve support for container runtime configurations #712

Description

Refactor the toolkit container tooling to prepare for the changes

Refactor runtime configuration logic

Add new functionality

Add support for new config (i.e. TOML) sources that allow for the current config to be extracted

Update the toolkit configuration flow to allow for a config that is runtime dependent

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions