Skip to content

Improve support for container runtime configurations #712

@elezar

Description

@elezar

The goal of this enhancement is to the provide better support for configuring container runtimes such as Containerd or Cri-o. The existing approach in the toolkit container has the following shortcomings:

  • We rely on a single config file to determine the current configuration. This may not reflect the active config as it ignores drop-in-files or command line options, for example.
  • We have no mechanism to update the NVIDIA Container Toolkit configuration based on the extracted config. This is an issue for cases where the low-level runtime has an explicit path set, or if it is not runc or crun.

The work required includes the following:

Refactor the toolkit container tooling to prepare for the changes

In oder to be able to address the shortcomings above, it is recommended that we refactor the toolkit container tooling. The primary goal here is to move from invoking toolkit, containerd, crio, or docker as shell commands to apply the required changes and instead exposing this functionality over Go APIs. #704 shows this for the toolkit executable, with toolkit.Install function being used to perform the toolkit installations. In each case, the command line arguments should be added to the top-level command. Common options could be moved to the top-level option structure.

Note that there may be some changes required from the GPU Operator to ensure that the correct options are being passed. These are not expected, since we should be relying on envvars for most options, but these should be audited.

The following could be used as a rough checklist:

  • Use GO API for toolkit installation #704
  • Use GO API for docker config
  • Use GO API for containerd config
  • Use GO API for crio config
  • Audit the GPU Operator usage of the API
  • Move the nvidia-toolkit command (possibly renaming it) to the cmd folder and include in the operator-extensions package.

Refactor runtime configuration logic

As mentioned, the current implementation assumes a single config file as the reference for the modifications required to enable the NVIDIA Container Runtime for a container runtime such as Docker, Containerd, or Cri-o. In order to extend this to other sources it would make sense to refactor the logic for extracting the current config to something more generic. This could then be extended with other mechanisms to determine the current config such as querying a gRPC API or invoking a shell command. #643 already demonstrates this by adding the concept of a TOML source that can be used for runtimes such as Containerd or CRI-O. (This PR needs to be iterated on).

The following could be used as a rough checklist:

Add new functionality

With the refactoring in place we can extend the existing functionality to add support for new use cases and configurations. The primary changes are:

Add support for new config (i.e. TOML) sources that allow for the current config to be extracted

This is already included in #686 but should be focussed to only address the additional config source.

Update the toolkit configuration flow to allow for a config that is runtime dependent

The current flow when installing the toolkit and configuring the runtimes is linear and there is no dependence between the steps (see

func Run(c *cli.Context, o *options) error {
). Instead of installing the toolkit and then configuring and restarting the runtime, we need to define a new flow that:

  1. Extracts the current config from the selected runtime.
  2. Installs the toolkit with the relevant config modifications (e.g. the low-level runtime path)
  3. Updates the config for the selected runtime.
  4. Restarts the selected runtime.

Note that moving to the GO API means that when un-configuring the NVIDIA runtime, the extracted config can be used as a reference to revert any toolkit-specific changes that need to be made at a global level.

One problem to solve is that there could theoretically be config changes made between the config extraction, the toolkit installation, and the runtime configuration. As a starting point, we could assume that the risk of this is low, but it may be good to add additional checks to confirm that the config read in 1. is equivalent to the config that is being modified in 3.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions