Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for container runtime configurations #712

Open
7 tasks
elezar opened this issue Sep 24, 2024 · 0 comments
Open
7 tasks

Improve support for container runtime configurations #712

elezar opened this issue Sep 24, 2024 · 0 comments
Assignees

Comments

@elezar
Copy link
Member

elezar commented Sep 24, 2024

The goal of this enhancement is to the provide better support for configuring container runtimes such as Containerd or Cri-o. The existing approach in the toolkit container has the following shortcomings:

  • We rely on a single config file to determine the current configuration. This may not reflect the active config as it ignores drop-in-files or command line options, for example.
  • We have no mechanism to update the NVIDIA Container Toolkit configuration based on the extracted config. This is an issue for cases where the low-level runtime has an explicit path set, or if it is not runc or crun.

The work required includes the following:

Refactor the toolkit container tooling to prepare for the changes

In oder to be able to address the shortcomings above, it is recommended that we refactor the toolkit container tooling. The primary goal here is to move from invoking toolkit, containerd, crio, or docker as shell commands to apply the required changes and instead exposing this functionality over Go APIs. #704 shows this for the toolkit executable, with toolkit.Install function being used to perform the toolkit installations. In each case, the command line arguments should be added to the top-level command. Common options could be moved to the top-level option structure.

Note that there may be some changes required from the GPU Operator to ensure that the correct options are being passed. These are not expected, since we should be relying on envvars for most options, but these should be audited.

The following could be used as a rough checklist:

  • Use GO API for toolkit installation #704
  • Use GO API for docker config
  • Use GO API for containerd config
  • Use GO API for crio config
  • Audit the GPU Operator usage of the API
  • Move the nvidia-toolkit command (possibly renaming it) to the cmd folder and include in the operator-extensions package.

Refactor runtime configuration logic

As mentioned, the current implementation assumes a single config file as the reference for the modifications required to enable the NVIDIA Container Runtime for a container runtime such as Docker, Containerd, or Cri-o. In order to extend this to other sources it would make sense to refactor the logic for extracting the current config to something more generic. This could then be extended with other mechanisms to determine the current config such as querying a gRPC API or invoking a shell command. #643 already demonstrates this by adding the concept of a TOML source that can be used for runtimes such as Containerd or CRI-O. (This PR needs to be iterated on).

The following could be used as a rough checklist:

Add new functionality

With the refactoring in place we can extend the existing functionality to add support for new use cases and configurations. The primary changes are:

Add support for new config (i.e. TOML) sources that allow for the current config to be extracted

This is already included in #686 but should be focussed to only address the additional config source.

Update the toolkit configuration flow to allow for a config that is runtime dependent

The current flow when installing the toolkit and configuring the runtimes is linear and there is no dependence between the steps (see

func Run(c *cli.Context, o *options) error {
). Instead of installing the toolkit and then configuring and restarting the runtime, we need to define a new flow that:

  1. Extracts the current config from the selected runtime.
  2. Installs the toolkit with the relevant config modifications (e.g. the low-level runtime path)
  3. Updates the config for the selected runtime.
  4. Restarts the selected runtime.

Note that moving to the GO API means that when un-configuring the NVIDIA runtime, the extracted config can be used as a reference to revert any toolkit-specific changes that need to be made at a global level.

One problem to solve is that there could theoretically be config changes made between the config extraction, the toolkit installation, and the runtime configuration. As a starting point, we could assume that the risk of this is low, but it may be good to add additional checks to confirm that the config read in 1. is equivalent to the config that is being modified in 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants