You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal of this enhancement is to the provide better support for configuring container runtimes such as Containerd or Cri-o. The existing approach in the toolkit container has the following shortcomings:
We rely on a single config file to determine the current configuration. This may not reflect the active config as it ignores drop-in-files or command line options, for example.
We have no mechanism to update the NVIDIA Container Toolkit configuration based on the extracted config. This is an issue for cases where the low-level runtime has an explicit path set, or if it is not runc or crun.
The work required includes the following:
Refactor the toolkit container tooling to prepare for the changes
In oder to be able to address the shortcomings above, it is recommended that we refactor the toolkit container tooling. The primary goal here is to move from invoking toolkit, containerd, crio, or docker as shell commands to apply the required changes and instead exposing this functionality over Go APIs. #704 shows this for the toolkit executable, with toolkit.Install function being used to perform the toolkit installations. In each case, the command line arguments should be added to the top-level command. Common options could be moved to the top-level option structure.
Note that there may be some changes required from the GPU Operator to ensure that the correct options are being passed. These are not expected, since we should be relying on envvars for most options, but these should be audited.
Move the nvidia-toolkit command (possibly renaming it) to the cmd folder and include in the operator-extensions package.
Refactor runtime configuration logic
As mentioned, the current implementation assumes a single config file as the reference for the modifications required to enable the NVIDIA Container Runtime for a container runtime such as Docker, Containerd, or Cri-o. In order to extend this to other sources it would make sense to refactor the logic for extracting the current config to something more generic. This could then be extended with other mechanisms to determine the current config such as querying a gRPC API or invoking a shell command. #643 already demonstrates this by adding the concept of a TOML source that can be used for runtimes such as Containerd or CRI-O. (This PR needs to be iterated on).
). Instead of installing the toolkit and then configuring and restarting the runtime, we need to define a new flow that:
Extracts the current config from the selected runtime.
Installs the toolkit with the relevant config modifications (e.g. the low-level runtime path)
Updates the config for the selected runtime.
Restarts the selected runtime.
Note that moving to the GO API means that when un-configuring the NVIDIA runtime, the extracted config can be used as a reference to revert any toolkit-specific changes that need to be made at a global level.
One problem to solve is that there could theoretically be config changes made between the config extraction, the toolkit installation, and the runtime configuration. As a starting point, we could assume that the risk of this is low, but it may be good to add additional checks to confirm that the config read in 1. is equivalent to the config that is being modified in 3.
The text was updated successfully, but these errors were encountered:
The goal of this enhancement is to the provide better support for configuring container runtimes such as Containerd or Cri-o. The existing approach in the toolkit container has the following shortcomings:
runc
orcrun
.The work required includes the following:
Refactor the toolkit container tooling to prepare for the changes
In oder to be able to address the shortcomings above, it is recommended that we refactor the toolkit container tooling. The primary goal here is to move from invoking
toolkit
,containerd
,crio
, ordocker
as shell commands to apply the required changes and instead exposing this functionality over Go APIs. #704 shows this for thetoolkit
executable, withtoolkit.Install
function being used to perform the toolkit installations. In each case, the command line arguments should be added to the top-level command. Common options could be moved to the top-level option structure.Note that there may be some changes required from the GPU Operator to ensure that the correct options are being passed. These are not expected, since we should be relying on envvars for most options, but these should be audited.
The following could be used as a rough checklist:
nvidia-toolkit
command (possibly renaming it) to thecmd
folder and include in theoperator-extensions
package.Refactor runtime configuration logic
As mentioned, the current implementation assumes a single config file as the reference for the modifications required to enable the NVIDIA Container Runtime for a container runtime such as Docker, Containerd, or Cri-o. In order to extend this to other sources it would make sense to refactor the logic for extracting the current config to something more generic. This could then be extended with other mechanisms to determine the current config such as querying a gRPC API or invoking a shell command. #643 already demonstrates this by adding the concept of a TOML source that can be used for runtimes such as Containerd or CRI-O. (This PR needs to be iterated on).
The following could be used as a rough checklist:
Add new functionality
With the refactoring in place we can extend the existing functionality to add support for new use cases and configurations. The primary changes are:
Add support for new config (i.e. TOML) sources that allow for the current config to be extracted
This is already included in #686 but should be focussed to only address the additional config source.
Update the toolkit configuration flow to allow for a config that is runtime dependent
The current flow when installing the toolkit and configuring the runtimes is linear and there is no dependence between the steps (see
nvidia-container-toolkit/tools/container/nvidia-toolkit/run.go
Line 127 in a5a5833
Note that moving to the GO API means that when un-configuring the NVIDIA runtime, the extracted config can be used as a reference to revert any toolkit-specific changes that need to be made at a global level.
One problem to solve is that there could theoretically be config changes made between the config extraction, the toolkit installation, and the runtime configuration. As a starting point, we could assume that the risk of this is low, but it may be good to add additional checks to confirm that the config read in 1. is equivalent to the config that is being modified in 3.
The text was updated successfully, but these errors were encountered: