Skip to content

Conversation

cdesiniotis
Copy link
Contributor

@cdesiniotis cdesiniotis commented Sep 22, 2025

This change updates the drop-in file contents that the nvidia-ctk-installer creates for containerd. In addition to adding nvidia runtimes, the nvidia-ctk-installer also includes all existing configuration present in the source containerd configuration for the CRI plugin. That is, all configuration already present in the 'plugins."io.containerd.grpc.v1.cri"' section will be added to our drop-in file.

This is needed due to how containerd merges configuration from multiple files. See containerd/containerd#5837

@cdesiniotis cdesiniotis force-pushed the add-existing-runtimes-to-containerd-drop-in branch 2 times, most recently from 53fd2be to af457bc Compare September 23, 2025 02:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the nvidia-ctk-installer to include existing runtime configurations from the source containerd config when creating drop-in files. Instead of starting with an empty config, the drop-in file now preserves all runtime configurations already present in the source's 'plugins."io.containerd.grpc.v1.cri".containerd.runtimes' section.

  • Modified the destination config to start with existing runtime configurations from source
  • Added a new function getBaseDropInConfigTree to extract and copy runtime configurations
  • Updated all test expectations to include existing runtime configurations in drop-in files

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
pkg/config/engine/containerd/containerd.go Implements the core logic to copy existing runtime configs to drop-in files
pkg/config/engine/containerd/config_test.go Updates test expectations to include existing runtime configurations
cmd/nvidia-ctk-installer/container/runtime/containerd/config_test.go Updates integration test expectations to verify runtime config preservation

Comment on lines 136 to 142
// The destinationConfig is a minimal config with the same options
// as the source config. The starting content of the destinationConfig
// is the entire plugins."io.containerd.grpc.v1.cri".containerd.runtimes
// section of the source config. This is needed due to how containerd
// merges configuration from multiple files.
// Reference: https://github.com/containerd/containerd/issues/5837
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that newer containerd versions (v2.1) do seem to do what we want, did we want to make this optional somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a local change that I can create a PR from as a follow-up. Let's not overcomplicate this now though.

@cdesiniotis cdesiniotis force-pushed the add-existing-runtimes-to-containerd-drop-in branch from af457bc to 2c06db4 Compare September 23, 2025 16:34
@cdesiniotis cdesiniotis changed the title Add runtime config from source containerd config to drop-in file Add CRI plugin config from source containerd config to drop-in file Sep 23, 2025
@cdesiniotis cdesiniotis force-pushed the add-existing-runtimes-to-containerd-drop-in branch from 2c06db4 to e1c213e Compare September 23, 2025 16:36
@coveralls
Copy link

coveralls commented Sep 23, 2025

Pull Request Test Coverage Report for Build 17971376989

Details

  • 15 of 26 (57.69%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 36.304%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/nvidia-ctk-installer/container/container.go 0 11 0.0%
Totals Coverage Status
Change from base Build 17920495274: 0.05%
Covered Lines: 4827
Relevant Lines: 13296

💛 - Coveralls

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

@cdesiniotis
Copy link
Contributor Author

Initial testing of these changes looks good. Using the GPU Operator (built from NVIDIA/gpu-operator#1710) I have verified several of the tests captured in https://github.com/NVIDIA/nvidia-container-toolkit/blob/b3ce3a4f3855764208b4b14004620d27a947110a/cmd/nvidia-ctk-installer/container/runtime/containerd/config_test.go.

Comment on lines 127 to 129
if cfg.UseDropInConfig() {
return cfg.CleanupDropInConfig()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar note, I remove the nvidia runtimes first before a call to CleanupDropInConfig()since the ConfigWithDropIn implementation of RemoveRuntime removes runtimes from both the top-level config and the drop-in file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The other option would be to remove the runtimes in the top-level config when ADDING them to the drop in, but that may be more complex than it needs to be.

@cdesiniotis cdesiniotis force-pushed the add-existing-runtimes-to-containerd-drop-in branch from e1c213e to dafe333 Compare September 23, 2025 17:43
}

func (c *Config) UseDropInConfig() bool {
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is this always false? for containerd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is false for the Config implementation, and true for the ConfigWithDropIn implemenation.

Thinking about it now, this method is not very useful. I believe we can just call CleanupDropInConfig directly and this can be a no-op for all cases except for ConfigWithDropIn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think the check was only needed if we called this BEFORE we removed the runtimes. Since we're doing that first now, we can just call it unconditionally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and yes. I'm sorry, it's late).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left this as-is for now. Feel free to clean this up further as you see fit.

I spent some time trying to remove the UseDropInConfig method entirely (and unconditionally calling CleanupDropInConfig), but this did not end up being as straightforward as I would have liked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could something like #1311 (comment) work? If I recall correctly this is what you proposed initially. This is definitely simpler than trying to "clear" the config after the fact.

}

func (c *ConfigWithDropIn) CleanupDropInConfig() {
c.Interface = &Config{Tree: toml.NewEmpty()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not just be:

Suggested change
c.Interface = &Config{Tree: toml.NewEmpty()}
c.Tree = toml.NewEmpty()

What about the configOptions which are used to update the config later? (I understand that we only ever call this AFTER we're done updating the config, but this may not always be the case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't able to do this without some refactoring. Currently, an engine.Interface is a member of ConfigWithDropIn and not a Config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I think I'm missing some important details here after a very long day. I will have a look again after some rest.

}

func (c *Config) CleanupDropInConfig() {
c.Tree = toml.NewEmpty()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works for crio then i think we should be able to use it for containerd too.

EnableCDI()
GetRuntimeConfig(string) (RuntimeConfig, error)
RemoveRuntime(string) error
UseDropInConfig() bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note to self. We should probably clean up this interface / compose the other two interfaces that we have introduced.

@elezar elezar force-pushed the add-existing-runtimes-to-containerd-drop-in branch from dafe333 to 6e7220f Compare September 24, 2025 07:47
Comment on lines +78 to +87
if o.DropInConfig == "" {
return nil
}
// When a drop-in config is used, we remove the drop-in file explicitly.
// This is require for cases where we may have to include other contents
// in the drop-in file and as such it may not be empty when we flush it.
err = os.Remove(o.DropInConfig)
if err != nil && !errors.Is(err, os.ErrNotExist) {
return fmt.Errorf("failed to remove drop-in config file: %w", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdesiniotis given the discussions around the additional functions to the interface, I thought it would actually be cleaner to explicitly remove the file here instead of a more complex implementation at a lower level.

I'm going to push out an rc including these changes be we can always revisit if there's something that I missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the file is definitely cleaner and simplifies implementation. I am onboard as long as we are sure the conditional immediately preceding this is sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For use cases where we are modifying the config directly -- or drop ins are not supported -- a user SHOULD be setting DropInConfig to "" explicitly. This would be the case for docker, or for v1 containerd configs.

The check for DropInConfig == "" also mirrors what we do when deciding whether to flush to the drop-in config or the "regular" config.

Are the situations that you're concerned about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see that this is aligned with the code that flushes the config, so I believe this addresses my question.

For use cases where we are modifying the config directly -- or drop ins are not supported -- a user SHOULD be setting DropInConfig to "" explicitly. This would be the case for docker, or for v1 containerd configs.

Ah, I wasn't aware that the intention was for users to set --drop-in-config="" explicitly when using v1 containerd configs... As long as we cover this in our documentation, this should be fine. Most should be using v2 or v3 configs at this point anyways.

@elezar elezar force-pushed the add-existing-runtimes-to-containerd-drop-in branch from 6e7220f to 887b7f6 Compare September 24, 2025 08:42
This change updates the drop-in file contents that the nvidia-ctk-installer creates
for containerd. In addition to adding nvidia runtimes, the nvidia-ctk-installer
also includes all existing configuration present in the source containerd configuration
for the CRI plugin. That is, all configuration already present in the
'plugins."io.containerd.grpc.v1.cri"' section will be added to our drop-in file.

This is needed due to how containerd merges configuration from multiple files.
See containerd/containerd#5837

Signed-off-by: Christopher Desiniotis <[email protected]>
@elezar elezar force-pushed the add-existing-runtimes-to-containerd-drop-in branch from 887b7f6 to 598c632 Compare September 24, 2025 08:44
@elezar elezar added this to the v1.18.0 milestone Sep 24, 2025
@ArangoGutierrez ArangoGutierrez self-requested a review September 24, 2025 12:05
Comment on lines +78 to +80
if o.DropInConfig == "" {
return nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- if o.DropInConfig != "" is it always the case that we are using a drop-in file? Even if this option is specified, my understanding is that a drop-in file would not be used for 1) v1 containerd configs, or 2) cri-o when the hook config mode is configured.

@cdesiniotis cdesiniotis merged commit 8d6e5cf into NVIDIA:main Sep 24, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants