-
Notifications
You must be signed in to change notification settings - Fork 379
Generate CDI specification including additional GIDs #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note: This updates the default spec version to v0.7.0 since the DRM devices generally require additional GIDs. As such, this should probably be an opt-in (or at least an opt-out) feature. |
74441f4
to
285ea44
Compare
@@ -169,6 +174,11 @@ func (m command) build() *cli.Command { | |||
Usage: "Specify a pattern the CSV mount specifications.", | |||
Destination: &opts.csv.ignorePatterns, | |||
}, | |||
&cli.BoolFlag{ | |||
Name: "--allow-additional-gids", | |||
Usage: "Allow the use of the additionalGIDs field for generated CDI specifications. Note this will generate a v0.7.0 CDI specification.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which versions of containerd / cri-o (and other runtimes for that matter) support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to check. Note that this is not enabled by default when generating the CDI spec and only for internal representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- containerd
- (>= 1.7.16) - containerd/containerd@7a2f49f
- (>= 1.8.0, >= 2.0.0) - containerd/containerd@1b62224
- docker (>= 26.1.0) - moby/moby@745e235
- podman (>= 5.1.0) - containers/podman@a40cf31
- crio (>= 1.30.0) - cri-o/cri-o@fd9aa76
internal/config/features.go
Outdated
MOFED *feature `toml:"mofed,omitempty"` | ||
NVSWITCH *feature `toml:"nvswitch,omitempty"` | ||
GDRCopy *feature `toml:"gdrcopy,omitempty"` | ||
NoAdditionalGIDs *feature `toml:"no-additional-gids,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this reversed (i.e. no
rather than allow
) from the flag passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this has different default behaviour. For the NVIDIA Contianer Runtime I want to generated an internal representation including the additionalGIDs
container edits for device nodes. This allows users to opt out if there is an issue.
I suppose we could make the argument that if we backport this (which we may want to due to the Jetpack feature request) we should keep the existing behaviour and not make these modifications either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after having thought this through again, I have disabled this by default and switched the feature flag to --allow-additional-gids
.
internal/edits/device.go
Outdated
deviceNode, err := d.toSpec() | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
var additionalGIDs []uint32 | ||
if allowAdditionalGIDs { | ||
if requiredGID, _ := d.getRequiredGID(); requiredGID != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log the error instead of swallowing it ? Or we could just have the function not return an error struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to not return an error.
This change refactors the creation of contianer edits to make provision for injecting options that affect the generated CDI specifications. Signed-off-by: Evan Lezar <[email protected]>
285ea44
to
141772c
Compare
7ecd467
to
fab0c6d
Compare
This change adds support for injecting additional GIDs using the internal CDI representations. (Applicable for Tegra-based systems and /dev/dri devices) This is disabled by default, but can be opted in to by setting the features.allow-additional-gids feature flag. This can also be done by running sudo nvidia-ctk config --in-place --set features.allow-additional-gids Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
fab0c6d
to
5d55813
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for automatically including required group IDs (AdditionalGIDs
) for device nodes in generated CDI specs when enabled via config or CLI. It updates the internal edits pipeline, CLI, and config to propagate an allowAdditionalGIDs
flag, adds deduplication logic for the new field, and refactors resource construction.
- Added
allowAdditionalGIDs
option and plumbing throughnvcdilib
and modifiers - Extended
device.toEdits
to conditionally gather required GIDs and deduplicate them - Updated CLI to expose
--allow-additional-gids
flag and config handling
Reviewed Changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/nvcdi/transform/merged-device.go | Switched to edits.NewResource for merged device construction |
pkg/nvcdi/transform/deduplicate.go | Added deduplicateAdditionalGIDs to clean up GID lists |
pkg/nvcdi/options.go | Introduced WithAllowAdditionalGIDs option |
pkg/nvcdi/lib.go | Added allowAdditionalGIDs field and editsFromDiscoverer |
pkg/nvcdi/lib-wsl.go | Updated WSL library to use new edits API |
pkg/nvcdi/lib-nvml.go | Removed unused imports and updated GetCommonEdits |
pkg/nvcdi/lib-csv.go | Refactored CSV logic to use new edits API |
pkg/nvcdi/gds.go | Refactored GDS logic to use new edits API |
pkg/nvcdi/mofed.go | Updated MOFED logic to use editsFromDiscoverer |
pkg/nvcdi/mig-device-nvml.go | Refactored MIG device specs to use NewResource |
pkg/nvcdi/management.go | Updated management logic to use editsFromDiscoverer |
pkg/nvcdi/full-gpu-nvml.go | Refactored full GPU spec creation to use new edits API |
internal/edits/resource.go | Added NewResource helper |
internal/edits/options.go | Added allowAdditionalGIDs option for edits library |
internal/edits/device.go | Extended toEdits to collect AdditionalGIDs |
internal/edits/lib.go | Implemented new Interface for edits creation |
internal/edits/api.go | Defined edits Interface |
internal/edits/edits.go | Removed old global FromDiscoverer and NewSpecEdits |
internal/edits/edits_test.go | Updated tests to use new edits API |
pkg/nvcdi/transform/deduplicate.go | (duplicate entry) Deduplication logic for AdditionalGIDs |
internal/modifier/graphics.go | Passed allowAdditionalGIDs flag into graphics modifier |
internal/modifier/gated.go | Added allowAdditionalGIDs argument to gated modifier |
internal/modifier/discover.go | Updated NewModifierFromDiscoverer signature |
internal/modifier/discover_test.go | Updated tests for new modifier signature |
internal/config/features.go | Added FeatureAllowAdditionalGIDs handling |
internal/config/features_test.go | Added tests for AllowAdditionalGIDs feature flag |
cmd/nvidia-ctk/cdi/generate/generate.go | Introduced CLI flag for allow-additional-gids |
Comments suppressed due to low confidence (2)
pkg/nvcdi/transform/deduplicate.go:162
- [nitpick] Add unit tests for deduplicateAdditionalGIDs to verify that duplicate GIDs are removed and the slice is sorted correctly.
func (d dedupe) deduplicateAdditionalGIDs(entities []uint32) ([]uint32, error) {
cmd/nvidia-ctk/cdi/generate/generate.go:178
- CLI flag Name should not include leading dashes; use "allow-additional-gids" so the flag is recognized properly by the parser.
Name: "--allow-additional-gids",
@@ -155,3 +155,11 @@ func WithLibrarySearchPaths(paths []string) Option { | |||
o.librarySearchPaths = paths | |||
} | |||
} | |||
|
|||
// WithAllowAdditionalGIDs specifies whether a generated CDI spec can contain | |||
// the additionaGIDs field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in comment: change 'additionaGIDs' to 'additionalGIDs'.
// the additionaGIDs field. | |
// the additionalGIDs field. |
Copilot uses AI. Check for mistakes.
@@ -78,5 +78,7 @@ func NewFeatureGatedModifier(logger logger.Interface, cfg *config.Config, image | |||
discoverers = append(discoverers, d) | |||
} | |||
|
|||
return NewModifierFromDiscoverer(logger, discover.Merge(discoverers...)) | |||
allowAdditionalGIDs := !cfg.Features.IsEnabled(config.FeatureAllowAdditionalGIDs, image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The negation operator '!' inverts the intended flag; remove '!' so the feature flag passes through correctly.
allowAdditionalGIDs := !cfg.Features.IsEnabled(config.FeatureAllowAdditionalGIDs, image) | |
allowAdditionalGIDs := cfg.Features.IsEnabled(config.FeatureAllowAdditionalGIDs, image) |
Copilot uses AI. Check for mistakes.
This change adds logic to include the
AdditionalGIDs
required for a device node in a generated CDI specification.When a device node requires specific group membership for access, a non-root user running in a container (e.g. using the
-u
Docker command line argument) may not have access to the device.in the v0.7.0 CDI specification, support was added for specifying additional GIDs in the CDI specification and these changes extract the required information from the device nodes on the host if available.
This is disabled by default.
It can be enabled for internal CDI representations such as those used for Tegra-based systems or
/dev/dri
devices by setting:in the config.toml file.
This can also be opted in to on a per-container basis by setting
NVIDIA_ALLOW_ADDITIONAL_GIDS=enabled
.The
nvidia-ctk
command can be used to toggle this feature by running:It can be enabled when running
nvidia-ctk cdi generate
by specifying the--allow-additional-gids
command line argument: