Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDI support #4056

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
Draft

CDI support #4056

wants to merge 14 commits into from

Conversation

crazy-max
Copy link
Member

@crazy-max crazy-max commented Jul 24, 2023

related to #1436

Adds initial support for specifying fully-qualified CDI device names.

@crazy-max crazy-max force-pushed the cdi branch 2 times, most recently from 4a22da3 to a5c3952 Compare July 24, 2023 10:16
@crazy-max
Copy link
Member Author

I was also looking at moby/moby#45134 to register CDI device drivers. I think we need to add a new attribute in our TOML config to set the CDI specification directories otherwise CDI request will fail.

@elezar
Copy link

elezar commented Jul 25, 2023

@crazy-max I have not yet had a look at these changes in detail.

From my understanding buildkit is used to build containers. Am I to understand correctly that the expectation is that with these changes devices are made available at build time?

Note that from the NVIDIA side, this may have unintended consequences of producing container images that require a specific driver version to work correctly. The intent of the CDI device specifications that are used when running a container is to ensure that drivers that match the kernel mode driver on the host are injected dynamically, allowing applications that are built against the driver API to function as expected.

@crazy-max
Copy link
Member Author

You are correct that BuildKit is used to build containers, and with these changes, the goal is to make devices available at build time. This would allow users to mount specific devices when building container images.

Regarding the unintended consequences of producing container images that require a specific driver version, it sound expected and we might put in place some kind of capabilities. Intention is to provide more flexibility and convenience to users during the container building process. However, I understand the importance of keeping portability across environments.

If you have any further suggestions or concerns, please don't hesitate to let us know. We aim to create a seamless experience for users while ensuring compatibility and reliability across different setups.

@georgettica
Copy link

any update on this? @crazy-max @elezar

so far any place I see I need to disable buildkit to make use of building NVIDIA related components.

I would rather allow a very limited experience with tons of caveats than removing buildkit to build.

Copy link

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments.

It might be good to call out in the PR description how this option is used and protected by a feature flag / capabilities. Documentation changes for the option that call out the caveats would also be appreciated.

Comment on lines 180 to 191
registry := cdi.GetRegistry(cdi.WithAutoRefresh(false))
if err := registry.Refresh(); err != nil {
// We don't consider registry refresh failure a fatal error.
// For instance, a dynamically generated invalid CDI Spec file for
// any particular vendor shouldn't prevent injection of devices of
// different vendors. CDI itself knows better, and it will fail the
// injection if necessary.
bklog.G(ctx).Warnf("CDI registry refresh failed: %v", err)
}
return nil
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the v0.7.1 release you could use:

Suggested change
registry := cdi.GetRegistry(cdi.WithAutoRefresh(false))
if err := registry.Refresh(); err != nil {
// We don't consider registry refresh failure a fatal error.
// For instance, a dynamically generated invalid CDI Spec file for
// any particular vendor shouldn't prevent injection of devices of
// different vendors. CDI itself knows better, and it will fail the
// injection if necessary.
bklog.G(ctx).Warnf("CDI registry refresh failed: %v", err)
}
return nil
}
}
_ := cdi.Configure(cdi.WithAutoRefresh(false))
if err := cdi.Refresh(); err != nil {
// We don't consider registry refresh failure a fatal error.
// For instance, a dynamically generated invalid CDI Spec file for
// any particular vendor shouldn't prevent injection of devices of
// different vendors. CDI itself knows better, and it will fail the
// injection if necessary.
bklog.G(ctx).Warnf("CDI registry refresh failed: %v", err)
}
return nil
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/containerd/containerd/pull/10041/files

If this is require as external functionality, we could also consider exporting WithStaticCDIRegistry from the oci package like we do with WithCDIDevices.

@colinhemmings
Copy link
Collaborator

Hi @georgettica, thanks for your response. I'm a product manager at Docker, and this is something we are actively exploring at the moment, with the aim of getting a timeframe for the roadmap. Do you have details on what would be an acceptable limited experience for you? I would be interested to chat about your use case if you prefer? https://calendly.com/colin-hemmings-dock/buildkit-chat

Signed-off-by: CrazyMax <[email protected]>
@crazy-max crazy-max force-pushed the cdi branch 2 times, most recently from f7f20ed to a492e7f Compare February 4, 2025 02:17
disabled = true
# List of directories to scan for CDI spec files. For more details about CDI
# specification, please refer to https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md#cdi-json-specification
specDirs = ["/etc/cdi", "/var/run/cdi", "/etc/buildkit/cdi"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think auto-allow can be in two places. We can have a list here for auto-allowed devices and additionally there can be annotation in the CDI yaml for specific device.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be annotation in the CDI yaml for specific device.

I guess this would be under devices.annotations like:

cdiVersion: 0.3.0
kind: nvidia.com/gpu
devices:
- name: gpu0
  annotations:
    org.mobyproject.buildkit.device.allow: true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/allow/autoallow as all devices are allowed.

@@ -624,6 +640,12 @@ func AddUlimit(name UlimitName, soft int64, hard int64) RunOption {
})
}

func AddCDIDevice(name string) RunOption {
return runOptionFunc(func(ei *ExecInfo) {
ei.State = ei.State.AddCDIDevice(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to take at least the required bool. The string value should be kind | kind=device | classifier. I'm not sure if it makes sense to break these apart in the LLB level or keep as a single string variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string value should be kind | kind=device | classifier. I'm not sure if it makes sense to break these apart in the LLB level or keep as a single string variable.

I think better to keep name as a string as we are already validating it when generating the spec: https://github.com/moby/buildkit/pull/4056/files#diff-478a239cda41c7aff441eef90d0eae4fc531f3fcc30db204e9dfbe64b9d863aaR182-R184

@crazy-max crazy-max changed the title add support for CDI devices CDI support Feb 4, 2025
@crazy-max crazy-max force-pushed the cdi branch 2 times, most recently from b183106 to 98ae798 Compare February 4, 2025 10:06
crazy-max and others added 7 commits February 4, 2025 12:14
Signed-off-by: CrazyMax <[email protected]>
Signed-off-by: Tonis Tiigi <[email protected]>
Access should be managed by entitlements checks

Signed-off-by: Tonis Tiigi <[email protected]>
var out []llb.RunOption
devices := instructions.GetDevices(c)
for _, device := range devices {
out = append(out, llb.AddCDIDevice(llb.CDIDeviceName(device), llb.CDIDeviceOptional))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a CDI device set in a dockerfile always marked as optional?

Copy link
Member Author

@crazy-max crazy-max Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think instead we should have a required field like we do for secrets and is optional by default:

RUN --device=name=vendor1.com/device=foo,required

Comment on lines 62 to 65
FrontendAttrs: map[string]string{
"device:vendor1.com/device=foo": "required",
"device:vendor2.com/device=bar": "required=false",
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there is an issue with frontend attributes: #4715 (comment)

sandbox.go:205: panic: failed to parse devices: invalid CDI device name "vendor2.com/device": unqualified device "vendor2.com/device", missing vendor

I changed the device attribute to a map with device: prefix so we can pass multiple of them but seems the key with device name is already parsed somewhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the format to use an index instead:

FrontendAttrs: map[string]string{
	"device:0": "vendor1.com/device=foo,required",
	"device:1": "vendor2.com/device=bar",
},

Don't really like it but not sure what's best 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants