Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: Device Binding Conditions #5007

Open
1 of 4 tasks
pohly opened this issue Dec 19, 2024 · 10 comments
Open
1 of 4 tasks

DRA: Device Binding Conditions #5007

pohly opened this issue Dec 19, 2024 · 10 comments
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Milestone

Comments

@pohly
Copy link
Contributor

pohly commented Dec 19, 2024

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 19, 2024
@pohly
Copy link
Contributor Author

pohly commented Dec 19, 2024

/assign @KobayashiD27

As discussed in kubernetes/kubernetes#124042 (comment).

/sig scheduling
/wg device-management

@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to assign the following users: KobayashiD27.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @KobayashiD27

As discussed in kubernetes/kubernetes#124042 (comment).

/sig scheduling
/wg device-management

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 19, 2024
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Dec 19, 2024
@KobayashiD27
Copy link
Contributor

Thank you for creating the issue. I will post a draft KEP as soon as possible.

@KobayashiD27
Copy link
Contributor

@pohly

To facilitate the discussion on the KEP, we would like to share the design of the composable controller we are considering as a component utilizing the fabric-oriented scheduler function. By sharing this, we believe we can deepen the discussion on the optimal implementation of the scheduler function. Additionally, we would like to verify whether the controller design matches the DRA design.

Background

Our controller's philosophy is to efficiently utilize fabric devices. Therefore, we prefer to allocate devices directly connected to the node over attached fabric devices. (e.g., Node-local devices > Attached fabric devices > Pre-attached fabric devices)

Design Overview

This design aims to efficiently utilize fabric devices, prioritizing node-local devices to improve performance. The composable controller manages fabric devices that can be attached and detached. Therefore, it publishes a list of fabric devices as ResourceSlices.

The structure we are considering is as follows:

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device1
  ...
  - name: device2
  ...

The vendor's DRA kubelet plugin will also publish the devices managed by the vendor as ResourceSlices.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...

Here, when the scheduler selects the fabric device device1, it waits for the attachment of the fabric device during PreBind. The composable controller performs the attachment operation by checking the flag of the ResourceClaim. After successful attachment, the composable controller changes the flag of the ResourceClaim.

We are considering the following two methods for handling ResourceSlices upon completion of the attachment. We would like to hear your opinions and feasibility on these two composable controller proposals.

Proposal 1: The composable controller publishes ResourceSlices with NodeName set within the pool

Multiple ResourceSlices are published with the same pool name. One indicates the devices included in the fabric, and the other indicates the devices attached to the node.

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device2
  ...
---
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device1
  ...

If the vendor's plugin responds to hotplug, device1 will appear in the ResourceSlice published by the vendor.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...
  - name: device1
  ...

This may cause device duplication issues between ResourceSlices. To prevent multiple ResourceSlices from publishing duplicate devices, we plan to define a deny list and standardize it with DRA.

Advantages

  • No need to change the allocationResult by the scheduler or composable controller.
  • Can distinguish attached fabric devices and maintain prioritization.

Disadvantages

Proposal 2: Attached devices are published by the vendor's plugin

In this case, devices are removed from the composable-device pool.

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device2
  ...

If the vendor's plugin responds to hotplug, device1 will appear in the ResourceSlice published by the vendor.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...
  - name: device1
  ...

This breaks the linkage between ResourceClaim and ResourceSlice. Therefore, it is necessary to modify the AllocationResult of the ResourceClaim.

Advantages

  • Simplifies device management.
  • Centralizes management as the vendor's plugin directly publishes devices.
  • No need for mechanisms to prevent device duplication (e.g., deny list).

Disadvantages

  • Cannot distinguish attached fabric devices, making prioritization difficult.
  • Requires modification of the linkage between ResourceClaim and ResourceSlice (expected to be done by the scheduler or DRA controller. Which is more appropriate?).
  • Until the linkage is fixed, the device being used may be published as a ResourceSlice and reserved by other Pods.

We would appreciate your feedback and insights on these proposals to ensure the optimal implementation of the scheduler function and alignment with the DRA design.

@pohly
Copy link
Contributor Author

pohly commented Dec 19, 2024

Let's keep the discussion in this issue shorter. You now can put all of this, including the alternatives, into the KEP document.

@pohly pohly moved this from 🆕 New to 🏗 In progress in SIG Node: Dynamic Resource Allocation Dec 20, 2024
@pohly pohly moved this from 🏗 In progress to 🆕 New in SIG Node: Dynamic Resource Allocation Dec 20, 2024
@pohly pohly moved this from 🆕 New to 🏗 In progress in SIG Node: Dynamic Resource Allocation Jan 21, 2025
@KobayashiD27
Copy link
Contributor

@pohly
Could you please link this PR #5012 to "KEP update PR"?

@johnbelamaric
Copy link
Member

/retitle DRA: Device Binding Conditions

@k8s-ci-robot k8s-ci-robot changed the title DRA: attach devices to nodes DRA: Device Binding Conditions Feb 12, 2025
@johnbelamaric
Copy link
Member

Looks like this was never opted into the release. Given Aldo's approval, I am assuming it's OK to do so. Please correct me if not...

/label lead-opted-in
/milestone v1.33

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Feb 12, 2025
@dipesh-rawat
Copy link
Member

Hello @pohly, @KobayashiD27 👋, 1.33 Enhancements team here,

With PR #5012 has been merged, all the KEP requirements are in place and merged into k/enhancements.

Before the enhancement freeze, it would be appreciated if following nits could be addressed:

Aside from the minor nits mentioned above, this enhancement is all good for the upcoming enhancements freeze. 🚀

The status of this enhancement is now marked as tracked for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well. Thank you!
(cc: @fykaa)

/stage alpha
/label tracked/yes

@k8s-ci-robot k8s-ci-robot added stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Feb 12, 2025
@dipesh-rawat dipesh-rawat moved this to Tracked for enhancements freeze in 1.33 Enhancements Tracking Feb 12, 2025
@johnbelamaric
Copy link
Member

Before the enhancement freeze, it would be appreciated if following nits could be addressed:

done! thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Tracked for enhancements freeze
Status: 🏗 In progress
Status: Needs Triage
Development

No branches or pull requests

5 participants