Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4815 DRA Partitionable devices support for multi-host #5069

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mortent
Copy link
Member

@mortent mortent commented Jan 22, 2025

  • One-line PR description: Updates the design to also be able to support multi-host use-cases
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 22, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2025
@mortent mortent force-pushed the DRAPartitionableForMultiHost branch from ee1ffbf to 576dcfe Compare January 22, 2025 01:49
@mortent
Copy link
Member Author

mortent commented Jan 25, 2025

/assign @johnbelamaric

@mortent
Copy link
Member Author

mortent commented Feb 4, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 4, 2025
@@ -647,7 +766,129 @@ When such a device is allocated, the scheduler will need to track the full
capacity required to satisfy each of the sink devices along the chain. In this
way, all intermediate sink devices will essentially be rendered
"unschedulable", with the last-level sink device pulling its capacity from the
devices it references directly.
device it references directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the ofered ResourceSlice going to be overlapping? In other words are they going to be a Cartesian product of all possible allocations?

Have you considered also a different approach with just one offering, a ResourceSlice that just provide the biggest possible allocation? If a pod needs (claims) a portion of it, that the original ResourceSlice would be split into remaining offerings to be ready for the next scheduling cycle?

The advantage is that we'd not loose the information that binding to the bigger offering will in fact remove the big one and leave only smaller offering. If scheduler had such information, it could decide to bind to smaller one if it does not need the big one, but without it, scheduler would blindly pick whichever small one matches.

We could say that we could achieve the same using scoring, which is probably true, but another advantage is reducing number of offerings (ResourceSlices) that scheduler needs to process.

I'm aware that creating new ResourceSlices may be quite heavy process which requires api-server turnarounds, but I suspect that DRA plugin could generate such ResourceSlices in memory just for scheduling purposes and perform the split once scheduler decides to bind.

This approach might be especially useful if we want to offer resources of continuous type, like memory. It's not possible to offer all memory allocation possibilities, but it should be possible to offer the maximum available mem on a given node. Once claimed, new ReserourceSlice would appear with the remaining mem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach taken in DRA for now is that the ResourceSlice should list all available devices. There have been early discussions to support relaxing this requirement, either for dynamically creating smaller devices like you propose here or just represent a large number of identical devices in a more compact way. I suspect it is something we will look into at some point, since there are certain devices that support a very large number of partitions (AWS Inferentia devices is an example of this: https://docs.google.com/document/d/1lXGfnrBixRIMW9ESa-mv09Kisb2myVFV_A3nqPJ4FCQ/edit?tab=t.0#bookmark=id.mmv8k6mzxscm). But it is out of scope of the current change.

@dom4ha
Copy link
Member

dom4ha commented Feb 10, 2025

I don't have other concerns from the scheduler perspective, except this question regarding the approach to producing available offerings, but I don't think it's blocking.

@mortent mortent force-pushed the DRAPartitionableForMultiHost branch from 7f25e5f to 3db8e71 Compare February 12, 2025 18:08
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 12, 2025
@mortent mortent force-pushed the DRAPartitionableForMultiHost branch from 3db8e71 to e9bedb8 Compare February 12, 2025 18:14
@mortent mortent requested a review from klueska February 12, 2025 18:59
@johnbelamaric
Copy link
Member

Thanks you @mortent this looks great.

/lgtm

cc @dom4ha @alculquicondor

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@alculquicondor
Copy link
Member

@mimowo PTAL in case there are any major implications for Kueue and group scheduling in general.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We knew it wasn't going to stay "simple".

/approve on API

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent, thockin
Once this PR has been reviewed and has the lgtm label, please ask for approval from johnbelamaric. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mimowo
Copy link
Contributor

mimowo commented Feb 13, 2025

@mimowo PTAL in case there are any major implications for Kueue and group scheduling in general.

Thanks for the ping. cc @mwielgus @dom4ha who are also over-watching the DRA -Kueue integration efforts.

@dom4ha
Copy link
Member

dom4ha commented Feb 13, 2025

Thanks for the ping. cc Marcin Wielgus Dominik Marciński who are also over-watching the DRA -Kueue integration efforts.

One aspect that is not discussed in this KEP is a high chance the pods become unschedulable. IIUC scheduling of the first pod determines subset of nodes on which remaining pods can schedule. Obviously, when there are no classic resources available on the selected nodes, the remaining pods won't be scheduled anywhere else.

Regarding DRA-Kueue integration, it's a broad topic and we need to discuss it separately. Multi-node introduces in fact an implicit topology and currently scheduler is not capable to guarantee scheduling of a group of pods (thus the issue above). On the other hand Kueue support TAS, so if Kueue was aware of DRA resources availability and this inter-chip topology, it could take a better decision about placement, but it does not have capability to reserve resources (guarantee that scheduler don't schedule other pods in the meantime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

10 participants