Skip to content

Conversation

@wojtek-t
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 28, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 28, 2025
@wojtek-t
Copy link
Member Author

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from 672aa68 to ce04eca Compare December 1, 2025 08:21
@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from ce04eca to 0ff3958 Compare December 1, 2025 08:52
Comment on lines +385 to +387
1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Contributor

@44past4 44past4 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two independent priorities for a workload - one for scheduling and one for the preemption or the single preemption priority which can be dynamically updated can potentially lead to a cycle in the preemption.

Let's assume that we have an existing workload A with high scheduling priority and low preemption priority running in a cluster.

Now let's assume that we want to schedule a workload B which has medium scheduling priority and medium preemption priority.

Workload B will preempt workload A and will start to run because its scheduling priority > preemption priority of the workload A.

However when workload A will restart and it will be rescheduled it will preempt workload B and will start to run because its scheduling priority > preemption priority of workload B.

The same issue can happen if we will have only one priority but this priority will be reduced while the workload is running. After preemption when the workload will reappear with the original higher priority it can preempt the workload which has preempted it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential solution / mitigation to the described problem could be stating that preemption priority >= scheduling priority. This way after restarting the preempted workload will not be able to preempt the preemptor workload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for point that out!

Yeah - "preemption priority >= scheduling priority" is definitely desired. I don't think we have any usecases that would benefit from the reversed.

That said, I need to think a bit more if that is enough. I think it prevents the cycles if we assume static priorities, but it can still potentially trigger cycles if the priorities will be changing. OTOH, if the priorities are changing this is probably desired.

Let me think about it a bit more and I will update the KEP to reflect the thoughts later this week.

@sanposhiho
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants