-
Notifications
You must be signed in to change notification settings - Fork 458
OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift 4.21 #5408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift 4.21 #5408
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ngopalak-redhat The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ca28d80 to
00bb8e1
Compare
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ngopalak-redhat: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| } | ||
| // Validate that systemReservedCgroup matches systemCgroups if both are set | ||
| if kcDecoded.SystemReservedCgroup != "" && kcDecoded.SystemCgroups != "" { | ||
| if kcDecoded.SystemReservedCgroup != kcDecoded.SystemCgroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should both the values of SystemReservedCgroup and SystemCgroups match?
From the kubelet configuration doc I don't find such a condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
It is recommended that the OS system daemons are placed under a top level control group (system.slice on systemd machines for example).
If its not the same, the enforcement would happen on different cgroup while the calculation of the values would happen using SystemCgroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I'm still unclear on this.
| klog.Infof("reservedSystemCPUs is set to %s, disabling systemReservedCgroup enforcement", originalKubeConfig.ReservedSystemCPUs) | ||
| } | ||
|
|
||
| if shouldDisableSystemReservedCgroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can directly use the above condition without a need for a new variable as it won't be used anywhere else again.
We know that the --reserved-cpus flag supersedes the other flags, why should we clear the following settings explicitly? Is it because the kubelet complains to start? If so, could you add a relevant comment?
| if shouldDisableSystemReservedCgroup { | |
| if originalKubeConfig.ReservedSystemCPUs != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I had another condition before this to handle upgrade. Hence the variable got added. I'll change it.
|
@haircommander Please review |
What I did
This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.
Template Changes:
/system.sliceto default kubelet configuration for all node types (master, worker, arbiter)enforceNodeAllocatablealongside pods in kubelet template filesPerformance Profile Compatibility:
The kubelet cannot simultaneously enforce both
systemReservedCgroupand--reserved-cpus(used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:--reserved-cpus) is setenforceNodeAllocatable to ["pods"]only in this scenarioThis approach leverages the fact that
--reserved-cpusalready supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.Validation:
systemReservedCgroupmatchessystemCgroupswhen both are user-specifiedHow to verify it
For New OCP 4.21+ Clusters:
cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
systemReservedCgroup: /system.slice
enforceNodeAllocatable:
For Clusters with Performance Profiles:
cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
- systemReservedCgroup is NOT present (empty/cleared)
- enforceNodeAllocatable only contains ["pods"]
- Kubelet starts successfully without errors
journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"
For OCP 4.20 to 4.21 Upgrades:
Description for the changelog
Enable system-reserved-compressible enforcement by default in OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.
Related:
Decision Update
As per latest discussion, we plan to make this a default in OCP 4.21. The clusters upgraded from 4.20 also will have this enabled.