Skip to content

Conversation

@ngopalak-redhat
Copy link
Contributor

@ngopalak-redhat ngopalak-redhat commented Nov 12, 2025

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
    - systemReservedCgroup is NOT present (empty/cleared)
    - enforceNodeAllocatable only contains ["pods"]
    - Kubelet starts successfully without errors
  5. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

For OCP 4.20 to 4.21 Upgrades:

  1. Verify that the migration MachineConfig from PR WIP : [release-4.20] kubelet-config compressible patch #5412 is present and preserves old behavior
  2. Confirm no unexpected node reboots occur during upgrade

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.


Related:

Decision Update
As per latest discussion, we plan to make this a default in OCP 4.21. The clusters upgraded from 4.20 also will have this enabled.

@ngopalak-redhat ngopalak-redhat changed the title Implement system-reserved-compressible WIP: Implement system-reserved-compressible Nov 12, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 12, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 12, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ngopalak-redhat
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/system-reserved-compressible-1 branch from ca28d80 to 00bb8e1 Compare November 17, 2025 03:53
@ngopalak-redhat ngopalak-redhat changed the title WIP: Implement system-reserved-compressible OCPNODE-3201: Default Enablement of system-reserved-compressible in OpenShift 4.21 Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 19, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TODO: Before Review

  • Complete upgrade testing

What I did

This PR enables system-reserved-compressible enforcement by default for all new OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

For OCP 4.20 to 4.21 Upgrades:

  1. Verify that the migration MachineConfig from PR WIP : [release-4.20] kubelet-config compressible patch #5412 is present and preserves old behavior
  2. Confirm no unexpected node reboots occur during upgrade

Description for the changelog

Enable system-reserved-compressible enforcement by default in new OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.


Related:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

For OCP 4.20 to 4.21 Upgrades:

  1. Verify that the migration MachineConfig from PR WIP : [release-4.20] kubelet-config compressible patch #5412 is present and preserves old behavior
  2. Confirm no unexpected node reboots occur during upgrade

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.


Related:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@ngopalak-redhat: This pull request references OCPNODE-3201 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What I did

This PR enables system-reserved-compressible enforcement by default for all OpenShift 4.21+ clusters to allow better CPU allocation for system reserved processes through cgroup-based enforcement.

Template Changes:

  • Added systemReservedCgroup: /system.slice to default kubelet configuration for all node types (master, worker, arbiter)
  • Added system-reserved-compressible to enforceNodeAllocatable alongside pods in kubelet template files

Performance Profile Compatibility:
The kubelet cannot simultaneously enforce both systemReservedCgroup and --reserved-cpus (used by Performance Profiles in the Node Tuning Operator). To resolve this conflict, I added logic in the Kubelet Config Controller (pkg/controller/kubelet-config/helpers.go) to:

  • Detect when reservedSystemCPUs (--reserved-cpus) is set
  • Automatically clear systemReservedCgroup when reservedSystemCPUs is detected
  • Set enforceNodeAllocatable to ["pods"] only in this scenario
  • Preserve existing Performance Profile behavior without requiring any operator changes

This approach leverages the fact that --reserved-cpus already supersedes system-reserved, making systemReservedCgroup enforcement redundant in PerformanceProfile scenarios.

Validation:

  • Added validation to ensure systemReservedCgroup matches systemCgroups when both are user-specified

How to verify it

For New OCP 4.21+ Clusters:

  1. Deploy a new OCP 4.21+ cluster
  2. SSH into a node and verify kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep -A2 systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep -A3 enforceNodeAllocatable
  3. Verify the output shows:
    systemReservedCgroup: /system.slice
    enforceNodeAllocatable:
  • pods
  • system-reserved-compressible

For Clusters with Performance Profiles:

  1. Create a Performance Profile with reservedSystemCPUs set (via Node Tuning Operator)
  2. Wait for the MachineConfig to be applied and nodes to reboot
  3. SSH into the affected node and check kubelet configuration:
    cat /etc/kubernetes/kubelet.conf | grep systemReservedCgroup
    cat /etc/kubernetes/kubelet.conf | grep enforceNodeAllocatable
  4. Verify that:
  • systemReservedCgroup is NOT present (empty/cleared)
  • enforceNodeAllocatable only contains ["pods"]
  • Kubelet starts successfully without errors
  1. Check kubelet logs to confirm no conflicts:
    journalctl -u kubelet | grep -i "system-reserved|reserved-cpus"

For OCP 4.20 to 4.21 Upgrades:

  1. Verify that the migration MachineConfig from PR WIP : [release-4.20] kubelet-config compressible patch #5412 is present and preserves old behavior
  2. Confirm no unexpected node reboots occur during upgrade

Description for the changelog

Enable system-reserved-compressible enforcement by default in OCP 4.21+ clusters. The kubelet now enforces CPU limits on system daemons via systemReservedCgroup (/system.slice), improving CPU allocation for system reserved processes on nodes with high CPU counts. Automatically disables systemReservedCgroup enforcement when Performance Profiles with reserved-cpus are used to prevent conflicts. Existing OCP 4.20 clusters upgrading to 4.21+ will preserve their current behavior via migration MachineConfig.


Related:

Decision Update
As per latest discussion, we plan to make this a default in OCP 4.21. The clusters upgraded from 4.20 also will have this enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ngopalak-redhat ngopalak-redhat marked this pull request as ready for review November 20, 2025 00:48
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
@ngopalak-redhat
Copy link
Contributor Author

cc: @MarSik @ffromani

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 20, 2025

@ngopalak-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-single-node 00bb8e1 link true /test e2e-gcp-op-single-node
ci/prow/e2e-hypershift 00bb8e1 link true /test e2e-hypershift
ci/prow/bootstrap-unit 00bb8e1 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}
// Validate that systemReservedCgroup matches systemCgroups if both are set
if kcDecoded.SystemReservedCgroup != "" && kcDecoded.SystemCgroups != "" {
if kcDecoded.SystemReservedCgroup != kcDecoded.SystemCgroups {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should both the values of SystemReservedCgroup and SystemCgroups match?
From the kubelet configuration doc I don't find such a condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

It is recommended that the OS system daemons are placed under a top level control group (system.slice on systemd machines for example).

If its not the same, the enforcement would happen on different cgroup while the calculation of the values would happen using SystemCgroups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I'm still unclear on this.

klog.Infof("reservedSystemCPUs is set to %s, disabling systemReservedCgroup enforcement", originalKubeConfig.ReservedSystemCPUs)
}

if shouldDisableSystemReservedCgroup {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can directly use the above condition without a need for a new variable as it won't be used anywhere else again.
We know that the --reserved-cpus flag supersedes the other flags, why should we clear the following settings explicitly? Is it because the kubelet complains to start? If so, could you add a relevant comment?

Suggested change
if shouldDisableSystemReservedCgroup {
if originalKubeConfig.ReservedSystemCPUs != "" {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I had another condition before this to handle upgrade. Hence the variable got added. I'll change it.

@ngopalak-redhat
Copy link
Contributor Author

@haircommander Please review

@ngopalak-redhat ngopalak-redhat marked this pull request as draft November 20, 2025 15:11
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants