Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
# AEP-7571: Support Pod-Level Resources

<!-- toc -->
- [Summary](#summary)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Design Details](#design-details)
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
- [Design Principles](#design-principles)
- [Container-level resources](#container-level-resources)
- [Pod-level resources](#pod-level-resources)
- [Pod and Container-Level Resources](#pod-and-container-level-resources)
- [Proposal](#proposal)
- [Validation](#validation)
- [Test Plan](#test-plan)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Kubernetes version compatibility](#kubernetes-version-compatibility)
- [Implementation History](#implementation-history)
<!-- /toc -->

## Summary

Starting with Kubernetes version 1.34, it is now possible to specify CPU and memory `resources` for Pods at the pod level in addition to the existing container-level `resources` specifications. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth linking the KEP here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm linking the KEP and the official blog post a little further down: here


```yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: default
spec:
resources:
requests:
memory: "100Mi"
limits:
memory: "200Mi"
containers:
- name: container1
image: nginx
```

It is also possible to combine both pod-level and container-level specifications. In this case, one container can define its own resource constraints - the `ide` container, while other containers (`tool1` and `tool2`) can dynamically use any remaining resources within the Pod's overall limit:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: workload
namespace: default
spec:
resources:
limits:
memory: "1024Mi"
cpu: "4"
initContainers:
- image: tool1:latest
name: tool1
restartPolicy: Always
- image: tool2:latest
name: tool2
restartPolicy: Always
containers:
- name: ide
image: theia:latest
resources:
requests:
memory: "128Mi"
cpu: "0.5"
limits:
memory: "256Mi"
cpu: "1"
```

The benefits and implementation details of pod-level `resources` are described in [KEP-2837](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md). A related article is also available in the [Kubernetes documentation](https://kubernetes.io/docs/tasks/configure-pod-container/assign-pod-level-resources/).

Currently, before this AEP, VPA computes recommendations only at the container level, and those recommendations are applied exclusively at the container level. With the new pod-level resources specifications, VPA should be able to read from the pod-level `resources` stanza, calculate pod-level recommendations, and scale at the pod level when users define pod-level `resources`.

To address this, this AEP proposes extending the VPA object's `spec` and `status` fields, and introducing two new pod-level flags to set constraints directly at the pod level. For more details check the [Proposal](#proposal) section.

### Goals

* Add support for the pod-level resources stanza in VPA:
* Read pod-level values
* Calculate pod-level recommendations
* Apply recommendations at the pod level
* Thoroughly document the new feature, focusing on areas that change default behaviors, in the [VPA documentation](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/docs).

### Non-Goals

* Since the latest VPA does not support `initContainers` ([the native way to use sidecar containers](https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/)), this AEP does not aim to implement support for them. Support for `initContainers` may be explored in a future proposal. At the same time, other non-native sidecar containers defined in the Pod `spec.containers` should be included by default in the calculation of pod-level recommendations when a pod-level resources stanza is present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, other non-native sidecar containers defined in the Pod spec.containers should be included by default in the calculation of pod-level recommendations when a pod-level resources stanza is present.

Is this a non-goal?


## Design Details

### Notes/Constraints/Caveats

- Pod-level resources support in VPA is opt-in and does not change the behavior of existing workload APIs (e.g. Deployments) unless explicitly enabled and the workload is recreated with a pod-level resources stanza.
- At the time of writing this AEP, the [In-Place Pod-Level Resources Resizing](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5419-pod-level-resources-in-place-resize) is not available for pod level fields, so applying pod-level recommendations requires evicting Pods. When [In-Place Pod-Level Resources Resizing](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5419-pod-level-resources-in-place-resize) becomes available, VPA should attempt to apply pod-level recommendations in place first and fall back to eviction if in-place updates fail, mirroring the current `InPlaceOrRecreate` mode behavior used for container-level updates. This new mechanism should be addressed in a separate proposal.

### Design Principles

This section describes how VPA reacts based on where resources are defined (pod level, container level or both).

Before this AEP, the recommender computes recommendations only at the container level, and VPA applies changes only to container-level fields. With this proposal, the recommender also computes pod-level recommendations in addition to container-level ones. Pod-level recommendations are derived from per-container usage and recommendations, typically by aggregating container recommendations. Container-level policy still influences pod-level output: setting `mode: Off` in `spec.resourcePolicy.containerPolicies` excludes a container from recommendations, and `minAllowed`/`maxAllowed` bounds continue to apply.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions on this part:

  1. Do we include or exclude sidecar containers in this? Currently VPA doesn't handle sidecar containers
  2. What happens if a new container is added to a Pod, what values will the recommender set for the Pod?
  3. What happens if a container is removed from a Pod, does its recommendation still get included in the Pod level recommendation?

There is some ongoing work for points 2 and 3 here: #6745

cc @jkyros

Copy link
Contributor Author

@iamzili iamzili Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Since the latest VPA doesn't support initContainers, I think we shouldn't implement support for them in this AEP (a new AEP should be created for that IMHO). At the same time, other non-native sidecar containers defined in the Pod spec should be included by default in the calculation of pod-level recommendations when a pod-level resources stanza is present.

  2. If a new container is added to the Pod, then in the next recommender loop, recommendations will be calculated for the new container, which will trigger a recalculation at the pod level (default behavior). The updater may then evict the Pod if the new pod-level recommendation deviates significantly from the current one.

  3. if a container is removed from a Pod, the next recommender loop should calculate a new pod-level recommendation. However based on #6745, it seems this requires some additional work, as in the latest VPA version the recommender continues to use stale container aggregates for a period of time even after a container is removed, is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yup, makes sense. I think the AEP needs to be clear about what or what isn't included in the Pod calculation.
  2. There seems to be a chicken and egg situation though. When the new container is added to a Deployment, new Pods will be created prior to the new recommendation. What value will the Pod resource be getting here?
  3. Yes, possibly. This makes me wonder If the Pod level recommendation shouldn't be done in the VPA resource, but rather at the moment any recommendation needs to be applied, the Pod recommendation is created on the fly, using the recommendations available in the VPA resource along with the current containers in said Pod.

For what it's worth, I believe all these points need to be documented on the AEP too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that these points need to be added to the AEP. I will add them after we find the best approach of course.

I kind of like your approach in the 3rd point, where you mention that the Pod recommendation might be created on the fly instead of by the recommender (that is, saved to the VPA object). Your approach would make things easier when the user adds or removes a container from a workload API object such as a Deployment.

Here I propose the workflow when the user adds a container to the Deployment with container-level resources stanza (both with requests and limits):

  1. A Pod re-creation is triggered by the Deployment controller.
  2. The admission-controller intercepts the request and calculates the pod-level recommendations by adding the container-level recommendations. The container recommendations are still read from the VPA object. It also includes the container-level requests and limits set by the user for the new container in the calculation to avoid violating the PodSpec Validation Rules. Then the admission controller calculates the patches and applies them. The Pod is started successfully with the new container.
  3. In the following recommender loop, the recommender component re-calculates the container-level recommendations (including for the new container) and saves the results to the VPA object.
  4. The updater fetches the container recommendations from the VPA object and calculates the pod-level recommendations on the fly by summing the container-level ones. If they differ significantly from the actual resource allocation, it may evict the Pod.
  5. Same as step 2, but "just" using the container recommendations from the VPA object, since no new container was added by the user.

When the user adds a new container WITHOUT a container-level resources stanza:

  1. A Pod re-creation is triggered by the Deployment controller.
  2. The admission-controller calculates the pod-level recommendations based on the container recommendations specified in the VPA object. Patches are then applied accordingly. If the Pod already contains the most up-to-date recommendations from the VPA object, there will be no change, as at this point we do not yet know the recommendation for the newly added container.

Same as steps 3, 4 and 5 above.

I'm not entirely sure what we can do when a container is removed from a Pod, as this results in stale recommendations in the VPA object for some amount of time. That in turn cause the admission-controller to recreate the Pod with a non-existing container. Maybe in this AEP we could modify the updater and the admission-controller so they verify that containers still exist in the workload API? Or of course we could wait until #6745 is completed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all sounds good.

Can you explain this part to me? I'm not sure I understand what it means:

If the Pod already contains the most up-to-date recommendations from the VPA object, there will be no change, as at this point we do not yet know the recommendation for the newly added container.

I'm not entirely sure what we can do when a container is removed from a Pod, as this results in stale recommendations in the VPA object for some amount of time.

My assumption is that for the Pod level resources, we would get a list of the Pod's containers, and do a lookup for each container's recommendation for the sum, ignoring any stale recommendations.


This AEP extends the VPA CRD `spec.resourcePolicy` with a new `podPolicies` stanza that influences pod-level recommendations. The AEP also introduces two global pod-level flags `pod-recommendation-max-allowed-cpu` and `pod-recommendation-max-allowed-memory`. Details are covered in the [Proposal section](#proposal).

Today, the updater and admission controller update resources only at the container level. This proposal enables VPA components to update resources at the pod level as well.

**This AEP suggests that when a workload defines pod-level resources, VPA should manage those by default because pod-level resources offer benefits over container-only settings** - see the "Better resource utilization" section in [KEP-2837](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#better-resource-utilization) for details.


Scenarios with no resources defined, or with both pod-level and container-level values present, require clear defaulting rules and are discussed in the options below. Note: community feedback should determine the default behavior.

#### Container-level resources

For workloads that define only container-level resources, VPA should continue controlling resources at the container level, consistent with current behavior prior to this AEP. In other words, for a multi-container Pod without pod-level resources but with at least one container specifying resources, VPA should by default autoscale all containers.

#### Pod-level resources

For workloads that define only pod-level resources, VPA will control resources only at the pod level.

#### Pod and Container-Level Resources

This part of the AEP covers workloads that define resources both at the pod level and for at least one container. To demonstrate multiple implementation options for how VPA should handle such workloads by default, consider the following manifest. It defines three containers:
* `ide` - the main workload container
* `tool1` and `tool2` - non-critical sidecar containers


```yaml
apiVersion: v1
kind: Pod
metadata:
name: workload
namespace: default
spec:
resources:
limits:
memory: "1024Mi"
cpu: "4"
containers:
- name: ide
image: theia:latest
resources:
requests:
memory: "128Mi"
cpu: "0.5"
limits:
memory: "256Mi"
cpu: "1"
- image: tool1:latest
name: tool1
- image: tool2:latest
name: tool2
```

##### Option 1: VPA Controls Only Pod-Level Resources

With this option, VPA manages only the pod-level resources stanza. To follow this approach, the initially defined container-level resources for `ide` must be removed so that changes in usage are reflected only in pod-level recommendations.

**Pros**:
* VPA does not need to track which container-level resources were initially set.
* Straightforward for users: only the pod-level resources stanza is updated, while container-level stanzas are dropped.
* Enables shared headroom across containers in the same Pod. With container-only limits, a sidecar (`tool1` or `tool2`) or the main workload (`ide` container) hitting its own CPU limit could get throttled even if other containers in the Pod have idle CPU. Pod-level resources allow a container experiencing a spike to access idle resources from others, optimizing overall utilization.

**Cons**:
* A downside of this approach is that the most important container (`ide`) may be recreated without container-level resources, leading to an `oom_score_adj` that matches other sidecars in the Pod, as a result the OOM killer may target all containers more evenly under node memory pressure. For details on how `oom_score_adj` is computed when pod-level resources are present, see the [KEP section](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#oom-score-adjustment) on OOM score adjustment.

##### [Selected] Option 2: VPA controls pod-level resources and the initially set container-level resources

With this option, VPA controls pod-level resources and the container-level resources that were initially set. The resources for containers `tool1` and `tool2` are not updated by VPA, however their usage is still observed and contributes to the overall pod-level recommendation.

**Pros**:
- The primary container (`ide`) is less likely to be killed under memory pressure because sidecars (`tool1`, `tool2`) have higher `oom_score_adj` values, the OOM killer targets them first during node pressure evictions. See the [updated OOM killer behavior and formula](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#oom-killer-behavior) when pod-level resources are present.
- Sidecars, such as logging agents or mesh proxies (like `tool1` or `tool2`), that don't use container-level limits can borrow idle CPU from other containers in the pod when they experience a spike in usage. Pod-level resources allow a container experiencing a spike to access idle resources from others, optimizing overall utilization.

**Cons**:
- VPA must track which container-level resources are under its control by default and avoid mutating others.
- Existing VPA users may find the behavior surprising because VPA does not control all container-level resources stanzas - only those initially set - unless configured otherwise.

#### No resources stanza exists at the pod and container level

When a workload is created without any resources defined at either the pod or container level, there are two options:

##### Option 1: VPA controls only the container-level resources

This option mirrors current VPA behavior by managing only container-level resources, preserving benefits like in-place container-level resource resize. In this mode, pod-level recommendations are not computed and therefore not applied.

**Pros**:
- Familiar to existing users because it does not change current VPA behavior.

**Cons**:
- No cross-container resource sharing: in multi-container Pods, a container can hit its own limit and be throttled even if sibling containers are idle.

##### [Selected] Option 2: VPA controls only the pod-level resources

With this option, VPA computes and applies only pod-level recommendations.

**Pros**:
- Enables shared headroom across containers in the same Pod, previously with container-only limits, a sidecar (e.g. `tool1` or `tool2` container) hitting its own CPU limit could throttle the Pod even if other containers had spare CPU. Pod-level resources allows one container experiencing a spike to access idle resources from others, optimizing overall utilization.
- Simple to adopt and remains straightforward for users if documented clearly in official documentation.

### Proposal

- Add a new feature flag named `PodLevelResources`. Because this proposal introduces new code paths across all three VPA components, this flag will be added to each component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a feature flag to assist with GAing the feature, or is it a flag to enable/disable the feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to use the flag to enable or disable the feature. In other words, the feature should be disabled by default at first, and once the feature matures, it can be enabled by default starting from a specific VPA version.

Could you please clarify what you mean by using the flag for GAing the feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal pattern for Kubernetes is to use a feature gate to introduce a new feature. Normally it works like this across many releases:

  1. First release - add a feature gate as alpha - defaulted to off
  2. Second release - promote to beta - default to on
  3. Third release - promote to GA - locked to on
  4. A few releases later (3 I think) - remove feature gate logic completely

This is mostly for the kubernetes components to handle roll forward/back gracefully.
I think the main thing it protects is if a user starts using the feature in the beta mode, if they roll back 1 release, that feature would continue to work (ie: the APIs would be valid) since the logic exists in the alpha mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation - I appreciate it! Based on your comment, the feature flags (there will be a new one for each component) will serve both purposes, i.e. GAing and enabling/disabling the feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the point of feature gates in Kubernetes is to eventually remove them. enabling/disabling the feature should be driven by the API


- Extend the VPA object:
1. Add a new `spec.resourcePolicy.podPolicies` stanza. This stanza is user-modifiable and allows setting constraints for pod-level recommendations:
- `controlledResources`: Specifies which resource types are recommended (and possibly applied). Valid values are `cpu`, `memory`, or both. If not specified, both resource types are controlled by VPA.
- `controlledValues`: Specifies which resource values are controlled. Valid values are `RequestsAndLimits` and `RequestsOnly`. The default is `RequestsAndLimits`.
- `minAllowed`: Specifies the minimum resources that will be recommended for the Pod. The default is no minimum.
- `maxAllowed`: Specifies the maximum resources that will be recommended for the Pod. The default is no maximum. To ensure per-container recommendations do not exceed the Pod's defined maximum, apply the formula to adjust the recommendations for containers proposed by @omerap12 (see [discussion](https://github.com/kubernetes/autoscaler/issues/7147#issuecomment-2515296024)). This field takes precedence over the global Pod maximum set by the new flags (see "Global Pod maximums").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! (I forgot I wrote that TBH ) :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My formula should be correct, but what happens if after the normalization of the container[i] resources we get a value which is little/bigger than the minAllowed/maxAllowed?
I thought we can do something like that:

- If adjusted[i] < container.minAllowed[i]: set to minAllowed[i]
- If adjusted[i] > container.maxAllowed[i]: set to maxAllowed[i]

And then we need to re-check pod limits after container policy adjustments ( since it might be bigger ).
If we are still exceeding pod limits - what we wanna do here?
cc @adrianmoisey

Sorry if I wasn't clear enough :)

Copy link
Contributor Author

@iamzili iamzili Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An individual container limit can't be larger than the pod-level limit, but the aggregated container-level limits can exceed the pod-level limit - Ref.

So, when a new pod-level recommendation is calculated and the limit is set proportionally at the pod level, we also need to check the container-level limits. If a container-level limit is greater than the pod-level limit, it should be set to the same value as the pod-level limit, and the calculated container-level recommendation should be reduced proportionally as well to maintain the original request to limit ratio (similar to how it works when a LimitRange API object is in place).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. precisely!

2. Add a new `status.recommendation.podRecommendation` stanza. This field is not user-modifiable, it is populated by the VPA recommender and stores the Pod-level recommendations. The updater and admission controller use this stanza to read Pod-level recommendations. The updater may evict Pods to apply the recommendation, the admission controller applies the recommendation when the Pod is recreated.
Comment on lines +207 to +213
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have an example Go Type here?


- Global Pod maximums: add two new recommender flags to constrain the maximum CPU and memory recommended at the Pod level. These flags are the Pod-level equivalents of `container-recommendation-max-allowed-cpu` and `container-recommendation-max-allowed-memory`, and will be named `pod-recommendation-max-allowed-cpu` and `pod-recommendation-max-allowed-memory`. Use the same enforcement formula referenced in the `maxAllowed` section. The VerticalPodAutoscaler-level maximum (that is, `maxAllowed`) takes precedence over the global maximum.

### Validation

#### Static Validation

- The new fields under `spec.resourcePolicy.podPolicies` will be validated consistently, following the same approach we use for the existing `containerPolicies` stanza.
- This AEP proposes validating the new global flags, `--pod-recommendation-max-allowed-cpu` and `--pod-recommendation-max-allowed-memory`, as Kubernetes `Quantity` values using `resource.ParseQuantity`.

#### Dynamic Validation via Admission Controller

- When a pod-level resources stanza exists in the workload API, avoid using the `InPlaceOrRecreate` mode because it implies that in-place updates are possible - [in-place updates are not currently supported for pod-level resources](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5419-pod-level-resources-in-place-resize). Therefore this mode should be rejected by the admission controller when a pod-level resources stanza is present.

### Test Plan

This AEP aims to propose thoughtful unit test coverage for new code paths and additionally intends to develop the following e2e tests:
* Enable VPA for a workload that doesn't define any pod- or container-level resources stanzas.
The expected outcome is that pod-level recommendations are calculated and applied only at the pod level.
* Enable VPA for a workload that contains only container-level resources stanzas. There is no need to implement this use case, as an existing e2e test already covers it. The outcome should remain the same as in a VPA version prior to this AEP - recommendations are calculated and applied only at the container level.
* Enable VPA for a workload that defines pod-level and at least one container-level resources stanza.
The expected outcome is that both the pod-level stanza and the initially defined container-level stanzas are updated.
* Test use cases where the user adds or removes a container from a workload managed by VPA.

### Upgrade / Downgrade Strategy

#### Upgrade

Use a VPA release that includes this feature across all three components and pass `--feature-gates=PodLevelResources=true` to each component. Begin deploying new workloads that specify a pod-level resources stanza.

#### Downgrade

Downgrading VPA from a version that includes this feature should not disrupt existing workloads. Existing pod-level resource specifications remain in Pod specs, and VPA reverts to controlling container-level resources only.

### Examples

TODO

### Kubernetes version compatibility

This feature targets Kubernetes v1.34 or newer, with the beta version of [KEP-2837](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md) (Pod-level resource specifications) enabled.

Kubernetes v1.32 is not recommended because the alpha implementation of Pod-level resources rejects container-level resource updates when Pod-level resources are set, see the [validation and defaulting rules for details](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec/README.md#proposed-validation--defaulting-rules).

## Implementation History

- 2025-09-29: initial version
Loading