-
Notifications
You must be signed in to change notification settings - Fork 4.3k
AEP-8818: InPlace Update Mode #8818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Omer Aplatony <[email protected]>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: omerap12 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
|
/kind api-review |
|
@omerap12: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/label kind/api-review |
|
@omerap12: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/label api-review |
|
/lgtm @omerap12 If you have a draft PR that makes sense to review, then let me know Omer. |
|
@iamzili: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
I'll take a look at this next week @omerap12 sorry for the delay 🥲 |
Signed-off-by: Omer Aplatony <[email protected]>
| klog.V(4).InfoS("Can't in-place update pod, waiting for next loop", "pod", klog.KObj(pod)) | ||
| return utils.InPlaceDeferred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, are these supposed to be indented the same level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sure, I'll try to fmt soon
|
|
||
|
|
||
| Retry is handled entirely by the Kubelet based on pod conditions: | ||
| - `PodResizePending` (reason: `Deferred`) - Kubelet will retry automatically, , VPA continues to defer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `PodResizePending` (reason: `Deferred`) - Kubelet will retry automatically, , VPA continues to defer. | |
| - `PodResizePending` (reason: `Deferred`) - Kubelet will retry automatically, VPA continues to defer. |
|
|
||
| ### Behavior when Feature Gate is Disabled | ||
|
|
||
| - When `InPlace` feature gate is disabled and a VPA is configured with `UpdateMode: InPlace`, the updater will skip processing that VPA entirely (not fall back to eviction). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to check: it won't evict and it won't in-place update?
Also, what does the admission-controller do when the feature gate is disabled but a pod is set to InPlace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The admission controller will deny the the request ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to check: it won't evict and it won't in-place update?
That’s what I assumed, because if someone wants to use in-place mode only, it likely means the workload can’t be evicted. In that case, I think the correct action is to do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, what if someone does this:
- Upgrades to this version of VPA and enables the feature gate
- Uses the InPlace mode on a VPA
- Disables the feature gate
- Deletes a Pod from the VPA pointing at InPlace
Does the admission-controller:
- Set the resources as per the recommendation (as if the VPA was in "Initial" mode)
- Ignore the pod (as if the VPA was in "Off" mode)
- Something else..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I didn't test it. but it should be 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked, we set the resources as per the recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that's worth clarifying here
|
|
||
| - Guarantee that all updates will eventually succeed (node capacity constraints may prevent this) | ||
| - Provide mechanisms to automatically increase node capacity to accommodate updates | ||
| - Change the behavior of existing update modes (Auto, Recreate, InPlaceOrRecreate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Change the behavior of existing update modes (Auto, Recreate, InPlaceOrRecreate) | |
| - Change the behavior of existing update modes (Off, Recreate, InPlaceOrRecreate) |
Not sure if adding Auto makes sense, since it's deprecated? I'll let you decide.
| - Apply recommendations during pod admission (like all other modes) | ||
| - Attempt in-place updates for running pods under the same conditions as `InPlaceOrRecreate` | ||
| - Never add pods to `podsForEviction` if in-place updates fail | ||
| - Continuously retry failed in-place update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a backoff policy for retrying, or do we think linear retry is sufficient if we keep failing?
| - Allow VPA to eventually apply updates when cluster conditions improve | ||
| - Respect the existing in-place update infrastructure from AEP-4016 | ||
|
|
||
| ## Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think we should have some small note that this update mode is subject to the behavior of the inplacepodverticalscaling gate, such that it's possible (but improbable) that a resize can cause an OOMkill during a memory limit downsize?
Though I don't actually know the probability of this happening if a limit gets resized close to the usage, I think it may be useful to callout since we emphasize that brief disruptions are unnacceptable.
I think to mitigate risk here we may want to recommend that if you absolutely cannot tolerate disruption (i.e. unintended OOMkill), then you can either:
- disallow memory limits for your no disruption container
- if you must allow VPA to set memory limits, then you should configure the VPA to generate more generous/conservative memory limit recommendations as a safety buffer.
^ Though this may or may not be better for our docs, instead of getting into it in the AEP here.
Thoughts? cc @adrianmoisey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right.
I was thinking a similar though on the "Provide a truly non-disruptive VPA update mode that never evicts pods" goal.
I think it may be worth softening the language in the AEP (since we can't make guarantees that resizes are non-disruptive)
I also agree that most of what you suggested may be good for the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: #8805
What type of PR is this?
/kind documentation
What this PR does / why we need it:
AEP for #8720
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: