-
Notifications
You must be signed in to change notification settings - Fork 1.4k
🐛 Fix race conditions ScaleDownOldMS #12812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
🐛 Fix race conditions ScaleDownOldMS #12812
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code wise everything looks fine to me. I'm being a little nit picky on some docstring comments, just thinking from the context of reading through this code at some point in the future needing to understand what is happening once the current context is lost.
// NOTE: we are scaling up unavailable machines first in order to increase chances for the rollout to progress; | ||
// however, the MS controller might have different opinion on which machines to scale down. | ||
// As a consequence, the scale down operation must continuously assess if reducing the number of replicas | ||
// for an older MS could further impact availability under the assumption than any scale down could further impact availability (same as above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time parsing this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased + added an example, PTAL
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount. | ||
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also having trouble parsing this sentence too. I think you're saying we will scale down the MS decrementing by totalScaleDownCount
without going below 0?
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount. | |
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability, | |
// Then scale down old MS by totalScaleDownCount to get to zero. | |
// NOTE: also in this case, continuously assess if reducing the number of replicas could further impact availability, |
|
||
// This funcs tries to detect and address the case when a rollout is not making progress because both scaling down and scaling up are blocked. | ||
// Note. this func must be called after computing scale up/down intent for all the MachineSets. | ||
// Note. this func only address deadlock due to unavailable machines not getting deleted on oldMSs, e.g. due to a wrong configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand what that means, i.e. when that happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified, PTAL
// This funcs tries to detect and address the case when a rollout is not making progress because both scaling down and scaling up are blocked. | ||
// Note. this func must be called after computing scale up/down intent for all the MachineSets. | ||
// Note. this func only address deadlock due to unavailable machines not getting deleted on oldMSs, e.g. due to a wrong configuration. | ||
// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old machines set are not remediated by MHC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old machines set are not remediated by MHC. | |
// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old MachineSets are not remediated by MHC. |
// Find the number of pending scale down from previous reconcile/from current reconcile; | ||
// This is required because whenever we are reducing the number of replicas, this operation could further impact availability e.g. | ||
// - in case of regular rollout, there is no certainty about which machine is going to be deleted (and if this machine is currently available or not): | ||
// - e.g. MS controller is going to delete first machines with deletion annotation; also MS controller has a slight different notion of unavailable as of now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" slight different notion of unavailable as of now" Would it make sense to have a follow-up to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the tracking issue
return err | ||
// Compute the total number of replicas that can be scaled down. | ||
// Exit immediately if there is no room for scaling down. | ||
totalScaleDownCount := max(totReplicas-totPendingScaleDown-minAvailable, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to assume that every replica of the new MS (according to spec.replicas) is available.
I guess this might be fine if we would consider it further down? (not sure if we do). In any case a godoc would be good if that is the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note
} | ||
|
||
log.V(4).Info("Cleaned up unhealthy replicas from old MachineSets", "count", cleanupCount) | ||
sort.Sort(mdutil.MachineSetsByCreationTimestamp(p.oldMSs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please add a comment what ordering this gives us (oldest or newest first)
newReplicasCount := oldMSReplicas - scaledDownCount | ||
// Compute the scale down extent by considering either unavailable replicas or, if scaleToZero is set, all replicas. | ||
// In both cases, scale down is limited to totalScaleDownCount. | ||
maxScaleDown := max(scaleIntent-ptr.Deref(oldMS.Status.AvailableReplicas, 0), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if some of the availableReplicas are pending deletion from a previous reconcile?
Are we scaling down to far then?
Discussed the logic below will ensure this is fine, maybe a bit more godoc can help
availableMachineScaleDown = max(totAvailableReplicas-minAvailable, 0) | ||
scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown | ||
|
||
totAvailableReplicas = max(totAvailableReplicas-availableMachineScaleDown, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this only updated if we would breach minAvailable? (I think we need an else branch)
Should we consider availableReplicas on the MS when computing availableMachineScaleDown
And is it correct to update this if scaleDown becomes negative (and the MS is not scaled at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed this the logic in scaleDownOldMSs to address this comments, PTAL
machineScaleDownIntent := max(ptr.Deref(oldMS.Status.Replicas, 0)-newScaleIntent, 0) | ||
if totAvailableReplicas-machineScaleDownIntent < minAvailable { | ||
availableMachineScaleDown = max(totAvailableReplicas-minAvailable, 0) | ||
scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown | |
scaleDown = scaleDown - (machineScaleDownIntent - availableMachineScaleDown) |
Maybe this would be slightly easier to parse. Probably even better to introduce a new var for (machineScaleDownIntent - availableMachineScaleDown)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed this logic in scaleDownOldMSs to improve readability, PTAL
// Before scaling down validate if the operation will lead to a breach to minAvailability | ||
// In order to do so, consider how many machines will be actually deleted, and consider this operation as impacting availability; | ||
// if the projected state breaches minAvailability, reduce the scale down extend accordingly. | ||
availableMachineScaleDown := int32(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this into the if
totalScaledDown += scaledDownCount | ||
if scaleDown > 0 { | ||
newScaleIntent := max(scaleIntent-scaleDown, 0) | ||
log.V(5).Info(fmt.Sprintf("Setting scale down intent for %s to %d replicas (-%d)", oldMS.Name, newScaleIntent, scaleDown), "machineset", client.ObjectKeyFromObject(oldMS).String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the MS k/v pair (same for other logs)
@sbueringer @stmcginnis thanks for the first round of feedback, |
/test pull-cluster-api-e2e-main |
What this PR does / why we need it:
This PR fixes two race conditions in the logic that is responsible for scaling down OldMSs when performing rollingRollout.
The two race conditions were surfaced and documented in the context of the work for #12804. More specifically:
Notably after the fix:
Which issue(s) this PR fixes *(optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the Part of #12291/area machinedeployment