Skip to content

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
This PR fixes two race conditions in the logic that is responsible for scaling down OldMSs when performing rollingRollout.
The two race conditions were surfaced and documented in the context of the work for #12804. More specifically:

  • fails when the MD controller is called twice in a row, without MS controller being triggered in the between, e.g.
    • first reconcile scales down ms1, 6-->5 (-1)
    • second reconcile is not taking into account scales down already in progress, unhealthy count is wrongly computed as -1 instead of 0, this leads to increasing replica count instead of keeping it as it is (or scaling down), and then the safeguard below errors out.
  • fails when the MD controller is called twice in a row e.g. reconcile of md 6 replicas MaxSurge=3, MaxUnavailable=1
    • when current state is: ms1, 6/5 replicas << one is scaling down, but scale down not yet processed by the MS controller, ms2, 3/3 replicas
    • reconcile leads to: ms1, 6/1 replicas << it further scaled down by 4, which leads to totAvailable machines is less than MinUnavailable, which should not happen

Notably after the fix:

  • All the rollout sequence tests with default rollout order are completed without any change from the old logic
  • It is now possible to run rollout sequence tests with random rollout order, increasing the number of tested scenarios for 9 to 918

Which issue(s) this PR fixes *(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the Part of #12291

/area machinedeployment

@k8s-ci-robot k8s-ci-robot added area/machinedeployment Issues or PRs related to machinedeployments cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 1, 2025
Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise everything looks fine to me. I'm being a little nit picky on some docstring comments, just thinking from the context of reading through this code at some point in the future needing to understand what is happening once the current context is lost.

// NOTE: we are scaling up unavailable machines first in order to increase chances for the rollout to progress;
// however, the MS controller might have different opinion on which machines to scale down.
// As a consequence, the scale down operation must continuously assess if reducing the number of replicas
// for an older MS could further impact availability under the assumption than any scale down could further impact availability (same as above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time parsing this sentence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased + added an example, PTAL

Comment on lines 222 to 223
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount.
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also having trouble parsing this sentence too. I think you're saying we will scale down the MS decrementing by totalScaleDownCount without going below 0?

Suggested change
// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount.
// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability,
// Then scale down old MS by totalScaleDownCount to get to zero.
// NOTE: also in this case, continuously assess if reducing the number of replicas could further impact availability,


// This funcs tries to detect and address the case when a rollout is not making progress because both scaling down and scaling up are blocked.
// Note. this func must be called after computing scale up/down intent for all the MachineSets.
// Note. this func only address deadlock due to unavailable machines not getting deleted on oldMSs, e.g. due to a wrong configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what that means, i.e. when that happens

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified, PTAL

// This funcs tries to detect and address the case when a rollout is not making progress because both scaling down and scaling up are blocked.
// Note. this func must be called after computing scale up/down intent for all the MachineSets.
// Note. this func only address deadlock due to unavailable machines not getting deleted on oldMSs, e.g. due to a wrong configuration.
// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old machines set are not remediated by MHC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old machines set are not remediated by MHC.
// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old MachineSets are not remediated by MHC.

// Find the number of pending scale down from previous reconcile/from current reconcile;
// This is required because whenever we are reducing the number of replicas, this operation could further impact availability e.g.
// - in case of regular rollout, there is no certainty about which machine is going to be deleted (and if this machine is currently available or not):
// - e.g. MS controller is going to delete first machines with deletion annotation; also MS controller has a slight different notion of unavailable as of now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" slight different notion of unavailable as of now" Would it make sense to have a follow-up to fix this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the tracking issue

return err
// Compute the total number of replicas that can be scaled down.
// Exit immediately if there is no room for scaling down.
totalScaleDownCount := max(totReplicas-totPendingScaleDown-minAvailable, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to assume that every replica of the new MS (according to spec.replicas) is available.

I guess this might be fine if we would consider it further down? (not sure if we do). In any case a godoc would be good if that is the case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note

}

log.V(4).Info("Cleaned up unhealthy replicas from old MachineSets", "count", cleanupCount)
sort.Sort(mdutil.MachineSetsByCreationTimestamp(p.oldMSs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please add a comment what ordering this gives us (oldest or newest first)

newReplicasCount := oldMSReplicas - scaledDownCount
// Compute the scale down extent by considering either unavailable replicas or, if scaleToZero is set, all replicas.
// In both cases, scale down is limited to totalScaleDownCount.
maxScaleDown := max(scaleIntent-ptr.Deref(oldMS.Status.AvailableReplicas, 0), 0)
Copy link
Member

@sbueringer sbueringer Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if some of the availableReplicas are pending deletion from a previous reconcile?
Are we scaling down to far then?

Discussed the logic below will ensure this is fine, maybe a bit more godoc can help

availableMachineScaleDown = max(totAvailableReplicas-minAvailable, 0)
scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown

totAvailableReplicas = max(totAvailableReplicas-availableMachineScaleDown, 0)
Copy link
Member

@sbueringer sbueringer Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this only updated if we would breach minAvailable? (I think we need an else branch)

Should we consider availableReplicas on the MS when computing availableMachineScaleDown

And is it correct to update this if scaleDown becomes negative (and the MS is not scaled at all

Copy link
Member Author

@fabriziopandini fabriziopandini Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed this the logic in scaleDownOldMSs to address this comments, PTAL

machineScaleDownIntent := max(ptr.Deref(oldMS.Status.Replicas, 0)-newScaleIntent, 0)
if totAvailableReplicas-machineScaleDownIntent < minAvailable {
availableMachineScaleDown = max(totAvailableReplicas-minAvailable, 0)
scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown
Copy link
Member

@sbueringer sbueringer Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown
scaleDown = scaleDown - (machineScaleDownIntent - availableMachineScaleDown)

Maybe this would be slightly easier to parse. Probably even better to introduce a new var for (machineScaleDownIntent - availableMachineScaleDown)

Copy link
Member Author

@fabriziopandini fabriziopandini Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed this logic in scaleDownOldMSs to improve readability, PTAL

// Before scaling down validate if the operation will lead to a breach to minAvailability
// In order to do so, consider how many machines will be actually deleted, and consider this operation as impacting availability;
// if the projected state breaches minAvailability, reduce the scale down extend accordingly.
availableMachineScaleDown := int32(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this into the if

totalScaledDown += scaledDownCount
if scaleDown > 0 {
newScaleIntent := max(scaleIntent-scaleDown, 0)
log.V(5).Info(fmt.Sprintf("Setting scale down intent for %s to %d replicas (-%d)", oldMS.Name, newScaleIntent, scaleDown), "machineset", client.ObjectKeyFromObject(oldMS).String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the MS k/v pair (same for other logs)

@fabriziopandini
Copy link
Member Author

@sbueringer @stmcginnis thanks for the first round of feedback,
I have reviewed the PR to add examples, improve godoc + also reviewed part of the logic in scaleDownOldMSs
PTAL

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinedeployment Issues or PRs related to machinedeployments cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants