🐛 Fix race conditions ScaleDownOldMS #12812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

fabriziopandini wants to merge 3 commits into kubernetes-sigs:main from fabriziopandini:fix-scaleDownOldMS

Member

fabriziopandini commented Oct 1, 2025

What this PR does / why we need it:
This PR fixes two race conditions in the logic that is responsible for scaling down OldMSs when performing rollingRollout.
The two race conditions were surfaced and documented in the context of the work for #12804. More specifically:

fails when the MD controller is called twice in a row, without MS controller being triggered in the between, e.g.
- first reconcile scales down ms1, 6-->5 (-1)
- second reconcile is not taking into account scales down already in progress, unhealthy count is wrongly computed as -1 instead of 0, this leads to increasing replica count instead of keeping it as it is (or scaling down), and then the safeguard below errors out.
fails when the MD controller is called twice in a row e.g. reconcile of md 6 replicas MaxSurge=3, MaxUnavailable=1
- when current state is: ms1, 6/5 replicas << one is scaling down, but scale down not yet processed by the MS controller, ms2, 3/3 replicas
- reconcile leads to: ms1, 6/1 replicas << it further scaled down by 4, which leads to totAvailable machines is less than MinUnavailable, which should not happen

Notably after the fix:

All the rollout sequence tests with default rollout order are completed without any change from the old logic
It is now possible to run rollout sequence tests with random rollout order, increasing the number of tested scenarios for 9 to 918

Which issue(s) this PR fixes *(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the Part of #12291

/area machinedeployment


          Fix ScaleDownOldMS

d16533a

k8s-ci-robot added area/machinedeployment cncf-cla: yes labels

Contributor

k8s-ci-robot commented Oct 1, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from chrischdi and JoelSpeed

October 1, 2025 11:47

k8s-ci-robot added the size/XXL label

fabriziopandini mentioned this pull request

Tracking issue for In-place Updates implementation #12291

Open

44 tasks

stmcginnis reviewed

View reviewed changes

Contributor

stmcginnis left a comment

Code wise everything looks fine to me. I'm being a little nit picky on some docstring comments, just thinking from the context of reading through this code at some point in the future needing to understand what is happening once the current context is lost.

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              	// NOTE: we are scaling up unavailable machines first in order to increase chances for the rollout to progress;

              	// however, the MS controller might have different opinion on which machines to scale down.

              	// As a consequence, the scale down operation must continuously assess if reducing the number of replicas

              	// for an older MS could further impact availability under the assumption than any scale down could further impact availability (same as above).

Contributor

stmcginnis Oct 3, 2025

I'm having a hard time parsing this sentence.

Member Author

fabriziopandini Oct 6, 2025

Rephrased + added an example, PTAL

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

Comment on lines 222 to 223

    
              	// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount.

              	// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability,

Contributor

stmcginnis Oct 3, 2025

Also having trouble parsing this sentence too. I think you're saying we will scale down the MS decrementing by totalScaleDownCount without going below 0?

Suggested change

      
            	// Then scale down old MS up to zero replicas / up to residual totalScaleDownCount.
          
            	// NOTE: also in this case, continuously assess if reducing the number of replicase could further impact availability,
          
            	// Then scale down old MS by totalScaleDownCount to get to zero.
          
            	// NOTE: also in this case, continuously assess if reducing the number of replicas could further impact availability,


          Minor improvements

142d1de

fabriziopandini added the tide/merge-method-squash label

fabriziopandini mentioned this pull request

OnDelete rollout strategy produces negative replica counts causing infinite reconciliation loop #12815

Open

sbueringer reviewed

View reviewed changes

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              	// This funcs tries to detect and address the case when a rollout is not making progress because both scaling down and scaling up are blocked.

              	// Note. this func must be called after computing scale up/down intent for all the MachineSets.

              	// Note. this func only address deadlock due to unavailable machines not getting deleted on oldMSs, e.g. due to a wrong configuration.

Member

sbueringer Oct 6, 2025

I don't fully understand what that means, i.e. when that happens

Member Author

fabriziopandini Oct 6, 2025

Clarified, PTAL

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              	// This funcs tries to detect and address the case when a rollout is not making progress because both scaling down and scaling up are blocked.

              	// Note. this func must be called after computing scale up/down intent for all the MachineSets.

              	// Note. this func only address deadlock due to unavailable machines not getting deleted on oldMSs, e.g. due to a wrong configuration.

              	// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old machines set are not remediated by MHC.

Member

sbueringer Oct 6, 2025

Suggested change

      
            	// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old machines set are not remediated by MHC.
          
            	// unblocking deadlock when unavailable machines exists only on oldMSs, is required also because failures on old MachineSets are not remediated by MHC.

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              	// Find the number of pending scale down from previous reconcile/from current reconcile;

              	// This is required because whenever we are reducing the number of replicas, this operation could further impact availability e.g.

              	// - in case of regular rollout, there is no certainty about which machine is going to be deleted (and if this machine is currently available or not):

              	// 	 - e.g. MS controller is going to delete first machines with deletion annotation; also MS controller has a slight different notion of unavailable as of now.

Member

sbueringer Oct 6, 2025

" slight different notion of unavailable as of now" Would it make sense to have a follow-up to fix this?

Member Author

fabriziopandini Oct 6, 2025

Added to the tracking issue

internal/controllers/machinedeployment/machinedeployment_rolling.go

    
              		return err

              	// Compute the total number of replicas that can be scaled down.

              	// Exit immediately if there is no room for scaling down.

              	totalScaleDownCount := max(totReplicas-totPendingScaleDown-minAvailable, 0)

Member

sbueringer Oct 6, 2025

This seems to assume that every replica of the new MS (according to spec.replicas) is available.

I guess this might be fine if we would consider it further down? (not sure if we do). In any case a godoc would be good if that is the case

Member Author

fabriziopandini Oct 6, 2025

Added a note

internal/controllers/machinedeployment/machinedeployment_rolling.go

    
              	}

              	log.V(4).Info("Cleaned up unhealthy replicas from old MachineSets", "count", cleanupCount)

              	sort.Sort(mdutil.MachineSetsByCreationTimestamp(p.oldMSs))

Member

sbueringer Oct 6, 2025

Let's please add a comment what ordering this gives us (oldest or newest first)

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              		newReplicasCount := oldMSReplicas - scaledDownCount

              		// Compute the scale down extent by considering either unavailable replicas or, if scaleToZero is set, all replicas.

              		// In both cases, scale down is limited to totalScaleDownCount.

              		maxScaleDown := max(scaleIntent-ptr.Deref(oldMS.Status.AvailableReplicas, 0), 0)

Member

sbueringer Oct 6, 2025 •

edited

Loading

What if some of the availableReplicas are pending deletion from a previous reconcile?
Are we scaling down to far then?

Discussed the logic below will ensure this is fine, maybe a bit more godoc can help

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              				availableMachineScaleDown = max(totAvailableReplicas-minAvailable, 0)

              				scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown

              				totAvailableReplicas = max(totAvailableReplicas-availableMachineScaleDown, 0)

Member

sbueringer Oct 6, 2025 •

edited

Loading

Why is this only updated if we would breach minAvailable? (I think we need an else branch)

Should we consider availableReplicas on the MS when computing availableMachineScaleDown

And is it correct to update this if scaleDown becomes negative (and the MS is not scaled at all

Member Author

fabriziopandini Oct 6, 2025 •

edited

Loading

I have reviewed this the logic in scaleDownOldMSs to address this comments, PTAL

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              			machineScaleDownIntent := max(ptr.Deref(oldMS.Status.Replicas, 0)-newScaleIntent, 0)

              			if totAvailableReplicas-machineScaleDownIntent < minAvailable {

              				availableMachineScaleDown = max(totAvailableReplicas-minAvailable, 0)

              				scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown

Member

sbueringer Oct 6, 2025 •

edited

Loading

Suggested change

      
            				scaleDown = scaleDown - machineScaleDownIntent + availableMachineScaleDown
          
            				scaleDown = scaleDown - (machineScaleDownIntent - availableMachineScaleDown)

Maybe this would be slightly easier to parse. Probably even better to introduce a new var for (machineScaleDownIntent - availableMachineScaleDown)

Member Author

fabriziopandini Oct 6, 2025 •

edited

Loading

I have reviewed this logic in scaleDownOldMSs to improve readability, PTAL

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              		// Before scaling down validate if the operation will lead to a breach to minAvailability

              		// In order to do so, consider how many machines will be actually deleted, and consider this operation as impacting availability;

              		// if the projected state breaches minAvailability, reduce the scale down extend accordingly.

              		availableMachineScaleDown := int32(0)

Member

sbueringer Oct 6, 2025

let's move this into the if

internal/controllers/machinedeployment/machinedeployment_rolling.go Outdated

    
              		totalScaledDown += scaledDownCount

              		if scaleDown > 0 {

              			newScaleIntent := max(scaleIntent-scaleDown, 0)

              			log.V(5).Info(fmt.Sprintf("Setting scale down intent for %s to %d replicas (-%d)", oldMS.Name, newScaleIntent, scaleDown), "machineset", client.ObjectKeyFromObject(oldMS).String())

Member

sbueringer Oct 6, 2025

Please fix the MS k/v pair (same for other logs)


          Address feedback

aca5de4

Member Author

fabriziopandini commented Oct 6, 2025

@sbueringer @stmcginnis thanks for the first round of feedback,
I have reviewed the PR to add examples, improve godoc + also reviewed part of the logic in scaleDownOldMSs
PTAL

Member Author

fabriziopandini commented Oct 6, 2025

/test pull-cluster-api-e2e-main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/machinedeployment cncf-cla: yes size/XXL tide/merge-method-squash