Skip to content

Conversation

@thatmidwesterncoder
Copy link
Contributor

@thatmidwesterncoder thatmidwesterncoder commented Dec 29, 2025

Issue: rancher/rancher#52665

Summary

This PR implements a validating webhook with a side effect to handle scale-down operations for Rancher-managed machine deployments, eliminating a race condition that can occur when a scale operation occurs on a CAPI MachineDeployment object.

Problem

During scale-down operations, a race condition can occur when the controller that synchronizes the CAPI MachineDeployment object's replicas field back to the provisioning cluster's matching rkeMachinePool happens too late. This causes the following problematic behavior:

  1. A scale-down operation is initiated on a machine deployment
  2. The CAPI MachineDeployment is updated in etcd
  3. Before the sync controller can propagate the replica count change to the Rancher Provisioning cluster's rkeMachinePool, Rancher provisioning controllers may fire
  4. This results in new nodes being spun up immediately after node deletion
  5. Eventual consistency is eventually achieved, but only after multiple failed attempts
  6. This creates a poor user experience with delayed and redundant node provisioning

Solution

tl;dr: moving the logic removed here to the webhook, it just looks a little different due to validating webhook-isms + not being able to use the CAPI controllers due to the lack of a deferred start mechanism in this codebase.


The PR moves the synchronization logic that updates the Rancher Provisioning cluster object from a controller-based approach to a validating webhook with a side effect. This approach:

  1. Intercepts scale requests before the MachineDeployment is committed to etcd
  2. Updates the Rancher Provisioning cluster's matching rkeMachinePool to match the CAPI MachineDeployment replica count
  3. Only admits the scale request after confirming the machinedeployment either doesn't match a rancher provisioning cluster OR updating the provisisioning cluster to match

By processing the scale request through a validating webhook with a side effect, we guarantee that the Rancher object matches the CAPI object before the MachineDeployment even hits etcd. This completely prevents the race condition from occurring.

Notes:

  • Scale requests will fail if the Rancher Provisioning cluster update cannot be completed
    • This may be the correct behavior in scenarios where nodes are provisioned by Rancher - as we are the infra provider in CAPI terms
  • There is a slight update to the way a URL is generated to handle subresources such as machinedeployments/scale in this case
  • this rancher PR will need to be merged AFTER this is merged, webhook tagged and bumped in Rancher

CheckList

  • Test
  • Docs

@thatmidwesterncoder thatmidwesterncoder changed the title [main] add validating webhook for machinedeployments/scale resource for autoscaline side-effect [main] add validating webhook for machinedeployments/scale resource for autoscaling side-effect Dec 29, 2025
@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch 7 times, most recently from ed414aa to aa81370 Compare December 29, 2025 23:05
@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch from aa81370 to df0a24d Compare December 30, 2025 17:37
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a validating webhook with side effects to handle scale operations on CAPI MachineDeployment resources, preventing race conditions during scale-down operations by synchronizing replica counts between CAPI MachineDeployments and Rancher Provisioning cluster machine pools before the scale request is committed to etcd.

Key Changes:

  • Added a validating webhook for the machinedeployments/scale subresource that synchronizes replica counts with the corresponding Rancher Provisioning cluster machine pool
  • Enhanced admission path handling to support subresources with / in the resource name
  • Added CAPI controllers and generated code to support MachineDeployment and Cluster resource handling

Reviewed changes

Copilot reviewed 8 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go Core webhook implementation that intercepts scale requests and synchronizes replica counts
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator_test.go Comprehensive test suite covering happy path, error cases, and edge scenarios
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/Scale.md Documentation describing webhook behavior and synchronization flow
pkg/server/handlers.go Registration of the new validator in the webhook handlers
pkg/clients/clients.go Addition of CAPI controllers to the client factory
pkg/admission/admission.go Enhanced SubPath function to handle subresources with / characters
pkg/codegen/main.go Configuration to generate controllers and objects for CAPI and autoscaling resources
pkg/generated/controllers/cluster.x-k8s.io/* Generated controller code for CAPI MachineDeployment and Cluster resources
pkg/generated/objects/cluster.x-k8s.io/v1beta1/objects.go Generated helper functions for extracting CAPI objects from admission requests
pkg/generated/objects/autoscaling/v1/objects.go Generated helper functions for extracting Scale objects from admission requests
docs.md Documentation for the new webhook validation behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 15 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch 10 times, most recently from 0dc475b to 1919dfa Compare January 6, 2026 21:27
try removing instantiation to get tests to run to figure out where failure starts
@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch from 1919dfa to 9e6e962 Compare January 7, 2026 21:23
@thatmidwesterncoder thatmidwesterncoder marked this pull request as ready for review January 7, 2026 21:39
@thatmidwesterncoder thatmidwesterncoder requested a review from a team as a code owner January 7, 2026 21:39
@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch 2 times, most recently from 770a2c2 to c9c4d2f Compare January 7, 2026 21:41
@thatmidwesterncoder thatmidwesterncoder requested a review from a team January 8, 2026 15:33
@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch from c9c4d2f to 10de80b Compare January 8, 2026 15:42
Copy link
Collaborator

@crobby crobby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's here is good. It might also be desirable to add an integration test. They're usually pretty straightforward to write. tests/integration has some existing ones that you could use for a pattern.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch 2 times, most recently from 417a0f4 to 0ba4e2e Compare January 9, 2026 19:57
@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch from 0ba4e2e to cbfc660 Compare January 9, 2026 20:21
Copy link
Contributor

@HarrisonWAffel HarrisonWAffel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Though I'm not a big fan of having a validating webhook with a side effect like this, I can see how it prevents machine thrashing. Is there anywhere we can document this behavior outside of the webhook to help future debugging? Maybe in autoscaler.go in r/r?


if clusterName == "" {
logrus.Debugf("MachineDeployment %s/%s has no CAPI cluster name label", md.Namespace, md.Name)
return nil, apierrors.NewNotFound(schema.GroupResource{Group: "cluster.x-k8s.io", Resource: "clusters"}, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm is there something I'm missing, the GroupVersion doesn't really translate to the GroupResource there. Unless you were moreso implying i could yoink the harcoded strings - which I can definitely do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, probably more relevant for line 61 then.

}

logrus.Debugf("Getting CAPI cluster %s/%s", md.Namespace, clusterName)
capiClusterObj, err := v.dynamic.Get(capiClusterGVK, md.Namespace, clusterName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive my ignorance, but is this a standard practice for the webhook? the whole (could be an object, could be unstructured) seems wrong to me, but I also get the impression it's being done elsewhere (and with purpose).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand this is due to the fact that the CAPI CRDs are not registered in the webhook itself - when I initially started this I generated the controllers etc which made it so the webhook failed to start until the CAPI CRDs were available which is...not the best. And there is no deferred start for controllers here in the webhook.

For now - I added a test to validate that both ways work - it seems right now most of the time it returns unstructured data in my manual testing.


// Label constants for MachineDeployment labels
const (
machinePoolNameLabel = "rke.cattle.io/rke-machine-pool-name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a remark, but we really ought to move our labels/annotations to pkg/apis so they can be consumed across projects :P

@thatmidwesterncoder thatmidwesterncoder force-pushed the machinedeployment_scale_validating_webhook branch from 24ec5ea to f0ef05c Compare January 9, 2026 23:03
@thatmidwesterncoder
Copy link
Contributor Author

thatmidwesterncoder commented Jan 9, 2026

This makes sense. Though I'm not a big fan of having a validating webhook with a side effect like this, I can see how it prevents machine thrashing. Is there anywhere we can document this behavior outside of the webhook to help future debugging? Maybe in autoscaler.go in r/r?

@HarrisonWAffel Yep - I agree. I'll update my PR over on r/r with a link to where this will end up.

@thatmidwesterncoder
Copy link
Contributor Author

I think what's here is good. It might also be desirable to add an integration test. They're usually pretty straightforward to write. tests/integration has some existing ones that you could use for a pattern.

@crobby so I added a few integration tests (not too bad really!) and CI still doesn't work due to the aforementioned missing CAPI CRDs. I left them in just skipped for now - maybe after we get that deferred start functionality in the webhook we could enable them.

Comment on lines +184 to +205
cluster = cluster.DeepCopy()
for i := range cluster.Spec.RKEConfig.MachinePools {
pool := &cluster.Spec.RKEConfig.MachinePools[i]
if pool.Name != machinePoolName {
continue
}

// If quantity is nil and targetReplicas is zero, or quantity is non-nil and already
// equals targetReplicas, no update is needed.
if pool.Quantity == nil && targetReplicas == 0 {
return cluster, false
} else if *pool.Quantity == targetReplicas {
return cluster, false
}

logrus.Debugf("Updating cluster %s/%s machine pool %s quantity from %d to %d", cluster.Namespace, cluster.Name, machinePoolName, *pool.Quantity, targetReplicas)
if pool.Quantity == nil {
pool.Quantity = new(int32)
}
*pool.Quantity = targetReplicas
return cluster, true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it didn't occur to me immediately, but we could've moved the deep copy inwards to line 198 because it would only copy in the event it was definitely going to be mutated, and the cluster only modifies the desired pool. Where it's at is fine now though honestly.

Copy link
Contributor

@jakefhyde jakefhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, though we probably still want someone from frameworks to approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants