-
Notifications
You must be signed in to change notification settings - Fork 81
[main] add validating webhook for machinedeployments/scale resource for autoscaling side-effect
#1208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
machinedeployments/scale resource for autoscaline side-effect machinedeployments/scale resource for autoscaling side-effect
ed414aa to
aa81370
Compare
aa81370 to
df0a24d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a validating webhook with side effects to handle scale operations on CAPI MachineDeployment resources, preventing race conditions during scale-down operations by synchronizing replica counts between CAPI MachineDeployments and Rancher Provisioning cluster machine pools before the scale request is committed to etcd.
Key Changes:
- Added a validating webhook for the
machinedeployments/scalesubresource that synchronizes replica counts with the corresponding Rancher Provisioning cluster machine pool - Enhanced admission path handling to support subresources with
/in the resource name - Added CAPI controllers and generated code to support MachineDeployment and Cluster resource handling
Reviewed changes
Copilot reviewed 8 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go | Core webhook implementation that intercepts scale requests and synchronizes replica counts |
| pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator_test.go | Comprehensive test suite covering happy path, error cases, and edge scenarios |
| pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/Scale.md | Documentation describing webhook behavior and synchronization flow |
| pkg/server/handlers.go | Registration of the new validator in the webhook handlers |
| pkg/clients/clients.go | Addition of CAPI controllers to the client factory |
| pkg/admission/admission.go | Enhanced SubPath function to handle subresources with / characters |
| pkg/codegen/main.go | Configuration to generate controllers and objects for CAPI and autoscaling resources |
| pkg/generated/controllers/cluster.x-k8s.io/* | Generated controller code for CAPI MachineDeployment and Cluster resources |
| pkg/generated/objects/cluster.x-k8s.io/v1beta1/objects.go | Generated helper functions for extracting CAPI objects from admission requests |
| pkg/generated/objects/autoscaling/v1/objects.go | Generated helper functions for extracting Scale objects from admission requests |
| docs.md | Documentation for the new webhook validation behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 15 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator_test.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator_test.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator_test.go
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
0dc475b to
1919dfa
Compare
try removing instantiation to get tests to run to figure out where failure starts
1919dfa to
9e6e962
Compare
770a2c2 to
c9c4d2f
Compare
c9c4d2f to
10de80b
Compare
crobby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's here is good. It might also be desirable to add an integration test. They're usually pretty straightforward to write. tests/integration has some existing ones that you could use for a pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/Scale.md
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
417a0f4 to
0ba4e2e
Compare
0ba4e2e to
cbfc660
Compare
HarrisonWAffel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Though I'm not a big fan of having a validating webhook with a side effect like this, I can see how it prevents machine thrashing. Is there anywhere we can document this behavior outside of the webhook to help future debugging? Maybe in autoscaler.go in r/r?
|
|
||
| if clusterName == "" { | ||
| logrus.Debugf("MachineDeployment %s/%s has no CAPI cluster name label", md.Namespace, md.Name) | ||
| return nil, apierrors.NewNotFound(schema.GroupResource{Group: "cluster.x-k8s.io", Resource: "clusters"}, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat pedantic but you can just use this: https://github.com/kubernetes-sigs/cluster-api/blob/main/api/core/v1beta1/groupversion_info.go#L27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm is there something I'm missing, the GroupVersion doesn't really translate to the GroupResource there. Unless you were moreso implying i could yoink the harcoded strings - which I can definitely do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad, probably more relevant for line 61 then.
| } | ||
|
|
||
| logrus.Debugf("Getting CAPI cluster %s/%s", md.Namespace, clusterName) | ||
| capiClusterObj, err := v.dynamic.Get(capiClusterGVK, md.Namespace, clusterName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive my ignorance, but is this a standard practice for the webhook? the whole (could be an object, could be unstructured) seems wrong to me, but I also get the impression it's being done elsewhere (and with purpose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand this is due to the fact that the CAPI CRDs are not registered in the webhook itself - when I initially started this I generated the controllers etc which made it so the webhook failed to start until the CAPI CRDs were available which is...not the best. And there is no deferred start for controllers here in the webhook.
For now - I added a test to validate that both ways work - it seems right now most of the time it returns unstructured data in my manual testing.
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
|
|
||
| // Label constants for MachineDeployment labels | ||
| const ( | ||
| machinePoolNameLabel = "rke.cattle.io/rke-machine-pool-name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a remark, but we really ought to move our labels/annotations to pkg/apis so they can be consumed across projects :P
pkg/resources/cluster.x-k8s.io/v1beta1/machinedeployment/validator.go
Outdated
Show resolved
Hide resolved
24ec5ea to
f0ef05c
Compare
@HarrisonWAffel Yep - I agree. I'll update my PR over on r/r with a link to where this will end up. |
@crobby so I added a few integration tests (not too bad really!) and CI still doesn't work due to the aforementioned missing CAPI CRDs. I left them in just skipped for now - maybe after we get that deferred start functionality in the webhook we could enable them. |
| cluster = cluster.DeepCopy() | ||
| for i := range cluster.Spec.RKEConfig.MachinePools { | ||
| pool := &cluster.Spec.RKEConfig.MachinePools[i] | ||
| if pool.Name != machinePoolName { | ||
| continue | ||
| } | ||
|
|
||
| // If quantity is nil and targetReplicas is zero, or quantity is non-nil and already | ||
| // equals targetReplicas, no update is needed. | ||
| if pool.Quantity == nil && targetReplicas == 0 { | ||
| return cluster, false | ||
| } else if *pool.Quantity == targetReplicas { | ||
| return cluster, false | ||
| } | ||
|
|
||
| logrus.Debugf("Updating cluster %s/%s machine pool %s quantity from %d to %d", cluster.Namespace, cluster.Name, machinePoolName, *pool.Quantity, targetReplicas) | ||
| if pool.Quantity == nil { | ||
| pool.Quantity = new(int32) | ||
| } | ||
| *pool.Quantity = targetReplicas | ||
| return cluster, true | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it didn't occur to me immediately, but we could've moved the deep copy inwards to line 198 because it would only copy in the event it was definitely going to be mutated, and the cluster only modifies the desired pool. Where it's at is fine now though honestly.
jakefhyde
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, though we probably still want someone from frameworks to approve.
Issue: rancher/rancher#52665
Summary
This PR implements a validating webhook with a side effect to handle scale-down operations for Rancher-managed machine deployments, eliminating a race condition that can occur when a
scaleoperation occurs on a CAPI MachineDeployment object.Problem
During scale-down operations, a race condition can occur when the controller that synchronizes the CAPI MachineDeployment object's
replicasfield back to the provisioning cluster's matching rkeMachinePool happens too late. This causes the following problematic behavior:Solution
tl;dr: moving the logic removed here to the webhook, it just looks a little different due to validating webhook-isms + not being able to use the CAPI controllers due to the lack of a deferred start mechanism in this codebase.
The PR moves the synchronization logic that updates the Rancher Provisioning cluster object from a controller-based approach to a validating webhook with a side effect. This approach:
By processing the scale request through a validating webhook with a side effect, we guarantee that the Rancher object matches the CAPI object before the MachineDeployment even hits etcd. This completely prevents the race condition from occurring.
Notes:
machinedeployments/scalein this caseCheckList