Skip to content

Conversation

@jferrazbr
Copy link
Contributor

Issue: rancher/rancher#52574

Problem

When restoring from an ETCD snapshot, the webhook did not validate the snapshot metadata before accepting spec.rkeConfig.etcdSnapshotRestore.
It was possible to request "kubernetesVersion" or "all" for restoreRKEConfig even when the referenced snapshot had missing or invalid metadata.
This led to restore requests that passed admission but failed later in the restore flow with parse errors.

Solution

This PR adds a validator for spec.rkeConfig.etcdSnapshotRestore on provisioning.cattle.io/v1, Cluster and wires the RKE client into the webhook Clients struct.

The validator:

  • Only runs when etcdSnapshotRestore changes from empty to a new non empty value, so it does not block unrelated cluster updates.
  • Verifies that the snapshot named in etcdSnapshotRestore.name exists in the same namespace.
  • Ensures etcdSnapshotRestore.restoreRKEConfig is one of "none", "kubernetesVersion", or "all".
  • Parses the snapshot metadata and, for "kubernetesVersion", requires a kubernetesVersion, and for "all", requires both kubernetesVersion and rkeConfig.

In addition:

  • The Cluster validator handler registration in pkg/server/handlers.go was moved to a management cluster only list so that validation only runs where snapshot resources exist (local/management cluster). This avoids issues on downstream clusters that do not have the snapshot resources.

Docs are updated to describe the new validation behavior, and unit tests cover the main success and failure paths.
This partially addresses the linked issue by validating snapshot metadata before restore. The annotation based mode filtering will be handled in a follow up change.

CheckList

  • Test
  • Docs

@jferrazbr jferrazbr requested a review from a team as a code owner November 24, 2025 21:55
Copy link
Member

@jiaqiluo jiaqiluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some nits.

@jferrazbr jferrazbr force-pushed the add-snap-restore-validator branch from ba25be2 to 88b2c50 Compare November 25, 2025 19:36
@jiaqiluo jiaqiluo requested a review from a team November 25, 2025 20:53
Copy link
Contributor

@jakefhyde jakefhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 nit


// parseSnapshotClusterSpec decodes snapshot.SnapshotFile.Metadata into a v1.ClusterSpec.
// The metadata is stored as a nested, gzipped, base64-encoded structure.
func parseSnapshotClusterSpec(snap *rkev1.ETCDSnapshot) (*v1.ClusterSpec, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We copy this from Rancher correct? I wonder if we could move this to github.com/rancher/rancher/pkg/apis/provisioning.cattle.io/v1 somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 👍

@jferrazbr jferrazbr force-pushed the add-snap-restore-validator branch 3 times, most recently from 045569e to 3863e06 Compare December 22, 2025 21:29
Copy link
Contributor

@jakefhyde jakefhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Comment on lines 814 to 820
if err != nil {
if apierrors.IsNotFound(err) {
return admission.ResponseBadRequest(
fmt.Sprintf("etcd restore references missing snapshot %s in namespace %s", newRestore.Name, newCluster.Namespace)), nil
}
return nil, fmt.Errorf("failed to get etcd snapshot %s/%s: %w", newCluster.Namespace, newRestore.Name, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil {
if apierrors.IsNotFound(err) {
return admission.ResponseBadRequest(
fmt.Sprintf("etcd restore references missing snapshot %s in namespace %s", newRestore.Name, newCluster.Namespace)), nil
}
return nil, fmt.Errorf("failed to get etcd snapshot %s/%s: %w", newCluster.Namespace, newRestore.Name, err)
}
if apierrors.IsNotFound(err) {
return admission.ResponseBadRequest(
fmt.Sprintf("etcd restore references missing snapshot %s in namespace %s", newRestore.Name, newCluster.Namespace)), nil
} else if err != nil
return nil, fmt.Errorf("failed to get etcd snapshot %s/%s: %w", newCluster.Namespace, newRestore.Name, err)
}

apierrors.IsNotFound Does a nil check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 👍

@jferrazbr jferrazbr force-pushed the add-snap-restore-validator branch 18 times, most recently from 63f3b7a to 6782d52 Compare December 29, 2025 15:10
@jferrazbr jferrazbr force-pushed the add-snap-restore-validator branch 7 times, most recently from 9a9828d to d596ab9 Compare January 2, 2026 17:38
// validateETCDSnapshotRestore ensures that any requested ETCD restore
// (a) references an existing ETCDSnapshot, and
// (b) contains decodable metadata with a valid "provisioning-cluster-spec".
func (p *provisioningAdmitter) validateETCDSnapshotRestore(request *admission.Request, oldCluster, newCluster *v1.Cluster) (*admissionv1.AdmissionResponse, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really wanted to get pedantic, we could validate that it isn't set on create, but I say we save that for a later date.

@jferrazbr jferrazbr requested a review from jiaqiluo January 5, 2026 14:20
Copy link
Collaborator

@crobby crobby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validator changes seem good. Not sure about the golang/lint bumps as part of this PR though. Should they be separate?

# Build Stage
# ===============
FROM --platform=$BUILDPLATFORM registry.suse.com/bci/golang:1.24 AS build
FROM --platform=$BUILDPLATFORM registry.suse.com/bci/golang:1.25 AS build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a bad change, but this change doesn't appear to be related to adding a new validator. Should it be a separate issue/PR?

# ===============
FROM build AS validate
RUN curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b /usr/local/bin v1.64.8
RUN curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b /usr/local/bin v2.7.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a bad change, but this change doesn't appear to be related to adding a new validator. Should it be a separate issue/PR?

module github.com/rancher/webhook

go 1.24.0
go 1.25.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a bad change, but this change doesn't appear to be related to adding a new validator. Should it be a separate issue/PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to mention this in the PR description.

The Go version bump is here because rancher/rancher main is already on this Go version, and I needed to validate this webhook change together with my rancher/rancher updates to the snapshotbackpopulate controller. With the older Go version in this repo, I was running into toolchain mismatches when testing the full flow.

After bumping Go, the current golangci-lint version started failing to run with the new Go toolchain, so I updated it as well to keep CI working.

To keep scope clean, I can open a dedicated PR like "Bump Go toolchain + golangci-lint", merge that first, and then rebase this validator PR on top of it (so this PR stays focused on the validator logic).

Let me know what you think 🙇

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jferrazbr I agree that it would be cleaner to do the go toolchain + lint bump in a separate PR (similar to how the other repos have been trickling in go 1.25).
Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@jferrazbr jferrazbr mentioned this pull request Jan 6, 2026
2 tasks
@jferrazbr jferrazbr force-pushed the add-snap-restore-validator branch 4 times, most recently from f8fe15f to 7b88f4b Compare January 7, 2026 17:09
@jferrazbr jferrazbr requested a review from crobby January 7, 2026 17:34
@jferrazbr jferrazbr force-pushed the add-snap-restore-validator branch from 7b88f4b to 7634303 Compare January 8, 2026 18:50
@jferrazbr jferrazbr merged commit 9638662 into rancher:main Jan 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants