Can't delete PVC, finalizer pvc-as-source-protection does not finish #2670

kaitimmer · 2024-11-25T14:03:22Z

What happened:
When deleting a PVC, the deletion process is "stuck".

The finalizer: snapshot.storage.kubernetes.io/pvc-as-source-protection does not finish.

If I patch the PVC in `Terminating" state and remove the finalizer, everything works as expected.

I've seen this behavior randomly in multiple clusters. But in the current one, it has been persisting for a couple of weeks already.

What you expected to happen:
Finalizer finishes and I can delete the PVC without the need to patch it first.

How to reproduce it:

k delete PVC pvc-something-0

It does not matter which StorageClass or SKU is behind the PVC. If it is not working in a cluster, it is not working for all PVCs.

Anything else we need to know?:

When this error exists, I cannot get a VolumeSnapshot into "ReadyToUse." It looks like everything that interacts with Snapshots is broken in this cluster.

Environment:

CSI Driver version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.30.4
Kubernetes version (use kubectl version): - Client Version: v1.31.2 Kustomize Version: v5.4.2 Server Version: v1.30.3

The text was updated successfully, but these errors were encountered:

andyzhangx · 2024-11-28T12:03:26Z

that snapshot creation is stuck there, I could help troubleshooting if you could provide the aks cluster fqdn.

kaitimmer · 2024-11-28T14:46:35Z

@andyzhangx thank you for offering help. We figured out that this was caused by having a lot of VolumeSnapshots and VolumeSnapshotcontents in this cluster (some cleanup did not work as expected). Once we cleaned everything up, it started working again.

However, seeing this got me thinking:

How many VolumeSnapshots and VolumesnapshotContents can the csi-driver safely handle before we reach this problem again? Do you have any numbers there?

monotek · 2024-11-28T15:08:03Z

The downside is now, with removing the finalizer, we have to delete the actual azure snapshots manually from the azure portal, because the csi driver did not do it.

"az delete snapshot" seems to be rather slow for this (even with using --now-wait=true), needing about 5 seconds for every snapshot delete command.

I'll check again if we can delete all snapshots at once but my first try failed as we have ~20000 snapshots to delete and bash said "to many arguments" :D

andyzhangx · 2024-11-29T01:40:47Z

@andyzhangx thank you for offering help. We figured out that this was caused by having a lot of VolumeSnapshots and VolumeSnapshotcontents in this cluster (some cleanup did not work as expected). Once we cleaned everything up, it started working again.

However, seeing this got me thinking:

How many VolumeSnapshots and VolumesnapshotContents can the csi-driver safely handle before we reach this problem again? Do you have any numbers there?

@kaitimmer as long as the snapshot container is working fine, that's ok. Recently we found that the memory limit of snapshot container is too small when there are lots of snapshots, finally the snapshot container is OOM. So I think the question is about snapshot num and memory limit of snapshot container, how fast can the csi driver handle the snapshot to avoid snapshot content accumulated, just let me know when your cluster is stuck on creating snapshot, I could increase the memory limit immediately. Later on, we will increase the memory limit since Azure service is in CCOA now.

kaitimmer · 2024-11-29T10:45:31Z

Hi @andyzhangx,

One of our clusters is again in the state where the finalizer does not work. I will send you the ID and URI via EMail.

Since we cleaned up all the VolumeSnapshots, the amount is not the problem. I assume that we are again in a state where the problem started.

k8s-triage-robot · 2025-02-27T11:01:16Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't delete PVC, finalizer pvc-as-source-protection does not finish #2670

Can't delete PVC, finalizer pvc-as-source-protection does not finish #2670

kaitimmer commented Nov 25, 2024

andyzhangx commented Nov 28, 2024

kaitimmer commented Nov 28, 2024

monotek commented Nov 28, 2024 •

edited

Loading

andyzhangx commented Nov 29, 2024

kaitimmer commented Nov 29, 2024

k8s-triage-robot commented Feb 27, 2025

Can't delete PVC, finalizer pvc-as-source-protection does not finish #2670

Can't delete PVC, finalizer pvc-as-source-protection does not finish #2670

Comments

kaitimmer commented Nov 25, 2024

andyzhangx commented Nov 28, 2024

kaitimmer commented Nov 28, 2024

monotek commented Nov 28, 2024 • edited Loading

andyzhangx commented Nov 29, 2024

kaitimmer commented Nov 29, 2024

k8s-triage-robot commented Feb 27, 2025

monotek commented Nov 28, 2024 •

edited

Loading