Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't delete PVC, finalizer pvc-as-source-protection does not finish #2670

Open
kaitimmer opened this issue Nov 25, 2024 · 6 comments
Open
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@kaitimmer
Copy link

What happened:
When deleting a PVC, the deletion process is "stuck".

The finalizer: snapshot.storage.kubernetes.io/pvc-as-source-protection does not finish.

If I patch the PVC in `Terminating" state and remove the finalizer, everything works as expected.

I've seen this behavior randomly in multiple clusters. But in the current one, it has been persisting for a couple of weeks already.

What you expected to happen:
Finalizer finishes and I can delete the PVC without the need to patch it first.

How to reproduce it:

k delete PVC pvc-something-0

It does not matter which StorageClass or SKU is behind the PVC. If it is not working in a cluster, it is not working for all PVCs.

Anything else we need to know?:

When this error exists, I cannot get a VolumeSnapshot into "ReadyToUse." It looks like everything that interacts with Snapshots is broken in this cluster.

Environment:

  • CSI Driver version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.30.4
  • Kubernetes version (use kubectl version): - Client Version: v1.31.2 Kustomize Version: v5.4.2 Server Version: v1.30.3
@andyzhangx
Copy link
Member

that snapshot creation is stuck there, I could help troubleshooting if you could provide the aks cluster fqdn.

@kaitimmer
Copy link
Author

@andyzhangx thank you for offering help. We figured out that this was caused by having a lot of VolumeSnapshots and VolumeSnapshotcontents in this cluster (some cleanup did not work as expected). Once we cleaned everything up, it started working again.

However, seeing this got me thinking:

How many VolumeSnapshots and VolumesnapshotContents can the csi-driver safely handle before we reach this problem again? Do you have any numbers there?

@monotek
Copy link
Member

monotek commented Nov 28, 2024

The downside is now, with removing the finalizer, we have to delete the actual azure snapshots manually from the azure portal, because the csi driver did not do it.

"az delete snapshot" seems to be rather slow for this (even with using --now-wait=true), needing about 5 seconds for every snapshot delete command.

I'll check again if we can delete all snapshots at once but my first try failed as we have ~20000 snapshots to delete and bash said "to many arguments" :D

@andyzhangx
Copy link
Member

@andyzhangx thank you for offering help. We figured out that this was caused by having a lot of VolumeSnapshots and VolumeSnapshotcontents in this cluster (some cleanup did not work as expected). Once we cleaned everything up, it started working again.

However, seeing this got me thinking:

How many VolumeSnapshots and VolumesnapshotContents can the csi-driver safely handle before we reach this problem again? Do you have any numbers there?

@kaitimmer as long as the snapshot container is working fine, that's ok. Recently we found that the memory limit of snapshot container is too small when there are lots of snapshots, finally the snapshot container is OOM. So I think the question is about snapshot num and memory limit of snapshot container, how fast can the csi driver handle the snapshot to avoid snapshot content accumulated, just let me know when your cluster is stuck on creating snapshot, I could increase the memory limit immediately. Later on, we will increase the memory limit since Azure service is in CCOA now.

@kaitimmer
Copy link
Author

Hi @andyzhangx,

One of our clusters is again in the state where the finalizer does not work. I will send you the ID and URI via EMail.

Since we cleaned up all the VolumeSnapshots, the amount is not the problem. I assume that we are again in a state where the problem started.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants