Skip to content

Add cleanup-on-delete ArgoCD PreDelete hook for RHOAI 2.x and 3.x#181

Open
lautou wants to merge 8 commits into
redhat-ai-services:mainfrom
lautou:feat/cleanup-on-delete-hook
Open

Add cleanup-on-delete ArgoCD PreDelete hook for RHOAI 2.x and 3.x#181
lautou wants to merge 8 commits into
redhat-ai-services:mainfrom
lautou:feat/cleanup-on-delete-hook

Conversation

@lautou

@lautou lautou commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a cleanup-on-delete Kustomize Component to both instance-2.x/components/ and instance-3.x/components/ that implements an ArgoCD PreDelete hook running the official Red Hat RHOAI uninstall procedure
  • Enables the component by default in all instance overlays (11 × 2.x, 1 × 3.x) — cleanup is not optional since stuck namespaces and orphaned webhooks affect every user who deletes the RHOAI Application
  • Zero runtime footprint: all resources carry the PreDelete hook annotation and are only created/executed at Application deletion time

Problem

Without this component, ArgoCD deletes RHOAI manifests in arbitrary order, which consistently results in:

  • Namespaces (redhat-ods-operator, redhat-ods-applications, redhat-ods-monitoring) stuck in Terminating
  • Orphaned ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources that block future reinstalls

How it works

A PreDelete Job runs in openshift-gitops (not in the RHOAI namespaces, so it survives their deletion) and executes a 10-step cleanup:

  1. Delete DataScienceCluster CR (triggers operator cleanup)
  2. Wait 60 s for operator cleanup to complete
  3. Delete operator Subscription
  4. Delete operator ClusterServiceVersion
  5. Wait for operator pods to terminate
  6. Delete OperatorGroup
  7. Delete all RHOAI ValidatingWebhookConfiguration / MutatingWebhookConfiguration resources
  8. Delete RHOAI namespaces (redhat-ods-operator, redhat-ods-applications, redhat-ods-monitoring)
  9. Delete RHOAI CRDs (opendatahub.io, kubeflow.org) with finalizer fallback
  10. Verify cleanup and report status

The Job exits 0 even when warnings occur, so ArgoCD always proceeds with Application deletion. backoffLimit: 0 means no retry; any remaining resources are cleaned up by ArgoCD's normal deletion.

Resources created

All are PreDelete hooks with hook-delete-policy: HookSucceeded,HookFailed:

Resource Namespace sync-wave
ServiceAccount/rhoai-cleanup openshift-gitops -1
ClusterRole/rhoai-cleanup cluster-scoped -1
ClusterRoleBinding/rhoai-cleanup cluster-scoped -1
Job/delete-rhoai-resources openshift-gitops 0

Test plan

  • Installed RHOAI 3.3.1 via fast-3.x channel on OCP 4.20 (fresh cluster)
  • Verified all 3 RHOAI namespaces active and DataScienceCluster Ready
  • Applied cleanup-on-delete resources simulating ArgoCD PreDelete hook
  • Job completed in ~3m11s with exit 0 — all namespaces, CRDs, and webhooks removed
  • dscinitializations CRD finalizer fallback triggered and handled correctly
  • All 260 kustomization.yaml files pass format validation (check-kustomization-format.sh)
  • All manifests build successfully (validate_manifests.sh)

🤖 Generated with Claude Code

@lautou lautou requested a review from a team as a code owner May 10, 2026 00:16
Implements an ArgoCD PreDelete hook that runs the official Red Hat
RHOAI uninstall procedure automatically before ArgoCD removes
Application resources. Without this, ArgoCD deletions leave namespaces
stuck in Terminating and orphaned webhooks that block future reinstalls.

The component is enabled by default in all instance overlays since the
cleanup problem affects every user who deletes the RHOAI Application.

Resources created (all PreDelete hooks, zero runtime footprint):
- ServiceAccount/rhoai-cleanup in openshift-gitops
- ClusterRole/rhoai-cleanup
- ClusterRoleBinding/rhoai-cleanup
- Job/delete-rhoai-resources in openshift-gitops

The Job runs in openshift-gitops to survive while RHOAI namespaces
are being deleted, and executes a 10-step cleanup procedure per Red Hat
documentation: DSC deletion, operator Subscription/CSV removal, pod
termination wait, OperatorGroup removal, webhook cleanup, namespace
deletion, and CRD removal with finalizer fallback.

Validated end-to-end on OCP 4.20 with RHOAI 3.3.1 (fast-3.x channel).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lautou lautou force-pushed the feat/cleanup-on-delete-hook branch from 48eeda6 to 9fc4cc1 Compare May 10, 2026 00:22
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lautou

lautou commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

@strangiato can you review and merge?

@strangiato strangiato left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest the approach here scares me.

There are a lot of very unsafe practices that I would advise against.

I would highly recommend simplifying the uninstall to cover as basic uninstall as covered by the official docs here:
https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.4/html-single/installing_and_uninstalling_openshift_ai_self-managed/index#uninstalling-openshift-ai-self-managed-using-cli_uninstalling-openshift-ai-self-managed

There appears to be a ton of brute force that is happening in the script that should not be necessary and I am concerned that was applied because of other unsafe practices.

In case you are curious, the following is a set of instructions I put together to do a complete manual uninstall of 2.25 for a customer:
https://github.com/redhat-ai-services/ai-accelerator/blob/main/documentation/rhoai-2.25-uninstall.md

@lautou

lautou commented May 11, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed feedback and the uninstall doc — very helpful. You're right, the brute force (forced deletion, finalizer patching) was a symptom of deleting things in the wrong order. I'll rework the script to follow the official procedure: delete user workload CRs first, then DSC, then subscription/CSV, then explicit CRD list by name. No grep, no force. Will push an update shortly.

- Move bash script to separate delete-rhoai-resources.sh file mounted
  via ConfigMap, following the aws-gpu-machineset component pattern
- Delete user workload CRs first (InferenceServices, Notebooks, Ray,
  Kueue, TrustyAI, etc.) to prevent finalizer deadlocks — root cause
  of the brute-force approach in the previous version
- Replace dynamic grep-based CRD deletion with explicit CRD list per
  Red Hat documentation (rhoai-2.25-uninstall.md)
- Remove force deletion, finalizer patching, arbitrary sleep, and
  pod wait — all unnecessary when deletion order is correct
- Fix namespace deletion: use oc delete namespace instead of oc delete
  project (ClusterRole grants namespaces, not projects resource)
- Add missing CRDs: nemoguardrails.trustyai.opendatahub.io and
  trainers.components.platform.opendatahub.io
- Expand ClusterRole to cover all workload CR API groups

Validated end-to-end on OCP 4.20 with RHOAI 3.3.1 (fast-3.x):
all namespaces, CRDs, and webhooks removed cleanly in 55s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lautou

lautou commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks again for the detailed review — every point was valid and I've reworked the entire script based on your feedback.

What changed:

  • Script structure: moved to a separate delete-rhoai-resources.sh file mounted via configMapGenerator, following the aws-gpu-machineset pattern you referenced
  • Deletion order fixed: added Step 0 that deletes all user workload CRs across all namespaces first (InferenceServices, Notebooks, RayJobs, Kueue workloads, TrustyAI, etc.) before touching the DSC — this is the root cause of why the original version needed force deletion and finalizer patching
  • Explicit CRD list: replaced the grep-based approach with an explicit list taken directly from your rhoai-2.25-uninstall.md — thank you for that reference, it was exactly what was needed
  • No more force deletion: removed --grace-period=0 --force
  • No more finalizer patching: removed completely — as you said, if you need it, you've done something else wrong
  • No more arbitrary sleep: removed the 60s grace period

Tested end-to-end on OCP 4.20:

  • RHOAI 3.3.1 (fast-3.x): clean uninstall in 55s ✓
  • RHOAI 3.3.2 (stable-3.x): clean uninstall in 51s ✓

All namespaces, CRDs, and webhooks removed cleanly with no forced operations.

lautou and others added 3 commits May 12, 2026 22:37
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Validated on OCP cluster running rhods-operator.3.4.0 — the CRD
sparkoperators.components.platform.opendatahub.io was not in the
explicit list and remained after cleanup. Added to both Step 0 (CR
deletion) and Step 5 (CRD deletion) in instance-2.x and instance-3.x.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lautou lautou requested a review from strangiato May 21, 2026 14:37
@lautou

lautou commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

@strangiato can you review?

@lautou

lautou commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

@strangiato can you kindly review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants