Add cleanup-on-delete ArgoCD PreDelete hook for RHOAI 2.x and 3.x#181
Add cleanup-on-delete ArgoCD PreDelete hook for RHOAI 2.x and 3.x#181lautou wants to merge 8 commits into
Conversation
Implements an ArgoCD PreDelete hook that runs the official Red Hat RHOAI uninstall procedure automatically before ArgoCD removes Application resources. Without this, ArgoCD deletions leave namespaces stuck in Terminating and orphaned webhooks that block future reinstalls. The component is enabled by default in all instance overlays since the cleanup problem affects every user who deletes the RHOAI Application. Resources created (all PreDelete hooks, zero runtime footprint): - ServiceAccount/rhoai-cleanup in openshift-gitops - ClusterRole/rhoai-cleanup - ClusterRoleBinding/rhoai-cleanup - Job/delete-rhoai-resources in openshift-gitops The Job runs in openshift-gitops to survive while RHOAI namespaces are being deleted, and executes a 10-step cleanup procedure per Red Hat documentation: DSC deletion, operator Subscription/CSV removal, pod termination wait, OperatorGroup removal, webhook cleanup, namespace deletion, and CRD removal with finalizer fallback. Validated end-to-end on OCP 4.20 with RHOAI 3.3.1 (fast-3.x channel). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
48eeda6 to
9fc4cc1
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@strangiato can you review and merge? |
strangiato
left a comment
There was a problem hiding this comment.
To be honest the approach here scares me.
There are a lot of very unsafe practices that I would advise against.
I would highly recommend simplifying the uninstall to cover as basic uninstall as covered by the official docs here:
https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.4/html-single/installing_and_uninstalling_openshift_ai_self-managed/index#uninstalling-openshift-ai-self-managed-using-cli_uninstalling-openshift-ai-self-managed
There appears to be a ton of brute force that is happening in the script that should not be necessary and I am concerned that was applied because of other unsafe practices.
In case you are curious, the following is a set of instructions I put together to do a complete manual uninstall of 2.25 for a customer:
https://github.com/redhat-ai-services/ai-accelerator/blob/main/documentation/rhoai-2.25-uninstall.md
|
Thanks for the detailed feedback and the uninstall doc — very helpful. You're right, the brute force (forced deletion, finalizer patching) was a symptom of deleting things in the wrong order. I'll rework the script to follow the official procedure: delete user workload CRs first, then DSC, then subscription/CSV, then explicit CRD list by name. No grep, no force. Will push an update shortly. |
- Move bash script to separate delete-rhoai-resources.sh file mounted via ConfigMap, following the aws-gpu-machineset component pattern - Delete user workload CRs first (InferenceServices, Notebooks, Ray, Kueue, TrustyAI, etc.) to prevent finalizer deadlocks — root cause of the brute-force approach in the previous version - Replace dynamic grep-based CRD deletion with explicit CRD list per Red Hat documentation (rhoai-2.25-uninstall.md) - Remove force deletion, finalizer patching, arbitrary sleep, and pod wait — all unnecessary when deletion order is correct - Fix namespace deletion: use oc delete namespace instead of oc delete project (ClusterRole grants namespaces, not projects resource) - Add missing CRDs: nemoguardrails.trustyai.opendatahub.io and trainers.components.platform.opendatahub.io - Expand ClusterRole to cover all workload CR API groups Validated end-to-end on OCP 4.20 with RHOAI 3.3.1 (fast-3.x): all namespaces, CRDs, and webhooks removed cleanly in 55s. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks again for the detailed review — every point was valid and I've reworked the entire script based on your feedback. What changed:
Tested end-to-end on OCP 4.20:
All namespaces, CRDs, and webhooks removed cleanly with no forced operations. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Validated on OCP cluster running rhods-operator.3.4.0 — the CRD sparkoperators.components.platform.opendatahub.io was not in the explicit list and remained after cleanup. Added to both Step 0 (CR deletion) and Step 5 (CRD deletion) in instance-2.x and instance-3.x. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@strangiato can you review? |
|
@strangiato can you kindly review? |
Summary
cleanup-on-deleteKustomize Component to bothinstance-2.x/components/andinstance-3.x/components/that implements an ArgoCDPreDeletehook running the official Red Hat RHOAI uninstall procedurePreDeletehook annotation and are only created/executed at Application deletion timeProblem
Without this component, ArgoCD deletes RHOAI manifests in arbitrary order, which consistently results in:
redhat-ods-operator,redhat-ods-applications,redhat-ods-monitoring) stuck inTerminatingValidatingWebhookConfigurationandMutatingWebhookConfigurationresources that block future reinstallsHow it works
A
PreDeleteJob runs inopenshift-gitops(not in the RHOAI namespaces, so it survives their deletion) and executes a 10-step cleanup:DataScienceClusterCR (triggers operator cleanup)SubscriptionClusterServiceVersionOperatorGroupValidatingWebhookConfiguration/MutatingWebhookConfigurationresourcesredhat-ods-operator,redhat-ods-applications,redhat-ods-monitoring)opendatahub.io,kubeflow.org) with finalizer fallbackThe Job exits
0even when warnings occur, so ArgoCD always proceeds with Application deletion.backoffLimit: 0means no retry; any remaining resources are cleaned up by ArgoCD's normal deletion.Resources created
All are
PreDeletehooks withhook-delete-policy: HookSucceeded,HookFailed:ServiceAccount/rhoai-cleanupopenshift-gitopsClusterRole/rhoai-cleanupClusterRoleBinding/rhoai-cleanupJob/delete-rhoai-resourcesopenshift-gitopsTest plan
fast-3.xchannel on OCP 4.20 (fresh cluster)DataScienceClusterReadycleanup-on-deleteresources simulating ArgoCD PreDelete hookdscinitializationsCRD finalizer fallback triggered and handled correctlykustomization.yamlfiles pass format validation (check-kustomization-format.sh)validate_manifests.sh)🤖 Generated with Claude Code