ACS AI Overwatch is a GitOps repository for an OpenShift Proof of Concept that combines:
- Red Hat OpenShift AI (RHOAI) for GPU-backed workbenches and model serving
- Red Hat Advanced Cluster Security (ACS / RHACS) for runtime policy enforcement
- NVIDIA OpenShell agent sandboxes for contrasting “good” vs “rogue” AI agent behavior
- Kagenti for agent deployment and orchestration
- Mattermost as a Slack-compatible notification sink for ACS violations
- Quay as the on-cluster container registry for agent images
- Tekton (OpenShift Pipelines) for building and pushing agent images
The repository is designed to be deployed through OpenShift GitOps (Argo CD) using:
acs-ai-overwatch-gitops-bootstrap— namespaces withargocd.argoproj.io/managed-byso Argo CD can create ServiceAccountsacs-ai-overwatch-cluster-discovery— in-cluster Job writes cluster settings to a ConfigMap (including Mattermost external URL)acs-ai-overwatch— umbrella Helm chart atgitops/helm/acs-ai-overwatch
Optional (opt-in, disabled by default):
acs-ai-overwatch-kagenti-platform— installs the Kagenti control plane (Keycloak, SPIRE, operator, API). Not registered ingitops/argocd/kustomization.yamluntil you enable Phase 4.acs-ai-overwatch-observability— Phase 5 shared tracing (OTEL → Tempo + MLflow, Grafana dashboards). Not registered until you enable Phase 5.- Full RHACS Central + SecuredCluster — templates and bootstrap Job in the main chart, gated by
acs.central.enabled/acs.bootstrap.enabled(bothfalseby default).
See PoC deployment phases for phase definitions and Step-by-step deployment for the full walkthrough in deployment order (Mattermost login and alerts are near the end, before the demo).
When you need Pipelines: GitOps-only deploy (operators, Quay, Mattermost) does not require Pipelines. Install Pipelines before step 7 below (building helpful-hank / rosey-regrets images). See OpenShift Pipelines (Tekton) prerequisite.
oc login # cluster-admin
# 1. Cluster-admin bootstrap (RBAC, namespaces, cluster ConfigMap, discovery SA) — see below
make cluster-admin-pre-gitops
# or: ./scripts/cluster-admin/install-pre-gitops.sh
# 1b. Install Red Hat Kueue Operator manually (before default-dsc) — see Prerequisites
# 2. Confirm StorageClass matches values.yaml (default gp3-csi)
# oc get storageclass
# 3. Register Argo CD Applications (set repoURL in YAML to your fork if needed)
oc apply -k gitops/argocd/
# 4. Sync Applications (waves 0→1→2) or wait for automated sync
# acs-ai-overwatch-gitops-bootstrap → cluster-discovery → acs-ai-overwatch
# 5. Confirm cluster ConfigMap (from step 1 or discovery Job)
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config
# 6. Install OpenShift Pipelines, then build agent images (Phase 2)
oc apply -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipeline.yaml
# Or: oc start-build rosey-regrets-slm --from-dir=. --follow -n test-range
# 7. Enable Phase 3 (RHACS Central), Phase 4 (Kagenti), agents in values-poc.yaml — sync Argo CD
# See "PoC deployment phases" and "Step-by-step deployment"
# 8. Verify Mattermost + RHACS notifier — see "Mattermost & RHACS notifications"
# 9. Run the end-to-end demo — see "PoC Demo Walkthrough (After Setup)"Optional (local Helm / override file): make cluster-values writes values-cluster.yaml from your oc login (see Cluster-Aware Configuration).
OpenShift AI version: This PoC targets OpenShift AI 3.4 (stable-3.4 channel by default; confirm with packagemanifest, DataScienceCluster v2). Use a fresh cluster — do not install on a cluster that already ran RHOAI 2.25 (no supported in-place upgrade to 3.4).
- Solution Overview
- Architecture
- Repository Layout
- Prerequisites
- PoC deployment phases
- Cluster admin: pre-GitOps setup
- Fresh cluster deployment (OpenShift AI 3.4)
- Helm Values File Layering
- Cluster-Aware Configuration
- Storage
- Configuration Checklist
- Deployment Methods
- Helm Chart Reference
- Platform Components
- AI Agents
- Kagenti Integration
- ACS / RHACS Security
- Tekton Image Build Pipeline
- Step-by-step deployment
- Operational Scripts
- Namespaces and Resource Map
- Helm Template Inventory
- Troubleshooting
- Security and Legal Notes
- Development and Validation
- Mattermost & RHACS notifications
- PoC Demo Walkthrough (After Setup)
This PoC demonstrates how a platform team can:
- Provision OpenShift AI with GPU time-slicing on NVIDIA L4 accelerators
- Deploy contrasting agent personalities built on NVIDIA OpenShell:
- Helpful Hank — a standard technical assistant
- Rosey Regrets — a deliberately misaligned agent used only in isolated lab environments
- Sneaky Sam (opt-in) — telemetry guardrail demo agent (non-compliant deploy)
- Enforce runtime guardrails with RHACS in the
test-rangenamespace - Enforce telemetry compliance on Kagenti agents (DEPLOY policy + NetworkPolicy)
- Route policy violations to Mattermost (Slack-compatible webhook integration)
- Persist Rosey’s reconnaissance output to a named PVC:
agent-reference-information - Build agent images with Tekton and push them to local Quay
Demo B — ACS violation loop (Rosey / runtime policy):
Operator chats with rosey-regrets-slm in Kagenti ("Network Audit" or recon prompt)
│
▼
Qwen3 SLM calls run_network_recon → nmap runs in background (10.0.0.0/24 PoC default)
│
▼
RHACS runtime policy test-range-runtime-guardrails detects nmap (alert-only — no kill)
│
▼
RHACS generic notifier → acs-mattermost-bridge → Slack-format POST → Mattermost Town Square
│
▼
Operator reviews scan transcripts on PVC agent-reference-information
Demo A — Telemetry guardrail (Sneaky Sam): non-compliant agent Deployment triggers DEPLOY policy → Mattermost alert (and admission block when Phase 3 is enabled). See Step-by-step deployment and PoC Demo Walkthrough (After Setup).
flowchart TB
subgraph GitOps["GitOps Control Plane"]
ArgoCD["Argo CD Application<br/>acs-ai-overwatch"]
Helm["Helm Chart<br/>gitops/helm/acs-ai-overwatch"]
ArgoCD --> Helm
end
subgraph Infra["Infrastructure Layer"]
NFD["Node Feature Discovery"]
GPUOp["NVIDIA GPU Operator<br/>time-slicing"]
Quay["Quay Registry<br/>gp3-csi"]
NFD --> GPUOp
end
subgraph AI["OpenShift AI"]
RHOAI["RHOAI Operator"]
DSC["DataScienceCluster"]
WB["Workbenches"]
HP["HardwareProfile<br/>l4-timeslice-half-gpu"]
RHOAI --> DSC
DSC --> WB
DSC --> HP
end
subgraph Agents["Agent Layer (test-range)"]
Hank["helpful-hank"]
Rosey["rosey-regrets"]
Sam["sneaky-sam<br/>no telemetry label"]
PVC["PVC agent-reference-information"]
OpenShell["NVIDIA OpenShell base"]
Hank --> OpenShell
Rosey --> OpenShell
Sam --> OpenShell
Rosey --> PVC
end
subgraph Security["Security Layer"]
ACSOp["RHACS Operator"]
Policy["Runtime Policy<br/>test-range-runtime-guardrails"]
TelPol["Telemetry Policy<br/>test-range-agent-telemetry-required"]
NetPol["NetworkPolicy<br/>agent-telemetry-block-noncompliant"]
SCC["SCC openshell-gpu-runtime"]
ACSOp --> Policy
ACSOp --> TelPol
TelPol --> Mattermost
Policy --> Mattermost
NetPol --> Sam
end
subgraph Notify["Notifications"]
MM["Mattermost<br/>monitoring namespace"]
end
subgraph CI["Build / CI"]
Tekton["Tekton Pipeline<br/>hank + rosey + sneaky-sam"]
Tekton --> Quay
Quay --> Hank
Quay --> Rosey
Quay --> Sam
end
Helm --> Infra
Helm --> AI
Helm --> Agents
Helm --> Security
Helm --> MM
Policy --> Mattermost
Rosey Regrets uses a single, explicitly named persistent volume contract:
| Concept | Value |
|---|---|
| PVC name | agent-reference-information |
| PVC namespace | test-range |
| Container mount path | /agent-reference-information |
| Environment variable | AGENT_OUTPUT_DIR=/agent-reference-information |
| Helm value | agentsRoseyRegrets.pvc.name |
| Mount path value | kagenti.rosey.outputMountPath |
All four layers (values, PVC template, Kagenti deployment, container image) must stay aligned.
acs-ai-overwatch/
├── README.md
├── Makefile # make cluster-values, make helm-template
├── agents/
│ ├── common/acs_agent/ # Shared Kagenti A2A server + OTEL bootstrap
│ ├── helpful-hank/ # OpenShell + standard assistant
│ ├── rosey-regrets/ # OpenShell + nmap + rogue prompt + /agent-reference-information
│ ├── sneaky-sam/ # Telemetry guardrail demo (non-compliant deploy)
│ ├── rosey-rogue/ # Legacy placeholder
│ └── scripts/ # pull-model, install-agent-runtime, agent-entrypoint
├── gitops/
│ ├── argocd/
│ │ ├── kustomization.yaml # Baseline Applications; Phase 4/5 commented out
│ │ ├── application-gitops-bootstrap.yaml
│ │ ├── application-cluster-discovery.yaml
│ │ ├── application.yaml # Main umbrella chart (sync-wave 2)
│ │ ├── application-kagenti-platform.yaml # OPT-IN Phase 4 (sync-wave 3)
│ │ ├── application-observability.yaml # OPT-IN Phase 5 (sync-wave 4)
│ │ └── cmp/ # Optional CMP if Helm lookup fails
│ └── helm/
│ ├── acs-ai-overwatch-cluster-discovery/ # Job → cluster ConfigMap
│ ├── acs-ai-overwatch-kagenti-platform/ # OPT-IN Phase 4
│ ├── acs-ai-overwatch-observability/ # OPT-IN Phase 5
│ └── acs-ai-overwatch/
│ ├── Chart.yaml # v0.4.0
│ ├── values.yaml # Base defaults + clusterDiscovery + toggles
│ ├── values-poc.yaml # PoC overlay (component toggles)
│ ├── values-cluster.yaml.example
│ └── templates/ # 35+ OpenShift / K8s manifests
├── pipelines/tekton/ # Build helpful-hank, rosey-regrets, sneaky-sam → Quay
├── scripts/
│ ├── cluster-admin/ # Run as cluster-admin BEFORE Argo CD (see README there)
│ │ ├── install-pre-gitops.sh # All steps
│ │ ├── 01-grant-openshift-gitops-rbac.sh
│ │ ├── 02-bootstrap-namespaces.sh
│ │ ├── 03-apply-cluster-configmap.sh
│ │ └── 04-apply-discovery-prerequisites.sh
│ ├── lib/openshift-cluster-discovery.sh # Shared discovery logic
│ ├── discover-cluster-values.sh # oc login → optional values-cluster.yaml
│ └── trigger-network-audit.sh # Kagenti Network Audit → ACS loop
├── bootstrap/operators/ # Reserved
├── infrastructure/gpu-config/ # Reserved
├── monitoring/ # Reserved (Mattermost is in Helm chart)
└── scratch/ # Not deployed by chart
| Requirement | Notes |
|---|---|
| OpenShift 4.14+ (recommended) | Verify channel compatibility for operators on your cluster version |
| Fresh cluster for OpenShift AI 3.4 | No prior RHOAI 2.25 install; see Fresh cluster deployment |
| OpenShift GitOps Operator | Argo CD control plane in openshift-gitops |
| OpenShift Pipelines | Required before applying pipelines/tekton/ (not installed by this repo’s GitOps chart). See OpenShift Pipelines prerequisite |
| Red Hat build of Kueue Operator | Required before default-dsc on OpenShift AI 3.4 — manual install only (OperatorHub; not in GitOps). See Kueue Operator prerequisite |
| Worker nodes with NVIDIA L4 GPUs | Default values assume 3× L4 with time-slicing |
Dynamic block storage (gp3-csi) |
All PVCs including Quay, Mattermost, RHACS, Rosey — override storage.defaultStorageClass if needed |
| Operator catalogs | redhat-operators, certified-operators |
| Dependency | Purpose |
|---|---|
| Git remote | Source of truth for Argo CD and Tekton clone |
| Hugging Face Hub | Model HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive |
| ghcr.io/nvidia/openshell-community | OpenShell sandbox base image pull |
| Kagenti | Agent orchestration platform — optional Phase 4 GitOps Application, or manual setup-kagenti.sh |
- Cluster admin or sufficient privileges to install operators, SCCs, and cluster-scoped resources
- Ability to create Secrets for Quay credentials, Mattermost bootstrap, and Kagenti API tokens
- Network access from build pods to Quay and from agents to Hugging Face (if pulling models at runtime/build)
The agent image build manifests (pipelines/tekton/agents-build-pipeline.yaml) define Task and Pipeline resources with apiVersion: tekton.dev/v1. They are not deployed by the Argo CD Applications in this repo. If the Red Hat OpenShift Pipelines operator is not installed, oc apply fails with:
no matches for kind "Task" in version "tekton.dev/v1"
ensure CRDs are installed first
When you need it: GitOps-only deploy (operators, Quay, Mattermost) does not require Pipelines. Install Pipelines before step 7 in Quick Start (building helpful-hank / rosey-regrets images).
Verify:
oc get crd tasks.tekton.dev pipelineruns.tekton.dev
oc get csv -A | grep -i 'pipelines-operator'Install (cluster-admin) — use a channel that matches your OpenShift version:
# List channels (pick one, e.g. pipelines-1.14 or latest)
oc get packagemanifest openshift-pipelines-operator-rh \
-n openshift-marketplace \
-o jsonpath='{range .status.channels[*]}{.name}{"\n"}{end}'
export PIPELINES_CHANNEL="<channel-from-above>"
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: openshift-pipelines
namespace: openshift-operators
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-pipelines-operator
namespace: openshift-operators
spec:
channel: ${PIPELINES_CHANNEL}
name: openshift-pipelines-operator-rh
source: redhat-operators
sourceNamespace: openshift-marketplace
installPlanApproval: Automatic
EOF
# Wait for CSV Succeeded, then:
oc get crd | grep tektonConsole: OperatorHub → Red Hat OpenShift Pipelines → Install.
After the operator is healthy, apply the pipeline:
oc apply -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipeline.yamlSee Tekton Image Build Pipeline for Quay secrets and PipelineRun.
OpenShift AI 3.4 rejects spec.components.kueue.managementState: Managed on the DataScienceCluster. This chart sets Unmanaged, which requires the Red Hat build of Kueue Operator installed manually (not via GitOps).
When you need it: Before the main Argo CD Application syncs default-dsc (sync wave 30).
Verify:
oc get csv -n openshift-kueue-operator
oc get crd kueues.kueue.openshift.io
oc get kueue cluster -n openshift-kueue-operatorInstall (cluster-admin) — OperatorHub (recommended):
- Console → Operators → OperatorHub → Red Hat build of Kueue Operator → Install
- Enable cluster monitoring on namespace
openshift-kueue-operator - After CSV Succeeded, create the cluster
KueueCR (console Kueue tab → Create Kueue, or YAML):
apiVersion: kueue.openshift.io/v1
kind: Kueue
metadata:
name: cluster
namespace: openshift-kueue-operator
spec:
managementState: Managed- Label the workbench namespace when it exists:
oc label namespace rhods-notebooks kueue.openshift.io/managed=true --overwriteInstall (cluster-admin) — CLI (pick channel from your catalog):
oc get packagemanifest kueue-operator -n openshift-marketplace \
-o jsonpath='{range .status.channels[*]}{.name}{"\n"}{end}'
export KUEUE_CHANNEL="<channel-from-above>"
cat <<EOF | oc apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: openshift-kueue-operator
labels:
openshift.io/cluster-monitoring: "true"
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: openshift-kueue-operator
namespace: openshift-kueue-operator
spec: {}
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: kueue-operator
namespace: openshift-kueue-operator
spec:
channel: ${KUEUE_CHANNEL}
installPlanApproval: Automatic
name: kueue-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
EOF
oc get csv -n openshift-kueue-operator -w
# Then apply the Kueue CR YAML above and label rhods-notebooks.See OpenShift AI 3.4 — Kueue and Red Hat build of Kueue on OCP.
This repo is intentionally layered. The baseline (Phases 0–1) deploys GitOps, operators, and the Mattermost workload (server + bootstrap Job). Phases 2–5 add agents, full RHACS, Kagenti, and observability. Using Mattermost (login, webhook, alerts) is documented near the end in Mattermost & RHACS notifications — after RHACS and agents are in place.
| Check | Command |
|---|---|
| Argo apps Synced | oc get application -n openshift-gitops | grep acs-ai-overwatch |
| Cluster ConfigMap | oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config |
| Mattermost pod (deployed, not yet used) | oc get pods -n monitoring -l app.kubernetes.io/name=mattermost |
| RHACS operator only (no Central yet) | oc get csv -n rhacs-operator |
Argo Applications (from gitops/argocd/kustomization.yaml):
| Wave | Application | Delivers |
|---|---|---|
| 0 | acs-ai-overwatch-gitops-bootstrap |
Namespaces + managed-by labels |
| 1 | acs-ai-overwatch-cluster-discovery |
ConfigMap acs-ai-overwatch-cluster-config |
| 2 | acs-ai-overwatch |
Operators, Mattermost, RHACS operator subscription, SecurityPolicy CRs, etc. |
| (opt-in 3) | acs-ai-overwatch-kagenti-platform |
Kagenti control plane — see Phase 4 |
| (opt-in 4) | acs-ai-overwatch-observability |
OTEL → Tempo + MLflow — see Phase 5 |
oc apply -k gitops/argocd/Not applied by default: application-kagenti-platform.yaml and application-observability.yaml (commented out in kustomization).
| When | Action |
|---|---|
| Before first Argo sync | Run cluster-admin pre-GitOps scripts (make cluster-admin-pre-gitops) — details in scripts/cluster-admin/README.md |
Before default-dsc syncs |
Install Red Hat Kueue Operator from OperatorHub (not in GitOps) |
| Using a fork | Set spec.source.repoURL in each gitops/argocd/application*.yaml |
Storage class differs from gp3-csi |
Set storage.defaultStorageClass in values.yaml (see Storage) |
| Enabling Quay | Set quayStorage.registryCredentials.password and review MinIO credentials before production |
| Mattermost bootstrap | Set mattermost.bootstrap.* passwords in values.yaml (not auto-generated) |
Quay operator ResolutionFailed |
Delete orphaned CSV in quay namespace, approve InstallPlan, re-sync |
Helm lookup empty on repo-server |
Optional CMP in gitops/argocd/cmp/ (see Cluster-Aware Configuration) |
Everything else in Phase 0 (namespaces, discovery Job, operator Subscriptions, Mattermost deploy) is GitOps-driven once the above prerequisites are met.
Mattermost deploys during the main chart sync (Postgres, server, bootstrap Job, Route). Discovery writes mattermostSiteUrl into the cluster ConfigMap for the browser URL.
Do not commit sandbox hostnames in values-cluster.yaml — they go stale when the cluster is recreated.
After a new sandbox, re-sync discovery, then refresh the main app:
oc get job -n acs-ai-overwatch-system cluster-discovery
oc annotate application acs-ai-overwatch -n openshift-gitops argocd.argoproj.io/refresh=hard --overwriteLogin, webhook verification, and RHACS alert delivery are covered in Mattermost & RHACS notifications — do that after Phase 3 (RHACS) is healthy, immediately before the demo.
Requires Quay (or another registry), Tekton pipeline, and enabling component flags in values-poc.yaml:
components:
kagenti:
enabled: true # agent Deployments only — not the Kagenti platform
agentsHelpfulHank:
enabled: true
agentsRoseyRegrets:
enabled: true
agentsSneakySam:
enabled: true # demo: deliberately non-telemetry-compliant agentAgent telemetry guardrails (agentTelemetryPolicy.enabled, default true):
| Layer | Mechanism | When it applies |
|---|---|---|
| Kubernetes | NetworkPolicy selects kagenti.io/type=agent pods without acs-ai-overwatch.io/telemetry=enabled and allows DNS egress only |
Immediate on sync (no RHACS Central required) |
| RHACS (ACS) | SecurityPolicy CRs — DEPLOY-stage telemetry label policy; Mattermost notifier on violation; optional admission block (not scale-to-zero) |
Phase 3 bootstrap configures SecuredCluster + notifier; policies sync as CRs |
Compliant agents (Hank, Rosey) carry acs-ai-overwatch.io/telemetry: enabled. Sneaky Sam and Sneaky Sam SLM omit the telemetry label entirely and are isolated by the NetworkPolicy — demonstrating the guardrail.
RHACS telemetry policy (Phase 3) — deploy alert, not scale-down: The policy uses lifecycle stage DEPLOY only (no RUNTIME / SCALE_TO_ZERO). When Sneaky Sam is synced, RHACS evaluates the Deployment, fires test-range-agent-telemetry-required, and the Mattermost Notifier posts to Town Square (human-in-the-loop user is on that channel). With admission enforcement enabled, the Deployment is blocked at create/update — the notification describes a non-compliant deploy attempt, not a pod being scaled down later.
To notify without blocking admission, set agentTelemetryPolicy.rhacs.enforcementActions: [] in values.
See Tekton Image Build Pipeline and AI Agents.
| When | Action |
|---|---|
| Before Tekton builds | Install OpenShift Pipelines (OperatorHub — not deployed by this repo) |
| Using in-cluster Quay | Enable quayStorage.enabled: true, set registry password, wait for QuayRegistry Ready |
| Building images | Apply pipeline manifests and create a PipelineRun (not in default GitOps): oc apply -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipeline.yaml |
| Before agent pods start | Build and push images to Quay (Tekton or docker build); enable components.kagenti and per-agent flags only after images exist |
| Rosey “Network Audit” demo | Requires Phase 4 — chat with rosey-regrets-slm in Kagenti UI (see PoC Demo Walkthrough) |
| Sneaky Sam telemetry demo | Enable agentsSneakySam.enabled: true; Mattermost alert needs Phase 3 SecuredCluster + notifier |
Agent Deployments and NetworkPolicy sync via GitOps; image builds and operator prerequisites do not.
Baseline: components.acsPolicies.enabled: true installs the RHACS operator, test-range namespace, SecurityPolicy CRs, and OpenShell SCC — not Central or sensors.
Full stack (opt-in): set in values.yaml or values-poc.yaml:
acs:
central:
enabled: true
persistence:
storageClassName: gp3-csi
bootstrap:
enabled: truePolicies (test-range-runtime-guardrails, test-range-agent-telemetry-required) sync as SecurityPolicy CRs via GitOps — not via the bootstrap Job (RHACS 4.10 removed roxctl declarative-config create --file for policies).
This adds (when RHACS CRDs exist):
| Resource | Purpose |
|---|---|
Central CR (stackrox) |
RHACS UI/API |
Job acs-platform-bootstrap |
Init bundle, SecuredCluster, Mattermost notifier upsert (best-effort) |
SecurityPolicy CRs |
Runtime + telemetry policies (sync-wave with main chart) |
Rollback to baseline (disable full RHACS without removing operator):
acs:
central:
enabled: false
bootstrap:
enabled: falseCommit, push, sync. Existing Central resources may need manual cleanup in stackrox if you previously enabled Phase 3.
| When | Action |
|---|---|
Sandbox without full registry.redhat.io entitlement |
Copy cluster pull secret to stackrox and attach to bootstrap ServiceAccount: oc get secret pull-secret -n openshift-config -o yaml | sed 's/namespace: openshift-config/namespace: stackrox/' | oc apply -f - then patch acs-bootstrap SA imagePullSecrets |
| Bootstrap Job warns on notifier upsert | Notifier is declarative ConfigMap rhacs-mattermost-notifier in stackrox; endpoint must be acs-mattermost-bridge (not the Mattermost URL directly) |
| Alerts not reaching Mattermost | RHACS generic JSON ≠ Slack {"text":...} — confirm acs-mattermost-bridge is Running; test webhook with curl -d '{"text":"test"}' to URL in mattermost-acs-integration |
| Kagenti 504 on Rosey message | Was caused by scanning 10.0.0.0/8 synchronously — use rosey-regrets-slm, NETWORK_AUDIT_CIDR=10.0.0.0/24, and LLM-driven background recon |
| Stale init bundle (secrets missing) | Bootstrap Job revokes and retries automatically; if stuck, revoke bundle in Central UI and re-run Job |
| Rollback from full RHACS | Delete Central / SecuredCluster and related secrets in stackrox if GitOps prune does not remove them |
Central install, SecuredCluster registration, and policy CRs are GitOps-driven once pull secrets and Central CRDs are healthy.
The main chart’s components.kagenti flag deploys agent Deployments labeled for Kagenti. It does not install Kagenti itself.
To install the Kagenti platform (Keycloak, SPIRE, operator, API):
- Uncomment in
gitops/argocd/kustomization.yaml:- application-kagenti-platform.yaml - Enable the install Job in
gitops/helm/acs-ai-overwatch-kagenti-platform/values.yaml:job: enabled: true
- Commit, push, apply:
oc apply -k gitops/argocd/ # PostSync install Job (15–30+ min). Argo waits up to 1h (Timeout=3600) before SyncFailed. oc logs -n kagenti-system -l job-name=kagenti-platform-install -c install -f
Install can take 15–30 minutes. Requires cluster-admin (Job uses cluster-admin RBAC — PoC only). Argo Application sync-wave is 3 (after main chart at wave 2). While the install Job runs, Argo shows Running / waiting for hook — that is normal, not a failed install. Avoid re-syncing until the Job completes or you will recreate the hook.
| When | Action |
|---|---|
| Argo sync times out during install | Raise controller sync timeout (see optional patch below) — install Job can run 15–30+ min |
| After install completes | Run ./scripts/kagenti-auth-info.sh for Kagenti UI URL and Keycloak demo credentials |
| First UI login | Open Kagenti route → sign in at Keycloak as user admin (password from script) |
| Custom realms / users / OIDC clients | See KEYCLOAK.md — default PoC needs no manual Keycloak setup |
| Rollback | Set job.enabled: false, remove Application from kustomization, delete Argo app (commands below) |
Optional (cluster-admin): if sync still times out, raise the global Argo CD controller limit (OpenShift GitOps default is unlimited, but some clusters override it):
oc patch configmap argocd-cmd-params-cm -n openshift-gitops --type merge \
-p '{"data":{"controller.sync.timeout.seconds":"3600"}}'
oc rollout restart deployment openshift-gitops-controller -n openshift-gitopsRollback: set job.enabled: false, remove the Application from kustomization, delete the Argo app:
oc delete application acs-ai-overwatch-kagenti-platform -n openshift-gitops --ignore-not-foundThe install Job provisions Keycloak (RHBK), imports realm kagenti, and configures OIDC for the Kagenti UI. Manual Keycloak setup is not required for the default PoC.
| Step | Action |
|---|---|
| 1 | Wait for Phase 4 install to finish (helm list -n kagenti-system shows kagenti + kagenti-deps) |
| 2 | Verify Keycloak: oc get pods -n keycloak ( keycloak-0 Running, realm import Job Complete ) |
| 3 | Print URLs and demo credentials: ./scripts/kagenti-auth-info.sh |
| 4 | Open the Kagenti UI route → sign in at Keycloak with user admin (password from script) |
Docs: gitops/helm/acs-ai-overwatch-kagenti-platform/KEYCLOAK.md — verification checklist, demo users, manual steps, Keycloak admin console, troubleshooting.
GitOps values (only if you need non-default names): kagenti.keycloakNamespace, kagenti.keycloakRealm in gitops/helm/acs-ai-overwatch-kagenti-platform/values.yaml.
Architecture (Option C):
| Layer | Role |
|---|---|
| Agents / Kagenti | Emit OTLP traces to a shared collector |
| Shared OTEL Collector | Dual-export: Tempo (Grafana trace search) + MLflow (LLM trace detail) |
| Tempo | Distributed tracing backend for Grafana user-workload dashboards |
| MLflow | LLM/agent trace store (RHOAI mlflowoperator component) |
| Grafana (user workload) | Shared dashboard for agent trace overview |
Prerequisites: RHOAI operator + default-dsc from the main chart (Phase 0). For Kagenti AuthBridge traces, enable Phase 4 before Phase 5.
Internal sync waves (within the observability chart):
| Wave | Step |
|---|---|
| 0 | Namespaces (acs-ai-overwatch-observability, tempo, openshift-tempo-operator) |
| 5 | User workload monitoring prep (enable + namespace labels) |
| 10 | Tempo Operator subscription |
| 30 | TempoMonolithic CR; MLflow DSC patch Job (mlflowoperator: Managed) |
| 50 | OTEL collector ConfigMap; Grafana dashboard ConfigMaps |
| 60 | OTEL collector Deployment/Service |
| 70 | Bootstrap Job → writes acs-ai-overwatch-observability-config |
To enable Phase 5:
-
Uncomment in
gitops/argocd/kustomization.yaml:- application-observability.yaml -
Enable the chart in
gitops/helm/acs-ai-overwatch-observability/values.yaml(or usevalues-phase5.yaml):enabled: true
-
Commit, push, apply:
oc apply -k gitops/argocd/ oc logs -n acs-ai-overwatch-observability job/observability-bootstrap -f
-
Optional — agent OTLP env (after bootstrap ConfigMap exists), in main chart values:
observability: agentInstrumentation: enabled: true
Refresh the main Argo Application so Helm
lookupreads the integration ConfigMap. Agent images already include the OTEL SDK — this step only injects the collector endpoint env vars. -
Rebuild agent images (Tekton pipeline or
docker build) so pods run theacs_agent.serverruntime with instrumentation baked in. -
Optional — Kagenti + Phase 5 (when Phase 4 is also enabled):
# gitops/helm/acs-ai-overwatch-kagenti-platform/values.yaml phase5: integration: enabled: true
Verify:
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-observability-config
oc get deploy -n acs-ai-overwatch-observability acs-otel-collector
oc get tempomonolithic -n tempo
oc get dsc default-dsc -o jsonpath='{.spec.components.mlflowoperator.managementState}{"\n"}'Grafana user workload URL (after bootstrap):
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-observability-config \
-o jsonpath='{.data.grafanaUserWorkloadUrl}{"\n"}'Rollback: set enabled: false, remove the Application from kustomization, delete the Argo app:
oc delete application acs-ai-overwatch-observability -n openshift-gitops --ignore-not-foundSet observability.agentInstrumentation.enabled: false in the main chart to stop injecting OTEL env on agents.
| When | Action |
|---|---|
| Kagenti AuthBridge traces | Enable Phase 4 before Phase 5 |
| Agent OTLP env injection | After observability bootstrap ConfigMap exists, set observability.agentInstrumentation.enabled: true and hard-refresh main Argo app (Helm lookup) |
| Traces from running agents | Rebuild agent images (Tekton) — instrumentation env is injected at deploy time; images must include OTEL SDK |
| Kagenti + shared collector | Set phase5.integration.enabled: true in kagenti-platform values; Kagenti collector config merge may still need a manual step (bootstrap Job logs a hint) |
| Viewing traces | Open Grafana user-workload URL from ConfigMap acs-ai-overwatch-observability-config (written by bootstrap Job) |
| Rollback | Disable chart, remove Application from kustomization, set agentInstrumentation.enabled: false on main chart |
Tempo, MLflow, OTEL collector, and dashboard ConfigMaps deploy via GitOps; image rebuilds and cross-app refresh are the usual manual follow-ups.
Phase 0–1 (baseline) → bootstrap → discovery → main chart (operators + Mattermost deploy)
Phase 2 (agents) → Tekton/binary build + components.kagenti + per-agent flags
Phase 3 (full RHACS) → acs.central.enabled + acs.bootstrap.enabled (+ SecurityPolicy CRs)
Phase 4 (Kagenti plat) → application-kagenti-platform + job.enabled
Phase 5 (observability)→ application-observability + enabled: true (+ agentInstrumentation optional)
Mattermost + alerts → login, webhook bridge, notifier verify (before demo)
Demo → PoC Demo Walkthrough (After Setup)
After the PoC: run scripts/cleanup-poc-repo.sh to reset GitOps overlays and remove local cluster-specific files before the next sandbox or fork handoff.
Run these steps locally as cluster-admin after oc login and before oc apply -k gitops/argocd/. They create the objects Argo CD needs so the first sync does not fail on RBAC or missing cluster settings.
Scripts live under scripts/cluster-admin/ (includes manual steps by phase).
oc login
chmod +x scripts/cluster-admin/*.sh
make cluster-admin-pre-gitops
# equivalent: ./scripts/cluster-admin/install-pre-gitops.sh| Step | Script | Kubernetes objects |
|---|---|---|
| 0 | 00-apply-appproject.sh |
AppProject acs-ai-overwatch (allows DSC, GPU ClusterPolicy, Namespace, SCC) |
| 1 | 01-grant-openshift-gitops-rbac.sh |
ClusterRoleBinding → openshift-gitops-argocd-application-controller (cluster-admin for PoC) |
| 2 | 02-bootstrap-namespaces.sh |
PoC namespaces with argocd.argoproj.io/managed-by=openshift-gitops |
| 3 | 03-apply-cluster-configmap.sh |
ConfigMap acs-ai-overwatch-system/acs-ai-overwatch-cluster-config (appsDomain, quayRegistryServer, kagentiApiBaseUrl, gitRepoUrl, …) |
| 4 | 04-apply-discovery-prerequisites.sh |
ServiceAccount cluster-discovery, discovery RBAC, ConfigMap cluster-discovery-script |
Manual prerequisites (OperatorHub, not GitOps): Kueue Operator (before default-dsc), OpenShift Pipelines (before agent builds).
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config
oc get sa -n acs-ai-overwatch-system cluster-discovery
oc get cm -n acs-ai-overwatch-system cluster-discovery-script
oc auth can-i create serviceaccounts -n acs-ai-overwatch-system \
--as=system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller# Skip cluster-admin binding if managed-by namespaces are enough on your cluster:
./scripts/cluster-admin/install-pre-gitops.sh --skip-rbac
# Let Argo CD create discovery SA/script ConfigMap (only run steps 1–3):
./scripts/cluster-admin/install-pre-gitops.sh --skip-discovery-prereqs
# Also write values-cluster.yaml for local helm template:
./scripts/cluster-admin/install-pre-gitops.sh --with-values-file./scripts/cluster-admin/01-grant-openshift-gitops-rbac.sh
./scripts/cluster-admin/02-bootstrap-namespaces.sh
./scripts/cluster-admin/03-apply-cluster-configmap.sh
./scripts/cluster-admin/04-apply-discovery-prerequisites.shOnly the cluster ConfigMap (step 3):
./scripts/discover-cluster-values.sh --apply-configmap# Edit values-poc.yaml, set repoURL in gitops/argocd/application*.yaml, then:
oc apply -k gitops/argocd/Sync order: acs-ai-overwatch-gitops-bootstrap → acs-ai-overwatch-cluster-discovery → acs-ai-overwatch. If you ran the cluster-admin scripts, bootstrap and discovery may already match desired state; Argo will reconcile.
Use this checklist on a new OpenShift cluster with OpenShift GitOps already installed. Do not reuse a cluster where RHOAI 2.25 was previously installed.
oc login
chmod +x scripts/cluster-admin/*.sh
make cluster-admin-pre-gitopsConfirm storage.defaultStorageClass matches your cluster before enabling Quay (see Storage).
Optional: make cluster-values or ./scripts/cluster-admin/install-pre-gitops.sh --with-values-file for values-cluster.yaml.
Required before default-dsc applies. Manual install only — see Kueue Operator prerequisite.
If using a fork, update spec.source.repoURL in:
gitops/argocd/application-gitops-bootstrap.yamlgitops/argocd/application-cluster-discovery.yamlgitops/argocd/application.yaml
oc apply -k gitops/argocd/Sync in order (or wait for sync-waves): bootstrap → cluster-discovery → acs-ai-overwatch.
The main Application includes SkipDryRunOnMissingResource=true and gates platform CRs until operator CRDs exist — expect multiple syncs over 15–30+ minutes while OLM installs operators.
# OperatorGroup must be empty spec (not targetNamespaces)
oc get operatorgroup redhat-ods-operator -n redhat-ods-operator -o yaml | grep -A2 '^spec:'
# CSV must be 3.4.x, not 2.25.x
oc get csv -n redhat-ods-operator | grep rhods
# CRD must serve v2
oc get crd datascienceclusters.datasciencecluster.opendatahub.io \
-o jsonpath='{range .spec.versions[*]}{.name}{"\n"}{end}'
# After main app syncs platform CRs
oc get dsc default-dsc
oc get dsc default-dsc -o jsonpath='{.apiVersion}{" "}{.status.phase}{"\n"}'Expected: rhods-operator.3.4.* Succeeded, datasciencecluster.opendatahub.io/v2, DSC phase Ready (may take several minutes).
If the Subscription channel is wrong on a fresh cluster:
# List channels your catalog actually exposes (pick one ending in -3.4 for OpenShift AI 3.4)
oc get packagemanifest rhods-operator -n openshift-marketplace \
-o jsonpath='{range .status.channels[*]}{.name}{"\n"}{end}'
oc patch subscription rhods-operator -n redhat-ods-operator --type merge \
-p '{"spec":{"channel":"stable-3.4"}}'Use fast-3.4 or eus-3.4 instead if that is what your catalog lists and you intend that stream.
See OpenShift Pipelines prerequisite, then apply pipelines/tekton/agents-build-pipeline.yaml.
Commit/push changes to values.yaml (e.g. components.kagenti, components.acsPolicies, acs.central.enabled) and sync acs-ai-overwatch.
Follow Step-by-step deployment for Phases 2–4, then Mattermost & RHACS notifications before the demo.
Configuration is merged in this order (Argo CD main Application and make helm-template):
| Source | Purpose | Edit by |
|---|---|---|
values.yaml |
Base defaults, clusterDiscovery.*, operator subscriptions, component toggles |
Hand (repo) |
values-poc.yaml |
PoC component toggles | Hand (per cluster) |
ConfigMap acs-ai-overwatch-system/acs-ai-overwatch-cluster-config |
Apps domain, Quay host, Kagenti URL, git repoUrl, default StorageClass, OLM operator channels |
scripts/cluster-admin/03-apply-cluster-configmap.sh or discovery Job |
values-cluster.yaml (optional) |
Same fields as ConfigMap | make cluster-values (local/CI override) |
Argo CD registers three Applications via oc apply -k gitops/argocd/ (see Cluster admin: pre-GitOps setup and Cluster-Aware Configuration).
Main Application Helm stanza:
helm:
valueFiles:
- values.yaml
- values-poc.yaml
- values-cluster.yaml # optional; ignoreMissingValueFiles: trueWhen cluster.appsDomain is set (from values, ConfigMap, or lookup), _helpers.tpl derives hostnames. OpenShift’s ingress domain usually already includes an apps. prefix (e.g. apps.cluster.example.com):
| Output | Logic |
|---|---|
Mattermost siteUrl / Route host |
ConfigMap keys mattermostSiteUrl / mattermostRouteHost (from discovery Job), else computed from appsDomain |
Quay registryCredentials.server |
Values/ConfigMap override, else quay-quay.<appsDomain> |
Kagenti api.baseUrl |
Values/ConfigMap override, else https://kagenti-api.<appsDomain> |
Leave mattermost.siteUrl, mattermost.route.host, and quayStorage.registryCredentials.server empty in values.yaml. Do not commit values-cluster.yaml (gitignored); use discovery ConfigMap instead.
Most hostnames that used to be CHANGE_ME are derived from cluster.appsDomain, supplied by either:
- GitOps (default): ConfigMap
acs-ai-overwatch-cluster-configwritten by the discovery Application, or - Local/CI:
values-cluster.yamlfrommake cluster-values
Three Argo CD Applications (see gitops/argocd/kustomization.yaml), ordered by sync-wave on the Application CR:
| Wave | Application | Purpose |
|---|---|---|
| 0 | acs-ai-overwatch-gitops-bootstrap |
Creates namespaces with argocd.argoproj.io/managed-by=openshift-gitops so Argo can create ServiceAccounts |
| 1 | acs-ai-overwatch-cluster-discovery |
ServiceAccount + script ConfigMap, then PostSync Job writes acs-ai-overwatch-cluster-config |
| 2 | acs-ai-overwatch |
Main chart; reads that ConfigMap via Helm lookup + serverDryRun |
| (opt-in 3) | acs-ai-overwatch-kagenti-platform |
Not in kustomization by default — see Phase 4 — Kagenti platform |
| (opt-in 4) | acs-ai-overwatch-observability |
Not in kustomization by default — see Phase 5 — Shared observability |
Workflow:
make cluster-admin-pre-gitops # or install-pre-gitops.sh
oc apply -k gitops/argocd/
# 1) Wait for cluster-discovery Job to succeed
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config
# 2) Refresh the main Application in Argo CD (or wait for automated sync)On the first main-app sync, Mattermost routes, Quay pull secrets, and Kagenti agents are skipped until the ConfigMap exists; operators and storage still deploy. After discovery completes, refresh so gated resources render.
Optional override: values-cluster.yaml from make cluster-values (still supported; not required in Git).
Optional environment variables for the local script:
export QUAY_REGISTRY_PASSWORD='<token>' # written to values-cluster.yaml if set
export KAGENTI_API_BASE_URL='https://...' # override detection
export GIT_REPO_URL='https://github.com/...' # override git remoteIf Helm lookup does not see the ConfigMap from the repo-server, use the optional CMP in gitops/argocd/cmp/ (see README there).
After oc login:
make cluster-values
# or: ./scripts/discover-cluster-values.shThis writes gitops/helm/acs-ai-overwatch/values-cluster.yaml with cluster settings from your oc login session (same fields as the in-cluster ConfigMap).
| Discovered value | Source |
|---|---|
cluster.appsDomain |
oc get ingresses.config cluster |
mattermostSiteUrl |
https://mattermost-<ns>.<appsDomain> (or live Route if present) |
mattermostRouteHost |
Same host without scheme |
cluster.name |
Infrastructure CR or current context |
storage.defaultStorageClass |
Cluster default StorageClass annotation, else gp3-csi / gp3 |
quayStorage.quayOperator.subscription.channel |
Latest stable-3.* from packagemanifest quay-operator |
rhoai.operator.subscription.channel |
stable-3.4 / fast-3.4 / eus-3.4 (override minor with RHOAI_TARGET_VERSION) |
acs.operator.subscription.channel |
packagemanifest rhacs-operator default channel |
accelerators.nfd/gpuOperator.subscription.channel |
packagemanifest default channel for nfd / gpu-operator-certified |
quayStorage.registryCredentials.server |
Quay Route in quay (if present), else quay-quay.<domain> |
kagenti.api.baseUrl |
Kagenti Route (best effort), else default hostname pattern |
kagenti.appSource.repoUrl |
git remote origin (HTTPS normalized) |
At Helm render time (Argo CD with serverDryRun), Subscription templates and PVC StorageClasses prefer these ConfigMap keys when present, falling back to values.yaml.
Commit this file only if you want Argo CD to use Git-stored overrides instead of (or in addition to) the ConfigMap.
| Setting | Notes |
|---|---|
quayStorage.registryCredentials.password |
Set via QUAY_REGISTRY_PASSWORD when running the script, or edit values |
| Mattermost bootstrap passwords | mattermost.bootstrap.* in values.yaml |
pipelines.imageRegistry.host |
In-cluster DNS (usually no apps domain needed) |
| Target | Command |
|---|---|
cluster-admin-pre-gitops |
Runs scripts/cluster-admin/install-pre-gitops.sh (before Argo CD) |
cluster-values |
Runs scripts/discover-cluster-values.sh → optional values-cluster.yaml |
cleanup-poc-repo |
Runs scripts/cleanup-poc-repo.sh → baseline GitOps (after PoC) |
helm-template |
Renders main chart (values.yaml + values-poc.yaml + optional values-cluster.yaml) |
helm-template-discovery |
Renders acs-ai-overwatch-cluster-discovery chart |
All persistent volumes use gp3-csi (AWS EBS / dynamic provisioning on ROSA). One default keeps GitOps and troubleshooting simple.
Confirm on your cluster:
oc get storageclassIf your class has a different name (gp2-csi, standard, etc.), set it in values.yaml:
storage:
defaultStorageClass: your-storage-classHelm templates fall back to storage.defaultStorageClass for Mattermost, Quay, RHACS Central, and Rosey PVC when a component value is empty.
| Workload | Values key | Default |
|---|---|---|
| Mattermost data + Postgres | mattermost.pvc.storageClassName, mattermost.postgres.pvc.storageClassName |
gp3-csi |
| Quay Postgres / Clair | quayStorage.quayRegistry.components.postgres/clairpostgres.storageClassName |
gp3-csi |
| Quay blob storage (MinIO) | quayStorage.quayRegistry.minio.storageClassName |
gp3-csi |
| RHACS Central database | acs.central.persistence.storageClassName |
gp3-csi |
| Rosey Regrets output PVC | agentsRoseyRegrets.pvc.storageClassName |
gp3-csi |
| Tempo trace storage (Phase 5) | tempo.monolithic.storageClassName (observability chart) |
gp3-csi |
| Tekton build workspace | pipelines/tekton/agents-build-pipelinerun.example.yaml |
gp3-csi |
Quay blob storage cannot use a block PVC directly. With objectstorage.managed: true, the operator requires the objectbucket.io API (OpenShift Data Foundation / NooBaa), which EBS-only clusters do not have.
This chart deploys MinIO on gp3-csi and sets objectstorage.managed: false with a configBundleSecret pointing Quay at in-cluster S3. Default MinIO PVC is 500Gi (quayStorage.quayRegistry.minio.volumeSize).
Set quayStorage.quayRegistry.minio.credentialsSecret.secretKey before production use (default placeholder in repo).
Set quayStorage.enabled: false in values-poc.yaml until you are ready to build agent images, or use an external registry and point kagenti.images.* / Tekton at that host instead.
⚠️ Change all default passwords before deploying to a shared or non-throwaway cluster.
This repository ships known placeholder credentials for PoC speed. They are not safe to leave in place. Anyone with access to this Git repo or the cluster can read them. You must replace every value below ingitops/helm/acs-ai-overwatch/values.yamland/orvalues-poc.yamlbefore your first sync (or immediately after, then re-sync).
Secret / account Values path Default (change this) Mattermost admin mattermost.bootstrap.adminPasswordredhatpassword123Mattermost HITL user mattermost.bootstrap.hitlPasswordredhatpassword123Mattermost Postgres mattermost.postgres.passwordmattermost-db-passwordQuay UI admin (reference Secret admin-account)quayStorage.quayRegistry.adminAccount.passwordCHANGE_ME_QUAY_ADMIN_PASSWORDQuay registry pull robot quayStorage.registryCredentials.passwordCHANGE_ME_PASSWORDQuay MinIO blob storage quayStorage.quayRegistry.minio.credentialsSecret.secretKeyCHANGE_ME_MINIO_SECRETAfter sync, retrieve stored secrets with
oc get secret <name> -n <namespace>(e.g.admin-accountinquay,mattermost-bootstrapinmonitoring). See Security and Legal Notes — Secrets Management.
Before syncing, run cluster-admin pre-GitOps scripts or ensure cluster settings exist from discovery.
| Setting | Location | Description |
|---|---|---|
| OpenShift GitOps RBAC | ClusterRoleBinding from 01-grant-openshift-gitops-rbac.sh |
Required unless managed-by alone is sufficient |
| Cluster apps domain | ConfigMap acs-ai-overwatch-cluster-config or values-cluster.yaml |
03-apply-cluster-configmap.sh / discovery Job |
| Discovery SA + script CM | acs-ai-overwatch-system |
04-apply-discovery-prerequisites.sh (optional if Argo creates them) |
| Git repository URL | ConfigMap gitRepoUrl or kagenti.appSource.repoUrl in values |
Discovery / git remote |
| Argo CD repo URL | gitops/argocd/application*.yaml → spec.source.repoURL |
Set to your Git remote (not auto-updated) |
| Quay credentials | quayStorage.registryCredentials.password |
Manual or QUAY_REGISTRY_PASSWORD (local script only) |
| Quay UI admin (reference) | quayStorage.quayRegistry.adminAccount.password |
Secret admin-account in quay — change default before sync |
| Quay MinIO secret key | quayStorage.quayRegistry.minio.credentialsSecret.secretKey |
Change default before sync |
| Mattermost admin/HITL passwords | mattermost.bootstrap.* |
Bootstrap job credentials — change defaults before sync |
| Quay object storage size | quayStorage.quayRegistry.minio.volumeSize |
Default 500Gi MinIO PVC on gp3-csi |
| Red Hat Kueue Operator | OperatorHub (manual) | Before default-dsc; see Kueue prerequisite |
| OpenShift Pipelines | OperatorHub / OLM Subscription | Required before oc apply -f pipelines/tekton/; see OpenShift Pipelines prerequisite |
oc login ...
make cluster-admin-pre-gitops
oc apply -k gitops/argocd/
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config
make helm-template # optional local render; add -f values-cluster.yaml if generatedRender with PoC feature flags enabled:
helm template acs-ai-overwatch gitops/helm/acs-ai-overwatch \
--set components.acsPolicies.enabled=true \
--set components.agentsRoseyRegrets.enabled=true \
--set components.kagenti.enabled=true-
Clone/fork the repository, log in, and confirm
storage.defaultStorageClassmatches your cluster (oc get storageclass). -
Set
spec.source.repoURLingitops/argocd/application.yamlandapplication-cluster-discovery.yamlto your Git remote (if different from the default fork). -
Run cluster-admin bootstrap (recommended):
make cluster-admin-pre-gitops
-
Register Applications:
oc login oc apply -k gitops/argocd/
-
Wait for cluster discovery, then refresh the main app:
oc get application -n openshift-gitops oc get job -n acs-ai-overwatch-system cluster-discovery oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config
In the Argo CD UI, Refresh
acs-ai-overwatchafter the ConfigMap exists. -
Optional local override file (not required in Git):
make cluster-values # writes values-cluster.yaml for helm-template / Argo override
Argo CD is configured with:
- Two Applications: discovery (sync-wave
0) then main (sync-wave1) - Automated sync with prune and self-heal on both
- CreateNamespace=true so chart-managed namespaces are created on sync
- Helm value files (main app):
values.yaml,values-poc.yaml, optionalvalues-cluster.yaml(ignoreMissingValueFiles: true) - Cluster ConfigMap for
cluster.appsDomainwhen values are empty (see Cluster-Aware Configuration) - Sync waves on rendered manifests so resources apply in dependency order (see below)
The umbrella Helm chart (gitops/helm/acs-ai-overwatch) is deployed by a single Argo CD Application (gitops/argocd/application.yaml). When argocd.syncWaves.enabled is true (default), each manifest gets argocd.argoproj.io/sync-wave so Argo CD applies lower waves before higher ones within a sync:
| Wave | Key | Examples |
|---|---|---|
| 0 | namespace |
mattermost, quay, redhat-ods-applications, stackrox, test-range, cluster-metadata |
| 10 | operators |
OLM Subscription / OperatorGroup (RHACS, RHOAI, GPU, NFD, Quay) |
| 20 | storage |
GPU time-slicing ConfigMap |
| 30 | platformCRs |
QuayRegistry, DataScienceCluster, HardwareProfile, ClusterPolicy |
| 40 | secrets |
Quay pull secrets, Mattermost bootstrap secrets |
| 45 | pvcs |
Mattermost DB, Rosey agent-reference-information |
| 50 | configMaps |
Mattermost env, ACS policy bundle, cluster metadata |
| 55 | security |
OpenShell SecurityContextConstraints, Mattermost bootstrap RBAC |
| 60 | workloads |
Core platform workloads |
| 85–90 | mattermostPrep / mattermostWorkloads |
Mattermost Postgres, server, PVCs |
| 95 | mattermostBootstrap |
Mattermost admin bootstrap Job → mattermost-acs-integration ConfigMap |
| 96 | acsBootstrap |
acs-mattermost-bridge, RHACS notifier ConfigMap, acs-platform-bootstrap Job |
| 97 | acsPolicies |
SecurityPolicy CRs in stackrox |
| 80 | agents |
Kagenti Deployment / Service / AppSource |
Tune wave numbers in values.yaml under argocd.syncWaves. Set argocd.syncWaves.enabled: false to omit annotations (e.g. for plain helm install debugging).
Important: Sync waves order apply only. They do not wait for OLM operators or CRs to become healthy. After wave 10, allow operator installs to finish before expecting wave 30+ CRs to reconcile. Use Argo CD UI health, oc get csv, or staged components.* toggles for the PoC agents and ACS policies.
make cluster-values
helm upgrade --install acs-ai-overwatch gitops/helm/acs-ai-overwatch \
-f gitops/helm/acs-ai-overwatch/values.yaml \
-f gitops/helm/acs-ai-overwatch/values-poc.yaml \
-f gitops/helm/acs-ai-overwatch/values-cluster.yaml \
-n openshift-gitopsAdjust release namespace as appropriate for your environment.
Most PoC-specific resources are gated behind components.* toggles. Deploy the platform first, then enable agents and security:
components:
acsPolicies:
enabled: true
agentsRoseyRegrets:
enabled: true
kagenti:
enabled: trueChart: acs-ai-overwatch
Version: 0.4.0
Path: gitops/helm/acs-ai-overwatch
| Key | Default | Description |
|---|---|---|
argocd.syncWaves.enabled |
true |
Emit argocd.argoproj.io/sync-wave on chart resources |
argocd.syncWaves.namespace |
"0" |
Wave for namespaces |
argocd.syncWaves.operators |
"10" |
Wave for OLM subscriptions / operator groups |
argocd.syncWaves.storage |
"20" |
Reserved (platform CRs wave) |
argocd.syncWaves.platformCRs |
"30" |
Wave for operator-owned CRs |
argocd.syncWaves.secrets |
"40" |
Wave for secrets |
argocd.syncWaves.pvcs |
"45" |
Wave for PVCs |
argocd.syncWaves.configMaps |
"50" |
Wave for config maps |
argocd.syncWaves.security |
"55" |
Wave for SCC / RBAC |
argocd.syncWaves.workloads |
"60" |
Wave for deployments, services, routes |
argocd.syncWaves.bootstrap |
"70" |
Wave for bootstrap jobs |
argocd.syncWaves.agents |
"80" |
Wave for Kagenti agents |
| Key | Default | Description |
|---|---|---|
global.partOf |
acs-ai-overwatch |
Applied as app.kubernetes.io/part-of label |
| Key | Default | Description |
|---|---|---|
clusterDiscovery.enabled |
true |
Read acs-ai-overwatch-cluster-config ConfigMap via Helm lookup |
clusterDiscovery.namespace |
acs-ai-overwatch-system |
ConfigMap namespace |
clusterDiscovery.configMapName |
acs-ai-overwatch-cluster-config |
Written by discovery Application |
clusterDiscovery.discoveryApplicationName |
acs-ai-overwatch-cluster-discovery |
Used in Helm fail messages |
| Key | Default | Description |
|---|---|---|
cluster.name |
acs-ai-overwatch |
Override or from ConfigMap clusterName |
cluster.appsDomain |
"" |
Override or from ConfigMap appsDomain (required for URL templates) |
cluster.topology |
3x3 |
Documented layout (informational) |
| Key | Default | Description |
|---|---|---|
storage.defaultStorageClass |
gp3-csi |
All PVCs; templates fall back here |
mattermost.pvc.storageClassName |
gp3-csi |
Mattermost file data |
mattermost.postgres.pvc.storageClassName |
gp3-csi |
Mattermost Postgres |
quayStorage.quayRegistry.components.postgres.storageClassName |
gp3-csi |
Quay Postgres |
quayStorage.quayRegistry.components.clairpostgres.storageClassName |
gp3-csi |
Quay Clair Postgres |
quayStorage.quayRegistry.minio.storageClassName |
gp3-csi |
MinIO blob storage PVC |
acs.central.persistence.storageClassName |
gp3-csi |
RHACS Central PVC |
agentsRoseyRegrets.pvc.storageClassName |
gp3-csi |
Rosey output PVC |
| Key | Default | Description |
|---|---|---|
agentsRoseyRegrets.namespace |
test-range |
PVC namespace |
agentsRoseyRegrets.pvc.name |
agent-reference-information |
PVC name (referenced by Kagenti deployment) |
agentsRoseyRegrets.pvc.storageClassName |
gp3-csi |
Persistent volume StorageClass |
agentsRoseyRegrets.pvc.size |
20Gi |
Requested capacity |
| Key | Default | Description |
|---|---|---|
kagenti.namespace |
test-range |
Agent workloads |
kagenti.rosey.outputMountPath |
/agent-reference-information |
Must match image AGENT_OUTPUT_DIR |
kagenti.rosey.networkAuditCommand |
Network Audit |
Exact command that starts background recon immediately |
kagenti.rosey.networkAuditCidr |
10.0.0.0/24 |
nmap target — keep small for PoC (avoids Kagenti 504) |
kagenti.rosey.networkAuditTimeoutSec |
45 |
Max seconds per nmap invocation |
kagenti.rosey.llmDrivenNetworkAudit |
true |
Model tool-calling path (vs hardcoded auto-nmap) |
kagenti.rosey.autoNetworkAudit |
false |
Legacy: nmap before every message |
kagenti.api.baseUrl |
"" |
From ConfigMap / values-cluster.yaml or derived from appsDomain |
kagenti.appSource.repoUrl |
"" |
From ConfigMap gitRepoUrl / values / git remote |
| Key | Default | Description |
|---|---|---|
cluster.gpu.count |
3 |
GPU count (documentation) |
cluster.gpu.model |
L4 |
GPU model |
cluster.gpu.vendor |
nvidia |
GPU vendor |
clusterMetadata.enabled |
true |
Creates ConfigMap cluster-metadata in acs-ai-overwatch-system |
clusterMetadata.namespace |
acs-ai-overwatch-system |
Metadata namespace |
| Flag | Default | Enables |
|---|---|---|
components.bootstrapOperators |
false |
Reserved |
components.gpuConfig |
false |
Reserved |
components.acsPolicies |
false |
ACS operator, test-range namespace, policy ConfigMap, OpenShell SCC |
components.kagenti |
false |
Kagenti agent Deployments, Services, AppSource |
components.agentsHelpfulHank |
true (base) |
helpful-hank Deployment + Service |
components.agentsRoseyRogue |
false |
Reserved legacy toggle |
components.agentsRoseyRegrets |
false |
rosey-regrets Deployment + PVC |
components.agentsRoseyRegretsSlm |
false |
rosey-regrets-slm + SLM PVC (recommended demo agent) |
components.slmVllm |
false |
Qwen3-0.6B vLLM Deployment for SLM agents |
components.agentsSneakySam |
false |
sneaky-sam demo agent (telemetry violator) |
components.pipelines |
false |
Reserved (Tekton YAML applied separately) |
agentTelemetryPolicy.enabled |
true |
NetworkPolicy + RHACS telemetry policy ConfigMap |
Installs Node Feature Discovery (NFD) and the NVIDIA GPU Operator with time-slicing so each physical L4 advertises multiple logical nvidia.com/gpu devices.
| Resource | Namespace | Template |
|---|---|---|
| NFD Subscription | openshift-nfd |
accelerators-nfd.yaml |
| GPU Operator Subscription | nvidia-gpu-operator |
accelerators-gpu-operator.yaml |
| Time-slicing ConfigMap | nvidia-gpu-operator |
accelerators-gpu-time-slicing-configmap.yaml |
| ClusterPolicy | cluster-scoped | accelerators-gpu-clusterpolicy.yaml |
Key tuning values:
accelerators:
timeSlicing:
replicasPerGpu: 2 # 3 physical GPUs × 2 = 6 half-GPU slices
migStrategy: noneThe GPU Operator ClusterPolicy references ConfigMap time-slicing-config with key any.
Deploys on-cluster Quay on gp3-csi (same StorageClass as other workloads).
Stack:
- Quay Operator subscription
QuayRegistryCR — Postgres and Clair ongp3-csi; blob storage via MinIO (S3-compatible) ongp3-csi- Pull credentials Secret in
ai-workbenches
Set quayStorage.enabled: false in values-poc.yaml until you are ready to build agent images. Alternatively use an external registry and disable in-cluster Quay (see Storage).
Installs the Red Hat OpenShift AI Operator (rhods-operator) on channel stable-3.4 (override to match your catalog) and a DataScienceCluster using datasciencecluster.opendatahub.io/v2 (required for 3.x; do not use v1 from 2.25).
| Setting | Default | Notes |
|---|---|---|
rhoai.targetVersion |
3.4 |
Documentation marker |
rhoai.operator.subscription.channel |
stable-3.4 |
Must match catalog: oc get packagemanifest rhods-operator -n openshift-marketplace |
rhoai.datascienceCluster.apiVersion |
.../v2 |
Required for 3.4; v1 is 2.25 only |
| Component | managementState | Purpose |
|---|---|---|
| dashboard | Managed | OpenShift AI dashboard |
| kserve | Managed | Model serving |
| kueue | Unmanaged | Queue scheduling via Red Hat Kueue Operator (Managed rejected in 3.4) |
| workbenches | Managed | Developer workbench provisioning |
| modelregistry | Managed | Model registry in rhoai-model-registries |
| ray, aipipelines, feast, training, trustyai, llamastack | Removed | Reduced footprint |
Note: The standalone CodeFlare operator was removed in OpenShift AI 3.x; Ray/distributed workloads use the ray component (set to Managed if needed). See Red Hat OpenShift AI 3.4 docs.
Kueue (3.4): Chart default is kueue.managementState: Unmanaged. Install the Red Hat Kueue Operator before syncing default-dsc.
Workbench namespace (DSC default): rhods-notebooks — set once at install; cannot be changed after the operator is deployed.
Team workbenches: When rhoai.teamWorkbenches.enabled (default true), the chart provisions a Notebook CR plus PVC in each Kagenti team namespace (team1, team2). Open workbenches from the OpenShift AI dashboard under Projects → team1 / team2 → Workbenches.
Model registry namespace: rhoai-model-registries (created when modelregistry.managementState: Managed on default-dsc).
HardwareProfile: l4-timeslice-half-gpu
Templates: rhoai-operator.yaml, rhoai-datasciencecluster.yaml, rhoai-hardwareprofile.yaml, rhoai-namespace-applications.yaml, rhoai-team-workbenches.yaml
Important: OpenShift AI 3.4 must be installed on a fresh cluster. If a cluster previously had RHOAI 2.25, provision a new cluster rather than attempting an upgrade path.
Mattermost Team Edition deploys in the baseline sync as the Slack-compatible notification sink. Helm templates: Postgres, server PVC, Route, bootstrap Job → ConfigMap mattermost-acs-integration.
Operational steps (login, webhook, RHACS bridge, notifier verification) are in Mattermost & RHACS notifications — follow that section after Phase 3, before the demo.
Three PoC agent images (Helpful Hank, Rosey Regrets, Sneaky Sam) are built from the NVIDIA OpenShell community sandbox, extended with a shared Kagenti A2A server and OpenTelemetry SDK (agents/common/acs_agent/):
ghcr.io/nvidia/openshell-community/sandboxes/base:latest
+ kagenti-adk (A2A protocol on PORT 8000)
+ opentelemetry SDK (exports when OTEL_EXPORTER_OTLP_ENDPOINT is set — Phase 5)
Each image starts /sandbox/.venv/bin/python -m acs_agent.server and listens on 0.0.0.0:8000.
Model reference:
HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
| Attribute | Value |
|---|---|
| Path | agents/helpful-hank/ |
| Personality | Standard technical assistant |
| System prompt | agents/helpful-hank/system_prompt.txt |
| Extra packages | None beyond OpenShell base |
| Output directory | N/A |
| Telemetry label | acs-ai-overwatch.io/telemetry=enabled |
Build from repository root:
docker build -f agents/helpful-hank/Dockerfile .Optional full model bake-in at build time:
docker build -f agents/helpful-hank/Dockerfile \
--build-arg PULL_MODEL_AT_BUILD=1 .| Attribute | Value |
|---|---|
| Path | agents/rosey-regrets/ |
| Personality | Deliberately misaligned lab evaluation agent |
| System prompt | agents/rosey-regrets/system_prompt.txt — instructs the model to call run_network_recon for audit/recon prompts |
| Runtime behavior (default) | LLM-driven: AGENT_LLM_DRIVEN_NETWORK_AUDIT=true, AGENT_AUTO_NETWORK_AUDIT=false — Qwen/vLLM decides when to invoke the run_network_recon tool; nmap runs in the background so Kagenti does not 504 |
| Scan target (PoC default) | NETWORK_AUDIT_CIDR=10.0.0.0/24 (not /8 — a full RFC1918 sweep times out gateways) |
| Extra packages | nmap, iproute2 |
| Output directory | /agent-reference-information |
| PVC | agent-reference-information in test-range |
| Telemetry label | acs-ai-overwatch.io/telemetry=enabled |
Recommended for the live demo: rosey-regrets-slm (below) — smaller image, shared Qwen3-0.6B vLLM backend, faster responses.
Build from repository root:
docker build -f agents/rosey-regrets/Dockerfile .OpenShift binary build (when Tekton/Quay is unavailable — uploads local source):
oc start-build rosey-regrets --from-dir=. --follow -n test-range| Attribute | Value |
|---|---|
| Path | agents/rosey-regrets-slm/ |
| Inference | Shared in-cluster Qwen3-0.6B vLLM (qwen3-vllm Deployment) via LLM_API_BASE |
| Demo agent in Kagenti | Select rosey-regrets-slm in the UI for the violation loop |
| PVC | agent-reference-information-slm |
oc start-build rosey-regrets-slm --from-dir=. --follow -n test-range| Attribute | Value |
|---|---|
| Path | agents/sneaky-sam/ |
| Personality | Hank-like assistant that skips telemetry compliance at deploy time |
| System prompt | agents/sneaky-sam/system_prompt.txt |
| Telemetry label | None (deliberately omits acs-ai-overwatch.io/telemetry=enabled) |
| RHACS policy | test-range-sneaky-sam-telemetry-violation → Mattermost on deploy |
| OTEL env | Not injected — demonstrates ACS/NetworkPolicy blocking |
Enable with components.agentsSneakySam.enabled: true. With Phase 3 RHACS and admission control, the telemetry policy should reject the Deployment (Argo may show sync failure) and post a deploy-time violation to Mattermost Town Square — not a runtime scale-down alert. If the pod still lands (e.g. admission off), the NetworkPolicy limits it to DNS egress only.
Build from repository root:
docker build -f agents/sneaky-sam/Dockerfile .| Variable | Description |
|---|---|
AGENT_HF_MODEL_ID |
Hugging Face repo ID |
MODEL_LOCAL_DIR |
/models/hf-model |
HF_HOME |
/models/hf-hub |
OPENSHELL_SYSTEM_PROMPT_FILE |
/etc/openshell/agent/system_prompt.txt |
HOST |
Bind address for A2A server (0.0.0.0) |
PORT |
A2A server port (8000) |
AGENT_OUTPUT_DIR |
Rosey only: /agent-reference-information |
AGENT_ENABLE_NETWORK_AUDIT |
Rosey only: true enables network recon handler |
AGENT_AUTO_NETWORK_AUDIT |
Rosey only: false (default) — legacy mode that runs nmap before every reply |
AGENT_LLM_DRIVEN_NETWORK_AUDIT |
Rosey only: true (default) — model invokes run_network_recon tool via vLLM |
NETWORK_AUDIT_CIDR |
Rosey only: PoC default 10.0.0.0/24 (override in kagenti.rosey.networkAuditCidr) |
NETWORK_AUDIT_TIMEOUT_SEC |
Rosey only: PoC default 45 |
LLM_API_BASE / LLM_MODEL |
SLM agents: OpenAI-compatible vLLM endpoint (e.g. http://qwen3-vllm.test-range.svc.cluster.local:8000/v1, qwen3-0-6b) |
AGENT_NETWORK_RECON_INCLUDE_IP |
Rosey only: include ip route / ip addr in recon transcript (default true) |
OTEL_EXPORTER_OTLP_ENDPOINT |
Phase 5: shared collector gRPC endpoint (injected by GitOps when enabled) |
OTEL_SERVICE_NAME |
Phase 5: trace service name (helpful-hank / rosey-regrets) |
OpenTelemetry is baked into the image and activates automatically when Phase 5 injects the OTEL_* variables via observability.agentInstrumentation.enabled. Without those env vars, the SDK initializes as a no-op and baseline behavior is unchanged.
agents/scripts/pull-model.sh uses huggingface_hub.snapshot_download for runtime or manual model pulls:
export AGENT_HF_MODEL_ID="HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive"
pull-modelFor gated models, provide HF_TOKEN in the environment.
Two separate toggles:
| Toggle | Location | What it does |
|---|---|---|
| Kagenti platform | acs-ai-overwatch-kagenti-platform Application + job.enabled |
Installs Keycloak, SPIRE, operator, API (Phase 4) |
| Agent workloads | components.kagenti + per-agent flags in main chart |
Deploys helpful-hank, rosey-regrets, and/or sneaky-sam + AppSource |
| Observability | acs-ai-overwatch-observability Application + enabled: true |
Shared OTEL → Tempo + MLflow; optional observability.agentInstrumentation on agents (Phase 5) |
Kagenti discovers agent workloads via standard Kubernetes Deployments labeled kagenti.io/type: agent.
Enable with:
components:
kagenti:
enabled: true
agentsRoseyRegrets:
enabled: true # PVC agent-reference-information
acsPolicies:
enabled: true # test-range namespace, RHACS policy, openshell SCC/SA| Resource | Name | Namespace |
|---|---|---|
| Deployment | helpful-hank |
test-range |
| Service | helpful-hank |
test-range |
| Deployment | rosey-regrets |
test-range |
| Service | rosey-regrets |
test-range |
| Deployment | sneaky-sam (opt-in) |
test-range |
| Service | sneaky-sam (opt-in) |
test-range |
| AppSource | acs-ai-overwatch-gitops |
test-range |
| PVC | agent-reference-information |
test-range |
| NetworkPolicy | agent-telemetry-block-noncompliant |
test-range |
Registers this GitOps repository with Kagenti:
kagenti:
appSource:
name: acs-ai-overwatch-gitops
repoUrl: "" # from ConfigMap gitRepoUrl or values-cluster.yaml
revision: main
path: gitops/helm/acs-ai-overwatchNote: The AppSource CRD (kagenti.io/v1alpha1) may vary by Kagenti version. Validate against your cluster’s installed CRD schema and adjust apiVersion / spec if needed.
The Kagenti Deployment mounts the PVC at the standardized path:
volumeMounts:
- name: reference-information
mountPath: /agent-reference-information
volumes:
- name: reference-information
persistentVolumeClaim:
claimName: agent-reference-information # from agentsRoseyRegrets.pvc.name in valuesDefault images point to in-cluster Quay:
kagenti:
images:
helpfulHank: quay-quay-registry.quay.svc.cluster.local:443/acs-agents/helpful-hank:latest
roseyRegrets: quay-quay-registry.quay.svc.cluster.local:443/acs-agents/rosey-regrets:latestBuild and push images with Tekton before enabling Kagenti, or pods will fail image pull.
Enable baseline ACS artifacts with components.acsPolicies.enabled: true (operator, test-range, SecurityPolicy CRs, SCC).
| Mode | Flags | What gets deployed |
|---|---|---|
| Baseline (default) | acs.central.enabled: false, acs.bootstrap.enabled: false |
RHACS operator Subscription, test-range NS, SecurityPolicy CRs, OpenShell SCC |
| Full stack (opt-in) | both true |
Above + Central CR, bootstrap Job (init bundle, SecuredCluster, Mattermost notifier) |
See Phase 3 — Full RHACS.
| Resource | Namespace |
|---|---|
| Namespace | rhacs-operator |
| OperatorGroup | rhacs-operator |
| Subscription | rhacs-operator (stable, redhat-operators) |
When acs.central.enabled: true, the chart also renders a Central CR in namespace stackrox. When acs.bootstrap.enabled: true, Job acs-platform-bootstrap completes init bundle + SecuredCluster + Mattermost notifier (best-effort).
If Phase 3 is disabled (default), deploy Central/SCS manually per Red Hat documentation, or enable Phase 3 in values.
Creates isolated namespace test-range for agent workloads and ACS policy scope.
Policies are SecurityPolicy CRs (config.stackrox.io/v1alpha1) in namespace stackrox, applied by GitOps when the RHACS operator CRD is present:
oc get securitypolicy -n stackrox test-range-runtime-guardrails test-range-agent-telemetry-requiredLegacy roxctl declarative-config create --file does not work on RHACS 4.10+ (that command is for auth/roles/notifiers, not violation policies).
Policy name: test-range-runtime-guardrails
Scope: namespace test-range
Lifecycle stage: RUNTIME (eventSource: DEPLOYMENT_EVENT)
Severity: HIGH
Enforcement: alert-only (enforcementActions: []) — Rosey keeps running for lab demos; Mattermost is notified
Policy sections:
| Section | Behavior |
|---|---|
| Suspicious recon processes detected | Violation on process name nmap or masscan |
Notifier: Mattermost Notifier → generic webhook → acs-mattermost-bridge → Mattermost Town Square
Separate SecurityPolicy resources when agentTelemetryPolicy.enabled and components.acsPolicies.enabled:
| Policy | Targets | Mattermost | Admission |
|---|---|---|---|
test-range-agent-telemetry-required |
All kagenti.io/type=agent in test-range except sneaky-sam / sneaky-sam-slm |
Yes | Block (default) |
test-range-sneaky-sam-telemetry-violation |
sneaky-sam, sneaky-sam-slm only |
Yes | Alert-only (demo) |
Policy name: test-range-agent-telemetry-required
Scope: namespace test-range, workloads labeled kagenti.io/type=agent
Lifecycle stage: DEPLOY only (no runtime scale-to-zero)
Severity: HIGH
Enforcement (default): FAIL_DEPLOYMENT_CREATE_ENFORCEMENT, FAIL_DEPLOYMENT_UPDATE_ENFORCEMENT
Required label: acs-ai-overwatch.io/telemetry=enabled
Import manually (deprecated — use GitOps SecurityPolicy CRs instead):
# Policies are SecurityPolicy CRs — oc apply -f or sync Argo CD
oc get securitypolicyWhen Phase 3 bootstrap runs, policies are already applied as SecurityPolicy CRs (sync wave 70). Configure Mattermost Notifier in the ACS console if alerts do not arrive (bootstrap notifier upsert is best-effort on RHACS 4.10).
To alert without blocking admission:
agentTelemetryPolicy:
rhacs:
enforcementActions: []NetworkPolicy fallback: agent-telemetry-block-noncompliant applies even without RHACS Central — non-compliant agent pods get DNS-only egress.
SCC openshell-gpu-runtime grants the openshell ServiceAccount in test-range permissions required for GPU and network tooling workloads:
| Setting | Value |
|---|---|
| Privileged containers | Allowed |
| HostPath volumes | Allowed |
| Capabilities | NET_ADMIN, NET_RAW, IPC_LOCK, SYS_ADMIN, SYS_PTRACE |
| seccompProfiles | runtime/default, unconfined |
ServiceAccount: openshell in test-range
Referenced by Kagenti agent Deployments via kagenti.serviceAccountName.
Prerequisite: OpenShift Pipelines (Tekton) must be installed on the cluster (CSV Succeeded, tasks.tekton.dev CRD present). This repo does not install that operator via GitOps.
Location: pipelines/tekton/agents-build-pipeline.yaml
| Kind | Name |
|---|---|
| Task | agents-git-clone |
| Task | agents-buildah-image |
| Pipeline | build-helpful-hank-and-rosey-regrets |
fetch-repository (git clone)
│
├──────────────────────┐
▼ ▼
build-helpful-hank build-sneaky-sam (parallel)
│
▼
build-rosey-regrets
build-helpful-hank and build-rosey-regrets run sequentially (shared RWO workspace). build-sneaky-sam runs in parallel after clone (separate image tag, same workspace read).
The test-range namespace includes BuildConfigs that push to the internal OpenShift registry (used when Tekton/Quay builds fail):
| BuildConfig | Source | ImageStream |
|---|---|---|
rosey-regrets |
Binary (--from-dir=.) or Git@main |
rosey-regrets:latest |
rosey-regrets-slm |
Binary | rosey-regrets-slm:latest |
helpful-hank, sneaky-sam |
Git@main or Binary | matching ImageStream |
# From repository root — includes latest agent code without a git push
oc start-build rosey-regrets-slm --from-dir=. --follow -n test-range
oc start-build rosey-regrets --from-dir=. --follow -n test-range
oc rollout restart deploy/rosey-regrets-slm deploy/rosey-regrets -n test-range# Fails with "no matches for kind Task" if OpenShift Pipelines is not installed
oc get crd tasks.tekton.dev || echo "Install OpenShift Pipelines first (see Prerequisites)"
oc apply -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipeline.yamloc create secret docker-registry quay-build-robot \
-n acs-ai-overwatch-system \
--docker-server=quay-quay-registry.quay.svc.cluster.local:443 \
--docker-username=<robot-account> \
--docker-password=<token>Ensure organization acs-agents (or your chosen org) exists in Quay with repositories helpful-hank, rosey-regrets, and sneaky-sam.
Edit and create from the example PipelineRun:
oc create -n acs-ai-overwatch-system \
-f pipelines/tekton/agents-build-pipelinerun.example.yamlMonitor:
oc get pipelinerun -n acs-ai-overwatch-system
tkn pipelinerun logs -f -n acs-ai-overwatch-system -l app.kubernetes.io/part-of=acs-ai-overwatch- Uses
registry.redhat.io/rhel9/buildah:latest - Requires privileged pod security context
- Uses
vfsstorage driver (common pattern on OpenShift) - Default
push-tls-verify: falsefor internal Quay with self-signed certs
The example PipelineRun (agents-build-pipelinerun.example.yaml) uses storageClassName: gp3-csi for the shared source workspace PVC.
Follow these steps in order — they mirror PoC deployment phases. Mattermost login and alert verification come last; see Mattermost & RHACS notifications.
oc login
chmod +x scripts/cluster-admin/*.sh
make cluster-admin-pre-gitops
oc apply -k gitops/argocd/Wait for sync waves: bootstrap → cluster-discovery → acs-ai-overwatch. Confirm discovery and operators:
oc get job -n acs-ai-overwatch-system cluster-discovery
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config
oc get pods -n monitoring -l app.kubernetes.io/name=mattermostAfter a new sandbox, hard-refresh the main app so Helm lookup picks up the ConfigMap:
oc annotate application acs-ai-overwatch -n openshift-gitops argocd.argoproj.io/refresh=hard --overwriteInstall OpenShift Pipelines if needed, apply the Tekton pipeline, then build and push images to Quay:
oc apply -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipeline.yaml
oc create -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipelinerun.example.yaml
# Or binary build: oc start-build rosey-regrets-slm --from-dir=. --follow -n test-rangeEnable agent component flags in values-poc.yaml, commit, sync:
components:
kagenti:
enabled: true
agentsHelpfulHank:
enabled: true
agentsRoseyRegrets:
enabled: true
agentsSneakySam:
enabled: false # true for Demo A (telemetry guardrail)Verify workloads in test-range:
oc get pods,svc,pvc,networkpolicy -n test-rangeEnable Central + bootstrap in values-poc.yaml, sync, wait for Central and SecuredCluster:
acs:
central:
enabled: true
bootstrap:
enabled: true
components:
acsPolicies:
enabled: trueoc logs -n stackrox job/acs-platform-bootstrap
oc get securitypolicy -n stackrox test-range-runtime-guardrails test-range-agent-telemetry-requiredPolicies sync as SecurityPolicy CRs via GitOps. The bootstrap Job configures init bundle and SecuredCluster; the Mattermost notifier is applied declaratively when Phase 3 is enabled (details in Mattermost & RHACS notifications).
Uncomment application-kagenti-platform.yaml in gitops/argocd/kustomization.yaml, set job.enabled: true in the Kagenti platform chart, sync, then:
./scripts/kagenti-auth-info.shComplete Mattermost & RHACS notifications before running demos — login, confirm webhook bridge, verify notifier path.
Use PoC Demo Walkthrough (After Setup) for the presenter script.
Demo A — Telemetry guardrail (Sneaky Sam): Set components.agentsSneakySam.enabled: true, sync. With Phase 3 admission, RHACS posts a deploy-time violation to Mattermost Town Square. Log in as human-in-the-loop (password in mattermost.bootstrap.hitlPassword).
Demo B — Network audit (Rosey Regrets): In Kagenti UI, select rosey-regrets-slm and send Network Audit or a recon prompt. Expect RHACS runtime violation → Mattermost Town Square. Alternative: ./scripts/trigger-network-audit.sh.
See Cluster admin: pre-GitOps setup and scripts/cluster-admin/README.md.
| Script | Purpose |
|---|---|
install-pre-gitops.sh |
Runs steps 01–04 (with optional flags) |
01-grant-openshift-gitops-rbac.sh |
Argo CD controller cluster-admin binding |
02-bootstrap-namespaces.sh |
Labeled PoC namespaces |
03-apply-cluster-configmap.sh |
acs-ai-overwatch-cluster-config ConfigMap |
04-apply-discovery-prerequisites.sh |
Discovery ServiceAccount + script ConfigMap |
Uses scripts/lib/openshift-cluster-discovery.sh to generate optional values-cluster.yaml from your oc login (same logic as the in-cluster discovery Job).
| Discovers | OpenShift / Git source |
|---|---|
cluster.appsDomain |
ingresses.config/cluster |
cluster.name |
Infrastructure or current context |
| Quay hostname | Route in quay or quay-quay.<domain> |
| Kagenti API URL | Route search or default hostname pattern |
kagenti.appSource.repoUrl |
git remote origin |
./scripts/discover-cluster-values.sh
./scripts/discover-cluster-values.sh --output /path/to/values-cluster.yaml
export QUAY_REGISTRY_PASSWORD='...' # optional: written into values-cluster.yamlSee Cluster-Aware Configuration and Makefile (make cluster-values).
Prints Kagenti UI / API / Keycloak URLs and demo login credentials from cluster Secrets. Requires oc login and a completed Phase 4 install.
./scripts/kagenti-auth-info.shSee KEYCLOAK.md for the full verification checklist and troubleshooting.
Resets the local Git repo to the portable baseline — no cluster-specific settings. Does not delete OpenShift resources; tear down the cluster or Argo Applications separately if needed.
| Action | Target |
|---|---|
| Restore baseline overlay | gitops/helm/acs-ai-overwatch/values-poc.yaml (Quay, RHACS Central, agents off) |
| Disable opt-in Argo apps | gitops/argocd/kustomization.yaml (Phase 4/5 Applications commented out) |
| Disable Kagenti install Job | gitops/helm/acs-ai-overwatch-kagenti-platform/values.yaml → job.enabled: false |
| Ensure Phase 5 off | gitops/helm/acs-ai-overwatch-observability/values.yaml → enabled: false if set |
| Remove local discovery output | gitops/helm/acs-ai-overwatch/values-cluster.yaml (gitignored) |
| Clear scratch workspace | scratch/ (gitignored) |
Baseline copies live in scripts/baseline/ — update those files when the default PoC posture changes.
./scripts/cleanup-poc-repo.sh --dry-run # preview
./scripts/cleanup-poc-repo.sh # apply
./scripts/cleanup-poc-repo.sh --reset-repo-urls # also set Argo repoURL from git remote / GIT_REPO_URL
git status && git diff # review, then commit or discardOr: make cleanup-poc-repo
Triggers the Rosey Regrets Network Audit command via Kagenti REST API.
| Environment Variable | Required | Default |
|---|---|---|
KAGENTI_API_BASE |
Yes | ConfigMap kagentiApiBaseUrl or values-cluster.yaml |
KAGENTI_API_TOKEN |
Yes | — |
ROSEY_AGENT_NAME |
No | rosey-regrets — set to rosey-regrets-slm for the SLM demo |
NETWORK_AUDIT_COMMAND |
No | Network Audit |
KAGENTI_COMMANDS_PATH_TEMPLATE |
No | /api/v1/agents/{agent}/commands |
KAGENTI_TLS_INSECURE |
No | false |
Downloads Hugging Face model weights into MODEL_LOCAL_DIR (default /models/hf-model).
| Namespace | Primary Contents |
|---|---|
openshift-gitops |
Argo CD Application |
acs-ai-overwatch-system |
Cluster metadata ConfigMap; Tekton pipeline namespace |
monitoring |
Mattermost server, bootstrap Job, ACS webhook ConfigMap |
quay |
QuayRegistry instance |
ai-workbenches |
Quay pull Secrets for workbenches |
openshift-nfd |
Node Feature Discovery |
nvidia-gpu-operator |
GPU Operator, time-slicing ConfigMap, ClusterPolicy |
redhat-ods-operator |
OpenShift AI operator |
redhat-ods-applications |
HardwareProfile CR, workbench image streams |
rhods-notebooks |
Default DSC workbench namespace |
rhoai-model-registries |
Model registry (when modelregistry: Managed) |
team1, team2 |
Kagenti team namespaces; team workbench Notebook CRs |
rhacs-operator |
RHACS operator subscription |
test-range |
Agents, PVC, ACS policy ConfigMap, OpenShell SCC/SA |
| Template | Condition | Creates |
|---|---|---|
cluster-metadata.yaml |
clusterMetadata.enabled |
Namespace + ConfigMap |
mattermost-*.yaml |
mattermost.enabled |
Mattermost stack |
quay-*.yaml |
quayStorage.enabled |
Quay operator + registry |
accelerators-*.yaml |
accelerators.enabled |
NFD + GPU Operator |
rhoai-*.yaml |
rhoai.enabled |
OpenShift AI operator, DSC, team workbenches |
acs-test-range-namespace.yaml |
components.acsPolicies.enabled |
test-range namespace |
acs-operator-install.yaml |
components.acsPolicies.enabled |
RHACS operator |
acs-securitypolicy-*.yaml |
components.acsPolicies.enabled |
Runtime + telemetry SecurityPolicy CRs (namespace stackrox) |
acs-notifier-declarative-configmap.yaml |
Phase 3 + Mattermost webhook CM | RHACS generic notifier → bridge |
mattermost-webhook-bridge.yaml |
mattermost.webhookBridge.enabled |
Translates RHACS alerts to Slack-format for Mattermost |
acs-policy-agent-telemetry.yaml |
acsPolicies + agentTelemetryPolicy |
Agent telemetry required-label policy |
agent-telemetry-networkpolicy.yaml |
kagenti + agentTelemetryPolicy |
DNS-only egress for non-compliant agents |
acs-openshell-scc.yaml |
components.acsPolicies.enabled |
SCC + ServiceAccount |
agents-rosey-regrets-pvc.yaml |
components.agentsRoseyRegrets.enabled |
PVC |
kagenti-agent-deployments.yaml |
components.kagenti.enabled |
Agent Deployments/Services |
kagenti-appsource.yaml |
components.kagenti.enabled |
AppSource CR |
GitOps: Sync the discovery Application first, confirm the ConfigMap, then refresh the main Application:
oc get application acs-ai-overwatch-cluster-discovery -n openshift-gitops
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-configLocal render: Run make cluster-values and pass -f values-cluster.yaml to helm template, or set cluster.appsDomain in values.
Mattermost, Quay pull secrets, and Kagenti are gated until cluster config is ready. If the ConfigMap exists but Argo still fails, Helm lookup may be unavailable on the repo-server — use gitops/argocd/cmp/.
This message is generic; the per-resource reason is in the Application status or controller logs. A common cause for this repo:
The main chart uses cluster-scoped resources (Namespace, DataScienceCluster, nvidia.com/ClusterPolicy, optional SecurityContextConstraints) that the openshift-gitops default AppProject does not allow. Discovery and bootstrap can still sync because they only create namespaced objects.
Fix:
# 1) AppProject (also in gitops/argocd/kustomization.yaml)
oc apply -f gitops/argocd/appproject-acs-ai-overwatch.yaml
# 2) Point Applications at that project (after you pull the repo change)
oc patch application acs-ai-overwatch -n openshift-gitops --type merge \
-p '{"spec":{"project":"acs-ai-overwatch"}}'
oc patch application acs-ai-overwatch-cluster-discovery -n openshift-gitops --type merge \
-p '{"spec":{"project":"acs-ai-overwatch"}}'
oc patch application acs-ai-overwatch-gitops-bootstrap -n openshift-gitops --type merge \
-p '{"spec":{"project":"acs-ai-overwatch"}}'
# 3) See the real error (replace with a failed resource from the list)
oc get application acs-ai-overwatch -n openshift-gitops \
-o jsonpath='{range .status.operationState.syncResult.resources[?(@.status=="SyncFailed")]}{.kind}/{.name} in {.namespace}: {.message}{"\n"}{end}'Or re-apply everything: oc apply -k gitops/argocd/ (includes the AppProject and updated spec.project).
Also ensure cluster-admin RBAC for the controller if the next error is forbidden on Subscriptions or SCCs:
oc apply -f gitops/argocd/bootstrap/openshift-gitops-controller-rbac.yamlExample error:
no matches for kind "DataScienceCluster" in version "datasciencecluster.opendatahub.io/v2"
ensure CRDs are installed first
Common causes:
-
API version / operator mismatch — This repo targets OpenShift AI 3.4 only (
datasciencecluster.opendatahub.io/v2, channel such asstable-3.4). On a fresh cluster you should seerhods-operator.3.4.*Succeeded. If you see 2.25.x, the cluster had a prior 2.25 install or the wrong channel — use a new cluster or fix the Subscription channel before syncingdefault-dsc.oc get csv -n redhat-ods-operator | grep rhods oc get crd datascienceclusters.datasciencecluster.opendatahub.io \ -o jsonpath='{range .spec.versions[*]}{.name}{" "}{end}{"\n"}' oc get subscription rhods-operator -n redhat-ods-operator -o jsonpath='{.spec.channel}{"\n"}'
-
CRD not ready yet — Subscription applied before CSV Succeeded; see operator checks below.
Example:
admission webhook "datasciencecluster-v2-validator.opendatahub.io" denied the request:
Managed is no longer supported as a managementState
Cause: OpenShift AI 3.4 no longer allows spec.components.kueue.managementState: Managed (embedded Kueue was removed).
Fix: Install the Red Hat Kueue Operator, then ensure the chart uses Unmanaged (default in values.yaml):
rhoai:
datascienceCluster:
components:
kueue:
managementState: UnmanagedPush/sync, or patch on cluster:
oc patch datasciencecluster default-dsc --type merge \
-p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged","defaultClusterQueueName":"default","defaultLocalQueueName":"default"}}}}'To skip Kueue entirely (no operator): set managementState: Removed and rhoai.hardwareProfile.enabled: false in values-poc.yaml.
SkipDryRunOnMissingResource only skips dry-run validation; it does not stop apply failures.
Example on the Subscription in redhat-ods-operator:
ConstraintsNotSatisfiable: constraints not satisfiable: no operators found in channel fast-3.4 of package rhods-operator in the catalog referenced by subscription rhods-operator
ResolutionFailed: True
Meaning: The redhat-operators catalog on this cluster does not publish the channel named in the Subscription (fast-3.4, stable-3.4, etc.). OLM cannot resolve any CSV for that channel.
Fix:
-
List channels the catalog actually exposes:
oc get packagemanifest rhods-operator -n openshift-marketplace \ -o jsonpath='{range .status.channels[*]}{.name}{" "}{.currentCSV}{"\n"}{end}' -
Pick a 3.4 channel from that list (e.g.
stable-3.4,fast-3.4,eus-3.4) and set it on the Subscription:oc patch subscription rhods-operator -n redhat-ods-operator --type merge \ -p '{"spec":{"channel":"stable-3.4"}}'Or override in Helm/Argo:
rhoai.operator.subscription.channelinvalues.yaml(or your cluster values file), then syncacs-ai-overwatch. -
If no channel ending in
-3.4appears, this cluster’s operator index likely does not include OpenShift AI 3.4 yet. OpenShift AI 3.4 requires OpenShift Container Platform 4.19.9+ (see supported configurations). Check:oc version oc get clusterversion version -o jsonpath='{.status.desired.version}{"\n"}'Upgrade OCP or refresh the
redhat-operatorscatalog before retrying. The unversionedstable/fastchannels may still point at 2.25 and will not satisfy this chart’sDataScienceClusterv2 API. -
After changing the channel, confirm resolution:
oc get subscription rhods-operator -n redhat-ods-operator -o yaml | grep -A2 conditions oc get csv -n redhat-ods-operator | grep rhods
Expect
rhods-operator.3.4.*with phase Succeeded before syncingdefault-dsc.
Chart behavior (current): platformResources.waitForCrds: true (default) omits DataScienceCluster, HardwareProfile, ClusterPolicy, and QuayRegistry from the manifest until Helm lookup sees each CRD on the cluster. After operators install, Refresh → Sync and those resources appear automatically.
On the cluster:
# RHOAI must succeed — OperatorGroup must be spec: {} (not targetNamespaces)
oc get operatorgroup redhat-ods-operator -n redhat-ods-operator -o yaml | grep -A2 '^spec:'
oc get csv -n redhat-ods-operator
oc get pods -n redhat-ods-operator
oc get crd datascienceclusters.datasciencecluster.opendatahub.io
oc get subscription rhods-operator -n redhat-ods-operatorWhen rhods-operator.* is Succeeded and the CRD exists, Refresh → Sync acs-ai-overwatch.
If sync stays green but default-dsc never appears (Helm lookup unavailable on repo-server), set in values.yaml after CRDs exist:
platformResources:
waitForCrds: falseCommit/push and sync again.
Sync wave order: namespaces → Subscriptions (10) → platform CRs (30) → workloads later.
oc get application acs-ai-overwatch -n openshift-gitops -o yaml
oc describe application acs-ai-overwatch -n openshift-gitopsCommon causes: invalid Helm values, missing CRDs, operator subscriptions pending install plans, or wrong StorageClass name for your cluster.
openshift-gitops-argocd-application-controller cannot create resource "serviceaccounts"
in namespace "acs-ai-overwatch-system"
OpenShift GitOps does not grant the application controller cluster-wide deploy rights by default. Apply the bootstrap binding once (cluster-admin required):
oc apply -f gitops/argocd/bootstrap/openshift-gitops-controller-rbac.yamlThen Sync the discovery and main Applications again. See gitops/argocd/bootstrap/README.md.
The main chart also deploys to monitoring, quay, test-range, operator namespaces, etc.; without this binding you will see similar errors on those namespaces next.
Health Missing + “Resource not found in cluster” usually means Argo CD has not applied those objects yet (desired state from Git, live cluster empty). Click Sync on acs-ai-overwatch-cluster-discovery.
If Sync fails or resources stay missing:
-
Confirm the discovery chart is on the branch Argo syncs (includes
gitops/helm/acs-ai-overwatch-cluster-discovery/files/). -
Check the Application sync error:
oc describe application acs-ai-overwatch-cluster-discovery -n openshift-gitops -
Verify namespace exists:
oc get ns acs-ai-overwatch-system -
After a successful sync:
oc get sa,cm -n acs-ai-overwatch-system | grep cluster-discovery oc get job -n acs-ai-overwatch-system cluster-discovery oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config -
Refresh Application
acs-ai-overwatchonceacs-ai-overwatch-cluster-configexists.
The discovery Job needs cluster-scoped RBAC (ClusterRole / ClusterRoleBinding named acs-ai-overwatch-cluster-discovery-cluster-discovery). If you pre-created them with scripts/cluster-admin/04-apply-discovery-prerequisites.sh, they can exist on the cluster while Argo still shows Missing or refuses to sync because:
- AppProject
acs-ai-overwatchdid not whitelistClusterRole/ClusterRoleBinding(fixed ingitops/argocd/appproject-acs-ai-overwatch.yaml). - The GitOps application controller cannot get cluster-scoped RBAC (needs the PoC
ClusterRoleBindingor equivalent).
Fix on the cluster:
# Update AppProject (pull latest or apply from repo)
oc apply -f gitops/argocd/appproject-acs-ai-overwatch.yaml
# Controller must read/create cluster RBAC
oc apply -f gitops/argocd/bootstrap/openshift-gitops-controller-rbac.yaml
# Verify
oc auth can-i get clusterroles \
--as=system:serviceaccount:openshift-gitops:openshift-gitops-argocd-application-controller
oc get clusterrole,clusterrolebinding | grep cluster-discoveryThen Refresh and Sync acs-ai-overwatch-cluster-discovery. Argo should adopt the existing objects (same name/rules) and add its tracking metadata.
Expected names (Helm release = Application name):
| Kind | Name |
|---|---|
ClusterRole |
acs-ai-overwatch-cluster-discovery-cluster-discovery |
ClusterRoleBinding |
acs-ai-overwatch-cluster-discovery-cluster-discovery |
If you see warnings like Namespace/monitoring is part of applications acs-ai-overwatch and acs-ai-overwatch-gitops-bootstrap, both Applications define the same Namespace objects. Bootstrap should own namespaces (with argocd.argoproj.io/managed-by); the main chart sets gitops.bootstrapNamespaces: true (default) to skip duplicate Namespace manifests.
Ensure acs-ai-overwatch-gitops-bootstrap is Synced first, then refresh the main app. After pulling the chart fix, SharedResourceWarning entries should clear.
Expected until OLM installs operators (GPU, RHOAI, Quay, Local Storage). Subscriptions sync in wave 10; platform CRs are wave 20/30.
- Enable
SkipDryRunOnMissingResource=trueon the main Application (see CRD not installed above). - Wait for CSVs:
oc get csv -A | grep Succeeded - Refresh → Sync again (often 2–3 times over 15–30 minutes).
If ClusterPolicy fails validation (daemonsets / dcgmExporter / nodeStatusExporter required), pull the latest chart — the GPU ClusterPolicy template includes those fields for current GPU Operator CRDs.
The DataScienceCluster CRD can exist even when the operator CSV is Failed. Diagnose:
oc describe csv rhods-operator.2.25.6 -n redhat-ods-operator
oc get installplan -n redhat-ods-operator
oc get pods -n redhat-ods-operatorCommon cause in this repo: RHOAI OperatorGroup must use spec: {} (all namespaces), not targetNamespaces: [redhat-ods-operator]. A restricted OperatorGroup blocks deployment to redhat-ods-applications and can leave the CSV in Failed.
oc patch operatorgroup redhat-ods-operator -n redhat-ods-operator --type merge -p '{"spec":{}}'If the CSV stays Failed, delete it so OLM retries (or re-sync the Argo app to recreate the Subscription):
oc delete csv rhods-operator.2.25.6 -n redhat-ods-operatorWhen oc get csv -n redhat-ods-operator shows rhods-operator.* Succeeded, Refresh → Sync acs-ai-overwatch so default-dsc applies.
oc get clusterpolicy -n nvidia-gpu-operator
oc get configmap time-slicing-config -n nvidia-gpu-operator -o yaml
oc describe node <gpu-worker> | grep nvidia.com/gpu- Verify
storage.defaultStorageClassmatchesoc get storageclass(defaultgp3-csi) - Check Quay component PVCs:
oc get pvc -n quay - Confirm Quay operator and
QuayRegistryCR are reconciling:oc get quayregistry -n quay
oc logs job/mattermost-bootstrap -n monitoring
oc get route mattermost -n monitoringEnsure Mattermost pod is Ready before bootstrap runs.
Cause: Red Hat OpenShift Pipelines is not installed; Tekton CRDs are missing.
Fix: Install the operator and wait for CSV Succeeded, then re-apply the pipeline. Full steps: OpenShift Pipelines prerequisite.
oc get crd tasks.tekton.dev
oc get csv -A | grep pipelines-operator-
Confirm Tekton pipeline pushed images successfully
-
Verify
kagenti.images.*references match Quay org/repo/tag -
Ensure pull Secret exists in
test-rangeif Quay requires auth:oc get secret -n test-range oc get sa openshell -n test-range -o yaml
-
Confirm
components.agentsRoseyRegrets.enabled: true -
Confirm PVC name alignment:
oc get pvc agent-reference-information -n test-range oc describe deploy rosey-regrets -n test-range
- Confirm Secured Cluster Services are connected to Central
- Verify policy was imported and is not disabled
- Confirm violation scope matches namespace
test-range - Validate notifier name exactly matches
Mattermost Notifier
- Verify
KAGENTI_API_BASEand token - Adjust
KAGENTI_COMMANDS_PATH_TEMPLATEfor your Kagenti version - Set
KAGENTI_TLS_INSECURE=trueonly for lab clusters with self-signed certs
After Phase 4, the Kagenti operator injects authbridge-proxy on port 8000 and moves the agent container to 8001. Agent Services must reach a listening port or AgentCard reconciliation fails with connection refused.
oc get agentcard -n test-range
oc get pods -n test-range -l app.kubernetes.io/name=sneaky-sam-slm
oc logs -n test-range deploy/sneaky-sam-slm -c authbridge-proxy --tail=20
oc get pods -n zero-trust-workload-identity-manager -l app.kubernetes.io/name=spire-agent| Symptom | Likely cause | Fix |
|---|---|---|
authbridge-proxy logs only Starting authbridge-proxy..., nothing on :8000 |
SPIRE agents crash-looping (expired trust bundle / server cert) | Check spire-agent pods in zero-trust-workload-identity-manager; restart SPIRE server and agents or re-run Phase 4 install |
Agent process healthy on :8001, Service targets :8000 |
Port mismatch after Kagenti mutation | Chart default kagenti.agentServiceTargetPort: 8001 routes Services to the agent container |
Chat returns LLM request failed: All connection attempts failed |
Kagenti sets HTTP_PROXY to authbridge :8081; forward proxy is down when SPIRE is unhealthy |
Chart sets NO_PROXY for .svc.cluster.local; agent code bypasses proxy for vLLM calls (trust_env=false) |
All AgentCards Synced=False |
Combination of the above | Patch Services to targetPort: 8001 or sync Argo after updating values |
Verify agent card fetch:
curl -sS -o /dev/null -w '%{http_code}\n' \
http://sneaky-sam-slm.test-range.svc.cluster.local/.well-known/agent-card.jsonExpect 200 and oc get agentcard -n test-range showing Synced=True.
Kagenti Phase 4 installs SPIRE in zero-trust-workload-identity-manager. The SpireServer CA rotates on a TTL (caValidity, default 24h from upstream kagenti-deps). Rotation itself is normal; outages happen when spire-agent pods do not reload the updated spire-bundle ConfigMap and crash with:
x509: certificate signed by unknown authority
That breaks the SPIRE workload API socket, so authbridge-proxy never binds :8000/:8081.
| Layer | What | Why |
|---|---|---|
| 1. Longer CA TTL | Patch SpireServer caValidity to 168h (7 days) or longer |
Fewer rotations → fewer chances for agent/bundle drift. PoC install Job does this automatically (spire.caValidity in the Kagenti platform chart). |
| 2. Create-only annotation | ztwim.openshift.io/create-only=true on SpireServer |
Stops the ZTWIM operator from reverting your TTL patch (OpenShift 4.19 ZTWIM docs). |
| 3. Agent Service bypass | kagenti.agentServiceTargetPort: 8001 + NO_PROXY for .svc.cluster.local |
Keeps Kagenti chat working even when authbridge is down (already in this chart). |
| 4. Reactive restart | Roll spire-agent + agent Deployments after rotation |
Manual safety net if agents fall behind again. |
Apply on an existing cluster:
oc patch spireservers cluster --type=merge -p '{"spec":{"caValidity":"168h"}}'
oc annotate spireservers cluster ztwim.openshift.io/create-only=true --overwriteOr re-run / sync the Phase 4 install Job (kagenti-platform-install) — it idempotently applies the patch even when Kagenti is already installed.
Check:
oc get spireservers cluster -o jsonpath='caValidity={.spec.caValidity}{"\n"}'
oc get pods -n zero-trust-workload-identity-manager -l app.kubernetes.io/name=spire-agent
oc get spireagents cluster -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'Fix after an outage:
oc delete pods -n zero-trust-workload-identity-manager -l app.kubernetes.io/name=spire-agent
oc rollout restart deployment -n test-range -l kagenti.io/type=agentVerify authbridge is listening:
oc exec -n test-range deploy/helpful-hank-slm -c authbridge-proxy -- \
wget -qO- --timeout=3 http://127.0.0.1:8000/.well-known/agent-card.json | head -c 80For long-lived production clusters, consider a scheduled spire-agent rolling restart before CA expiry, or upstream SPIRE bundle hot-reload fixes — not required for a short PoC sandbox once TTL is extended.
Rosey Regrets is intentionally configured to perform network discovery using nmap against RFC1918 address space. This agent must only be deployed in:
- Isolated lab clusters
- Environments where you have explicit authorization to scan
- Networks designed for security evaluation and demonstration
Unauthorized scanning of production or third-party networks may violate policy and law.
Do not run this PoC with repository default passwords on any cluster that is shared, long-lived, or reachable outside a personal sandbox. See the Configuration Checklist — change default passwords table for every placeholder and its values.yaml path.
- Mattermost:
mattermost.bootstrap.adminPassword,hitlPassword, andmattermost.postgres.passwordinvalues.yaml. - Quay UI admin (reference Secret only):
quayStorage.quayRegistry.adminAccount.*→ Secretadmin-accountin namespacequay. Use these credentials when completing the Quay setup wizard; Quay does not read this Secret automatically. - Quay docker pull robot:
quayStorage.registryCredentials.password, or setQUAY_REGISTRY_PASSWORDwhen runningdiscover-cluster-values.sh(local override file only). - Quay MinIO:
quayStorage.quayRegistry.minio.credentialsSecret.secretKey(must match MinIO andquay-config-bundle). - Prefer External Secrets Operator, Sealed Secrets, or OpenShift GitOps vault integration for production.
- Optionally gitignore
values-cluster.yamlif it contains secrets (see.gitignorecomment).
The OpenShell SCC and buildah Tekton tasks require elevated privileges. Restrict namespace access and audit SCC usage.
make helm-template-discovery
make cluster-values # optional override file
make helm-templatehelm template acs-ai-overwatch gitops/helm/acs-ai-overwatch \
-f gitops/helm/acs-ai-overwatch/values.yaml \
-f gitops/helm/acs-ai-overwatch/values-poc.yaml \
-f gitops/helm/acs-ai-overwatch/values-cluster.yaml \
--set components.acsPolicies.enabled=true \
--set components.agentsRoseyRegrets.enabled=true \
--set components.kagenti.enabled=true \
> /tmp/acs-ai-overwatch-render.yamlhelm upgrade acs-ai-overwatch gitops/helm/acs-ai-overwatch \
-f gitops/helm/acs-ai-overwatch/values.yaml \
-f gitops/helm/acs-ai-overwatch/values-poc.yaml \
-f gitops/helm/acs-ai-overwatch/values-cluster.yaml \
--dry-run --debugFiles under scratch/ are not deployed by the Helm chart. They contain reference manifests, dashboards, and exploratory OpenShift AI setup YAML for local experimentation.
oc login ...
make cluster-admin-pre-gitops
# Register GitOps Applications
oc apply -k gitops/argocd/
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config
# Optional: local values file
make cluster-values
make helm-template
make helm-template-discovery
# Apply Tekton pipeline
oc apply -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipeline.yaml
# Build agent images (example PipelineRun)
oc create -n acs-ai-overwatch-system -f pipelines/tekton/agents-build-pipelinerun.example.yaml
# End-to-end demo (after setup) — see "PoC Demo Walkthrough (After Setup)"
# Mattermost login + alerts — see "Mattermost & RHACS notifications"
./scripts/kagenti-auth-info.sh
# In Kagenti UI: rosey-regrets-slm → "Network Audit"
# Check test-range workloads
oc get all,pvc,cm -n test-rangeComplete this section after Phases 0–4: Mattermost is deployed, RHACS Central + SecuredCluster are healthy, and agents are built. Do this immediately before PoC Demo Walkthrough (After Setup).
Discovery writes the browser URL to the cluster ConfigMap — do not hard-code sandbox hostnames in Git:
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config \
-o jsonpath='{.data.mattermostSiteUrl}{"\n"}{.data.mattermostRouteHost}{"\n"}'After a new sandbox, re-sync acs-ai-overwatch-cluster-discovery, then hard-refresh the main app:
oc annotate application acs-ai-overwatch -n openshift-gitops argocd.argoproj.io/refresh=hard --overwriteWebhook vs browser URL: RHACS uses the internal webhook URL in monitoring/mattermost-acs-integration. Humans use mattermostSiteUrl from the ConfigMap.
Confirm the Mattermost bootstrap Job completed and the webhook ConfigMap exists:
oc get job mattermost-bootstrap -n monitoring
oc get cm mattermost-acs-integration -n monitoring -o yaml| Resource | Description |
|---|---|
| Deployment | Mattermost server in monitoring |
| PVC | 10Gi on gp3-csi |
| Route | Edge TLS; host from cluster.appsDomain / discovery |
| Bootstrap Job | Creates admin, HITL user, incoming webhook |
| ConfigMap | ACS_INCOMING_WEBHOOK_URL after bootstrap |
Bootstrap flow (Job mattermost-bootstrap):
- Waits for Mattermost API readiness
- Creates bootstrap admin (idempotent on HTTP 400 if exists)
- Creates
human-in-the-loopuser - Creates incoming webhook on Town Square channel
- Writes webhook URL to ConfigMap
mattermost-acs-integration
| User | Password source |
|---|---|
mattermost-admin |
mattermost.bootstrap.adminPassword in values.yaml |
human-in-the-loop |
mattermost.bootstrap.hitlPassword in values.yaml |
Open mattermostSiteUrl from the cluster ConfigMap. Join or open the Town Square channel (ACS team) — that is where RHACS alerts land.
When Phase 3 is enabled, the chart deploys:
- Declarative notifier ConfigMap
rhacs-mattermost-notifierin namespacestackrox acs-mattermost-bridgeDeployment inmonitoring— RHACS generic notifier POSTs{"alert":...}JSON; Mattermost Slack-compatible hooks require{"text":...}; the bridge translates and forwards- Job
acs-platform-bootstrapconfigures init bundle,SecuredCluster, and notifier mount on Central
Configure RHACS notifier name to match:
acs:
policy:
notifierName: Mattermost NotifierAlert path:
RHACS Central → generic notifier → http://acs-mattermost-bridge.monitoring.svc:8080/
→ Mattermost incoming webhook (Town Square, ACS team)
Verify:
oc get deploy acs-mattermost-bridge -n monitoring
oc logs -n stackrox job/acs-platform-bootstrap
# RHACS UI → Integration → Notifiers → Mattermost NotifierTest the Mattermost webhook directly (URL from mattermost-acs-integration):
WEBHOOK="$(oc get cm -n monitoring mattermost-acs-integration -o jsonpath='{.data.ACS_INCOMING_WEBHOOK_URL}')"
curl -sS -o /dev/null -w '%{http_code}\n' -X POST "$WEBHOOK" \
-H 'Content-Type: application/json' -d '{"text":"acs-ai-overwatch webhook test"}'Expect 200 or 204. If alerts fail, confirm the bridge is Running and the notifier endpoint is acs-mattermost-bridge (not the Mattermost URL directly). See Troubleshooting — Mattermost / notifier rows.
Use this section after Phases 0–4 are complete: Mattermost is up, RHACS Central + SecuredCluster are healthy, agents are built, and Kagenti UI is reachable. It is the presenter script for the two main demos.
Run these once; every item should pass before opening the UI for an audience.
# Cluster config + Mattermost URL
oc get cm -n acs-ai-overwatch-system acs-ai-overwatch-cluster-config \
-o jsonpath='{.data.mattermostSiteUrl}{"\n"}{.data.kagentiApiBaseUrl}{"\n"}'
# Mattermost + webhook bridge
oc get pods -n monitoring -l 'app.kubernetes.io/name in (mattermost,acs-mattermost-bridge)'
# RHACS stack
oc get pods -n stackrox -l app.kubernetes.io/name=central
oc get pods -n stackrox -l app=collector
oc get securitypolicy -n stackrox test-range-runtime-guardrails
# Agents + SLM inference
oc get pods -n test-range -l 'app.kubernetes.io/name in (rosey-regrets-slm,helpful-hank,qwen3-vllm)'
# Kagenti platform
./scripts/kagenti-auth-info.sh| Console | How to open |
|---|---|
| Mattermost | URL from mattermostSiteUrl above — team ACS, channel Town Square |
| RHACS Central | Route central in stackrox, or oc get route central -n stackrox |
| Kagenti UI | From kagenti-auth-info.sh output |
Log into Mattermost as human-in-the-loop (password: mattermost.bootstrap.hitlPassword in values) so you see notifier posts in Town Square.
Goal: Show an AI agent triggering suspicious recon, RHACS alerting (not killing the pod), and Mattermost receiving a human-readable message.
Step 1 — Open Kagenti
- Run
./scripts/kagenti-auth-info.shand open the Kagenti UI URL. - Log in with a demo user from the script output (e.g.
dev-user/ password from Keycloak secrets). - Select agent
rosey-regrets-slm(Qwen3-backed; recommended for live demos).
Step 2 — Trigger recon from the model
In the chat, send exactly:
Network Audit
Or a natural-language variant:
Scan the cluster network and report what you find.
What you should see immediately (not a 504):
- A reply that network reconnaissance started against
10.0.0.0/24 - nmap runs in the background inside the Rosey pod
Step 3 — Confirm nmap in the cluster
oc exec -n test-range deploy/rosey-regrets-slm -c rosey-regrets-slm -- \
sh -c 'pgrep -a nmap || ls -lt /agent-reference-information'Step 4 — RHACS violation
- Open RHACS Central → Violations
- Filter namespace
test-range - Look for policy
test-range-runtime-guardrails— processnmap - Note enforcement is none (alert-only) — Rosey pod stays Running
oc get pods -n test-range -l app.kubernetes.io/name=rosey-regrets-slmStep 5 — Mattermost alert
- Open Mattermost → team ACS → Town Square
- Within 1–2 minutes, look for a message from
acswith RHACS alert text - If empty, verify the bridge:
oc logs -n monitoring deploy/acs-mattermost-bridge --tail=20
oc get cm mattermost-acs-integration -n monitoring -o yamlStep 6 — Artifacts on the PVC
oc exec -n test-range deploy/rosey-regrets-slm -c rosey-regrets-slm -- \
cat /agent-reference-information/network-audit-latest.logTalking points
- Rosey’s model chose to run recon (tool calling), not an operator
oc exec - RHACS detected runtime behavior and notified operators without stopping the demo
- The webhook bridge exists because Mattermost Slack-compatible hooks need
{"text":...}while RHACS generic notifier sends{"alert":...}
Goal: Show deploy-time compliance — agents without the telemetry label trigger a DEPLOY policy and Mattermost alert (and admission block when enforcement is on).
Prerequisites: components.agentsSneakySam.enabled: true and Phase 3 RHACS with admission enforcement (default in chart).
- Sync Argo CD with Sneaky Sam enabled.
- Watch Argo CD for sync failure on
sneaky-samDeployment (expected if admission blocks). - Check Mattermost Town Square for policy
test-range-agent-telemetry-required. - Compare with Helpful Hank / Rosey — they carry
acs-ai-overwatch.io/telemetry=enabled.
oc get deploy -n test-range -o custom-columns=NAME:.metadata.name,TELEMETRY:.metadata.labels.acs-ai-overwatch\.io/telemetryDisable when done: components.agentsSneakySam.enabled: false.
| Agent | Kagenti name | Behavior |
|---|---|---|
| Helpful Hank | helpful-hank or helpful-hank-slm |
Benign assistant — no nmap |
| Rosey Regrets SLM | rosey-regrets-slm |
Rogue recon — triggers Demo 1 |
| Sneaky Sam | sneaky-sam |
Missing telemetry label — Demo 2 |
Chat with Helpful Hank on the same prompt (“scan the network”) — it should not invoke run_network_recon.
| Symptom | Likely cause | Fix |
|---|---|---|
| Kagenti HTTP 504 | nmap scanning too long (old /8 behavior or sync path) |
Use rosey-regrets-slm; confirm NETWORK_AUDIT_CIDR=10.0.0.0/24 on the Deployment |
| No Mattermost message | Bridge down or wrong notifier endpoint | oc get deploy acs-mattermost-bridge -n monitoring; notifier must point at bridge, not Mattermost URL directly |
| No RHACS violation | SecuredCluster/sensor not ready or policy not in stackrox |
oc get securitypolicy -n stackrox; oc get pods -n stackrox -l app=collector |
| Policy in UI missing | CR in wrong namespace | Policies must be oc get securitypolicy -n stackrox |
LLM request failed |
vLLM unreachable | oc get pods -n test-range -l app.kubernetes.io/name=qwen3-vllm |
# From repo root — no git push required
oc start-build rosey-regrets-slm --from-dir=. --follow -n test-range
oc rollout restart deploy/rosey-regrets-slm -n test-rangeOr use Tekton → Quay when the pipeline and quay-build-robot secret are healthy.
When extending this repository:
- Add new optional features behind
components.*toggles invalues.yaml - Keep namespace and naming conventions consistent with
test-rangeisolation model - Document new cluster-specific settings in this README,
values-cluster.yaml.example, and the discovery ConfigMap keys - Validate with
helm templatebefore opening a PR - Do not commit secrets, kubeconfigs, or environment-specific credentials
Refer to your organization's licensing terms for Red Hat OpenShift, OpenShift AI, Advanced Cluster Security, and third-party components (NVIDIA OpenShell, Mattermost, Kagenti, Hugging Face models).