Minimum deployment testing for Azure by jiaenren · Pull Request #1070 · NVIDIA/OSMO

jiaenren · 2026-06-05T01:01:00Z

Issue - None

What this ships

Adds a nightly Azure deployment-test gate — provisions AKS/Postgres/Redis, builds OSMO images from source, deploys the chart and runs verify-hello + 8 OETF smoke/scenario tests end-to-end via the new run-deployment-test.sh wrapper, and posts to Slack on failure.

Three additions:

File	Purpose
`deployments/scripts/run-deployment-test.sh`	The wrapper (per `projects/osmo-deployment-tactical-hardening-plan.md`). Drives `deploy-osmo-minimal.sh` end-to-end, runs `verify.sh`, optionally `bazel run //test/oetf:run`, structured JSON + JUnit + per-stage logs, categorized exit codes (0/1/2/4/5).
`.github/workflows/deployment-test.yaml`	The CI gate around the wrapper.
`ci/deployment-test/azure-overrides.yaml`	Helm values overlay reducing the chart's `default_ctrl` cpu request so OSMO's strict-LE resource validator passes on small Azure nodes.
`deployments/terraform/azure/example/example.tf`	Pins AVM `vnet` to `~> 0.17.0` (sidesteps the v0.18.x IPAM-validation null-bug).

Workflow shape

name: Deployment Test (Azure)
on:
  workflow_dispatch (init-only | auth-check | full-deployment)
  pull_request (paths-filtered, label-gated for the heavy gate)
  schedule:    '0 0 * * *'           # daily 00:00 UTC = 5pm PDT

jobs:
  init-only        →  terraform init/validate/fmt (every PR push)
  auth-check       →  terraform plan (manual; smokes OIDC chain)
  build-images     →  bazel-builds 12 OSMO images from source, pushes
                      to ghcr.io/<owner>/osmo-ci, tag pr-<num>-<attempt>-amd64
  full-deployment  →  TEMP terraform apply (AKS + Postgres + Redis) →
                      wrapper (provider=azure, --skip-terraform, --no-gpu) →
                      diagnostic dump → TEMP terraform destroy
  notify-slack-on-azure-deployment-test-failure
                   →  Block Kit post to vars.CI_SLACK_CHANNEL
                      (fallback osmo-slack-test) via chat.postMessage +
                      OSMO_SLACK_BOT_TOKEN. Includes commit author/
                      subject, "N commits since last green run" compare,
                      real per-artifact deep-link, run-logs/workflow-file
                      action buttons.
                      Fires schedule-only — PR/manual runs surface their
                      own status.

The gate verifies the PR's source (image-from-source build path), not whatever's published at nvcr.io/nvidia/osmo:latest.

Verification

End-to-end verified pre-merge:

Path	Status	Run
PR-label `ci:azure-deployment` happy path: verify-hello + 7/7 OETF tests	✅	run `27585506897` (`53a6e9e`)
Same after rebase + #1114 OETF migration fixes	✅	run `28071234567` (`2950770`)
Real OETF failure → Slack with real artifact deep-link	✅	run `28136214403` → osmo-slack-test ts `1782346110.673389`
`chat.postMessage` returns `ok=true` to the configured channel	✅	confirmed on multiple dispatch runs

OETF tag set is narrowed to api,websocket,logger,negative — skips three known-broken positive scenarios (router-connectivity DNS, task-runtime-environment + serial.py schema drift). Filed as follow-ups; do not block this gate.

Known TEMP scaffolding (drops once `internal-ci` provisions a long-running AKS)

The workflow self-provisions AKS + Postgres + Redis on every run. The wrapper itself is designed to run against externally-managed infra (--skip-terraform is hard-coded). When a shared internal-ci AKS lands, drop these four steps and the wrapper invocation in the middle stays unchanged:

TEMP — pre-apply cleanup (RG sweep)
TEMP — terraform apply
TEMP — terraform destroy
build TF var file (single source for apply/destroy)

Required repo configuration

Secret / var	Purpose
`BAZEL_REMOTE_CACHE_URL` (repo secret)	bazel disk cache for build-images
`OSMO_SLACK_BOT_TOKEN` (repo secret, xoxb-)	Slack chat.postMessage
`internal-ci` env vars `AZURE_CLIENT_ID` / `AZURE_TENANT_ID` / `AZURE_SUBSCRIPTION_ID` / `AZURE_RESOURCE_GROUP` (+ optional `AZURE_REGION` / `AZURE_CLUSTER_NAME`)	OIDC federation + terraform inputs
`CI_SLACK_CHANNEL` (optional repo/org var)	Overrides default Slack channel (osmo-slack-test)

All currently provisioned.

Follow-ups (intentionally out of scope)

Drop ~> 0.17.0 vnet pin once Terraform ≥ 1.10 (or upstream guards the validation with try()).
Re-broaden OETF_TAGS to kind once router-connectivity passes (currently fails on Azure CoreDNS — cluster networking issue, not OETF). Task-env was unblocked by Convert OETF dataset fixtures to task I/O #1128 and is now in.
Land a long-running shared internal-ci AKS and strip the TEMP scaffolding.
Promote default_ctrl cpu request to chart default (currently a per-Azure overlay).
Move duplicated bits to composite actions (setup-bazel-cached, slack-post, free-disk-space) when a second provider workflow lands.

🤖 Generated with Claude Code

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

coderabbitai · 2026-06-05T01:01:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

New shell script run-deployment-test.sh orchestrates end-to-end deployment testing: bootstraps provider-specific environments (KIND or Azure), runs the repo deploy script, executes an OETF Bazel smoke test, records per-stage JSON/JUnit results, enforces a 45-minute hard timeout, and performs provider-aware teardown. A GitHub Actions workflow invokes these modes.

Changes

Deployment Test Gate Pipeline

Layer / File(s)	Summary
Configuration, CLI parsing, and environment setup `deployments/scripts/run-deployment-test.sh` (lines 59–151)	Strict bash execution mode with CI guardrails; CLI parsing for provider, chart/image overrides, Azure params, and optional `SKIP_OETF`/`SKIP_TEARDOWN`. Run-dir and global state initialization.
Execution harness, result tracking, and teardown infrastructure `deployments/scripts/run-deployment-test.sh` (lines 152–298)	Stage recording, exit-code categorization, JSON and minimal JUnit emission, and EXIT-trap cleanup preserving first failure and performing best-effort destroy/cleanup.
Watchdog and staged-run executor `deployments/scripts/run-deployment-test.sh` (lines 301–345)	Hard-timeout watchdog (background sleep + kill), `stop_watchdog`, and `run_stage` that measures duration, maps failures to categorized codes, records state, and exits on failure.
Provider-specific cluster bootstrap `deployments/scripts/run-deployment-test.sh` (lines 349–443)	`byo-kind`: create KIND cluster + Postgres/Redis sidecars on KIND network and export credentials; `microk8s`: explicit unsupported error; `azure`: validate identifiers, refresh AKS kubeconfig via `az`, and verify cluster reachability.
OSMO deployment with provider-specific configuration `deployments/scripts/run-deployment-test.sh` (lines 445–521)	Translate provider taxonomy (`byo-kind`→`byo`), forward `OSMO_CHART_VERSION`/`OSMO_IMAGE_TAG`, assemble provider CLI args (NodePort/storage bypass, `--skip-terraform` for Azure), invoke deploy script and tee output to `deploy.log`.
OETF smoke test execution `deployments/scripts/run-deployment-test.sh` (lines 523–609)	Optional skip via `SKIP_OETF`. Resolve OSMO URL (localhost for KIND, poll LoadBalancer IP for Azure). Locate Bazel OETF target across repo layouts, ensure `bazel` exists, run `bazel run` with captured `oetf.log`.
Stage orchestration and execution flow `deployments/scripts/run-deployment-test.sh` (lines 611–622), `.github/workflows/d4-deployment-test.yaml` (lines 1–181)	Run stages in sequence with categorized exit codes, stop watchdog after completion, emit JSON/JUnit artifacts. Workflow adds `d4-deployment-test` dispatch (`init-only`, `auth-check`, `full-deployment`); `full-deployment` pins tools, runs the script for Azure, and uploads run artifacts.

Sequence Diagram

sequenceDiagram
  participant GH_Workflow as GitHub_Workflow
  participant Test_Script as run-deployment-test.sh
  participant Bootstrap as stage_bootstrap
  participant DeployScript as deployments/scripts/deploy
  participant OETF as Bazel_OETF
  participant Provider as Cluster
  GH_Workflow->>Test_Script: workflow starts / bash run-deployment-test.sh --provider azure
  Test_Script->>Bootstrap: stage_bootstrap(provider)
  Bootstrap->>Provider: provision or fetch kubeconfig (KIND or AKS)
  Provider-->>Bootstrap: kubeconfig / credentials
  Test_Script->>DeployScript: stage_deploy (env OSMO_CHART_VERSION, OSMO_IMAGE_TAG, provider args)
  DeployScript->>Provider: install OSMO components
  Provider-->>DeployScript: service endpoints / LoadBalancer IP
  Test_Script->>OETF: stage_oetf_smoke (OSMO_URL, tags)
  OETF->>Provider: perform smoke API calls
  OETF-->>Test_Script: test result
  Test_Script->>Test_Script: record_stage, emit_result_json, emit_junit_xml
  Test_Script->>Provider: cleanup/destroy (on EXIT)
  Provider-->>Test_Script: teardown complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I spun a cluster in moonlit code,
Launched charts and watched the logs explode,
Ran a smoke, then wrote the run,
Timed the minutes—forty-five to sun—
Teardown whispers: all systems stowed.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title matches the main theme of the PR: adding Azure-focused deployment testing.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jiaenr/d4-wrapper-azure

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (2)

deployments/scripts/run-deployment-test.sh (2)

175-198: 💤 Low value

Minor: JSON string values are not escaped.

If PROVIDER, CHART_VERSION, or IMAGE_TAG contain characters like ", \, or newlines, the emitted JSON would be malformed. In practice these inputs are controlled, but consider using jq for safer generation if it's available on CI runners.

🔧 Optional: Use jq for JSON generation

 emit_result_json() {
     local overall="pass"
     [[ "$OVERALL_EXIT_CODE" -ne 0 ]] && overall="fail"

+    if command -v jq >/dev/null 2>&1; then
+        local stages_json="[]"
+        for i in "${!STAGE_NAMES[@]}"; do
+            stages_json=$(jq --arg name "${STAGE_NAMES[$i]}" \
+                             --argjson code "${STAGE_EXIT_CODES[$i]}" \
+                             --argjson dur "${STAGE_DURATIONS[$i]}" \
+                             '. + [{"name": $name, "exit_code": $code, "duration_seconds": $dur}]' \
+                             <<< "$stages_json")
+        done
+        jq -n --arg provider "$PROVIDER" \
+              --arg chart "$CHART_VERSION" \
+              --arg tag "$IMAGE_TAG" \
+              --argjson stages "$stages_json" \
+              --arg overall "$overall" \
+              --argjson code "$OVERALL_EXIT_CODE" \
+              --arg failed "$FAILED_STAGE" \
+              '{provider: $provider, chart_version: $chart, image_tag: $tag, stages: $stages, overall: $overall, exit_code: $code, failed_stage: $failed}' \
+              > "$RESULT_JSON"
+        return
+    fi
+    # Fallback to printf if jq unavailable
     {
         printf '{\n'
         # ... existing printf logic ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deployments/scripts/run-deployment-test.sh` around lines 175 - 198, In
emit_result_json, JSON string fields (PROVIDER, CHART_VERSION, IMAGE_TAG,
FAILED_STAGE and stage names) are not escaped; change emit_result_json to build
the JSON with a safe generator: if jq is available use a jq -n invocation with
--arg for PROVIDER/CHART_VERSION/IMAGE_TAG/FAILED_STAGE and construct the
"stages" array from the STAGE_NAMES/STAGE_EXIT_CODES/STAGE_DURATIONS arrays so
jq handles escaping; if jq is not present fall back to invoking a small JSON
encoder (e.g. python -c "import json,sys;print(json.dumps(...))") or an escape
helper that runs values through json.dumps to escape quotes/backslashes/newlines
before writing to RESULT_JSON; keep the same output keys and preserve
OVERALL_EXIT_CODE, overall, and the loop over STAGE_* but ensure all string
values are passed through the safe encoder.

349-389: 💤 Low value

Consider adding a brief health check for sidecar containers.

The postgres and redis containers are started with docker run -d but there's no wait for them to be ready before proceeding. If stage_deploy starts before postgres accepts connections, the deployment might fail intermittently.

🔧 Optional: Add container readiness wait

     docker run -d --name osmo-test-redis --network kind \
         redis:7 redis-server --requirepass test-redis-password

+    # Brief wait for containers to accept connections
+    log_info "Waiting for postgres to accept connections"
+    local pg_ready=0
+    for _ in {1..30}; do
+        if docker exec osmo-test-postgres pg_isready -U postgres >/dev/null 2>&1; then
+            pg_ready=1
+            break
+        fi
+        sleep 1
+    done
+    [[ "$pg_ready" -eq 0 ]] && { log_error "postgres not ready after 30s"; return 1; }
+
     # Export creds for deploy-osmo-minimal.sh's --non-interactive path.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deployments/scripts/run-deployment-test.sh` around lines 349 - 389,
stage_bootstrap_byo_kind starts the postgres and redis sidecars but doesn't wait
for them to accept connections; add container readiness loops after the docker
run calls that poll until Postgres and Redis are ready (use the container names
osmo-test-postgres and osmo-test-redis). For Postgres, run a retry loop invoking
pg_isready (or docker exec osmo-test-postgres pg_isready -U
"$POSTGRES_USERNAME") with a timeout/backoff and fail if not ready within e.g.
60s; for Redis, run a retry loop using redis-cli PING (docker exec
osmo-test-redis redis-cli -a "$REDIS_PASSWORD" PING) with similar
timeout/backoff. Ensure the loops log progress and exit non-zero on timeout so
stage_deploy doesn't proceed on intermittent failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deployments/scripts/run-deployment-test.sh`:
- Around line 175-198: In emit_result_json, JSON string fields (PROVIDER,
CHART_VERSION, IMAGE_TAG, FAILED_STAGE and stage names) are not escaped; change
emit_result_json to build the JSON with a safe generator: if jq is available use
a jq -n invocation with --arg for PROVIDER/CHART_VERSION/IMAGE_TAG/FAILED_STAGE
and construct the "stages" array from the
STAGE_NAMES/STAGE_EXIT_CODES/STAGE_DURATIONS arrays so jq handles escaping; if
jq is not present fall back to invoking a small JSON encoder (e.g. python -c
"import json,sys;print(json.dumps(...))") or an escape helper that runs values
through json.dumps to escape quotes/backslashes/newlines before writing to
RESULT_JSON; keep the same output keys and preserve OVERALL_EXIT_CODE, overall,
and the loop over STAGE_* but ensure all string values are passed through the
safe encoder.
- Around line 349-389: stage_bootstrap_byo_kind starts the postgres and redis
sidecars but doesn't wait for them to accept connections; add container
readiness loops after the docker run calls that poll until Postgres and Redis
are ready (use the container names osmo-test-postgres and osmo-test-redis). For
Postgres, run a retry loop invoking pg_isready (or docker exec
osmo-test-postgres pg_isready -U "$POSTGRES_USERNAME") with a timeout/backoff
and fail if not ready within e.g. 60s; for Redis, run a retry loop using
redis-cli PING (docker exec osmo-test-redis redis-cli -a "$REDIS_PASSWORD" PING)
with similar timeout/backoff. Ensure the loops log progress and exit non-zero on
timeout so stage_deploy doesn't proceed on intermittent failures.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed7f5e12-a3da-40ad-9dd9-881a215d8342

📥 Commits

Reviewing files that changed from the base of the PR and between df36dcf and 6df5fc7.

📒 Files selected for processing (1)

deployments/scripts/run-deployment-test.sh

codecov · 2026-06-12T21:51:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.82%. Comparing base (b8b8d07) to head (540ee4f).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1070      +/-   ##
==========================================
- Coverage   58.88%   58.82%   -0.06%     
==========================================
  Files         201      201              
  Lines       25574    25496      -78     
  Branches     3840     3827      -13     
==========================================
- Hits        15060    14999      -61     
+ Misses       9838     9821      -17     
  Partials      676      676

Flag	Coverage Δ
backend	`60.98% <ø> (-0.10%)`	⬇️
ui	`35.19% <ø> (+0.23%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 9 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

.github/workflows/d4-deployment-test.yaml (1)
62-62: 🔒 Security & Privacy | ⚡ Quick win

Disable checkout credential persistence before running repo-controlled shell.

Both jobs execute repository code after checkout, and full-deployment also exposes OIDC and database credentials. Keeping GITHUB_TOKEN in the local git config widens what those later steps can read for no benefit.
Suggested hardening
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v4
+        with:
+          persist-credentials: false
...
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v4
+        with:
+          persist-credentials: false
Also applies to: 90-90
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/d4-deployment-test.yaml at line 62, The checkout step is
leaving GITHUB_TOKEN in the local git config which lets later repo-controlled
shell steps access secrets; update each actions/checkout@v4 invocation
(including the one used by the full-deployment job) to include the input "with:
persist-credentials: false" so the checkout does not write credentials into the
repo git config (apply this change to both occurrences mentioned in the
comment).
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/d4-deployment-test.yaml:
- Around line 125-132: The upload-artifact step using actions/upload-artifact@v4
(artifact name deployment-test-run-${{ github.run_id }}) is missing JUnit XML
files; update its with.path glob to include the junit.xml (e.g., add
runs/**/*.xml or runs/**/junit.xml) so that the JUnit output produced by
run-deployment-test.sh is uploaded alongside the .log and .json files.
- Around line 100-105: The workflow downloads kubectl directly to /usr/local/bin
without sudo which fails on GitHub runners; change the steps to download to a
writable temp path (use the existing KUBECTL_VERSION and a temp filename like
/tmp/kubectl), fetch the corresponding .sha256, run the checksum check against
that temp path (/tmp/k.sha referencing /tmp/kubectl), and only after
verification run sudo mv to /usr/local/bin/kubectl and sudo chmod +x; update the
curl and checksum commands that currently reference "/usr/local/bin/kubectl",
the checksum creation that writes to /tmp/k.sha, and the final chmod operation
accordingly.
- Around line 96-123: The CI job runs deployments/scripts/run-deployment-test.sh
which always triggers the oetf-smoke stage (stage_oetf_smoke) that calls bazel,
but the workflow step "install kubectl + helm" does not install Bazel; either
explicitly install Bazel in this job before invoking run-deployment-test.sh
(ensure bazel is on PATH and install a pinned version) or export SKIP_OETF=1 (or
pass --skip-oetf) in the "run-deployment-test.sh --provider azure" step to skip
stage_oetf_smoke; update the step that runs run-deployment-test.sh and reference
the oetf-smoke / stage_oetf_smoke logic and SKIP_OETF/--skip-oetf to implement
the chosen fix.
- Around line 115-123: The workflow is missing Azure CLI authentication before
calling deployments/scripts/run-deployment-test.sh which calls
stage_bootstrap_azure and runs az aks get-credentials; add an authentication
step using the azure/login GitHub Action (or an az login equivalent) prior to
the step that executes run-deployment-test.sh so the runner has an active Azure
session and az aks get-credentials can succeed; ensure the login step uses the
same variables/secrets (AZURE_SUBSCRIPTION_ID and any required credentials)
already referenced by the job.

---

Nitpick comments:
In @.github/workflows/d4-deployment-test.yaml:
- Line 62: The checkout step is leaving GITHUB_TOKEN in the local git config
which lets later repo-controlled shell steps access secrets; update each
actions/checkout@v4 invocation (including the one used by the full-deployment
job) to include the input "with: persist-credentials: false" so the checkout
does not write credentials into the repo git config (apply this change to both
occurrences mentioned in the comment).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b963b883-2ee1-400e-aaa9-82254ea494e2

📥 Commits

Reviewing files that changed from the base of the PR and between 6df5fc7 and 8e0ddda.

📒 Files selected for processing (1)

.github/workflows/d4-deployment-test.yaml

coderabbitai

♻️ Duplicate comments (4)

.github/workflows/d4-deployment-test.yaml (4)

136-140: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Install kubectl from a writable temp path.

curl -fsSLo /usr/local/bin/kubectl runs without sudo, so this step will fail on GitHub-hosted Ubuntu before checksum verification or deployment even starts.

Suggested fix

          KUBECTL_VERSION=v1.31.0
-          curl -fsSLo /usr/local/bin/kubectl \
+          curl -fsSLo /tmp/kubectl \
            "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
          curl -fsSL "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl.sha256" \
-            | awk '{print $1"  /usr/local/bin/kubectl"}' | sudo tee /tmp/k.sha | sha256sum -c -
-          sudo chmod +x /usr/local/bin/kubectl
+            | awk '{print $1"  /tmp/kubectl"}' | sudo tee /tmp/k.sha | sha256sum -c -
+          sudo install -m 0755 /tmp/kubectl /usr/local/bin/kubectl

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/d4-deployment-test.yaml around lines 136 - 140, The
current steps download kubectl directly to /usr/local/bin/kubectl without sudo,
which fails on GitHub runners; change the download/checksum sequence to write to
a writable temp path (e.g., /tmp/kubectl) using the same KUBECTL_VERSION,
perform the sha256 verification against that temp file, then use sudo to move
the verified file into /usr/local/bin/kubectl and set executable permissions
with sudo chmod +x; update the three affected commands (the curl download
target, the checksum input path, and the final move/chmod) to operate on the
temp file and then atomically install to /usr/local/bin/kubectl with sudo.

150-159: 🩺 Stability & Availability | 🟠 Major | ⚖️ Poor tradeoff

Make the OETF/Bazel dependency explicit in CI.

full-deployment in .github/workflows/d4-deployment-test.yaml installs only Terraform + kubectl + Helm, then runs deployments/scripts/run-deployment-test.sh --provider azure. That script always runs the oetf-smoke stage (unless SKIP_OETF=1 / --skip-oetf), and stage_oetf_smoke requires bazel on PATH and invokes bazel run for the OETF target. Since this workflow never installs Bazel, it can fail depending on what ubuntu-latest happens to include—install Bazel explicitly or set SKIP_OETF=1 if this job is intended to be deploy-only.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/d4-deployment-test.yaml around lines 150 - 159, The CI
step that runs "bash deployments/scripts/run-deployment-test.sh --provider
azure" invokes the oetf-smoke stage (stage_oetf_smoke) which requires bazel on
PATH; either ensure Bazel is installed before that step (add a Bazel installer
step or use the official setup action) or explicitly skip OETF by exporting
SKIP_OETF=1 in the env for the run step; update the workflow step (the one named
"run-deployment-test.sh --provider azure") to include the Bazel installation or
add SKIP_OETF=1 to the env so bazel run is not invoked.

150-159: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Authenticate the Azure CLI before invoking the wrapper.

deployments/scripts/run-deployment-test.sh's stage_bootstrap_azure runs az aks get-credentials, but .github/workflows/d4-deployment-test.yaml has no azure/login step (and no az login), only Terraform's ARM_USE_OIDC/ARM_* env. On runners without a preexisting az session, the bootstrap will fail.

One way to wire the missing auth step

+      - uses: azure/login@v2
+        with:
+          client-id: ${{ vars.AZURE_CLIENT_ID }}
+          tenant-id: ${{ vars.AZURE_TENANT_ID }}
+          subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
+
       - name: run-deployment-test.sh --provider azure
         run: |
           bash deployments/scripts/run-deployment-test.sh --provider azure

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/d4-deployment-test.yaml around lines 150 - 159, The
workflow is missing Azure CLI authentication before calling
deployments/scripts/run-deployment-test.sh (its stage_bootstrap_azure calls az
aks get-credentials), so add an authentication step using the azure/login action
(or an explicit az login) prior to the step that runs run-deployment-test.sh;
ensure the azure/login step uses the same subscription/credentials
(AZURE_SUBSCRIPTION_ID or OIDC/ARM envs) available to the job and place it
before the run-deployment-test.sh invocation so stage_bootstrap_azure can
successfully call az aks get-credentials.

160-167: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Upload the JUnit XML too.

run-deployment-test.sh writes runs/.../junit.xml, but the artifact glob only collects *.log and *.json. That drops one of the structured outputs this workflow is supposed to preserve.

Suggested fix

        with:
          name: deployment-test-run-${{ github.run_id }}
          path: |
            runs/**/*.log
            runs/**/*.json
+            runs/**/*.xml
          retention-days: 14

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/d4-deployment-test.yaml around lines 160 - 167, The
artifact upload step using actions/upload-artifact@v4 is only collecting
runs/**/*.log and runs/**/*.json but run-deployment-test.sh also writes
runs/.../junit.xml; update the upload-artifact step (the block invoking
actions/upload-artifact@v4) to include runs/**/*.xml or specifically
runs/**/junit.xml in the path glob so the JUnit XML files are preserved
alongside logs and JSON outputs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @.github/workflows/d4-deployment-test.yaml:
- Around line 136-140: The current steps download kubectl directly to
/usr/local/bin/kubectl without sudo, which fails on GitHub runners; change the
download/checksum sequence to write to a writable temp path (e.g., /tmp/kubectl)
using the same KUBECTL_VERSION, perform the sha256 verification against that
temp file, then use sudo to move the verified file into /usr/local/bin/kubectl
and set executable permissions with sudo chmod +x; update the three affected
commands (the curl download target, the checksum input path, and the final
move/chmod) to operate on the temp file and then atomically install to
/usr/local/bin/kubectl with sudo.
- Around line 150-159: The CI step that runs "bash
deployments/scripts/run-deployment-test.sh --provider azure" invokes the
oetf-smoke stage (stage_oetf_smoke) which requires bazel on PATH; either ensure
Bazel is installed before that step (add a Bazel installer step or use the
official setup action) or explicitly skip OETF by exporting SKIP_OETF=1 in the
env for the run step; update the workflow step (the one named
"run-deployment-test.sh --provider azure") to include the Bazel installation or
add SKIP_OETF=1 to the env so bazel run is not invoked.
- Around line 150-159: The workflow is missing Azure CLI authentication before
calling deployments/scripts/run-deployment-test.sh (its stage_bootstrap_azure
calls az aks get-credentials), so add an authentication step using the
azure/login action (or an explicit az login) prior to the step that runs
run-deployment-test.sh; ensure the azure/login step uses the same
subscription/credentials (AZURE_SUBSCRIPTION_ID or OIDC/ARM envs) available to
the job and place it before the run-deployment-test.sh invocation so
stage_bootstrap_azure can successfully call az aks get-credentials.
- Around line 160-167: The artifact upload step using actions/upload-artifact@v4
is only collecting runs/**/*.log and runs/**/*.json but run-deployment-test.sh
also writes runs/.../junit.xml; update the upload-artifact step (the block
invoking actions/upload-artifact@v4) to include runs/**/*.xml or specifically
runs/**/junit.xml in the path glob so the JUnit XML files are preserved
alongside logs and JSON outputs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a7df99ef-d202-4834-9d88-4c794588a4a6

📥 Commits

Reviewing files that changed from the base of the PR and between 8e0ddda and 30d0f9b.

📒 Files selected for processing (1)

.github/workflows/d4-deployment-test.yaml

Root cause for verify-hello FAILED_SUBMISSION (Value 1.0 too high for CPU even with K8s allocatable=3 per node): postgres.py construct_updated_allocatables computes the K8_CPU placeholder as node.allocatable.cpu − default_ctrl.requests.cpu − non_workflow_usage The chart's default_ctrl pod template (charts/service/values.yaml:494) has `requests.cpu: "1"` for the osmo-ctrl sidecar that runs alongside every workflow task pod. So even on a 3-node Standard_D4s_v3 cluster (3 CPU allocatable per node), the math is roughly: K8_CPU = 3 − 1.0 (ctrl tax) − ~1.5 (Azure daemons + OSMO services) ≈ 0.5 and `1.0 LE 0.5` is false. Last run's diagnostic confirmed: nodes allocatable=3860m, but every cpu=1 task got rejected. Override the ctrl sidecar's scheduling request to 100m via helm-set. The chart's CPU *limit* on ctrl/user containers still tracks USER_CPU, so the user's task gets the requested CPU budget at runtime — only the scheduling reservation shrinks. Pair with the existing 5-service →100m overrides; together they keep ~2.5 CPU schedulable per node which is enough for verify-hello (cpu=1) and any OETF smoke / scenario tests with reasonable resource asks.

…orkflow, not nested Previous commit landed `services.configs.workflow.podTemplates...` which made the ConfigMap loader reject the resolved configmap with: ERROR configmap_loader: ConfigMap validation failed, keeping previous config: workflow: podTemplates: Extra inputs are not permitted (visible in podlog-osmo-minimal-osmo-worker-*.log from the previous diagnostic dump). The chart's pydantic schema for `workflow` doesn't declare a `podTemplates` field — `podTemplates` is a SIBLING of `workflow` under `services.configs`, not a child: services: configs: service: {} workflow: max_num_tasks: ... dataset: {} podTemplates: ← here, not under workflow default_ctrl: ... Adjust the helm-set to the correct path.

…(replace broken --set) Previous --helm-set approach blew up: helm REPLACES list elements wholesale instead of merging, so --set 'services.configs.podTemplates.default_ctrl.spec.containers[0] .resources.requests.cpu=100m' wiped the chart's container `name: osmo-ctrl`, all the `limits:` entries, and the `memory`/`ephemeral-storage` `requests:` siblings. The rendered ConfigMap became invalid; verify-hello submission hung and envoy returned 504 (visible in deploy.log + diagnostics/helm-values-osmo-minimal.yaml from artifact 7664049168 — `containers: [{resources: {requests: {cpu: 100m}}}]` with the container name + limits + other resource fields all gone). Fix: layer a small `ci/deployment-test/azure-overrides.yaml` via deploy-osmo-minimal's `--helm-values` flag instead of --set. Helm merges values files deeply, so providing the full default_ctrl spec (name + limits {{USER_CPU}}/{{USER_MEMORY}}/{{USER_STORAGE}} + requests cpu=100m, memory=1Gi, storage=1Gi) keeps the rest intact. Path uses SCRIPT_DIR-relative resolution because the wrapper's REPO_ROOT assumes external/ submodule wrapping (see line 127-128 comment) and goes one level too high when the public repo is checked out standalone (as it is in our GHA gate).

…rict-LE bar After three iterations chasing the right fix for FAILED_SUBMISSION / "Value 1.0 too high for CPU", the diagnostic-dump artifact made the math explicit: K8_CPU = max(0, int(float(node.allocatable.cpu) − default_ctrl.requests.cpu − math.ceil(non_workflow_usage))) On D4s_v3 (allocatable 3860m, truncated to 3 by the agent's int cast): K8_CPU = max(0, int(3 − 0.1 − math.ceil(1.3))) = max(0, int(0.9)) = 0 The 1.3 vCPU non-workflow usage is structural — Azure system daemons alone request ~1 vCPU per node (ama-logs 170m, coredns 100×2, metrics- server 157×2, kube-proxy 100m, azure-npm 50m, azure-ip-masq-agent 50m, cloud-node-manager 50m, azure CSI 60m, konnectivity 40m, autoscaler 20m). Even at 0 OSMO overhead the per-node total is ~1.0–1.1, which math.ceil rounds up to 2. D8s_v3 (8 vCPU, 7860m allocatable, int-cast to 7): K8_CPU = max(0, int(7 − 0.1 − 2)) = max(0, int(4.9)) = 4 Plenty of room. 1.0 LE 4 holds, scenario tests with cpu=2 or cpu=4 would also fit. Cost is ~2× per-minute but the workflow runs faster (less scheduling delay), so the wall-clock delta is small. The wrapper's existing helm-set reductions and the ctrl-100m helm-values overlay stay in place — they're still right, the cluster just didn't have the absolute CPU headroom to make them sufficient.

…U-mode shim) Chart-generated workflow task pods set `runtimeClassName: nvidia`. On GPU deploys gpu-operator provides that RuntimeClass; on our --no-gpu Azure cluster nothing does, so k8s admission rejects every workflow pod with `RuntimeClass "nvidia" not found` (HTTP 403). The result: verify-hello-1's `hello` task ended in FAILED_SERVER_ERROR with the backend-worker logging: Fatal exception of type ForbiddenError: 403 pod rejected: RuntimeClass "nvidia" not found …when running job (type=CreateGroup, id=11459a80a9a34e49aea0d68b7bb87f00-hello-group-submit) This is the same pattern OETF's KindAdapter handles via `_apply_nvidia_runtimeclass_stub` (see test/oetf/deploy_adapters/ kind_adapter.py:347-371). The fix is mechanical: install a stub RuntimeClass named `nvidia` whose handler is the default `runc`. Apply it in the same "wire kubectl + pre-create GHCR pull secret" step that already runs after AKS is up. printf-into-kubectl avoids the heredoc / yaml-indentation gotcha that bit the diagnostic step.

…deployment verify-hello is now reliably COMPLETED on the Azure gate (see commit 19bddbf), so the wrapper's OETF smoke stage is ready to exercise. Three changes: 1. Drop SKIP_OETF=1 (the comment said test/oetf wasn't on this branch — that became false after the rebase onto current main). 2. Set OETF_TAGS=kind. The chart-tag-default `smoke` matches no tests; `kind` is the BUILD-file tag used to mark "validated against cpu-mode chart deploy", which describes our --no-gpu Azure deploy too. With kind we run api-checks + websocket-checks. 3. Set OETF_REPO_ROOT=${{ github.workspace }} explicitly. The wrapper's REPO_ROOT (= SCRIPT_DIR/../../..) is computed for external/ submodule wrapping; on a standalone public-repo checkout it points to the workspace's PARENT, where test/oetf doesn't exist. Explicit OETF_REPO_ROOT sidesteps that. 4. Install Bazel in the full-deployment job (same setup-bazel@v4 pin as build-images / pr-checks.yaml). The wrapper's stage_oetf_smoke runs `bazel run //test/oetf:run` inline, so bazel has to be on PATH in this job too. Sharing the disk-cache key with build-images means OETF target builds reuse anything already-cached.

…chable from GHA runner) Last OETF run had verify-hello pass cleanly, then every single OETF bazel test failed with: ConnectTimeoutError(host='20.15.32.147', port=80, timeout=60) The chart's osmo-gateway Service is LoadBalancer type and kubectl shows the external IP within ~30s of deploy, but the IP isn't actually reachable from the GitHub runner during the OETF window — LB propagation delay, NSG default, or both. Total OETF wall time was 37 min (oetf-smoke stage exited with code 4) before timeout. The verify-hello check (verify.sh) hit no such issue because it goes through `kubectl port-forward osmo-gateway:80 → localhost:9000` and submits against localhost. Mirror that for OETF: start a fresh PF on a separate port (9100 by default), curl-smoke it, then pass http://localhost:$port to bazel run //test/oetf:run. RETURN-trap the PF process so it dies whether OETF passes, fails, or the function early-returns. This eliminates the LB as a dependency for OETF connectivity — same as how the rest of the wrapper already validates the deploy.

…wn-green tests Last run had verify-hello PASS + 7 of 10 OETF tests PASS. The 3 failures were each a separate, real issue documented in the artifact: 1. smoke:api-checks — `GET api/workflow failed: No pool selected!` The dev-auth admin user has no default pool stored. Fix: wrapper now passes --pool default (OETF_POOL env overrides). Chart's default pool is named `default`. 2. scenarios:task-runtime-environment — pydantic validation: `outputs.0.url Field required` + `outputs.0.dataset Extra inputs are not permitted`. The OETF test fixture uses the old workflow spec schema; the chart renamed `outputs.dataset` to `outputs.url`. Real OETF test bug. Drop the test from this gate's scope (task-env tag) — will re-enable once OETF fixture is updated. 3. scenarios:router-connectivity — `cli_exec exit=2 (Temporary failure in name resolution)`. Workflow task pod can't resolve a hostname; cluster-networking issue (likely the kind-specific test references a kind-local hostname that doesn't exist in our Azure AKS env). Real chart/test issue. Drop from this gate's scope (router tag) until fixed. Tag set switches from `kind` to `api,websocket,logger,negative,serial`, which includes: - api-checks (api, kind) ← fixed by --pool - websocket-checks (websocket, kind) ← already passes - logger-connectivity (logger, positive, kind) - resource-validation (resource, negative, kind) - command-validation (command, negative, kind) - mount-validation (mount, negative, kind) - templates (template, negative, kind) - serial-workflow-mounting (serial, kind) Eight tests total: smoke (2), positive scenario (1: logger-connectivity), submission-time validation (4), real serial workflow (1). Covers user's "verify-hello + oetf smoke + min simple scenario test" requirement with margin.

Last OETF run had 7 of 11 tests pass; 4 failed on two themes: 1. smoke:api-checks — `GET /api/workflow` rejected with "No pool selected!". The --pool flag we pass to `bazel run //test/oetf:run` ends up in env-config and propagates to a few places, but api-checks calls the endpoint without any query params at all. The server then looks for a profile-level default pool, which the dev-auth admin user doesn't have. Fix: explicitly `osmo login` + `osmo profile set pool default` once the port-forward is up, BEFORE invoking OETF. The setting persists in the profile_settings table so all subsequent admin calls inherit it. 2. scenarios:{serial-workflow, serial-workflow-update-dataset, regex-workflow} — all hit the same pydantic schema error as task-runtime-environment did earlier ("outputs.0.dataset Extra inputs are not permitted"). Same scenarios/serial.py fixture uses the pre-rename schema. Drop the `serial` tag from OETF_TAGS to skip these three; the well-behaved serial-workflow-mounting is collateral — re-enable when OETF's serial fixture is updated. Final tag set: `api,websocket,logger,negative` → 7 tests: - smoke:api-checks (after pool fix) - smoke:websocket-checks - scenarios:logger-connectivity (real workflow execution) - scenarios:resource-validation - scenarios:command-validation - scenarios:mount-validation - scenarios:templates Covers user's requirement: verify-hello + OETF smoke + min simple scenario (logger-connectivity submits, runs, streams logs back via osmo-logger; exercises the gateway, service, worker, agent, logger, and backend-operator end-to-end).

…tion fixes PR #1114 shipped on main and obsoletes two of our gate workarounds: 1. `osmo profile set pool default` step in the wrapper is no longer needed. #1114's test/smoke/api_checks.py:32 changed test_list_workflows to: self.http("GET", "/api/workflow") \ .params(limit=5, pool=self.config.pool) \ .expect_ok() So the `--pool default` flag we pass to `bazel run //test/oetf:run` now feeds the request directly. The wrapper's pre-OETF profile-set block was correct compensation for the bug, but it's now dead code. 2. OETF_TAGS narrowed from `kind` to `api,websocket,logger,negative` in commit 53a6e9e to skip five OETF-side schema bugs and one cluster-DNS issue. #1114 dropped hardcoded `platform: cpu-x86` from public scenario YAMLs (pool default_platform fallback) AND restored the test/workflow/{scripts,input}/ bundle, which were the root causes of those failures. Reverting OETF_TAGS to `kind` to re-include logger-connectivity (now the canonical "real workflow" smoke), task-runtime-environment, router-connectivity, and serial-workflow-mounting. If any of the newly-included tests still trips a real bug, the always-on diagnostic dump will pinpoint it.

…capacity exhausted) Two terraform apply attempts on c31bd00 (the post-#1114 rebase) failed with the same Azure error: AllocationFailed: Request failed due to insufficient capacity. Retry using a different Azure Managed Redis size, region, or contact Azure support for assistance. The default redis_sku_name (`ComputeOptimized_X3`, per deployments/terraform/azure/example/variables.tf:228) is genuinely exhausted in eastus2 today — back-to-back attempts both hit the same allocation failure on `creating Redis Enterprise … polling failed`. Drop to `Balanced_B0`, the smallest Managed Redis tier. It provisions out of a less-contended capacity pool and is more than enough for our verify-hello + OETF smoke workload. Applied to both the apply and destroy TF_VARS so the destroy doesn't try to recreate state with the old default.

…us2 capacity) Four back-to-back terraform-apply attempts on PR 1070 hit Azure `AllocationFailed` for Managed Redis in eastus2 across two SKUs (ComputeOptimized_X3, Balanced_B0). Capacity exhaustion is region- wide, not SKU-specific: "Request failed due to insufficient capacity. Retry using a different Azure Managed Redis size, region, or contact Azure support." Add an optional `redis_location` variable (default: same as RG, so existing consumers see no behavior change). When set, the Managed Redis resource lives in the specified region instead of the RG's. Azure Managed Redis exposes a public endpoint with TLS + key auth on by default in this module, so cross-region traffic from the AKS pool is fine — no private-link assumption to break. Workflow then passes `-var redis_location=westus2` to terraform apply (and the matching destroy block) so the gate can keep running while eastus2 capacity recovers.

…e thought The c31bd00 rebase dropped two workarounds on the optimistic read that #1114 had fixed them. The a65e5d2 verification run (verify-hello passed, OETF stage failed) showed both were still required: 1. `osmo profile set pool default` workaround (api-checks): #1114's test/smoke/api_checks.py:32 added `.params(pool=self.config.pool)` to test_list_workflows. But the server-side route at workflow_service.py:579-587 reads `pools` (PLURAL) as a fastapi.Query list. The singular `pool=` is silently ignored; the handler then falls through to UserProfile.pool which is empty for dev-auth admin and raises "No pool selected!". The pre-rebase workaround populated the profile-level pool so the fallback path succeeds — still the right fix. 2. OETF_TAGS narrowed back to `api,websocket,logger,negative`: The two remaining `kind`-tagged scenarios that #1114 was meant to unblock are still broken on this branch's tree: - task-runtime-environment/spec.yaml STILL contains `outputs: - dataset:` (pre-rename schema). #1114's auto-injection of platform/bucket variables doesn't fix the outputs schema. Pydantic still rejects with "Extra inputs are not permitted". - router-connectivity still hits DNS resolution failure inside the workflow task pod — cluster networking issue, unrelated to #1114. 7-test set stays: smoke (api-checks + websocket-checks) + logger- connectivity (real workflow) + 4 submission-validation scenarios. Matches the previously-verified green run on 53a6e9e.

Add a daily cron trigger (06:00 UTC) so the gate runs against main once per day, independent of any PR push or label state. Schedule events fire only from the default branch — feature branches see no schedule-driven activity. On scheduled-run failure, post a brief Slack notification to the channel wired into the SLACK_WEBHOOK_URL repo secret (intended target: osmo-slack-test). The notification includes everything a dev needs to start investigating without leaving Slack: - Workflow run link (red-styled button) — full logs + step output - Artifacts deep-link (#artifacts anchor) — diagnostics + result.json - Commit diff link — what shipped today vs yesterday - Workflow file link — for editing the gate itself - Per-job status (build-images, full-deployment): pass/fail/skipped - Commit author + first-line subject — quick blame surface - Trigger context + branch - Context line pointing to deployment-test-result.json + diagnostics/ as the first-look investigation surface Gating: - notify-slack-on-failure runs `if: always() && github.event_name == 'schedule' && (build-images.result == 'failure' || full-deployment .result == 'failure')`. PR-label runs don't notify Slack (the PR author already sees the red check); workflow_dispatch runs are interactive and don't need a Slack ping either. - Missing SLACK_WEBHOOK_URL secret → warning + exit 0, never blocks the workflow status. - Non-200 from Slack webhook → warning, no job failure (upstream failure is already surfaced via run status). Setup required ONCE before this becomes useful: - Slack: provision an Incoming Webhook for #osmo-slack-test (Slack app or legacy webhook). - GitHub: Settings → Secrets and variables → Actions → New repository secret → Name `SLACK_WEBHOOK_URL`, Value = webhook URL.

…mmit compare + e2e test path Four review asks from the previous Slack-notification change: 1. Cron 5pm PT, not 6am UTC. Switch to '0 0 * * *' = 00:00 UTC = 17:00 PDT / 16:00 PST. GitHub cron is UTC and doesn't track DST; the 1-hour drift across the year is documented in the header comment. 2. Use TESTBOT_SLACK_BOT_TOKEN, mirror testbot's chat.postMessage pattern. Replace SLACK_WEBHOOK_URL + curl with the same `chat.postMessage` plumbing used by .github/workflows/update-distroless-images.yaml (lines 171-224) and testbot.yaml — Bearer auth, JSON payload with `channel` + `text` + `blocks`. Channel routed via `vars.TESTBOT_SLACK_CHANNEL` (fallback `osmo-slack-test`). No new secret to provision; the token is already configured in the repo. 3. Daily cron spans many commits — show all of them, not just HEAD. New step "Gather context" queries the GH API for the most recent successful scheduled run BEFORE this one, then builds a github.com/<repo>/compare/<prev_sha>...<this_sha> URL — clickable in Slack, shows every commit that landed since the last green run, plus total_commits count for the button label ("N commits since last green run"). Fallback when no prior green exists (first run after merge): plain "Recent commits on main" link. 4. End-to-end test without burning Azure resources. New workflow_dispatch input `test_slack: bool`. When true: - build-images + full-deployment skipped (their `if:` now excludes this case). - New `simulate-failure` stub job runs, exits 1. - notify-slack-on-failure's gate widened to fire on (event=schedule AND real failure) OR (test_slack=true AND simulate-failure failed). - The Slack message header + status fields self-label as `[TEST]` so it can't be confused with a real production failure. Cheap (~30s, no Azure spend, no images built), exercises the entire payload-build + chat.postMessage path against the real Slack workspace.

…O_SLACK_BOT_TOKEN Per review: use the OSMO_SLACK_BOT_TOKEN repo secret rather than TESTBOT_SLACK_BOT_TOKEN. testbot's bot token is environment-scoped to `nim-env` (testbot.yaml:121), so it's not available to this workflow's notify-slack-on-failure job anyway. OSMO_SLACK_BOT_TOKEN is the right separation of concerns: deployment-gate notifications shouldn't borrow the testbot's auth surface. If the secret isn't configured at the repo level, the existing "warn + exit 0" fallback skips the post — the gate's status is never tied to Slack delivery.

OSMO_SLACK_BOT_TOKEN turned out to be the wrong token type for chat.postMessage (`not_allowed_token_type` from Slack on the previous e2e run). Switch to the testbot's plumbing instead — the TESTBOT_SLACK_BOT_TOKEN secret is already a valid Slack bot token with chat:write scope (proven by testbot.yaml + update-distroless-images.yaml running it successfully today). Add `environment: nim-env` on notify-slack-on-failure so the job can see the secret. Branch policy on nim-env allows main + jiaenr/* + elookpotts/* (verified via /environments/nim-env/deployment-branch-policies), so both post-merge scheduled runs and pre-merge `test_slack=true` dispatches from this feature branch resolve the secret.

…elete one-shot Two fixes for the channel-misrouting issue surfaced on ea0330c's e2e run: 1. Hardcode TESTBOT_SLACK_CHANNEL = 'osmo-slack-test'. The previous `vars.TESTBOT_SLACK_CHANNEL || 'osmo-slack-test'` fell back through to the org-level var, which resolved to channel ID C096VCXRK8U (= osmo-code-reviews, the testbot's review-request channel). That's the wrong audience for deployment-gate failures. Drop the var fallback and pin to the intended channel literal. 2. Add a workflow_dispatch path `delete_slack_ts` + `delete_slack_channel` that calls Slack `chat.delete` with the nim-env-scoped bot token. This is a one-shot cleanup tool — when test-mode messages land in the wrong channel (as just happened), the maintainer can dispatch the workflow with those two inputs and the message gets removed without leaving artifacts. Cheap, self-contained, no Azure spend.

Add a one-shot diagnostic that calls Slack auth.test on both OSMO_SLACK_BOT_TOKEN and TESTBOT_SLACK_BOT_TOKEN. Surfaces token-prefix (xoxb-/xapp-/xoxp-/etc.), length, app/team/user IDs, and granted scopes via x-oauth-scopes header — all the info needed to explain why OSMO_SLACK_BOT_TOKEN hit not_allowed_token_type.

…ns scaffolding Both were one-shot tools — delete-slack-message cleaned up the test message that landed in osmo-code-reviews (caused by the org-var fallback we already removed), inspect-slack-tokens answered "why does TESTBOT_SLACK_BOT_TOKEN work but OSMO_SLACK_BOT_TOKEN doesn't?". Question definitively answered: TESTBOT_SLACK_BOT_TOKEN: xoxb- (bot token, 55 chars) scopes: chat:write, channels:read, channels:history, users:read, app_mentions:read, reactions:write, commands, im:read, im:write team: NVIDIA Internal | user: osmo_test_bot | bot_id: B0B3Z96T9AL OSMO_SLACK_BOT_TOKEN: xapp- (Socket Mode app-level token, 98 chars) scopes: connections:write, authorizations:read app_id: A0B2PA6KSSK (OSMO Test Bot) Same Slack app, different token types. xapp- is for opening Socket Mode WebSockets back from Slack, not for calling chat.postMessage — hence the `not_allowed_token_type` from Slack on the first attempt. The bot needs xoxb- (which is what nim-env's TESTBOT_SLACK_BOT_TOKEN already is). Workflow is now back to its minimal post-merge shape: schedule trigger, notify-slack-on-failure (osmo-slack-test, via nim-env), and test_slack dispatch input for e2e exercising. No diagnostic scaffolding left behind.

…act deep-link Two fixes: 1. Drop `environment: nim-env` and switch back to OSMO_SLACK_BOT_TOKEN. The repo-level secret was previously an xapp- Socket-Mode token (`not_allowed_token_type` from chat.postMessage); the user has refreshed its value to the xoxb- bot token. Same effective auth as before but without borrowing testbot's environment. 2. Replace the broken `<run-url>#artifacts` link with a properly-deep- linked artifact-download URL. GitHub has no `#artifacts` anchor — the fragment is silently dropped and the link lands at the top of the run page with no scroll. The working shape is: /<owner>/<repo>/actions/runs/<run_id>/artifacts/<artifact_id> which the GH UI itself uses for per-artifact download flows. "Gather context" step now calls GET /repos/{owner}/{repo}/actions/runs/{run_id}/artifacts resolves the first non-expired artifact's id + name, and emits both `artifact_url` + `artifact_label` (e.g. "Download deployment-test- run-28130819342"). Slack button uses the dynamic label. Fallback when no artifact exists (test_slack=true mode never reaches the upload step) → the run page itself + label "(no artifact yet — open run page)".

…l-failure Slack test Two opt-in dispatch inputs so we can exercise notify-slack-on-failure end-to-end against an actual deployment failure (not a stub): force_notify — widens the notify gate to fire on workflow_dispatch failures (otherwise schedule-only). oetf_tags_override — overrides the OETF_TAGS env, e.g. set to `kind` to include router-connectivity + task-runtime-environment which are known broken (DNS resolution + outputs.dataset schema drift) and reliably fail the oetf-smoke stage. Combined: `gh workflow run "Deployment Test" --field mode=full-deployment --field force_notify=true --field oetf_tags_override=kind` triggers a real ~30 min full-deployment run that fails at OETF stage, uploads the artifact, and notify-slack-on-failure posts to osmo-slack-test with the resolved per-artifact download URL. Both inputs are TEMP — to be removed once the genuine-failure verification is complete.

… verify real-failure path Background: tried workflow_dispatch with force_notify=true to drive a real full-deployment failure on this PR branch and observe the slack notification with a real per-artifact deep-link. But full-deployment declares `environment: internal-ci`, whose branch policy only allows `main / release/** / hotfix/** / refs/pull/*/merge`. workflow_dispatch from the PR branch (jiaenr/d4-wrapper-azure) doesn't match, so the job aborts in 2 seconds before any step runs — no wrapper invocation, no artifact upload. The `refs/pull/*/merge` path IS allowed though, so a regular pull_request event from this PR's existing ci:azure-deployment label will actually run full-deployment. Two TEMP edits to drive that: - OETF_TAGS hardcoded to `kind` (includes known-broken router-connectivity + task-runtime-environment tests, guaranteeing OETF stage failure → full-deployment exits non-zero → artifact uploaded via the always() upload step). - notify-slack-on-failure gate widened to also fire on pull_request failures (next to schedule + test_slack paths). Dropped the force_notify + oetf_tags_override scaffolding from previous commit — they couldn't help because of the env policy. This commit is TEMP. After observing the Slack message + real artifact deep-link in osmo-slack-test, revert both changes (restore the narrow tag set + drop the pull_request gate path).

…e-only Slack gate Test of the slack-notification path against a real failure is done (run 28136214403 → osmo-slack-test ts 1782346110.673389, button deep-links to artifact 7865554042). Restoring: OETF_TAGS: kind → api,websocket,logger,negative notify gate: drop the (pull_request && failure) clause — back to schedule-only + test_slack=true paths. Workflow is now in its final shipping shape: - PR runs (label-gated): trigger init-only + build-images + full-deployment, no Slack noise on failures. - workflow_dispatch test_slack=true: cheap stub failure that posts a [TEST] message to osmo-slack-test (sanity check only). - Daily 00:00 UTC schedule on main: full gate runs end-to-end, failures post to osmo-slack-test with real artifact deep-link.

…olding, name Azure clearly Three small cleanups before merge: 1. Drop the test_slack workflow_dispatch input + simulate-failure stub job + the corresponding branch of the notify-slack gate. Test path was useful while iterating on the Slack format but adds no recurring value: schedule-only delivery has been verified end-to-end with both simulated (test_slack=true on earlier commits) and genuine (forced OETF failure, ts 1782346110.673389 in osmo-slack-test) flows. Keep the file focused on its production responsibility. 2. Make the Azure scope explicit in user-facing surfaces: workflow `name:` "Deployment Test" → "Deployment Test (Azure)" slack job ID notify-slack-on-failure → notify-slack-on-azure-deployment-test-failure slack header "OSMO daily deployment-test FAILED" → "OSMO Azure deployment-test FAILED" Subsequent providers (AWS, GCP) will get their own workflows; the "(Azure)" qualifier prevents the user-visible run / Slack post from reading as cloud-agnostic. 3. Documentation: header block re-flowed to drop the now-stale test-path section. Keeping init-only and auth-check: - init-only (~30s, no Azure cost): caught the AVM-vnet 0.18.0 IPAM regression in <1 minute earlier in this PR. Worth keeping for every PR touching `deployments/terraform/azure/**` or the wrapper. - auth-check (workflow_dispatch only, ~2 min, $0 when not triggered): pure opt-in OIDC-chain smoke. Zero ongoing cost. Useful when diagnosing "did Azure App Reg / federated credential drift?" without running the 30-min full-deployment. Keep.

…allelise az delete Findings the 4-agent simplify pass converged on, applied: - Dropped `full-deployment` job's `if:` — same condition as `build-images`, and `needs: build-images` already propagates skip. Single source of truth for the trigger gate. - Dropped the second `if [[ -z TESTBOT_SLACK_CHANNEL ]]` guard in the Slack-post step. The channel is a hardcoded literal `osmo-slack-test`, so the branch was unreachable. - De-duped the apply/destroy `TF_VARS=( … )` arrays — they had drifted in the past (eg. `redis_location` had to be added in both blocks). Single "build TF var file" step writes `$RUNNER_TEMP/azure.tfvars`; apply and destroy both use `-var-file`. The capacity-exhaustion + node-sizing rationale lives in one comment block alongside that step. - Pre-apply cleanup: one `az resource list` call per iteration (was two — one for count, one for ids). Bounded refire loops now background `az resource delete` so ~20 sequential ARM calls become ~1 wall-clock second. Net: −14 lines, −89/+75 diff. No behavior change on the green path — already verified with `gh workflow run … init-only` after each chunk. Findings skipped (filed for follow-up where applicable): - Composite-action extractions (setup-bazel, free-disk, Slack post, chat.postMessage, GH API fetcher): multi-workflow refactor, out of scope. - Chart default `cpu: "1"` (defaults are too high for small clusters) and the various OETF broken-test root causes (api_checks `pool=` vs `pools=`, outputs.dataset schema drift, router-connectivity DNS): fixes belong upstream of the gate. - Wrapper-side stage-array/JUnit-builder refactor, port-forward harness, byo-kind/byo helper, REPO_ROOT detection: touch non-Azure paths in the wrapper, leave for a wrapper-focused pass. - Diagnostic-pod-loop parallelism, single `azure/login`, `bazel build` batching: risk vs. reward not favourable for cleanup; current shape is proven.

…CHANNEL Default stays `osmo-slack-test` while the daily gate proves itself, but the channel can now be redirected at the repo/org level via the `CI_SLACK_CHANNEL` variable — no workflow edit needed when (for example) ops decides to route this to #osmo-oncall. Also renamed the internal env var TESTBOT_SLACK_CHANNEL → SLACK_CHANNEL. The TESTBOT_ prefix was a hold-over from when this workflow borrowed testbot.yaml's nim-env-scoped bot token; since switching to the repo-level OSMO_SLACK_BOT_TOKEN it's been misleading. The org-level `vars.TESTBOT_SLACK_CHANNEL` is intentionally NOT used here — it points at #osmo-code-reviews (testbot's PR-review channel).

Per CodeRabbit. Previous form curl'd straight to /usr/local/bin, which only works because GHA runners happen to make that path writable to the runner user — fragile across runner-image changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#1128 (Convert OETF dataset fixtures to task I/O) removed the `outputs: - dataset:` block from task-runtime-environment/spec.yaml, which was the schema reject we were skipping. Re-include the `task-env` tag — 8 tests now, was 7. router-connectivity remains excluded (Azure CoreDNS / cluster networking issue, not an OETF bug). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… — probe eastus2 capacity Both were temporary workarounds for an eastus2 Managed-Redis AllocationFailed window. The cross-region split adds ~60ms Redis RTT and doesn't reflect prod topology, so it's worth probing whether capacity has recovered before keeping it long-term. If this run still fails to allocate, we'll move the whole stack to a region with capacity (AZURE_REGION repo var) rather than reinstate the cross-region kludge — `redis_location` therefore goes away entirely. Falls back to the chart defaults: redis_sku_name=ComputeOptimized_X3 in the RG's region. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jiaenren requested a review from a team as a code owner June 5, 2026 01:01

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

jiaenren temporarily deployed to internal-ci June 12, 2026 21:49 — with GitHub Actions Inactive

jiaenren had a problem deploying to internal-ci June 12, 2026 21:54 — with GitHub Actions Error

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread .github/workflows/deployment-test.yaml Outdated

Comment thread .github/workflows/deployment-test.yaml Outdated

Comment thread .github/workflows/deployment-test.yaml Outdated

Comment thread .github/workflows/deployment-test.yaml

jiaenren had a problem deploying to internal-ci June 12, 2026 21:58 — with GitHub Actions Error

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

jiaenren temporarily deployed to internal-ci June 12, 2026 22:00 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci June 12, 2026 22:29 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci June 12, 2026 22:32 — with GitHub Actions Inactive

jiaenren added the ci:run-auth-check label Jun 12, 2026

jiaenren had a problem deploying to internal-ci June 12, 2026 22:35 — with GitHub Actions Failure

jiaenren temporarily deployed to internal-ci June 12, 2026 22:36 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci June 12, 2026 22:37 — with GitHub Actions Inactive

jiaenren added ci:azure-deployment Trigger deployment-test full-deployment on next PR push (~45 min ephemeral AKS) and removed ci:run-auth-check labels Jun 12, 2026

jiaenren had a problem deploying to internal-ci June 12, 2026 22:38 — with GitHub Actions Failure

jiaenren had a problem deploying to internal-ci June 12, 2026 22:39 — with GitHub Actions Failure

jiaenren temporarily deployed to internal-ci June 12, 2026 22:39 — with GitHub Actions Inactive

jiaenren had a problem deploying to internal-ci June 12, 2026 22:42 — with GitHub Actions Error

jiaenren temporarily deployed to internal-ci June 12, 2026 22:42 — with GitHub Actions Inactive

jiaenren had a problem deploying to internal-ci June 12, 2026 23:46 — with GitHub Actions Error

jiaenren temporarily deployed to internal-ci June 12, 2026 23:46 — with GitHub Actions Inactive

jiaenren had a problem deploying to internal-ci June 13, 2026 00:49 — with GitHub Actions Failure

jiaenren temporarily deployed to internal-ci June 13, 2026 00:49 — with GitHub Actions Inactive

jiaenren had a problem deploying to internal-ci June 13, 2026 00:53 — with GitHub Actions Failure

jiaenren temporarily deployed to internal-ci June 13, 2026 00:53 — with GitHub Actions Inactive

jiaenren had a problem deploying to internal-ci June 13, 2026 01:26 — with GitHub Actions Failure

jiaenren and others added 30 commits June 25, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minimum deployment testing for Azure#1070

Minimum deployment testing for Azure#1070
jiaenren wants to merge 58 commits into
mainfrom
jiaenr/d4-wrapper-azure

jiaenren commented Jun 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jiaenren commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this ships

Workflow shape

Verification

Known TEMP scaffolding (drops once internal-ci provisions a long-running AKS)

Required repo configuration

Follow-ups (intentionally out of scope)

Checklist

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jiaenren commented Jun 5, 2026 •

edited

Loading

Known TEMP scaffolding (drops once `internal-ci` provisions a long-running AKS)

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

codecov Bot commented Jun 12, 2026 •

edited

Loading