After every successful production deployment, run the auditor automatically and post the results to your team. Catch the things traditional CI cannot:
- OAuth chain breakage
- 12-language mixin regressions
- Visual polish regressions vs. baseline
- Stripe webhook misconfigurations
- Admin tab empty-state bugs
- Email deliverability
A ready-to-copy template workflow is at docs/integration/post-deploy-audit.example.yml. Copy it into your app's .github/workflows/post-deploy-audit.yml and adjust the trigger / secrets for your environment. The template triggers:
- After a successful deploy workflow on
main(replaceYOUR-DEPLOY-WORKFLOWwith your actual workflow name) - Manually via "Run workflow"
Set these in your GitHub repo Settings → Secrets and variables → Actions:
| Secret | Purpose |
|---|---|
ANTHROPIC_API_KEY |
LLM access for Stagehand, critic, Computer Use |
TEST_GOOGLE_US, TEST_GOOGLE_US_PASSWORD |
Dedicated Google OAuth test account for 01-google-oauth-signup (US persona) |
TEST_GOOGLE_JP, TEST_GOOGLE_JP_PASSWORD |
Same for JP persona |
AUDIT_SLACK_WEBHOOK |
Optional, for completion notifications |
TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID |
Optional Telegram alerts |
If your scenarios depend on app-specific secrets (admin session cookies, Stripe test keys, etc.), add them to the secret list and reference them in your scenario YAMLs.
- Stripe Dashboard → toggle "Test mode" (top right)
- Developers → API keys → copy the Publishable key (starts with
pk_test_) - Add as repo secret
STRIPE_TEST_PUBLISHABLE_KEY - Never copy the Secret key (
sk_test_) — the auditor doesn't need it; it only fills the public checkout form
The auditor's startup check explicitly refuses any key starting with pk_live_.
- Create a brand-new Google account dedicated to audits (e.g.
audit-us-myapp@gmail.com) - Enable 2FA-free recovery email
- Pre-authorize your OAuth app once manually (otherwise consent screen blocks automation)
- Add to repo secrets
Risk: Google may eventually challenge automation. If you see CAPTCHA or "verify it's you" pages in audit reports, rotate to a new account or run the OAuth scenario less frequently (e.g. weekly instead of per-deploy).
The CLI emits CI-friendly exit codes:
| Code | Meaning |
|---|---|
| 0 | All scenarios pass |
| 1 | One or more scenarios failed (critical issues or critical step failures) |
| 2 | All scenarios passed but with warn-level issues |
| 3 | Quality-gate regression: overall score below --min-score (no hard failure) |
A hard scenario failure (1) takes precedence over a gate regression (3), which takes precedence over warnings (2). Code 3 lets CI distinguish a score regression from a functional failure.
GitHub Actions will mark the job as failed for non-zero codes. You may want to allow exit code 2 to pass (warning-only) in your workflow:
continue-on-error: true # only if you want warn-only runs to not block the pipelineAfter each run, the workflow uploads:
audit-report-<run_id>— fullreports/<run_id>/directory including JSON, HTML, screenshots, video, HAR, console logsvisual-baselines— only onmain, the snapshotted baseline directory for next-run diffing
To inspect a failure: download the artifact, open audit.html in a browser, click any screenshot to enlarge, watch the video for the failed unit.
For preview deployments, you can run a smaller subset on every PR. Add a job that runs only your smoke scenario with one persona, with a tighter budget:
- name: Quick audit (PR)
if: github.event_name == 'pull_request'
run: |
npx pixelcheck run \
--scenario 00-infra-smoke \
--persona us-english-free-mobile \
--budget 0.50 \
--tag pr-${{ github.event.pull_request.number }}Default budget_usd: 3.0 covers a full 27-unit run with several Computer Use escalations. Override with --budget per run.
To predict cost without running: each unit averages ~$0.08 (Sonnet 4.6 only) to ~$0.40 (with Opus 4.6 critical_review escalations). Multiply by matrix size.
Set SLACK_WEBHOOK or TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID env. The runner posts a summary at the end of every run:
[FAIL] AI Audit 2026-04-11_143022_post-deploy
Project: YourApp
Pass: 24 | Warn: 2 | Fail: 1
Critical issues: 1
Total cost: $1.342
Duration: 487.2s
Failure summaries include the minimum repro command for each failing unit so on-call can rerun locally.
| Symptom | Likely cause |
|---|---|
| All units fail with "Stagehand not installed" | npm install didn't run, or --with-deps flag missing on playwright install |
| OAuth scenarios fail with "captcha" | Google challenged the test account; rotate or rerun later |
| Localization scenario reports false English mixins | Brand names like your product name should be in the critic's exempt list — verify the prompt template |
| Budget exhausted before all units run | Increase --budget or use --scenario to filter |
| Same scenario flaky across runs | Increase retry on the unstable step, or add wait_for before it |
| Console errors flood the report | The auditor records ALL console errors; if your app is noisy, set per-step expected_console_noise (TODO) |