Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 142 additions & 67 deletions src/scripts/testbot/README.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,27 @@
# Testbot: AI-Powered Test Generation
# Testbot: AI-Powered Test Generation & Review Response

Testbot analyzes coverage gaps, generates tests using Claude Code, validates them, and opens PRs for human review. It also responds to inline review comments via `/testbot`.
Testbot is a GitHub Actions bot backed by Claude Code that:

## Architecture
1. **Generates tests** for low-coverage files on a weekly schedule (opens PRs for review)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve schedule wording mismatch (“weekly” vs weekday cron).

Line 5 says generation runs on a weekly schedule, but Line 47 defines a weekday cron (0 6 * * 1-5), which is daily on weekdays. Please align wording to avoid confusion.

Also applies to: 47-47

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scripts/testbot/README.md` at line 5, The README currently says
"Generates tests for low-coverage files on a weekly schedule" but the cron shown
("0 6 * * 1-5") runs weekdays; update the wording to match the cron (e.g.,
change the phrase to "runs on weekdays at 06:00" or "weekday schedule
(Mon–Fri)") or, if you intend a true weekly run, replace the cron with a weekly
expression (e.g., a single weekday like "0 6 * * 1"). Edit the sentence
containing "Generates tests ... weekly" and/or the cron string "0 6 * * 1-5" so
both describe the same schedule.

2. **Responds to `/testbot` comments** on any PR labeled `ai-generated` (applies fixes, writes tests, addresses CodeRabbit feedback)

### Test Generation (`testbot.yaml`)
## Using testbot

```text
Codecov API → coverage_targets.py → Claude Code CLI → guardrails → create_pr.py
| ↑
└─────────┘ (agent retries on test failures)
```

| Stage | Component | Description |
|-------|-----------|-------------|
| **Coverage analysis** | `coverage_targets.py` | Fetches Codecov report, selects lowest-coverage files |
| **Test generation** | Claude Code CLI | Reads source, writes test files and BUILD entries, runs tests, iterates on failures |
| **Guardrails** | `guardrails.py` | Filters out any non-test file changes made by Claude |
| **PR creation** | `create_pr.py` | Creates branch, commits test files, pushes, opens PR with `ai-generated` label |

Claude Code is sandboxed: it can only read files, edit test files, and run test commands (`bazel test`, `pnpm test`). It cannot run `git`, `gh`, or modify source code. All git and GitHub operations are in deterministic harness scripts.

### Review Response (`testbot-respond.yaml`)

```text
/testbot comment → respond.py
├─ fetch all thread comments (GraphQL)
├─ filter: trigger phrase, author, dedup
├─ Claude Code CLI: read files, apply fix, run tests
├─ respond.py: git commit + push
├─ structured reply via --json-schema
└─ post inline reply to each thread
```

| Feature | Description |
|---------|-------------|
| **Trigger** | Comment starting with `/testbot` on any PR with the `ai-generated` label |
| **Thread context** | Full conversation history (all nested comments) passed to Claude |
| **Structured output** | `--json-schema` returns per-thread replies and commit message |
| **Safety** | Repo-member-only access, crash recovery, push retry |
| **Dedup** | Skips threads where the bot already replied and is awaiting human follow-up |

### Security Boundary

| | Claude Code | Harness scripts |
|---|---|---|
| Read source files | Yes | — |
| Write/edit test files | Yes | — |
| Run `bazel test` / `pnpm test` | Yes | — |
| Run `git` commands | **No** | `create_pr.py`, `respond.py` |
| Run `gh` commands | **No** | `create_pr.py`, `respond.py` |
| Filter non-test changes | — | `guardrails.py` |

## Triggering on GitHub

### Manual dispatch
### Generate workflow — manual dispatch

**Actions → Testbot → Run workflow**, or via CLI:

```bash
gh workflow run testbot.yaml --ref <branch> \
gh workflow run testbot.yaml --ref main \
-f max_targets=1 \
-f max_uncovered=300 \
-f max_turns=50 \
-f model=aws/anthropic/claude-opus-4-5
```

### Schedule

Runs automatically on weekdays at 6 AM UTC.

### Review response
### Respond workflow — /testbot comments

Add the `ai-generated` label to your PR, then start an inline review comment with `/testbot <instruction>`. The command must be the first text in the comment. Examples:
Add the `ai-generated` label to your PR, then post an **inline review comment** (on the "Files changed" tab) starting with `/testbot`. Examples:

```text
/testbot add unit tests for this file
Expand All @@ -81,21 +30,147 @@ Add the `ai-generated` label to your PR, then start an inline review comment wit
/testbot refactor this function to reduce duplication
```

The bot responds only to repo members (OWNER, MEMBER, COLLABORATOR). It will not respond to its own replies or comments from bots.
The command must be the **first text** in the comment. Only repo members (OWNER, MEMBER, COLLABORATOR) can trigger the bot. It won't respond to its own replies or to other bots.

**Example threads** showing the bot in action on PR #890:
- [Thread r3126197776](https://github.com/NVIDIA/OSMO/pull/890/changes/40b026ff5eb4cb99d697476a49dead9811a9131b#r3126197776)
- [Thread r3126743347](https://github.com/NVIDIA/OSMO/pull/890/changes/40b026ff5eb4cb99d697476a49dead9811a9131b#r3126743347)

### Reverting a testbot commit

If the bot's commit isn't what you wanted, revert it and retry:
If the bot's commit isn't what you wanted:

```bash
git pull && git revert HEAD --no-edit && git push
```

Then post a new `/testbot` comment with clearer instructions.

## System Architecture

```text
┌──────────────────────────────────┐
│ Claude Code CLI │
│ (sandboxed — Read/Edit/Write, │
│ bazel test, pnpm test, gh pr) │
└───────────────┬──────────────────┘
│ --allowedTools, --json-schema
┌──────────────────────┐ ┌─────────┴──────────┐ ┌──────────────────┐
│ GENERATE WORKFLOW │ │ HARNESS │ │ RESPOND WORKFLOW │
│ (testbot.yaml) │────▶│ Python scripts │◀────│ (testbot-respond │
│ Weekly cron │ │ (git, gh, auth, │ │ .yaml) │
│ or dispatch │ │ guardrails) │ │ /testbot comment │
└──────────┬───────────┘ └─────────┬──────────┘ └──────────┬───────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌───────────────┐ ┌──────────┐
│ Codecov │ │ git push │ │ GitHub │
│ API │ │ gh pr ... │ │ API │
└──────────┘ └───────────────┘ └──────────┘
```

The architecture separates **what the LLM can do** (read code, run tests) from **what the harness does** (git, GitHub API, auth, guardrails). The LLM is never trusted with write access to branches or the GitHub API.

## Workflow 1: Generate Tests (`testbot.yaml`)

```text
┌─────────────┐ ┌────────────────────────┐ ┌─────────────────────┐ ┌──────────────┐ ┌────────────┐
│ Trigger │ │ 1. coverage_targets.py │ │ 2. Claude Code CLI │ │ 3. guardrails│ │ 4. create_ │
│ weekday │─▶│ Fetch Codecov │─▶│ Read source │─▶│ .py │─▶│ pr.py │
│ 6 AM UTC │ │ Pick low-cov file │ │ Write tests+BUILD │ │ Keep tests │ │ Branch, │
│ or manual │ │ Emit target list │ │ Run bazel test │ │ only │ │ commit, │
└─────────────┘ └────────────────────────┘ │ Retry on fail │ │ Revert src │ │ open PR │
└─────────────────────┘ └──────────────┘ └────────────┘
```

**Trigger:** Cron `0 6 * * 1-5` (weekdays 6 AM UTC) or `workflow_dispatch`

**Output:** A new branch `testbot/YYYYMMDD-HHMM` and a PR titled `[testbot] Add tests for <source_file>` with the `ai-generated` label

## Workflow 2: Respond to /testbot (`testbot-respond.yaml`)

```text
┌──────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ ┌────────────────┐
│ Trigger │ │ 1. auto-approve │ │ 2. respond.py │ │ 3. Claude Code │
│ pull_request_ │─▶│ (workflow_run) │─▶│ GraphQL: fetch │─▶│ CLI │
│ review_comment │ │ Check NVIDIA │ │ threads + filter │ │ Read, Edit, │
│ + ai-generated │ │ org membership │ │ for /testbot │ │ Write, bazel,│
│ label │ │ Approve env │ │ trigger + author │ │ gh pr view │
└──────────────────┘ └─────────────────────┘ │ assoc (MEMBER+) │ └────────┬───────┘
│ Build prompt │ │
└─────────────────────┘ │
▲ │
│ structured JSON │
│ {commit_message, │
│ replies[...]} │
│ │
┌─────────────────────┐ ┌──────────┴──────────┐ │
│ 5. Post inline │ │ 4. respond.py │ │
│ replies via │◀─│ get_changed_files │◀───────────┘
│ GitHub API │ │ commit_and_push │
│ (one per thread) │ │ (retry, detect │
│ │ │ GH013) │
└─────────────────────┘ └─────────────────────┘
```

**Trigger:** Inline review comment (on "Files changed" tab) that starts with `/testbot` on a PR labeled `ai-generated`

**Output:** A new commit by `testbot[bot]` pushed to the PR branch, plus an inline reply to each addressed comment

## Auto-approver (`testbot-respond-approve.yaml`)

The respond workflow runs under `environment: testbot-respond` (gated by a required reviewer) so it can access the `NVIDIA_NIM_KEY` secret. The auto-approver runs on `main` via `workflow_run` and:

1. Checks if the triggering actor is in the `NVIDIA/osmo-dev` team, OR a trusted bot (`svc-osmo-ci`, `github-actions[bot]`, `coderabbitai[bot]`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Auto-approver authorization description is internally inconsistent.

Line 85 says authorization is team member or trusted bot, while Line 100 describes only team membership verification. Please make these statements consistent.

Also applies to: 100-100

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scripts/testbot/README.md` at line 85, The README contains inconsistent
descriptions of the Auto-approver authorization; update both occurrences so they
state the same rule: that the actor must be either a member of the
NVIDIA/osmo-dev team OR one of the trusted bots (svc-osmo-ci,
github-actions[bot], coderabbitai[bot]). Edit the Auto-approver authorization
paragraph(s) to use identical wording and include the explicit OR logic and the
trusted bot list so both descriptions match exactly.

2. If authorized, calls GitHub's `reviewPendingDeploymentsForRun` to approve the deployment

This lets `/testbot` comments from NVIDIA team members run automatically while blocking external contributors and bots that don't need to run the full pipeline.

## Guardrails

| Guardrail | Scope | Implementation |
|-----------|-------|----------------|
| **Tool allowlist** | Both workflows | `--allowedTools` flag restricts Claude Code to `Read,Edit,Write,Glob,Grep,bazel test,pnpm test/validate/format,gh pr view/diff/checks` — no `git`, no `gh api`, no arbitrary bash |
| **Test-file-only filter** | Generate only | `guardrails.py:get_changed_test_files()` reverts non-test file changes before commit |
| **Label gate** | Respond only | Workflow `if:` requires `ai-generated` label on the PR |
| **Fork rejection** | Respond only | Workflow `if:` requires `head.repo == base.repo` (no forks) |
| **Author association** | Respond only | `respond.py` requires the triggering comment author to be `OWNER`, `MEMBER`, or `COLLABORATOR` |
| **Required reviewer** | Respond only | `environment: testbot-respond` blocks unauthorized runs from accessing the API key secret |
| **Org membership check** | Respond only | Auto-approver verifies the actor is in `NVIDIA/osmo-dev` before approving |
| **Commit message sanitization** | Both | `sanitize_commit_message()` enforces `testbot:` prefix, strips git trailers, caps at 500 chars |
| **Push retry with GH013 detection** | Both | `commit_and_push()` retries up to 3x; fails fast on repository ruleset violations |
| **Partial work discard** | Respond only | On timeout or max-turns hit, `respond.py` discards file changes and posts an informative reply (doesn't push half-finished work) |

## Harness responsibilities

Claude Code is intentionally given a **narrow capability surface**. Everything else lives in the Python harness:

| Responsibility | Component |
|----------------|-----------|
| Coverage analysis & target selection | `coverage_targets.py` |
| Prompt construction (shared rules + workflow-specific) | `respond.py`, `TESTBOT_PROMPT.md`, `TESTBOT_RESPOND_PROMPT.md`, `TESTBOT_RULES.md` |
| Git operations (branch, commit, push, retry, revert) | `create_pr.py`, `respond.py` |
| GitHub API (fetch threads, post replies, create PRs) | `create_pr.py`, `respond.py` |
| Guardrail enforcement (file-type filter) | `guardrails.py` |
| Structured output parsing (3-tier fallback) | `respond.py:_extract_replies()` |
| Timeout / max-turns handling | `respond.py:run_claude()` and `main()` |
| Auto-approval of environment deployments | `testbot-respond-approve.yaml` |

## Prompt files

Prompts are **file-based** (not inlined in Python) so they can be edited, diffed, and reviewed independently:

| File | Purpose |
|------|---------|
| `TESTBOT_RULES.md` | **Shared** test quality rules, bug-detection process, verification steps, language conventions. Referenced by both workflows. |
| `TESTBOT_PROMPT.md` | Generate-specific: coverage targets process, BUILD file handling, guardrails. References `TESTBOT_RULES.md`. |
| `TESTBOT_RESPOND_PROMPT.md` | Respond-specific: role framing, PR context guidance, output JSON schema example. References `TESTBOT_RULES.md`. |

## Configuration

### Test generation (dispatch inputs)
### Generate workflow (dispatch inputs)

| Input | Default | Description |
|-------|---------|-------------|
Expand All @@ -106,13 +181,13 @@ Then post a new `/testbot` comment with clearer instructions.
| `model` | `aws/anthropic/claude-opus-4-5` | LLM model on API gateway |
| `dry_run` | `false` | Generate without creating PR |

### Review response (CLI args in `testbot-respond.yaml`)
### Respond workflow (CLI args in `testbot-respond.yaml`)

| Arg | Default | Description |
|-----|---------|-------------|
| `--max-turns` | `50` | Claude Code agent turns |
| `--max-turns` | `75` | Claude Code agent turns |
| `--max-responses` | `10` | Max threads to address per trigger |
| `--timeout` | `720` | Claude Code CLI timeout in seconds |
| `--timeout` | `900` | Claude Code CLI timeout in seconds |
Comment on lines +188 to +190

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix documented respond defaults to match respond.py.

These defaults are currently incorrect in docs. respond.py uses --max-turns=50 and --timeout=720 (not 75 / 900), so operators will be misled.

Suggested doc fix
-| `--max-turns` | `75` | Claude Code agent turns |
+| `--max-turns` | `50` | Claude Code agent turns |
...
-| `--timeout` | `900` | Claude Code CLI timeout in seconds |
+| `--timeout` | `720` | Claude Code CLI timeout in seconds |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| `--max-turns` | `75` | Claude Code agent turns |
| `--max-responses` | `10` | Max threads to address per trigger |
| `--timeout` | `720` | Claude Code CLI timeout in seconds |
| `--timeout` | `900` | Claude Code CLI timeout in seconds |
| `--max-turns` | `50` | Claude Code agent turns |
| `--max-responses` | `10` | Max threads to address per trigger |
| `--timeout` | `720` | Claude Code CLI timeout in seconds |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scripts/testbot/README.md` around lines 184 - 186, Update the README
table values to match the actual defaults used by respond.py: change
`--max-turns` from `75` to `50` and `--timeout` from `900` to `720`; reference
the CLI flags in the docs and verify they match the defaults defined in the
respond.py entrypoint (flags `--max-turns` and `--timeout`) so operators see the
correct values.

| `--model` | `aws/anthropic/claude-opus-4-5` | LLM model |

### Coverage target selection (constants in `coverage_targets.py`)
Expand Down Expand Up @@ -143,5 +218,5 @@ src/scripts/testbot/
.github/workflows/
├── testbot.yaml # Scheduled test generation
├── testbot-respond.yaml # /testbot review response
└── testbot-respond-approve.yaml # Auto-approve for org members
└── testbot-respond-approve.yaml # Auto-approve for NVIDIA org members
```
Loading