Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 77 additions & 22 deletions .github/workflows/ci-failure-diagnosis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,19 @@ on:
- completed

permissions:
contents: write
pull-requests: write
contents: read
pull-requests: read
actions: read

jobs:
gather-diff-and-generate-patch:
collect-artifacts:
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
permissions:
contents: read
pull-requests: read
actions: read
runs-on: ubuntu-latest
name: Gather failure context and generate PR
name: Gather failure context and generate artifacts
steps:
- name: Checkout TensorZero config file
uses: actions/checkout@v5
Expand Down Expand Up @@ -54,33 +58,84 @@ jobs:
fetch-depth: 0
fetch-tags: false

- name: Debug
run: |
ls -lR /tmp/tensorzero-for-gateway
cat /tmp/tensorzero-for-gateway/tensorzero.toml
cat /tmp/tensorzero-for-gateway/prompt.minijinja

- name: Call LLM to generate PR revision
- name: Generate patch artifacts
uses: tensorzero/experimental-ci-bot/generate-pr-patch@main
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
token: ${{ secrets.GITHUB_TOKEN }}
tensorzero-base-url: http://localhost:3000
# TODO: Remove when agent creates PRs
tensorzero-diff-patched-successfully-metric-name: tensorzero_github_ci_bot_diff_patched_successfully
output-artifacts-dir: debug-logs
clickhouse-url: ${{ secrets.CI_BOT_CLICKHOUSE_URL }}
clickhouse-table: GitHubBotPullRequestToInferenceMap
output-artifacts-dir: ci-bot-artifacts
Comment on lines +61 to +66

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Pass GITHUB_TOKEN to generate-pr-patch action

The new implementation of generate-pr-patch now reads the GitHub token exclusively from process.env.GITHUB_TOKEN (src/generate-pr-patch/main.ts, getRequiredGitHubToken). In the workflow step that invokes the action, the previous token input and GH_TOKEN env were removed, but no GITHUB_TOKEN environment variable is supplied. Actions do not automatically expose secrets.GITHUB_TOKEN to the action process, so this step will crash with “GITHUB_TOKEN environment variable is required” before any artifacts are generated. Re‑introduce an env export, e.g. env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}, so the action can authenticate to GitHub APIs and the gh CLI.

Useful? React with 👍 / 👎.


- name: Upload diagnostics bundle
- name: Upload CI bot artifacts
if: always()
continue-on-error: true
uses: actions/upload-artifact@v4
with:
name: ci-failure-diagnostics
name: tensorzero-ci-bot-artifacts
path: |
debug-logs/
ci-bot-artifacts/

- name: Stop TensorZero gateway
if: always()
continue-on-error: true
run: docker stop tensorzero-gateway

apply-artifacts:
needs: collect-artifacts
if: >-
${{ github.event.workflow_run.conclusion == 'failure' &&
needs.collect-artifacts.result == 'success' }}
permissions:
contents: write
pull-requests: write
actions: read
runs-on: ubuntu-latest
name: Apply collected artifacts and update PR
steps:
- name: Download CI bot artifacts
uses: actions/download-artifact@v4
with:
name: tensorzero-ci-bot-artifacts
path: pr-artifacts

- name: Checkout TensorZero config file
uses: actions/checkout@v5
with:
repository: tensorzero/experimental-ci-bot
sparse-checkout: |
tensorzero

- name: Move tensorzero to tensorzero-for-gateway
run: |
mv ./tensorzero /tmp/tensorzero-for-gateway
ls -lR .

- name: Start TensorZero gateway
run: |
docker pull tensorzero/gateway:latest
docker run -d --rm \
--name tensorzero-gateway \
-e TENSORZERO_CLICKHOUSE_URL=${{ secrets.CI_BOT_CLICKHOUSE_URL }} \
-e OPENAI_API_KEY=${{ secrets.CI_BOT_OPENAI_API_KEY }} \
-p 3000:3000 \
--volume /tmp/tensorzero-for-gateway:/action-config \
tensorzero/gateway:latest --config-file /action-config/tensorzero.toml

for _i in {1..100}; do
curl -fsS http://localhost:3000/health && exit 0
sleep 3
done
echo "Gateway never became ready" >&2
exit 1

- name: Apply diagnostic artifacts
uses: tensorzero/experimental-ci-bot/apply-pr-artifacts@main
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
artifact-directory: pr-artifacts/ci-bot-artifacts
tensorzero-base-url: http://localhost:3000
tensorzero-diff-patched-successfully-metric-name: tensorzero_github_ci_bot_diff_patched_successfully
clickhouse-url: ${{ secrets.CI_BOT_CLICKHOUSE_URL }}
clickhouse-table: GitHubBotPullRequestToInferenceMap
Comment on lines +134 to +138
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all this here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

believe so bc need to write to CH

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "proper" way to do this is to spin up a hosted t0 server with auth, have an api key here, and whitelist it so we only accept the github action runners' IP range or something. that's a lot of work though.

Copy link
Member Author

@shuyangli shuyangli Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still worried about repo secrets; let's only run this on trusted code (coming from main repo, not forks)


- name: Stop TensorZero gateway
if: always()
Expand Down
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,29 @@
- Under "Settings > Actions > General", check the box for "Allow GitHub
Actions to create and approve pull requests".

## CI remediation workflow architecture

TensorZero CI remediation operates as a two-stage workflow to avoid running
untrusted pull request code with privileged credentials:

1. **collect-artifacts job (`.github/workflows/ci-failure-diagnosis.yml`)** —
triggered from the failing `Continuous Integration` run. It launches the
TensorZero gateway, gathers logs and diffs, calls the LLM, and uploads a
manifest + patch bundle as an artifact with read-only permissions. No GitHub
write operations happen here.
1. **apply-artifacts job (`.github/workflows/ci-failure-diagnosis.yml`)** — runs
in the same workflow after the first job succeeds. It downloads the untrusted
bundle, validates the manifest, applies the diff using a privileged token,
posts comments, and records telemetry.

Treat artifacts produced by the collect job as untrusted input. The collecting
job relies solely on the job-scoped `GITHUB_TOKEN`, and we never pass repository
secrets into that phase. The privileged job runs with separate `permissions` so
its token has write access while the collecting job only receives read scopes.
The privileged workflow performs additional safety checks (manifest schema
validation, SHA matching, diff length limits, and head/base SHA consistency)
before mutating the repository.

### Prepare ClickHouse database

We need to create a new table to store GitHub PR to Inference Map:
Expand Down
33 changes: 33 additions & 0 deletions apply-pr-artifacts/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: apply-pr-artifacts
description:
Consume TensorZero CI Bot artifacts, verify them, and apply patches and
comments with privileged GitHub access.
author: shuyangli

inputs:
artifact-directory:
description: Directory containing the downloaded artifact bundle.
required: true
manifest-path:
description:
Relative path to the manifest inside the artifact directory (defaults to
manifest.json).
required: false
tensorzero-base-url:
description: Base URL for the TensorZero instance.
required: true
tensorzero-diff-patched-successfully-metric-name:
description: Metric name for tracking diff patching success.
required: true
clickhouse-url:
description:
URL for ClickHouse HTTP interface, in the format of
http[s]://[username:password@]hostname:port[/database].
required: true
clickhouse-table:
description: Table where inference to PR associations should be recorded.
required: true

runs:
using: node24
main: ../dist/apply-pr-artifacts/index.js
Loading