Context
The eval framework (Waza) currently uses a mock executor that cannot validate response content. The copilot-sdk executor won't work in CI because GitHub Actions' built-in GITHUB_TOKEN is a server-to-server token which the Copilot models API rejects.
Options
- Azure OpenAI endpoint with a service principal (store API key as repo secret)
- Self-hosted runner with user auth (copilot login)
- Wait for GitHub to add S2S support to the models API
What this enables
- Real content assertions (output_contains, trigger_tests grader)
- A/B baseline testing (with vs without skill)
- Blocking CI on actual eval failures
Acceptance criteria
Context
The eval framework (Waza) currently uses a mock executor that cannot validate response content. The copilot-sdk executor won't work in CI because GitHub Actions' built-in GITHUB_TOKEN is a server-to-server token which the Copilot models API rejects.
Options
What this enables
Acceptance criteria