Enable real model executor for CI evals (Azure OpenAI)

## Context
The eval framework (Waza) currently uses a mock executor that cannot validate response content. The copilot-sdk executor won't work in CI because GitHub Actions' built-in GITHUB_TOKEN is a server-to-server token which the Copilot models API rejects.

## Options
1. **Azure OpenAI endpoint** with a service principal (store API key as repo secret)
2. **Self-hosted runner** with user auth (copilot login)
3. **Wait for GitHub** to add S2S support to the models API

## What this enables
- Real content assertions (output_contains, trigger_tests grader)
- A/B baseline testing (with vs without skill)
- Blocking CI on actual eval failures

## Acceptance criteria
- [ ] Eval CI job calls a real model and validates response content
- [ ] At least one grader checks domain-specific terms in responses
- [ ] CI fails when a rule produces incorrect guidance
- [ ] No false positives from non-deterministic responses (use trials_per_task > 1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable real model executor for CI evals (Azure OpenAI) #186

Context

Options

What this enables

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Enable real model executor for CI evals (Azure OpenAI) #186

Description

Context

Options

What this enables

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions