Skip to content

Enable real model executor for CI evals (Azure OpenAI) #186

Description

@sajeetharan

Context

The eval framework (Waza) currently uses a mock executor that cannot validate response content. The copilot-sdk executor won't work in CI because GitHub Actions' built-in GITHUB_TOKEN is a server-to-server token which the Copilot models API rejects.

Options

  1. Azure OpenAI endpoint with a service principal (store API key as repo secret)
  2. Self-hosted runner with user auth (copilot login)
  3. Wait for GitHub to add S2S support to the models API

What this enables

  • Real content assertions (output_contains, trigger_tests grader)
  • A/B baseline testing (with vs without skill)
  • Blocking CI on actual eval failures

Acceptance criteria

  • Eval CI job calls a real model and validates response content
  • At least one grader checks domain-specific terms in responses
  • CI fails when a rule produces incorrect guidance
  • No false positives from non-deterministic responses (use trials_per_task > 1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions