Skip to content

feat(evals): enable Azure OpenAI executor for CI content assertions#211

Open
Kunall7890 wants to merge 2 commits into
AzureCosmosDB:mainfrom
Kunall7890:feat/azure-openai-ci-executor
Open

feat(evals): enable Azure OpenAI executor for CI content assertions#211
Kunall7890 wants to merge 2 commits into
AzureCosmosDB:mainfrom
Kunall7890:feat/azure-openai-ci-executor

Conversation

@Kunall7890

Copy link
Copy Markdown

Closes #186

What this does

  • Adds .github/workflows/eval-ci.yaml — new CI job that runs Waza evals
    using copilot-sdk executor routed through Azure OpenAI (bypasses the
    GITHUB_TOKEN S2S rejection issue entirely)
  • Updates eval.yaml — adds text grader for domain-term assertions +
    sets trials_per_task: 3 to handle non-determinism; mock executor
    preserved as default for local dev
  • Adds 3 content-assertion tasks covering partition key guidance, SDK
    singleton pattern, and a CI-enforcement baseline that proves CI actually
    blocks incorrect guidance

Acceptance criteria met

  • Real model called in CI via Azure OpenAI endpoint
  • Domain-specific term grader checks response content
  • CI fails when incorrect guidance is produced (baseline task)
  • trials_per_task: 3 handles non-deterministic responses
  • No secrets hardcoded — uses repo secrets

Maintainer action required

Add 3 repo secrets before merging:

  • AZURE_OPENAI_ENDPOINT
  • AZURE_OPENAI_API_KEY
  • AZURE_OPENAI_MODEL

Adds a new eval-ci.yaml workflow that runs Waza evals against a real
Azure OpenAI endpoint, bypassing the GITHUB_TOKEN S2S rejection that
blocks the GitHub Copilot models API. Updates eval.yaml with
trials_per_task: 3 for non-determinism handling and a text grader for
domain-term assertions. Adds three content-assertion tasks covering
partition key guidance, SDK singleton pattern, and a CI-enforcement
baseline.

Closes AzureCosmosDB#186

Signed-off-by: Kunal Jaiswal <140198382+Kunall7890@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 19, 2026 05:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

Multiple added “text” graders use unsupported Waza config keys (match_any/match_all), which will likely break or nullify the intended CI assertions.

Pull request overview

This PR adds a GitHub Actions workflow to run the cosmosdb-best-practices Waza eval suite against a real model via Azure OpenAI, and extends the eval suite with content-assertion tasks and a domain-terms text grader to enforce response quality in CI.

Changes:

  • Add .github/workflows/eval-ci.yaml to run Waza evals in CI using copilot-sdk routed through Azure OpenAI.
  • Update evals/cosmosdb-best-practices/eval.yaml to increase trials_per_task to 3 and add a domain-term text grader.
  • Add three new eval tasks that assert specific Cosmos DB guidance content.
File summaries
File Description
.github/workflows/eval-ci.yaml New CI workflow to run Waza evals with Azure OpenAI + upload results artifact.
evals/cosmosdb-best-practices/eval.yaml Increase trials to reduce flakiness; add a global domain-term text grader.
evals/cosmosdb-best-practices/tasks/sdk-singleton-content.yaml New task to assert singleton/reuse guidance for CosmosClient.
evals/cosmosdb-best-practices/tasks/partition-key-content.yaml New task to assert actionable partition key guidance (incl. cardinality).
evals/cosmosdb-best-practices/tasks/incorrect-guidance-baseline.yaml New baseline task intended to prove CI blocks incorrect guidance.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 8

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment thread evals/cosmosdb-best-practices/eval.yaml Outdated
Comment on lines +26 to +33
config:
match_any:
- "singleton"
- "single instance"
- "reuse"
- "module-level"
- "global"
- type: text
Comment on lines +35 to +41
config:
match_any:
- "overhead"
- "connection pool"
- "performance"
- "per request"
- "each request"
Comment on lines +23 to +27
config:
match_all:
- "partition key"
- "tenantId"
- type: text
Comment on lines +29 to +35
config:
match_any:
- "cardinality"
- "high cardinality"
- "evenly"
- "distribution"
- "hot partition"
Comment on lines +23 to +31
config:
match_any:
- "avoid"
- "not recommended"
- "performance"
- "cost"
- "RU"
- "point read"
- "when possible"
Comment on lines +26 to +28
- name: Install Waza
run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash

Comment on lines +18 to +21
eval:
name: Run evals with real model
runs-on: ubuntu-latest

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Kunal jaiswal <140198382+Kunall7890@users.noreply.github.com>
@Kunall7890

Copy link
Copy Markdown
Author

@sajeetharan addressed all Copilot review comments — replaced unsupported match_any/match_all with regex_match/contains per Waza's text grader schema, pinned Waza installer to v0.37.0 for reproducibility, and added fork PR guard to prevent secret-missing failures. Ready for your review!

@jaydestro

Copy link
Copy Markdown
Contributor

@Kunall7890 there are some ongoing changes being evaluated to the structure that could require this to be modified. you'll definetely get notice when it's time to make any changes to avoid merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable real model executor for CI evals (Azure OpenAI)

3 participants