feat(evals): enable Azure OpenAI executor for CI content assertions#211
feat(evals): enable Azure OpenAI executor for CI content assertions#211Kunall7890 wants to merge 2 commits into
Conversation
Adds a new eval-ci.yaml workflow that runs Waza evals against a real Azure OpenAI endpoint, bypassing the GITHUB_TOKEN S2S rejection that blocks the GitHub Copilot models API. Updates eval.yaml with trials_per_task: 3 for non-determinism handling and a text grader for domain-term assertions. Adds three content-assertion tasks covering partition key guidance, SDK singleton pattern, and a CI-enforcement baseline. Closes AzureCosmosDB#186 Signed-off-by: Kunal Jaiswal <140198382+Kunall7890@users.noreply.github.com>
There was a problem hiding this comment.
⚠️ Not ready to approve
Multiple added “text” graders use unsupported Waza config keys (match_any/match_all), which will likely break or nullify the intended CI assertions.
Pull request overview
This PR adds a GitHub Actions workflow to run the cosmosdb-best-practices Waza eval suite against a real model via Azure OpenAI, and extends the eval suite with content-assertion tasks and a domain-terms text grader to enforce response quality in CI.
Changes:
- Add
.github/workflows/eval-ci.yamlto run Waza evals in CI usingcopilot-sdkrouted through Azure OpenAI. - Update
evals/cosmosdb-best-practices/eval.yamlto increasetrials_per_taskto 3 and add a domain-term text grader. - Add three new eval tasks that assert specific Cosmos DB guidance content.
File summaries
| File | Description |
|---|---|
.github/workflows/eval-ci.yaml |
New CI workflow to run Waza evals with Azure OpenAI + upload results artifact. |
evals/cosmosdb-best-practices/eval.yaml |
Increase trials to reduce flakiness; add a global domain-term text grader. |
evals/cosmosdb-best-practices/tasks/sdk-singleton-content.yaml |
New task to assert singleton/reuse guidance for CosmosClient. |
evals/cosmosdb-best-practices/tasks/partition-key-content.yaml |
New task to assert actionable partition key guidance (incl. cardinality). |
evals/cosmosdb-best-practices/tasks/incorrect-guidance-baseline.yaml |
New baseline task intended to prove CI blocks incorrect guidance. |
Copilot's findings
- Files reviewed: 5/5 changed files
- Comments generated: 8
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
| config: | ||
| match_any: | ||
| - "singleton" | ||
| - "single instance" | ||
| - "reuse" | ||
| - "module-level" | ||
| - "global" | ||
| - type: text |
| config: | ||
| match_any: | ||
| - "overhead" | ||
| - "connection pool" | ||
| - "performance" | ||
| - "per request" | ||
| - "each request" |
| config: | ||
| match_all: | ||
| - "partition key" | ||
| - "tenantId" | ||
| - type: text |
| config: | ||
| match_any: | ||
| - "cardinality" | ||
| - "high cardinality" | ||
| - "evenly" | ||
| - "distribution" | ||
| - "hot partition" |
| config: | ||
| match_any: | ||
| - "avoid" | ||
| - "not recommended" | ||
| - "performance" | ||
| - "cost" | ||
| - "RU" | ||
| - "point read" | ||
| - "when possible" |
| - name: Install Waza | ||
| run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash | ||
|
|
| eval: | ||
| name: Run evals with real model | ||
| runs-on: ubuntu-latest | ||
|
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Kunal jaiswal <140198382+Kunall7890@users.noreply.github.com>
|
@sajeetharan addressed all Copilot review comments — replaced unsupported match_any/match_all with regex_match/contains per Waza's text grader schema, pinned Waza installer to v0.37.0 for reproducibility, and added fork PR guard to prevent secret-missing failures. Ready for your review! |
|
@Kunall7890 there are some ongoing changes being evaluated to the structure that could require this to be modified. you'll definetely get notice when it's time to make any changes to avoid merge conflicts. |
Closes #186
What this does
.github/workflows/eval-ci.yaml— new CI job that runs Waza evalsusing
copilot-sdkexecutor routed through Azure OpenAI (bypasses theGITHUB_TOKEN S2S rejection issue entirely)
eval.yaml— addstextgrader for domain-term assertions +sets
trials_per_task: 3to handle non-determinism; mock executorpreserved as default for local dev
singleton pattern, and a CI-enforcement baseline that proves CI actually
blocks incorrect guidance
Acceptance criteria met
Maintainer action required
Add 3 repo secrets before merging:
AZURE_OPENAI_ENDPOINTAZURE_OPENAI_API_KEYAZURE_OPENAI_MODEL