Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions .github/workflows/eval-ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: Eval CI (Azure OpenAI)

on:
push:
branches: [main]
paths:
- "skills/**"
- "evals/**"
- ".waza.yaml"
pull_request:
paths:
- "skills/**"
- "evals/**"
- ".waza.yaml"
workflow_dispatch:

jobs:
eval:
name: Run evals with real model
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install Waza
run: curl -fsSL https://raw.githubusercontent.com/microsoft/waza/main/install.sh | bash

- name: Run cosmosdb-best-practices evals
env:
COPILOT_BASE_URL: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
COPILOT_PROVIDER: azure
COPILOT_WIRE_API: responses
COPILOT_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
run: |
waza run evals/cosmosdb-best-practices/eval.yaml \
--executor copilot-sdk \
--model ${{ secrets.AZURE_OPENAI_MODEL }} \
--verbose

- name: Upload eval results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/
retention-days: 30
8 changes: 7 additions & 1 deletion evals/cosmosdb-best-practices/eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: |
skill: cosmosdb-best-practices
version: "1.0"
config:
trials_per_task: 1
trials_per_task: 3
timeout_seconds: 300
parallel: false
executor: mock
Expand All @@ -22,5 +22,11 @@ graders:
config:
max_tool_calls: 10
max_duration_ms: 60000
- type: text
name: domain_terms
description: Response must contain at least one Cosmos DB domain-specific term
config:
regex_match:
- "(?i)(partition key|RU/s|request unit|CosmosClient|throughput|indexing|point read|change feed|TTL|consistency level)"
tasks:
- "tasks/*.yaml"
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
id: incorrect-guidance-baseline-009
name: CI Failure Baseline - Incorrect Rule Should Fail
description: |
This task verifies CI blocks on incorrect guidance.
A skill that gives wrong advice (e.g. "use cross-partition queries always")
must fail the content grader, proving CI is actually enforcing correctness.
This task should PASS only when the skill gives correct guidance.
tags:
- baseline
- ci-enforcement
- requires-real-executor
inputs:
prompt: |
Should I always use cross-partition queries in Cosmos DB for simplicity,
even when I have the partition key available?
expected:
outcomes:
- type: task_completed
graders:
- type: text
name: discourages_unnecessary_cross_partition
description: Skill must warn against unnecessary cross-partition queries
config:
match_any:
- "avoid"
- "not recommended"
- "performance"
- "cost"
- "RU"
- "point read"
- "when possible"
35 changes: 35 additions & 0 deletions evals/cosmosdb-best-practices/tasks/partition-key-content.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
id: partition-key-content-007
name: Partition Key Guidance - Content Assertion
description: |
With a real model executor, assert that the skill response
actually contains actionable partition key guidance including
the term "partition key" and cardinality advice.
tags:
- partition-key
- content-assertion
- requires-real-executor
inputs:
prompt: |
I'm building a multi-tenant SaaS app on Cosmos DB. Each tenant
has thousands of users and millions of events. My container stores
events with fields: eventId, tenantId, userId, timestamp, eventType.
What should my partition key be?
expected:
outcomes:
- type: task_completed
graders:
- type: text
name: mentions_partition_key
config:
match_all:
- "partition key"
- "tenantId"
- type: text
name: mentions_cardinality
config:
match_any:
- "cardinality"
- "high cardinality"
- "evenly"
- "distribution"
- "hot partition"
41 changes: 41 additions & 0 deletions evals/cosmosdb-best-practices/tasks/sdk-singleton-content.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
id: sdk-singleton-content-008
name: SDK Singleton - Content Assertion
description: |
Assert that the skill explicitly recommends the singleton CosmosClient
pattern and warns against creating a new client per request.
tags:
- sdk
- singleton
- content-assertion
- requires-real-executor
inputs:
prompt: |
Review this Python code for Cosmos DB best practices:

def get_user(user_id: str):
client = CosmosClient(url=ENDPOINT, credential=KEY)
db = client.get_database_client("mydb")
container = db.get_container_client("users")
return container.read_item(item=user_id, partition_key=user_id)
expected:
outcomes:
- type: task_completed
graders:
- type: text
name: recommends_singleton
config:
match_any:
- "singleton"
- "single instance"
- "reuse"
- "module-level"
- "global"
- type: text
name: warns_about_overhead
config:
match_any:
- "overhead"
- "connection pool"
- "performance"
- "per request"
- "each request"