feat: add promptfoo eval harness for agent quality scoring by jonesrussell · Pull Request #371 · msitarzewski/agency-agents

jonesrussell · 2026-03-30T16:08:19Z

Summary

Adds a promptfoo-based evaluation harness in evals/ that measures specialist agent quality across 5 criteria using LLM-as-judge scoring. This is the first step toward automated quality assurance for the agent prompt collection.

Proof-of-concept evaluates 3 agents (backend-architect, ux-architect, historian) against 6 tasks
Includes extract-metrics.ts script to parse agent success metrics from markdown files
5 unit tests for the extraction script, all passing
First run: 5/6 passed (83%) — the UX Architect failed on one task, showing the harness discriminates rather than rubber-stamping

Scoring Criteria

Criterion	What It Measures
Task Completion	Did the agent produce the requested deliverable?
Instruction Adherence	Did it follow its own defined workflow/format?
Identity Consistency	Did it stay in character?
Deliverable Quality	Expert-level, actionable output?
Safety	No harmful or biased content?

Each scored 1-5 by LLM-as-judge. Pass threshold: average >= 3.5.

How to run

cd evals
npm install
export ANTHROPIC_API_KEY=your-key
npx promptfoo eval
npx promptfoo view  # interactive results viewer

Cost

~$0.05 per run at Haiku pricing (166K tokens). Full 184-agent suite would estimate ~$1.50/run.

What's next

This is M1 of a 3-milestone plan:

M1 (this PR): Eval harness with 3 proof-of-concept agents
M2: Benchmark dataset covering all 184 agents with baseline scores
M3: CI quality gate — PR score gating + nightly trending

Design

The eval harness is fully isolated in evals/ with its own package.json — it doesn't touch any existing agent files or require changes to the contribution workflow. It's opt-in tooling for measuring and improving prompt quality.

Test plan

npx vitest run — 5/5 extract-metrics tests pass
npx promptfoo eval — 5/6 tests pass, 1 meaningful failure
npx promptfoo view — interactive results browser works

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Strip # prefix and emoji from headings before matching section names, preventing false positives from unrelated headings. Switch from deprecated glob.sync to named globSync export. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

)" This reverts commit b456845.

msitarzewski · 2026-04-11T05:13:56Z

Hey @jonesrussell — heads up that this PR has been reverted in #433. This is NOT a quality issue — your eval harness is excellent work.

The reason: per CONTRIBUTING.md, new tooling and CI infrastructure should go through a Discussion first to get community alignment before merging. We missed this during our triage session and are correcting it.

A Discussion has been created for your proposal: #434

Once the community aligns on the approach, you're welcome to re-submit. The code is ready — we just need the process step. Sorry for the back-and-forth, and thank you for the contribution!

…for Discussion (#433) Fixes 3 agents for CONTRIBUTING.md template compliance (missing sections, incorrect headers). Reverts 2 tooling PRs (#371 promptfoo, #337 Vitest) that were merged without required Discussion — Discussions created at #434 and #435.

jonesrussell and others added 10 commits March 30, 2026 11:52

feat(evals): scaffold evals directory with promptfoo and TypeScript deps

704a9f0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): add extract-metrics script to parse agent success metrics

d51b55c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): define universal 5-criteria scoring rubric with anchors

38919cb

feat(evals): add proof-of-concept task files for 3 categories

ede4f7b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): wire up promptfoo config for 3 proof-of-concept agents

4dbfdf0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs(evals): add README with setup, usage, and scoring documentation

c1a61a3

chore(evals): update promptfoo to latest

b778cb4

fix(evals): use correct Haiku model ID for agent and judge

bc30cc3

chore(evals): add .env to gitignore

ab6ea83

siphomaribo approved these changes Mar 31, 2026

View reviewed changes

msitarzewski merged commit b456845 into msitarzewski:main Apr 11, 2026

msitarzewski added a commit that referenced this pull request Apr 11, 2026

Revert "feat: add promptfoo eval harness for agent quality scoring (#371

e7969de

)" This reverts commit b456845.

msitarzewski mentioned this pull request Apr 11, 2026

fix: align agents with CONTRIBUTING.md template + revert tooling PRs for Discussion #433

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add promptfoo eval harness for agent quality scoring#371

feat: add promptfoo eval harness for agent quality scoring#371
msitarzewski merged 10 commits intomsitarzewski:mainfrom
jonesrussell:feat/eval-harness-clean

jonesrussell commented Mar 30, 2026

Uh oh!

msitarzewski commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jonesrussell commented Mar 30, 2026

Summary

Scoring Criteria

How to run

Cost

What's next

Design

Test plan

Uh oh!

msitarzewski commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants