feat: add promptfoo eval harness for agent quality scoring#371
Merged
msitarzewski merged 10 commits intomsitarzewski:mainfrom Apr 11, 2026
Merged
feat: add promptfoo eval harness for agent quality scoring#371msitarzewski merged 10 commits intomsitarzewski:mainfrom
msitarzewski merged 10 commits intomsitarzewski:mainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strip # prefix and emoji from headings before matching section names, preventing false positives from unrelated headings. Switch from deprecated glob.sync to named globSync export. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
siphomaribo
approved these changes
Mar 31, 2026
msitarzewski
added a commit
that referenced
this pull request
Apr 11, 2026
)" This reverts commit b456845.
4 tasks
Owner
|
Hey @jonesrussell — heads up that this PR has been reverted in #433. This is NOT a quality issue — your eval harness is excellent work. The reason: per CONTRIBUTING.md, new tooling and CI infrastructure should go through a Discussion first to get community alignment before merging. We missed this during our triage session and are correcting it. A Discussion has been created for your proposal: #434 Once the community aligns on the approach, you're welcome to re-submit. The code is ready — we just need the process step. Sorry for the back-and-forth, and thank you for the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a promptfoo-based evaluation harness in
evals/that measures specialist agent quality across 5 criteria using LLM-as-judge scoring. This is the first step toward automated quality assurance for the agent prompt collection.extract-metrics.tsscript to parse agent success metrics from markdown filesScoring Criteria
Each scored 1-5 by LLM-as-judge. Pass threshold: average >= 3.5.
How to run
Cost
~$0.05 per run at Haiku pricing (166K tokens). Full 184-agent suite would estimate ~$1.50/run.
What's next
This is M1 of a 3-milestone plan:
Design
The eval harness is fully isolated in
evals/with its ownpackage.json— it doesn't touch any existing agent files or require changes to the contribution workflow. It's opt-in tooling for measuring and improving prompt quality.Test plan
npx vitest run— 5/5 extract-metrics tests passnpx promptfoo eval— 5/6 tests pass, 1 meaningful failurenpx promptfoo view— interactive results browser works