Skip to content

Add evaluating-llm-output skill#286

Open
vikast908 wants to merge 1 commit into
addyosmani:mainfrom
vikast908:add-evaluating-llm-output-skill
Open

Add evaluating-llm-output skill#286
vikast908 wants to merge 1 commit into
addyosmani:mainfrom
vikast908:add-evaluating-llm-output-skill

Conversation

@vikast908

Copy link
Copy Markdown

What

A new evaluating-llm-output skill on building evals so LLM and agent changes don't silently regress.

Why

The repo has test-driven-development, but TDD targets deterministic code. LLM output is free-form text that ordinary unit tests can't assert on, so a prompt tweak or a model bump can quietly degrade quality with nothing in CI to catch it. There's no skill covering the regression discipline for model output.

What's in it

  • A versioned golden set of representative inputs and expected outcomes
  • Grading with the cheapest valid method (assertions for structured output, LLM-as-judge with a rubric for open-ended text)
  • Asserting on agent behavior and tool use, not just final text
  • Negative and safety cases (must-refuse, prompt injection, malformed input)
  • CI gating on a pass threshold
  • Tracking pass rate, cost, and latency over time, plus handling non-determinism

Each step has code, and the skill closes with a Verification checklist.

Conventions followed

  • Standard anatomy: Overview, When to Use, Process, Common Rationalizations, Red Flags, Verification.
  • Frontmatter is name + description only; the description leads with what it does, then Use when triggers.
  • Single SKILL.md, no supporting files.
  • References test-driven-development (deterministic code) and ci-cd-and-automation (pipeline setup) rather than duplicating them.

Related: this pairs with my other PR adding reliable-agent-loops. Happy to adjust naming, scope, or split anything.

🤖 Generated with Claude Code

Adds the regression-testing discipline for LLM output that ordinary unit tests miss: a versioned golden set, cheapest-valid grading (assertions plus LLM-as-judge), behavior/tool-use assertions, negative and safety cases, CI gating on a pass threshold, and metric tracking over time, with a Verification checklist.

Follows the standard skill anatomy and references test-driven-development and ci-cd-and-automation instead of duplicating them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant