Add evaluating-llm-output skill by vikast908 · Pull Request #286 · addyosmani/agent-skills

vikast908 · 2026-06-14T08:38:00Z

What

A new evaluating-llm-output skill on building evals so LLM and agent changes don't silently regress.

Why

The repo has test-driven-development, but TDD targets deterministic code. LLM output is free-form text that ordinary unit tests can't assert on, so a prompt tweak or a model bump can quietly degrade quality with nothing in CI to catch it. There's no skill covering the regression discipline for model output.

What's in it

A versioned golden set of representative inputs and expected outcomes
Grading with the cheapest valid method (assertions for structured output, LLM-as-judge with a rubric for open-ended text)
Asserting on agent behavior and tool use, not just final text
Negative and safety cases (must-refuse, prompt injection, malformed input)
CI gating on a pass threshold
Tracking pass rate, cost, and latency over time, plus handling non-determinism

Each step has code, and the skill closes with a Verification checklist.

Conventions followed

Standard anatomy: Overview, When to Use, Process, Common Rationalizations, Red Flags, Verification.
Frontmatter is name + description only; the description leads with what it does, then Use when triggers.
Single SKILL.md, no supporting files.
References test-driven-development (deterministic code) and ci-cd-and-automation (pipeline setup) rather than duplicating them.

Related: this pairs with my other PR adding reliable-agent-loops. Happy to adjust naming, scope, or split anything.

🤖 Generated with Claude Code

Adds the regression-testing discipline for LLM output that ordinary unit tests miss: a versioned golden set, cheapest-valid grading (assertions plus LLM-as-judge), behavior/tool-use assertions, negative and safety cases, CI gating on a pass threshold, and metric tracking over time, with a Verification checklist. Follows the standard skill anatomy and references test-driven-development and ci-cd-and-automation instead of duplicating them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vikast908 mentioned this pull request Jun 14, 2026

Add llm-cost-optimization skill #287

Open

nucliweb mentioned this pull request Jun 20, 2026

feat: add new skills, agent personas, and LLM integration checklist #254

Open

HMAKT99 mentioned this pull request Jun 21, 2026

docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8) #308

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluating-llm-output skill#286

Add evaluating-llm-output skill#286
vikast908 wants to merge 1 commit into
addyosmani:mainfrom
vikast908:add-evaluating-llm-output-skill

vikast908 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vikast908 commented Jun 14, 2026

What

Why

What's in it

Conventions followed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant