docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8) by HMAKT99 · Pull Request #308 · addyosmani/agent-skills

HMAKT99 · 2026-06-21T14:01:25Z

TL;DR

Two small, purely additive edits to existing skills that together add the one thing the pack is missing for LLM/agent work: the eval → learn → improve feedback loop. No new skill directory, no README/manifest/count changes, fully revert-safe. node scripts/validate-skills.js → 0 errors, 0 warnings.

File	What's added
`skills/test-driven-development/SKILL.md`	Evals: Testing Non-Deterministic (LLM-Backed) Behavior — golden set + rubric-judge grading, trajectory assertions, and growing the suite from production incidents
`skills/observability-and-instrumentation/SKILL.md`	Step 8: Feedback loops — learn from production runs — route failures to triage → distill recurring modes into eval cases + prompt/tool fixes → re-eval before shipping

The gap this fills

The pack's lifecycle runs forward: spec → plan → build → verify → review → ship. Once an LLM/agent feature is live, there's no documented operate-and-learn arc — no workflow for turning what production reveals back into durable improvements. That's the across-runs loop this PR adds, and it slots into the two skills that already own its two ends (verification and observability) rather than inventing a new scope.

Why this is not a duplicate (per CONTRIBUTING)

CONTRIBUTING asks contributors to justify the gap and prefer referencing over duplicating. This PR is built to that rule:

Adjacent work	What it owns	What this PR does instead
#286 `evaluating-llm-output` (proposed)	How to build an eval suite	Keeps eval mechanics lean and homes them in `test-driven-development`; references, doesn't re-teach
#285 `reliable-agent-loops` (proposed)	Single-run loop control (bounding, idempotent retries, resume)	Describes per-run reliability generically and leaves it to that domain
#253 `harness-engineering` (proposed)	Repo-local guardrails/governance for coding agents	This is about operating an LLM/agent product in production, framed in the observability skill

The new content is only the connective tissue none of them own: capture → triage → distill → re-eval, across runs.

Why this form (extensions, not a new skill)

CONTRIBUTING: "If your idea is a refinement of an existing skill, prefer a focused edit to that skill over a new directory." These are exactly that — additive sections that strengthen two existing skills, matching their structure, tone, and table format. Each follows the standard anatomy additions: a process section, rationalization rows, red flags, and verification checkboxes.

Validation

node scripts/validate-skills.js → PASSED (0 errors, 0 warnings) — all 24 skills, required sections intact
Frontmatter valid; TDD description 332 chars (< 1024 limit)
Cross-references resolve to existing skills only (no dead refs)
Diff touches only the two SKILL.md files — no README, manifest, or count edits

Adds a focused Evals section to test-driven-development covering golden-set plus rubric-judge grading, trajectory assertions, and growing the suite from production incidents. Cross-links the feedback loop in observability. Complements (does not duplicate) the proposed evaluating-llm-output skill.

… runs Adds step 8 to observability-and-instrumentation: route failures to triage, distill recurring modes into eval cases plus prompt/tool fixes, re-eval before shipping. The operate-and-learn arc on top of per-run telemetry. References test-driven-development for eval mechanics; does not duplicate loop-control or eval-build skills.

HMAKT99 · 2026-06-21T14:04:41Z

Flagging this proactively since evals and agent-loops are a crowded area right now: this PR is intentionally not another skill in that cluster. It adds zero new directories and instead strengthens two skills that already exist.

The distinction in one line: the open proposals cover building an eval suite (#286) and making one agent run reliable (#285). Neither covers what happens across runs over time — feeding production failures back into eval cases and confirmed fixes. That across-runs loop is the only thing this PR adds; everything else is referenced, not re-explained.

I kept it deliberately easy to merge:

Fully additive — no edits to existing prose, no README/manifest/count churn, trivially revert-safe.
Passes CI — node scripts/validate-skills.js → 0 errors, 0 warnings; all required sections preserved; no dead cross-refs.
Matches the house style — each section ships the standard process + rationalizations + red-flags + verification, in the same table format as the surrounding skill.

If you'd rather these land as a single section in one skill, route the eval mechanics elsewhere, or word the #286/#285 references differently, I'm happy to adjust — the content is structured so it can move without rework. Thanks for maintaining this pack.

HMAKT99 added 2 commits June 21, 2026 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8)#308

docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8)#308
HMAKT99 wants to merge 2 commits into
addyosmani:mainfrom
HMAKT99:feat/eval-feedback-loop

HMAKT99 commented Jun 21, 2026 •

edited

Loading

Uh oh!

HMAKT99 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HMAKT99 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

The gap this fills

Why this is not a duplicate (per CONTRIBUTING)

Why this form (extensions, not a new skill)

Validation

Uh oh!

HMAKT99 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HMAKT99 commented Jun 21, 2026 •

edited

Loading