docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8)#308
docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8)#308HMAKT99 wants to merge 2 commits into
Conversation
Adds a focused Evals section to test-driven-development covering golden-set plus rubric-judge grading, trajectory assertions, and growing the suite from production incidents. Cross-links the feedback loop in observability. Complements (does not duplicate) the proposed evaluating-llm-output skill.
… runs Adds step 8 to observability-and-instrumentation: route failures to triage, distill recurring modes into eval cases plus prompt/tool fixes, re-eval before shipping. The operate-and-learn arc on top of per-run telemetry. References test-driven-development for eval mechanics; does not duplicate loop-control or eval-build skills.
|
Flagging this proactively since evals and agent-loops are a crowded area right now: this PR is intentionally not another skill in that cluster. It adds zero new directories and instead strengthens two skills that already exist. The distinction in one line: the open proposals cover building an eval suite (#286) and making one agent run reliable (#285). Neither covers what happens across runs over time — feeding production failures back into eval cases and confirmed fixes. That across-runs loop is the only thing this PR adds; everything else is referenced, not re-explained. I kept it deliberately easy to merge:
If you'd rather these land as a single section in one skill, route the eval mechanics elsewhere, or word the #286/#285 references differently, I'm happy to adjust — the content is structured so it can move without rework. Thanks for maintaining this pack. |
TL;DR
Two small, purely additive edits to existing skills that together add the one thing the pack is missing for LLM/agent work: the eval → learn → improve feedback loop. No new skill directory, no README/manifest/count changes, fully revert-safe.
node scripts/validate-skills.js→ 0 errors, 0 warnings.skills/test-driven-development/SKILL.mdskills/observability-and-instrumentation/SKILL.mdThe gap this fills
The pack's lifecycle runs forward: spec → plan → build → verify → review → ship. Once an LLM/agent feature is live, there's no documented operate-and-learn arc — no workflow for turning what production reveals back into durable improvements. That's the across-runs loop this PR adds, and it slots into the two skills that already own its two ends (verification and observability) rather than inventing a new scope.
Why this is not a duplicate (per CONTRIBUTING)
CONTRIBUTING asks contributors to justify the gap and prefer referencing over duplicating. This PR is built to that rule:
evaluating-llm-output(proposed)test-driven-development; references, doesn't re-teachreliable-agent-loops(proposed)harness-engineering(proposed)The new content is only the connective tissue none of them own: capture → triage → distill → re-eval, across runs.
Why this form (extensions, not a new skill)
CONTRIBUTING: "If your idea is a refinement of an existing skill, prefer a focused edit to that skill over a new directory." These are exactly that — additive sections that strengthen two existing skills, matching their structure, tone, and table format. Each follows the standard anatomy additions: a process section, rationalization rows, red flags, and verification checkboxes.
Validation
node scripts/validate-skills.js→ PASSED (0 errors, 0 warnings) — all 24 skills, required sections intactSKILL.mdfiles — no README, manifest, or count edits