Skip to content

docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8)#308

Open
HMAKT99 wants to merge 2 commits into
addyosmani:mainfrom
HMAKT99:feat/eval-feedback-loop
Open

docs(skills): add the eval→learn→improve feedback loop (TDD evals + observability step 8)#308
HMAKT99 wants to merge 2 commits into
addyosmani:mainfrom
HMAKT99:feat/eval-feedback-loop

Conversation

@HMAKT99

@HMAKT99 HMAKT99 commented Jun 21, 2026

Copy link
Copy Markdown

TL;DR

Two small, purely additive edits to existing skills that together add the one thing the pack is missing for LLM/agent work: the eval → learn → improve feedback loop. No new skill directory, no README/manifest/count changes, fully revert-safe. node scripts/validate-skills.js0 errors, 0 warnings.

File What's added
skills/test-driven-development/SKILL.md Evals: Testing Non-Deterministic (LLM-Backed) Behavior — golden set + rubric-judge grading, trajectory assertions, and growing the suite from production incidents
skills/observability-and-instrumentation/SKILL.md Step 8: Feedback loops — learn from production runs — route failures to triage → distill recurring modes into eval cases + prompt/tool fixes → re-eval before shipping

The gap this fills

The pack's lifecycle runs forward: spec → plan → build → verify → review → ship. Once an LLM/agent feature is live, there's no documented operate-and-learn arc — no workflow for turning what production reveals back into durable improvements. That's the across-runs loop this PR adds, and it slots into the two skills that already own its two ends (verification and observability) rather than inventing a new scope.

Why this is not a duplicate (per CONTRIBUTING)

CONTRIBUTING asks contributors to justify the gap and prefer referencing over duplicating. This PR is built to that rule:

Adjacent work What it owns What this PR does instead
#286 evaluating-llm-output (proposed) How to build an eval suite Keeps eval mechanics lean and homes them in test-driven-development; references, doesn't re-teach
#285 reliable-agent-loops (proposed) Single-run loop control (bounding, idempotent retries, resume) Describes per-run reliability generically and leaves it to that domain
#253 harness-engineering (proposed) Repo-local guardrails/governance for coding agents This is about operating an LLM/agent product in production, framed in the observability skill

The new content is only the connective tissue none of them own: capture → triage → distill → re-eval, across runs.

Why this form (extensions, not a new skill)

CONTRIBUTING: "If your idea is a refinement of an existing skill, prefer a focused edit to that skill over a new directory." These are exactly that — additive sections that strengthen two existing skills, matching their structure, tone, and table format. Each follows the standard anatomy additions: a process section, rationalization rows, red flags, and verification checkboxes.

Validation

  • node scripts/validate-skills.jsPASSED (0 errors, 0 warnings) — all 24 skills, required sections intact
  • Frontmatter valid; TDD description 332 chars (< 1024 limit)
  • Cross-references resolve to existing skills only (no dead refs)
  • Diff touches only the two SKILL.md files — no README, manifest, or count edits

HMAKT99 added 2 commits June 21, 2026 19:28
Adds a focused Evals section to test-driven-development covering golden-set
plus rubric-judge grading, trajectory assertions, and growing the suite from
production incidents. Cross-links the feedback loop in observability.
Complements (does not duplicate) the proposed evaluating-llm-output skill.
… runs

Adds step 8 to observability-and-instrumentation: route failures to triage,
distill recurring modes into eval cases plus prompt/tool fixes, re-eval before
shipping. The operate-and-learn arc on top of per-run telemetry. References
test-driven-development for eval mechanics; does not duplicate loop-control
or eval-build skills.
@HMAKT99

HMAKT99 commented Jun 21, 2026

Copy link
Copy Markdown
Author

Flagging this proactively since evals and agent-loops are a crowded area right now: this PR is intentionally not another skill in that cluster. It adds zero new directories and instead strengthens two skills that already exist.

The distinction in one line: the open proposals cover building an eval suite (#286) and making one agent run reliable (#285). Neither covers what happens across runs over time — feeding production failures back into eval cases and confirmed fixes. That across-runs loop is the only thing this PR adds; everything else is referenced, not re-explained.

I kept it deliberately easy to merge:

  • Fully additive — no edits to existing prose, no README/manifest/count churn, trivially revert-safe.
  • Passes CInode scripts/validate-skills.js → 0 errors, 0 warnings; all required sections preserved; no dead cross-refs.
  • Matches the house style — each section ships the standard process + rationalizations + red-flags + verification, in the same table format as the surrounding skill.

If you'd rather these land as a single section in one skill, route the eval mechanics elsewhere, or word the #286/#285 references differently, I'm happy to adjust — the content is structured so it can move without rework. Thanks for maintaining this pack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant