docs(debugging): add production incident response & blameless postmortems by HMAKT99 · Pull Request #312 · addyosmani/agent-skills

HMAKT99 · 2026-06-22T16:18:08Z

TL;DR

A single purely additive section added to skills/debugging-and-error-recovery/SKILL.md — Production Incident Response & Blameless Postmortems. No new directory, no README/manifest/count changes, revert-safe. node scripts/validate-skills.js → 0 errors, 0 warnings.

The gap this fills

debugging-and-error-recovery is a dev-time triage workflow (reproduce → localize → reduce → fix → guard). It explicitly lists "production incidents" in scope but has no workflow for one — and a live incident inverts the order: you stabilize first and diagnose second. Nothing in the pack covers declare/severity, stabilize-before-diagnose, incident roles, or the blameless postmortem that turns an incident into prevention.

What's added

Severity-first response (SEV1/2/3 → response level)
Stabilize before you diagnose — the production sibling of the skill's existing Stop-the-Line Rule (rollback / flip the flag / failover, then root-cause)
Coordination — incident driver + scribe + running log
Blameless postmortem — telemetry-based timeline (by correlation ID), contributing factors not a culprit, action items with owners
Close the loop — every incident yields a regression test (and an eval case for LLM-backed features) + a detection-gap fix

Why this isn't a duplicate (per CONTRIBUTING)

It references rather than re-teaches adjacent skills:

Rollback/feature-flag mechanics → shipping-and-launch
Regression/eval cases → test-driven-development
Timeline reconstruction relies on signals from observability-and-instrumentation

It's a focused edit to an existing skill (the refinement form CONTRIBUTING prefers), not a new directory.

Validation

node scripts/validate-skills.js → PASSED (0 errors, 0 warnings); all required sections intact
Cross-references resolve to existing skills (no dead refs)
Diff touches only the one SKILL.md

…tems Adds an incident-response section to debugging-and-error-recovery covering declare/severity, stabilize-before-diagnose, coordination roles, blameless postmortems with telemetry-based timelines, and closing the loop into regression/eval cases. References shipping-and-launch, test-driven-development, and observability-and-instrumentation rather than duplicating them.

nucliweb · 2026-06-24T20:36:03Z

Reviewed this, and it's a solid, well-scoped addition. Sharing two notes: one confirming the sizing call, one optional.

Size / shape is right. The change is 67 lines added (~5K chars), purely additive, leaving the file at 367 lines, comfortably within the pack's range (security-and-hardening is 461, ci-cd-and-automation 390, test-driven-development 383). It's under the 100-line threshold that the contributing conventions use to decide when content should move into a supporting file, so keeping it inline is the correct call rather than spinning out a new file. node scripts/validate-skills.js passes clean here too (0 errors, 0 warnings), and all three cross-skill refs resolve.

Why inline beats a references/ file (in case it comes up). I considered whether this belongs in references/ instead, and I think inline is right:

The references/ files in this repo are quick-reference checklists and pattern catalogs. This is a workflow (DETECT -> DECLARE -> STABILIZE -> ... -> PREVENT), and workflows belong in skills.
The section leans on the host skill by name, the Stop-the-Line Rule, "the Triage Checklist above", and "the Guard-Against-Recurrence test from Step 5 above". That in-context coupling is a feature; extracting it would break those back-references.

One optional follow-up: close the discovery loop in the other direction. The section already links outward to shipping-and-launch, observability-and-instrumentation, and test-driven-development, but those skills don't point back. A couple of short pointers would make incident response reachable from where people actually are when they need it:

From observability-and-instrumentation: a note that the telemetry-based postmortem timeline is driven from this section.
From shipping-and-launch: from the rollback / feature-flag mechanics, point to "Stabilize before you diagnose".

Not blocking. Nice work, the stabilize-before-diagnose framing and the blameless-postmortem distinction are the genuinely valuable parts.

…ty & shipping Closes the discovery loop the other direction (per review on addyosmani#312): observability's correlation-ID guidance and shipping's rollback section now point to the Production Incident Response section in debugging-and-error-recovery. Two additive one-liners.

HMAKT99 · 2026-06-25T18:26:53Z

Thanks for the careful review — and for spelling out the inline-vs-references/ reasoning; that matches the intent exactly (it's a workflow with in-context back-references to the host skill, so extracting it would break them).

Good call on closing the discovery loop both ways. I've pushed the two back-pointers:

observability-and-instrumentation (correlation IDs → the postmortem timeline is reconstructed from this telemetry)
shipping-and-launch ("When to Roll Back" → "Stabilize before you diagnose")

Both are one-liners, fully additive; validate-skills.js still passes 0/0 and all cross-refs resolve. Appreciate the thoughtful feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(debugging): add production incident response & blameless postmortems#312

docs(debugging): add production incident response & blameless postmortems#312
HMAKT99 wants to merge 2 commits into
addyosmani:mainfrom
HMAKT99:feat/incident-response

HMAKT99 commented Jun 22, 2026

Uh oh!

nucliweb commented Jun 24, 2026

Uh oh!

HMAKT99 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HMAKT99 commented Jun 22, 2026

TL;DR

The gap this fills

What's added

Why this isn't a duplicate (per CONTRIBUTING)

Validation

Uh oh!

nucliweb commented Jun 24, 2026

Uh oh!

HMAKT99 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants