Skip to content

docs(debugging): add production incident response & blameless postmortems#312

Open
HMAKT99 wants to merge 2 commits into
addyosmani:mainfrom
HMAKT99:feat/incident-response
Open

docs(debugging): add production incident response & blameless postmortems#312
HMAKT99 wants to merge 2 commits into
addyosmani:mainfrom
HMAKT99:feat/incident-response

Conversation

@HMAKT99

@HMAKT99 HMAKT99 commented Jun 22, 2026

Copy link
Copy Markdown

TL;DR

A single purely additive section added to skills/debugging-and-error-recovery/SKILL.mdProduction Incident Response & Blameless Postmortems. No new directory, no README/manifest/count changes, revert-safe. node scripts/validate-skills.js0 errors, 0 warnings.

The gap this fills

debugging-and-error-recovery is a dev-time triage workflow (reproduce → localize → reduce → fix → guard). It explicitly lists "production incidents" in scope but has no workflow for one — and a live incident inverts the order: you stabilize first and diagnose second. Nothing in the pack covers declare/severity, stabilize-before-diagnose, incident roles, or the blameless postmortem that turns an incident into prevention.

What's added

  • Severity-first response (SEV1/2/3 → response level)
  • Stabilize before you diagnose — the production sibling of the skill's existing Stop-the-Line Rule (rollback / flip the flag / failover, then root-cause)
  • Coordination — incident driver + scribe + running log
  • Blameless postmortem — telemetry-based timeline (by correlation ID), contributing factors not a culprit, action items with owners
  • Close the loop — every incident yields a regression test (and an eval case for LLM-backed features) + a detection-gap fix

Why this isn't a duplicate (per CONTRIBUTING)

It references rather than re-teaches adjacent skills:

  • Rollback/feature-flag mechanics → shipping-and-launch
  • Regression/eval cases → test-driven-development
  • Timeline reconstruction relies on signals from observability-and-instrumentation

It's a focused edit to an existing skill (the refinement form CONTRIBUTING prefers), not a new directory.

Validation

  • node scripts/validate-skills.js → PASSED (0 errors, 0 warnings); all required sections intact
  • Cross-references resolve to existing skills (no dead refs)
  • Diff touches only the one SKILL.md

…tems

Adds an incident-response section to debugging-and-error-recovery covering
declare/severity, stabilize-before-diagnose, coordination roles, blameless
postmortems with telemetry-based timelines, and closing the loop into
regression/eval cases. References shipping-and-launch, test-driven-development,
and observability-and-instrumentation rather than duplicating them.
@nucliweb

Copy link
Copy Markdown
Contributor

Reviewed this, and it's a solid, well-scoped addition. Sharing two notes: one confirming the sizing call, one optional.

Size / shape is right. The change is 67 lines added (~5K chars), purely additive, leaving the file at 367 lines, comfortably within the pack's range (security-and-hardening is 461, ci-cd-and-automation 390, test-driven-development 383). It's under the 100-line threshold that the contributing conventions use to decide when content should move into a supporting file, so keeping it inline is the correct call rather than spinning out a new file. node scripts/validate-skills.js passes clean here too (0 errors, 0 warnings), and all three cross-skill refs resolve.

Why inline beats a references/ file (in case it comes up). I considered whether this belongs in references/ instead, and I think inline is right:

  • The references/ files in this repo are quick-reference checklists and pattern catalogs. This is a workflow (DETECT -> DECLARE -> STABILIZE -> ... -> PREVENT), and workflows belong in skills.
  • The section leans on the host skill by name, the Stop-the-Line Rule, "the Triage Checklist above", and "the Guard-Against-Recurrence test from Step 5 above". That in-context coupling is a feature; extracting it would break those back-references.

One optional follow-up: close the discovery loop in the other direction. The section already links outward to shipping-and-launch, observability-and-instrumentation, and test-driven-development, but those skills don't point back. A couple of short pointers would make incident response reachable from where people actually are when they need it:

  • From observability-and-instrumentation: a note that the telemetry-based postmortem timeline is driven from this section.
  • From shipping-and-launch: from the rollback / feature-flag mechanics, point to "Stabilize before you diagnose".

Not blocking. Nice work, the stabilize-before-diagnose framing and the blameless-postmortem distinction are the genuinely valuable parts.

…ty & shipping

Closes the discovery loop the other direction (per review on addyosmani#312): observability's
correlation-ID guidance and shipping's rollback section now point to the Production
Incident Response section in debugging-and-error-recovery. Two additive one-liners.
@HMAKT99

HMAKT99 commented Jun 25, 2026

Copy link
Copy Markdown
Author

Thanks for the careful review — and for spelling out the inline-vs-references/ reasoning; that matches the intent exactly (it's a workflow with in-context back-references to the host skill, so extracting it would break them).

Good call on closing the discovery loop both ways. I've pushed the two back-pointers:

  • observability-and-instrumentation (correlation IDs → the postmortem timeline is reconstructed from this telemetry)
  • shipping-and-launch ("When to Roll Back" → "Stabilize before you diagnose")

Both are one-liners, fully additive; validate-skills.js still passes 0/0 and all cross-refs resolve. Appreciate the thoughtful feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants