docs(debugging): add production incident response & blameless postmortems#312
docs(debugging): add production incident response & blameless postmortems#312HMAKT99 wants to merge 2 commits into
Conversation
…tems Adds an incident-response section to debugging-and-error-recovery covering declare/severity, stabilize-before-diagnose, coordination roles, blameless postmortems with telemetry-based timelines, and closing the loop into regression/eval cases. References shipping-and-launch, test-driven-development, and observability-and-instrumentation rather than duplicating them.
|
Reviewed this, and it's a solid, well-scoped addition. Sharing two notes: one confirming the sizing call, one optional. Size / shape is right. The change is 67 lines added (~5K chars), purely additive, leaving the file at 367 lines, comfortably within the pack's range ( Why inline beats a
One optional follow-up: close the discovery loop in the other direction. The section already links outward to
Not blocking. Nice work, the stabilize-before-diagnose framing and the blameless-postmortem distinction are the genuinely valuable parts. |
…ty & shipping Closes the discovery loop the other direction (per review on addyosmani#312): observability's correlation-ID guidance and shipping's rollback section now point to the Production Incident Response section in debugging-and-error-recovery. Two additive one-liners.
|
Thanks for the careful review — and for spelling out the inline-vs- Good call on closing the discovery loop both ways. I've pushed the two back-pointers:
Both are one-liners, fully additive; |
TL;DR
A single purely additive section added to
skills/debugging-and-error-recovery/SKILL.md— Production Incident Response & Blameless Postmortems. No new directory, no README/manifest/count changes, revert-safe.node scripts/validate-skills.js→ 0 errors, 0 warnings.The gap this fills
debugging-and-error-recoveryis a dev-time triage workflow (reproduce → localize → reduce → fix → guard). It explicitly lists "production incidents" in scope but has no workflow for one — and a live incident inverts the order: you stabilize first and diagnose second. Nothing in the pack covers declare/severity, stabilize-before-diagnose, incident roles, or the blameless postmortem that turns an incident into prevention.What's added
Why this isn't a duplicate (per CONTRIBUTING)
It references rather than re-teaches adjacent skills:
shipping-and-launchtest-driven-developmentobservability-and-instrumentationIt's a focused edit to an existing skill (the refinement form CONTRIBUTING prefers), not a new directory.
Validation
node scripts/validate-skills.js→ PASSED (0 errors, 0 warnings); all required sections intactSKILL.md