The StatusCat blog
How to write a blameless postmortem
A good postmortem turns an outage into a lasting improvement. Here's how to write a blameless postmortem — what to include, a reusable template, and the mistakes to avoid.
An outage is expensive whether or not you learn from it — so you may as well learn. A postmortem (or incident retrospective) turns a stressful incident into a concrete list of improvements. And the best ones are blameless: they focus on the system, not the person who happened to be holding the pager.
Here's how to write one that actually leads to change.
Why blameless?
People almost always act reasonably given the information and tools they had at the time. When something breaks, it's usually because the system allowed it — a missing guardrail, an unclear runbook, an alert that didn't fire, a deploy with no rollback. Blaming an individual gets you a scapegoat and a quieter, more defensive team. Focusing on the system gets you honest accounts and real fixes.
A simple test: if a postmortem's action items are "be more careful," it isn't blameless — and it won't prevent the next incident.
What to include
- Summary — two or three sentences a busy person can read.
- Impact — who and what was affected, how badly, and for how long.
- Timeline — key events with timestamps: when it started, when it was detected, key actions, when it resolved.
- Root cause & contributing factors — the technical cause, plus the conditions that let it happen or made it worse.
- Detection & response — how you found out (did monitoring catch it, or a customer?) and how the response went.
- Action items — concrete, owned, dated changes to prevent recurrence or reduce impact.
A reusable template
# Postmortem: [short title] — [date]
## Summary
[2–3 sentences: what happened and the outcome.]
## Impact
- Affected: [services / customers]
- Duration: [start] – [end] ([total])
- Severity: [SEV1 / SEV2 / SEV3]
## Timeline (UTC)
- HH:MM — [event]
- HH:MM — detected via [monitoring / customer report]
- HH:MM — [mitigation]
- HH:MM — resolved
## Root cause & contributing factors
[What broke, and the conditions that allowed it.]
## What went well / what didn't
- Went well: …
- Didn't: …
## Action items
- [ ] [Action] — owner: [name], due: [date]
- [ ] …
Common mistakes to avoid
- Naming and shaming. Kills honesty. Talk about roles and systems, not people.
- Vague action items. "Improve monitoring" isn't actionable. "Add an SSL-expiry alert at 30 days for all production domains — owner: X, due: Friday" is.
- No owners or dates. Unowned action items never happen.
- Skipping detection. How long between "it broke" and "we knew"? If a customer told you first, that gap is itself an action item.
- Writing it and filing it. Track the action items to completion, or the next incident is a rerun.
Close the loop with detection
Many postmortem action items come back to detection: an alert that should have fired, a check that didn't exist, a status page that wasn't updated. StatusCat gives you that layer — uptime monitoring, on-call and escalation, and status pages — free for 50 monitors, so "we should have known sooner" turns into a check you actually set up. Pair this with a solid incident response process.