Small confession: we currently use the term 'post-mortem' in incident.io despite preferring the term 'incident debrief'. Unless you have particularly serious incidents, the link to death here really isn’t helping anyone. However, we're optimising for familiarity, so we're sticking to the term 'post-mortem' here.
Ask any engineer and they’ll tell you that a post-mortem is a positive thing (despite the scary name). Being able to reflect on an incident helps us learn from our mistakes and do better next time. Your return only increases when future engineers and decision makers are able to access the record of events.
However, one does not simply follow a post-mortem guide and reap the benefits; post-mortems are all too easily executed badly.
Most obviously, you might not need a post-mortem at all. It’s common to skip post-mortems for low-severity incidents, but for one-in-a-million events or those that are out of your control (eg. a provider is down), you may want to apply the same rule. The energy invested in preventing the incident can outweigh the pain incurred by the incident itself.
Secondly, compiling information can take a lot of human time. I’ve seen engineers spend hours trying to find the exact timestamps, or pour over the best way to describe what happened. There is pressure to fill all the fields in a post-mortem template, otherwise it won’t be signed off as “done”.
With proper tooling, automation can be used to compile the important information, freeing up a responder’s time for thinking about the important questions like:
To get the most out of a post-mortem, it’s important to establish the level on which an incident is reflected on. The classic gotcha is: contributors assume that the post-mortem is a place to work out how to fix X so that it doesn’t happen again. But that would be on the lowest level of reflection - you could also look at how to prevent this class of incidents, or step back and ask if your existing solution needs a whole re-think. There’s no point fixing a leaky sink in a burning building. Your post-mortem should help you discover what the problem really is.
Post-mortems are meant to be blameless, meaning they focus on how a mistake was made rather than who made it. But they can easily get too retrospective, focusing on what could have been if a decision was made differently. If someone has a bee in their bonnet about how a particular service was built, and that service went down, you’ll likely end up with a rant. Post-mortem discussions need steering so they are only looking forward.
A post-mortem normally results in action points. This is great, but where are these tasks of work meant to lie alongside all the other tickets for this sprint? How do we prioritise them? I’ve seen entire lists of “Action points” lay dormant in post-mortem documents whilst everyone tries to recover from the incident and pick up everything else they were meant to get done. Just like any planning meeting, action points should be drawn up into an issue tracker like Linear/Jira, and ongoing work should be re-prioritised if needs be.
Given the negative tone of most of this, I’ll have to round off by reiterating that I do find post-mortems incredibly useful. When the laboursome bits are automated, the discussion is steered, and action points are seamlessly plugged into a team’s roadmap, what you have left is a brainstorming session on a tricky problem. And that’ll take any engineer’s fancy.