Post-mortems & debriefs
We currently use the term post-mortem here, despite preferring 'Incident Debrief' internally. Unless you have particularly serious incidents, the link to death here really isn’t helping anyone.
Post-mortems are used to dig into what happened in an incident, why, and how it can be prevented in future. They typically take the form of meetings that occur after an incident is resolved and will involve those who responded to the incident, plus other stakeholders who weren’t present.
During those meetings, a document, or post-mortem, will usually be created. It is likely to include a timeline of the incident, key learnings, and follow-ups that the team wants to do as a result.
Considerations for holding a debrief meeting #
Deciding which incidents warrant a debrief is a difficult balance.
Blindly running debriefs for all incidents by default can become a big-time sink, and render the opportunities for useful ones ineffective.
On the other hand, neglecting them means losing out on some of the best sources for learning and growth. Incidents are always going to happen, so it’s important to use them to your advantage and harness the learnings there.
Often just being involved in the incident is enough
The best learnings from incidents often come from just being involved and following the process.
Holding your incidents in the open with clear updates provides a great way for people to go back and follow the key decisions, without having to coordinate a large meeting.
Seek signal from those involved
Usually those firefighting an incident have the best perspective on whether there are useful learnings or things that need unpacking.
People who find themselves in incidents typically develop good instincts around which issues feel more routine, and which warrant the time and effort to explore.
Local thinking on global problems
Another common indicator that a debrief might be helpful is when we identify local fixes to global problems. These come in a few different guises, but in general, they look like plasters over a problem, rather than a deeper understanding and treatment of an underlying set of causes. Examples might include a quick-and-dirty fix that was applied in the heat of the moment that we want to revisit or a common misunderstanding with a few individuals that we might want to clarify more widely across a team or organization. In any case, if you’re uneasy about the recurrence of similar shaped issues, you’re likely to benefit from further analysis.
One of the best indicators of local thinking can be seen when we refer to an incident being the fault of an individual performing an action. If you find yourself citing human error as a cause, schedule that debrief meeting immediately! As Sidney Dekker explains in The Field Guide to Understanding Human Error, “human error is the starting point for your investigation”. If it was possible for someone to cause a problem, your debrief can analyze how that was the case.
Running an effective debrief #
Understand the difference between blame and accountability
It’s essential to start from an assumption of good intent.
Being blameless doesn't mean you need to tiptoe around the problem and avoid using names. It's more nuanced than that, and deliberately avoiding discussing the actions of key people can severely hamper the opportunities for learning.
Instead, being truly blameless is about the starting point for an investigation being based on the premise that everyone arrived at work on that day to do all the right things. When there’s mutual acceptance of this fact, discussions around specific actions move from “why did you do this?” to “that action clearly made sense to you – help us understand why”.
Being accountable in an debrief can be quite literally interpreted as being the person who’s responsible for giving the account of what happened.
If we understand the specific motivations of the folks who were there when this was happening, we stand to learn the most about the situation, and ultimately turn that into actionable follow-ups or knowledge that can be shared.
Good facilitation is vital
Debriefs for big incidents require experienced facilitators.
Emotions can run high, and there will be folks looking out for the person or team to blame. High-pressure meetings can inhibit a good environment for learning.
Steering people away from blame culture is not an easy task. This is amplified when the incident has gathered traction from the organization, and there are execs or other high-level stakeholders in attendance.
Prevention is not the only outcome
Incidents will always happen again. At best it’s unhelpful to ask “how are we going to prevent incidents like this from happening in future”, and at worst it can actively inhibit actually learning from a failure.
Take the scenario where a feature of a system you didn't know about, behaved in a way you didn't expect, and put you in a situation you couldn't foresee. How do you prevent that scenario from happening again?
By virtue of fixing the issue during the incident, you learned something we didn't know, and can put some controls in place to reduce the likelihood of that specific thing happening again. But what about the hundred other features of that system we don't know about? Do we prioritise a deep dive into the system to understand everything? And once we've done that, how many other systems do we need to do the same on?
The point here isn't that we should throw our hands in the air and give up. Everyone wants to drive towards a better service and happier customers, but you need to get comfortable with the fact you can't prevent everything. Trying to do so will likely tie you knots on low-value work, with little to no guarantee that it'll actually pay off.
Let things sink in
Post-mortems shouldn’t be run immediately. Incidents are high in pressure and it’s important for everyone to take a step away and really think about what happened before launching into the debrief. Take a day, but make sure you’re scheduling it when the incident is still fairly fresh in everyone’s minds - ideally no longer than a week.
When running a debrief, it's easy to get carried away and generate 37 action items to tackle to combat the 5 minutes of downtime you experienced. Incidents shine a light on a particular problem, and combined with recency bias (i.e. this is the most important thing because it's fresh in my memory), it's easy to get lured into prioritising a bunch of work that really shouldn't be done.
The sad reality is that there's always more than can be done in pretty much every corner of everything we build. But it's important we approach things with perspective, and avoid letting the pressure and spotlight on this incident drive you to commit to arguably low-value work.
The best solution we've found is to introduce a mandatory time gap – "soak time" – to let those ideas percolate, and the more rational part of your brain figure out whether they really are the best use of your time. How long that is will depend on the incident, the people, and a number of other factors, but 1-3 days is a sensible default.
Value the debrief process as much as the document
Valuing the written incident artifact over all else is a common pitfall.
Plans are nothing; planning is everything.
In our experience, these reports are typically not written to read or convey knowledge, but instead are there to tick a box.
It’s useful to remember, documents aside – the process of running debriefs is itself a perfectly effective way to get your money's worth out of incidents.