This is the first in a series of posts which look at what happens when the incident is over and we're thinking about what to do next. We'll look at some guidance for deciding whether a debrief is worthwhile, how to prepare for a debrief meeting, and finally how to approach the debrief meeting itself.
The dust has settled after your efforts to get things back on track during your last incident, and everything's once again working as it should. Time to get back to work? Possibly, but you might want to pause and take the time to look more deeply at what happened, and whether it's worth seeking out and socialising learnings more widely. We call this activity an incident debrief, but you might know them as post mortems or incident analysis.
We think about incidents as a cost of doing business – a byproduct of success – and since you can't avoid them, the best you can do is to make sure you get your money's worth. But what does that mean in practice? How do you get value from failure, and when is it worth actively investing time actively seeking that value with post-incident activities?
Perhaps the most pertinent question is why wouldn’t you want to thoroughly analyse every incident? In an ideal world, we’d probably do just that, but for the vast majority of us there are time and cost trade-offs to be made. You could spend a day or two preparing for an incredible debrief, but what about the product feature you need to ship, or the improvements you already know you need to make to improve the reliability of your system?
A key consideration for deciding whether or not to debrief is assessing whether the time investment is justified. Absent a crystal ball we can’t know this for sure, but here’s some points that might help steer you in the right direction.
Often just turning up is enough
When it comes to learning from incidents, the good news is that you're almost certain to be learning just by turning up. By being in incidents and dealing with unfamiliar issues under higher than normal pressure, you are unavoidably learning and developing expertise.
Expertise comes in all shapes and sizes too. Not only are we learning about our technical systems, but we're also learning about how the organization works as a system itself. When you learn that the risk and compliance person needs some specific data to decide whether to report externally, or you find out the folks in customer support have already prepared a set of status page updates for when the impact of a situation isn’t clear, these are all things that you pick up passively. Next time you have an incident, that’s knowledge that’ll lead to things running a little more smoothly.
For the avoidance of doubt, learning by turning up isn’t a case against a more thorough debriefing. But if you are struggling for time after incidents, don’t beat yourself up – by tracking incidents in the open, your organization will be improving.
Using policy and regulation to your advantage
Sometimes, there’ll be external forces that remove the need or ability for you to make that decision. Often regulation or internal policy will dictate that you have to complete a debrief, usually driven by a requirement to file a document rather than a specific desire to extract real and meaningful learnings. We've been there, but see this as a positive. You're being told you have to spend time learning. Use it to your advantage!
Driving the process from incident severity
Another common approach is thresholding on severity – only allocating the time to the biggest and most impactful incidents. If we set aside the fact that severities are easily negotiated (ever seen folks decide which of two severities apply?) thresholding on higher severities leaves value on the table.
High severity incidents bring with them a unique set of challenges. Firstly, they tend to receive a lot of interest from the organization, and the pressure that generates can inhibit a good environment for learning. You’ll see this in post-mortem meetings where people are keen to skip over the timeline and surrounding discussion in order to jump straight into what we’re doing to prevent it ever happening again. Debriefs for big incidents typically require experienced facilitators too. Understandably, emotions are likely to run high, and there’ll be folks who are out looking for the person or team that are to blame. Steering folks away from blame isn’t an easy task, especially when you’ve not done it before, or when you’re trying to do it to an exec several levels above you in the corporate food chain.
Our advice here? Continue to run your debrief for high severities if that’s what’s expected, but look for the smaller ones where you can develop the muscle. Some of the best debriefs we’ve experienced have been for lower severity incidents, where a group of individuals with a shared goal of learning have huddled around a whiteboard and collaboratively explored a failure. Do this and it’ll lead to better debriefs when those bigger, gnarlier ones come up.
Seeking signal from those who were involved
Ever found yourself in an incident with no clue how to proceed, or who to escalate to? Ever wondered how someone else ‘just knew’ the right dashboard to look at, or why one team has a set of critical alerts disabled for their system? We often find ourselves outside of our comfort zone in incidents, dealing with things that we’ve never faced before. It’s never a comfortable feeling in the heat of the moment, but uncertainty and unfamiliarity are strong signals that debrief time is justified. If you didn’t know how to deal with the issue, there’s a strong possibility other folks in your team or organization would be in the same position.
In the general case, speak to the folks in an incident. People who find themselves in incidents typically develop good instincts around which issues feel more routine, and which warrant time and effort to explore.
Local thinking on global problems
Another common indicator that a debrief might be helpful is when we identify local fixes to global problems. These come in a few different guises, but in general they look like plasters over a problem, rather than deeper understanding and treatment of an underlying set of causes. Examples might include a quick-and-dirty fix that was applied in the heat of the moment that we want to revisit, or a common misunderstanding with a few individuals that we might want to clarify more widely across a team or organization. In any case, if you’re uneasy about the recurrence of similar shaped issues, you’re likely to benefit from further analysis.
Human error is the starting point for your investigation.
One of the best indicators of local thinking can be seen when we refer to an incident being the fault of an individual performing an action. If you find yourself citing human error as a cause, schedule that debrief meeting immediately! As Sidney Dekker explains in The Field Guide to Understanding Human Error, “human error is the starting point for your investigation”. If it was possible for someone to cause a problem, your debrief can analyse how that was the case.
Good debriefs will breed good debriefs
Sometimes the time investment isn’t the problem, but instead it’s a question of motivation to actually go through the process. There’s plenty of reasons why this might be the case, ranging from people not knowing how or thinking they’re worthwhile, to scar tissue from bad experiences in the past. All understandable, and the kind of thing that’s best combated with a lead-by-example approach. Find a lower severity incident, get some folks together, and demonstrate the process and value to them. Good debriefs are likely to generate good debriefs.
There's a myriad of reasons why you should spend time debriefing after incidents, but as you develop an increasingly healthy incident culture across your organization you might need to be judicious about where you spend your time. We'd always bias towards learning wherever possible, but if trade-offs are neccesarry, the points here might help you focus your efforts.
Image credit: Volodymyr Hryshchenko
I'm one of the co-founders and the Chief Product Officer of incident.io.