This is the first in a series of posts which look at what happens when the incident is over and we're thinking about what to do next. We'll look at some guidance for deciding whether a debrief is worthwhile, how to prepare for a debrief meeting, and finally how to approach the debrief meeting itself.
The dust has settled after your efforts to get things back on track during your last incident, and everything's once again working as it should. Time to get back to work? Possibly, but you might want to pause and take the time to look more deeply at what happened, and whether it's worth seeking out and socialising learnings more widely. We call this activity an incident debrief, but you might know them as post mortems or incident analysis.
We think about incidents as a cost of doing business â a byproduct of success â and since you can't avoid them, the best you can do is to make sure you get your money's worth. But what does that mean in practice? How do you get value from failure, and when is it worth actively investing time actively seeking that value with post-incident activities?
Perhaps the most pertinent question is why wouldnât you want to thoroughly analyse every incident? In an ideal world, weâd probably do just that, but for the vast majority of us there are time and cost trade-offs to be made. You could spend a day or two preparing for an incredible debrief, but what about the product feature you need to ship, or the improvements you already know you need to make to improve the reliability of your system?
A key consideration for deciding whether or not to debrief is assessing whether the time investment is justified. Absent a crystal ball we canât know this for sure, but hereâs some points that might help steer you in the right direction.
When it comes to learning from incidents, the good news is that you're almost certain to be learning just by turning up. By being in incidents and dealing with unfamiliar issues under higher than normal pressure, you are unavoidably learning and developing expertise.
Expertise comes in all shapes and sizes too. Not only are we learning about our technical systems, but we're also learning about how the organization works as a system itself. When you learn that the risk and compliance person needs some specific data to decide whether to report externally, or you find out the folks in customer support have already prepared a set of status page updates for when the impact of a situation isnât clear, these are all things that you pick up passively. Next time you have an incident, thatâs knowledge thatâll lead to things running a little more smoothly.
For the avoidance of doubt, learning by turning up isnât a case against a more thorough debriefing. But if you are struggling for time after incidents, donât beat yourself up â by tracking incidents in the open, your organization will be improving.
Sometimes, thereâll be external forces that remove the need or ability for you to make that decision. Often regulation or internal policy will dictate that you have to complete a debrief, usually driven by a requirement to file a document rather than a specific desire to extract real and meaningful learnings. We've been there, but see this as a positive. You're being told you have to spend time learning. Use it to your advantage!
Another common approach is thresholding on severity â only allocating the time to the biggest and most impactful incidents. If we set aside the fact that severities are easily negotiated (ever seen folks decide which of two severities apply?) thresholding on higher severities leaves value on the table.
High severity incidents bring with them a unique set of challenges. Firstly, they tend to receive a lot of interest from the organization, and the pressure that generates can inhibit a good environment for learning. Youâll see this in post-mortem meetings where people are keen to skip over the timeline and surrounding discussion in order to jump straight into what weâre doing to prevent it ever happening again. Debriefs for big incidents typically require experienced facilitators too. Understandably, emotions are likely to run high, and thereâll be folks who are out looking for the person or team that are to blame. Steering folks away from blame isnât an easy task, especially when youâve not done it before, or when youâre trying to do it to an exec several levels above you in the corporate food chain.
Our advice here? Continue to run your debrief for high severities if thatâs whatâs expected, but look for the smaller ones where you can develop the muscle. Some of the best debriefs weâve experienced have been for lower severity incidents, where a group of individuals with a shared goal of learning have huddled around a whiteboard and collaboratively explored a failure. Do this and itâll lead to better debriefs when those bigger, gnarlier ones come up.
Ever found yourself in an incident with no clue how to proceed, or who to escalate to? Ever wondered how someone else âjust knewâ the right dashboard to look at, or why one team has a set of critical alerts disabled for their system? We often find ourselves outside of our comfort zone in incidents, dealing with things that weâve never faced before. Itâs never a comfortable feeling in the heat of the moment, but uncertainty and unfamiliarity are strong signals that debrief time is justified. If you didnât know how to deal with the issue, thereâs a strong possibility other folks in your team or organization would be in the same position.
In the general case, speak to the folks in an incident. People who find themselves in incidents typically develop good instincts around which issues feel more routine, and which warrant time and effort to explore.
Another common indicator that a debrief might be helpful is when we identify local fixes to global problems. These come in a few different guises, but in general they look like plasters over a problem, rather than deeper understanding and treatment of an underlying set of causes. Examples might include a quick-and-dirty fix that was applied in the heat of the moment that we want to revisit, or a common misunderstanding with a few individuals that we might want to clarify more widely across a team or organization. In any case, if youâre uneasy about the recurrence of similar shaped issues, youâre likely to benefit from further analysis.
Human error is the starting point for your investigation.
One of the best indicators of local thinking can be seen when we refer to an incident being the fault of an individual performing an action. If you find yourself citing human error as a cause, schedule that debrief meeting immediately! As Sidney Dekker explains in The Field Guide to Understanding Human Error, âhuman error is the starting point for your investigationâ. If it was possible for someone to cause a problem, your debrief can analyse how that was the case.
Sometimes the time investment isnât the problem, but instead itâs a question of motivation to actually go through the process. Thereâs plenty of reasons why this might be the case, ranging from people not knowing how or thinking theyâre worthwhile, to scar tissue from bad experiences in the past. All understandable, and the kind of thing thatâs best combated with a lead-by-example approach. Find a lower severity incident, get some folks together, and demonstrate the process and value to them. Good debriefs are likely to generate good debriefs.
There's a myriad of reasons why you should spend time debriefing after incidents, but as you develop an increasingly healthy incident culture across your organization you might need to be judicious about where you spend your time. We'd always bias towards learning wherever possible, but if trade-offs are neccesarry, the points here might help you focus your efforts.
I'm one of the co-founders, and the Chief Product Officer here at incident.io.
This post explores how a basic idea turned into a working Apple TV dashboard powered by the incident.io API. Using Claude Code and a âvibe codingâ approach, the app was built in a few hours, complete with real-time incident data, dual themes (including a Wargames-inspired view), and no Swift experience :)
We built an open-source MCP server that lets Claude directly access and manage your incident.io incidents through natural conversation. Instead of switching between tools when things break, you can now ask Claude to create incidents, update statuses, and pull context, all while staying in your existing workflow.
Ready for modern incident management? Book a call with one of our experts today.