Incident reviews are meetings held after the issue has been mitigated, often within a few days, and ideally while the incident is still fresh enough in people's minds.
Attendance should be kept to people directly involved in the incident, along with subject matter experts and any (actually) relevant stakeholders.
These meetings are cross-functional, with representation from engineering, customer support, and sometimes leadership. They are not just for engineers to talk about the technical parts, but for everyone to analyse and disccus the overall response too.
It's important to keep attendees limited to those with direct involvement or valuable input. The larger the “audience,” the harder it is for participants to speak openly. Reviewing what went wrong is challenging enough without feeling like you’re presenting to half the engineering team.
Recording the session and sharing is a great way to allow outside 'read-only' participation.
The objective of an incident review meeting is straightforward: to align the group on what happened, ensure everyone understands things a little better, and to identify areas for improvement.
We've collectively tried a number of methods for running a successful incident review, and landed on the following format as the best balance of in depth learning and time efficiency, making it highly practical for any organization to follow.
Whilst the process is outlined as a linear set of stages here, in reality a great incident review can bounce around between sections. For example, whilst walking the timeline it's likely you'll surface contributors you can log, or actions you might want to note.
We encourage leaning into the flow of the conversation rather than dogmatically following this structure.
Incident reviews—especially for big incidents—require experienced facilitators. Deciding who runs it is a little bit of an art form, but in general the following people are good candidates:
In all cases, the facilitator should feel comfortable walking through the incident, guiding the conversation, and helping to avoid the conversation getting stuck or too deep.
It's likely people in the review meeting will have varying levels of context on the incident, so opening with a summary to level-set the room is helpful. This should cover what happened, the impact that was felt and how it got fixed. It doesn't need to go into great depth, but it's a good time to make sure everyone understands the key components of the incident so the timeline walkthrough makes sense.
The incident review meeting starts with the facilitator walking through the timeline of the incident. You should have a comprehensive post-mortem document to help do this, and the timeline itself should help to walk the narrative and key moments of the incident.
The goal here is to bring the context of the incident back to the room, and gather any additional thoughts too. Participation should be actively encouraged and questions directed at individuals can help. For example:
[Facilitator]: At 9:05 Sara restarted the Auth service by deleting the Kubernetes pods. This worked and fixed the login issues we were seeing. [Facilitator]: Sara – how did you know that would work? [Sara]: We had an incident the week before ...
The timeline should be updated with new information, comments and corrections as you go, and by the end of the walkthrough the full picture of the incident should be understood.
Having walked the timeline and refreshed everyone's memory on the incident, the next stage is a more focussed discussion on the contributors, mitigators and risks surrounding the incident. It's likely you'll already have some of these written as part of the post-mortem writing process, or noted down during the timeline walkthrough.
You'll notice there's no "root cause" section here, which is very intentional. Incidents can rarely be boiled down to a singular cause, and instead are the culmination of many factors that come together to allow it to happen. Framing the conversation around contributors, mitigators and risks helps to capture a much fuller picture than looking for one smoking gun.
Here's how to think about each of these sections:
Contributors: Listing the contributors to an incident helps paint a full picture of what factors played a role, not as ‘root causes’ but as a set of conditions that allowed the incident to happen, or made it as bad as it was. Contributors can include technical factors (e.g. the server’s disk filled up), human factors (e.g. the on-call engineer missed the initial alert), and external factors (e.g. a marketing event was happening at the same time). Thinking in this way helps to explore the full scope of the problem without getting overly focused on any single thing.
Mitigators: While contributors are the factors that enabled the incident, mitigators are what helped reduce its impact. These can include technical controls that worked as expected, helpful circumstances like the incident occurring during working hours, or specific actions, such as having an expert on call. Identifying mitigators in this way can highlight strengths we might want to reinforce or share across the team.
Risks and learnings: This section captures insights on what we learned, how we can improve our response, and any broader risks the incident revealed. For example, we might realize that only one teammate knew the details of the system that failed, indicating a “key person risk.” Noting recurring risks across incidents can provide valuable insights teams might choose to address.
Action items in incident reviews can be a little bit of a hot-button topic, with some folks arguing they steal the focus away from learning as the primary objective. Our view is that learning and the discussion of action items are inextricably linked, and with good facilitation it's markedly better outcome to discuss and note actions during an incident review.
When it comes to discussion of action items, we suggest the following guiding principles:
Let improvements surface naturally: Focus on contextually relevant actions that come up organically during the review, rather than setting aside time specifically for “action items.” As you walk through the incident, people will naturally suggest improvements. The facilitator should note these, assign an owner, and guide the group to avoid going too deep.
Leverage action items for deeper learning: Action items can lead to valuable insights and refinement of mental models. When someone suggests an action, it often sparks further conversation about why it might help and how it aligns with the bigger picture. Lean into this!
Be mindful of second-order effects: Use the review to consider potential second-order impacts of proposed actions. Sometimes actions intended to reduce a risk can unintentionally introduce new risks elsewhere. Lightly discussing these actions during the review helps identify and mitigate unintended consequences.
For a deeper dive on action items in reviews, check out this blog post!
During a debrief, it’s incredibly easy to generate long lists of action items to prevent the thing from happening again. Incidents are fantastic at highlighting specific problems, and with the influence of recency bias (thinking it’s the most important issue because it just happened), there’s a temptation to prioritize tasks that may not be worth the effort, especially when weighed up against other priorities.
The reality is that there’s always more work to be done across every aspect of our systems. It’s essential to maintain perspective and avoid letting the intensity of an incident push you toward low-value work.
We'd suggest you allow time for the responsible teams to take a pause, or "soak time," to let ideas settle and allow for a more rational assessment of whether they’re truly worth pursuing.