Need your own incident post-mortem template? Here’s ours

Article
Picture of incident.io
incident.io

Having a dedicated incident post-mortem is just as important as having a robust incident response plan. The post-mortem is key to understanding exactly what went wrong, why it happened in the first place, and what you can do to avoid it in the future.

It’s an essential document but many organizations either haphazardly put together post-incident notes that live in disparate places or don’t know where to start in creating their own post-mortems. To help, we’re sharing the incident post-mortem template that we use internally.

This template outlines our “sensible default” for documenting any incident, technical or otherwise. We believe it strikes a healthy balance between raw data, human interpretation, and concrete actions. And we say “sensible default” because it’s rare that this will perfectly cover the specific needs of your organization, and that’s fine. Think of this as a launching off point for your own incident post-mortem document.

Within each section, we’ve outlined the background on what it’s for, why it’s important, and how we advise you to complete it.


What is an incident post-mortem?

A post-mortem is a document that allows you to better understand the origins and causes of an incident and ultimately learn from your response to it.

Each incident post-mortem should include key information such as start and resolution time, the incident response team and more. The end of the document should be focused on understanding what vulnerabilities allowed the incident to happen in the first place and how to best avoid similar incidents in the future.

💡 Pro tip: while you can use pretty much any document creator for your post-mortems, we highly recommend using something like Notion. Using Notion allows you to drop all of your incident post-mortem documentation into a single database and tag appropriate teammates, set filters, customize views, and truly customize as much as you deem necessary.

This makes workflows much more seamless compared to other kinds of standalone document editors like Google Docs.

Below, you’ll find the template that we follow here at incident.io that you can follow to inspire your own post-mortem document.

1) Key information

The key information section summarizes the data behind the incident. It’s the most succinct summary of the incident, which is useful to orient anyone reading the remainder of the document.

Think of this as the metadata of your incident. Core information such as incident type, severity, leads and more should be outlined here.

Here’s what this section could look like:

ℹ️ Tagged Data!

  • Incident Type: Platform
  • Severity: Major
  • Affected services: Login, Cart, User Preferences

👪 Team

  • Incident Lead: Chris Evans
  • Reporter: Steven Millhouse
  • Active participants: Rebecca Song
  • Observers: Michael Scott
  • Jira issue
  • Slack channel

⏱️ Key Timestamps

  • Impact started at: November 10, 2022 2:12 PM
  • Reported at: November 10, 2022 2:12 PM
  • Identified at: November 10, 2022 2:18 PM
  • Fixed at: @November 10, 2022 2:20 PM (UTC)
  • Closed at: November 10, 2022 2:30 PM

2) Incident summary

The summary should be provide a concise and accessible overview of the incident. It’s important that this summary is exactly that, a summary. If you try to throw everything in here, it becomes very difficult for someone to come in later on and easily parse out what happened.

Your incident summary should clearly articulate what happened, who was involved, how you responded, and the overall impact of the event. It’s helpful for the summary to be written such that someone who wasn’t there can use it to develop a high level understanding of the situation.

We find it useful to pitch it so your boss’s boss would understand!

Here’s a real-life incident post-mortem summary we wrote up internally:

💡 On 2022-11-18 from 15:39 to approximately 16:23, an asynchronous event caused an error which took one of our Heroku "dynos" down (referred to as servers for simplicity below).

Our infrastructure logic is such that if an event has an issue being processed, it retries. In this case, it meant that another server then attempted to process the same malformed/dangerous event, before also crashing. This repeated until all our servers were down, at which point our app became completely unavailable. This type of error is commonly known as a "poison pill".

The poison pill was caused by a consumer that returned an error, which wrapped a nil error. This meant our consumer tried to unwrap the error and got a nil pointer dereference when trying to access the root cause. Normally this would be fine, however only the actual app-code section of our queue consumers were within the panic recovery boundary, so by panicking in the pubsub wrapper, we'd crash the entire app.

We quickly rebooted the servers, but unfortunately, the event was retried again, taking down each server in quick succession after a few minutes. This continued for around 25 minutes.

We deployed a fix to make sure all of the code in our consumers were covered by a "recover" statement (meaning we handle errors like this more gracefully), which then quickly caught the error once we deployed it and also prevented it taking the app down again.

From there, we fixed the bad code and stayed online. We also made a number of other improvements to our app and infrastructure setup.

3) Incident timeline

The incident timeline exists to provide a narrative of the incident; essentially retelling the story from start to finish. It should outline key events and developments that took place, investigations that were carried out, and any actions that were taken.

It’s tempting to go into huge amounts of detail here, but we’d typically advise the detail to remain in a communication thread (i.e. a Slack channel), and have the timeline contain just the significant events and developments.

Typically we’d include the timestamp (in the dominant timezone for the incident or UTC if spanning many), the time since the notional start of the incident, and any details about what was happening.

Timestamp Event
12:00:00 Chris reported the incident
12:07:00 (+7mins) Incident lead assigned
12:10:00 (+10mins) Issue identified
12:12:00 (+12mins) Incident closed

4) Contributors

We think it’s helpful to enumerate the contributors of an incident, where contributors are thought of not as ‘root causes’ but as a collection of things that had to be true for this incident to take place, or that contributed to its severity.

This can include technical contributors (e.g. the server’s disk filled up), human contributors (e.g. the on-call engineer did not respond the first time we paged them), and any other external contributors (e.g. We were simultaneously running a marketing event).

It’s useful to enumerate these items to fully explore the space of the problem, and avoid overly fixating on a singular cause.

5) Mitigators

Where contributors constitute the set of things that helped the incident occur, mitigators can be thought of as the opposite. More precisely, they’re anything that helped to reduce the overall impact of the event.

Like contributors these can come in many guises, like the incident happened during working hours, or the fact that the person who knows the most about the system at the heart of the incident also being on call.

It’s useful to explicitly call these out, as they often highlight positive capacities that are helpful to double down on, or socialize further.

6) Learnings and risks

This section is a little harder to capture, should answer the questions: what did we learn, and what broader risks did this incident point towards?

For example, you might have learned that one of your teammates was the only person who knew the details of the system that failed during this incident, and that might point at a more general “key person risk”. If other incidents point at similar risks, that’s useful learning for any organization.

7) Follow-up actions

Follow-up actions are here to convey what’s being done to reduce the likelihood and/or impact of similar events in future.

In the context of this document, we think it’s most useful to highlight any key actions that are being taken, rather than the full set. Ultimately, this will vary depending on the intended audience, but if the document is being written to be read and learnt from, this section should close the loop on what’s being done to mitigate the major themes that have been identified.

8) Optional: Post-mortem meeting notes

Many post-mortems can happen async but others may require a dedicated meeting to talk through what happened—especially incidents that were of higher severity. In cases like these, it’s a good idea to set up a meeting with the incident response team and use the post-mortem as your agenda.

While it may seem a bit redundant, having the space to talk through issues in real-time ultimately encourages better collaboration and communication across teams that’s nearly impossible to capture over an async document.

Export your post-mortems to Notion with incident.io

notionxincident

Did you know that, thanks to our Notion integration, you can export your post-mortems directly into the tool? While is seems trivial, eliminating the need to copy and paste your post-mortems into Notion can save you precious time.

Through this integration, all of your exported post-mortems will be tracked in a Notion database, giving you a high-level overview of your post-mortems. Important fields about your incident such as type, severity and incident lead are added as properties. And you can create specific database views for different teams or personas within your organization.

If you want to try it out, enable it in your integration settings, or check out our help page for more information.

See related articles

Article

How we approach integrations at incident.io

incident.io
Picture of incident.io

incident.io

5 min read
Article

5 best incident management tools of 2023

incident.io
Picture of incident.io

incident.io

9 min read
Article

Goodbye, 2022. Hello, 2023 — reflecting on a year of change, progress and incidents

Chris Evans
Picture of Chris Evans

Chris Evans

6 min read