Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Small confession: we currently use the term 'post-mortem' in incident.io despite preferring the term 'incident debrief'. Unless you have particularly serious incidents, the link to death here really isn’t helping anyone. However, we're optimising for familiarity, so we're sticking to the term 'post-mortem' here.
Ask any engineer and they’ll tell you that a post-mortem is a positive thing (despite the scary name). Being able to reflect on an incident helps us learn from our mistakes and do better next time. Your return only increases when future engineers and decision makers are able to access the record of events.
However, one does not simply follow a post-mortem guide and reap the benefits; post-mortems are all too easily executed badly.
Most obviously, you might not need a post-mortem at all. It’s common to skip post-mortems for low-severity incidents, but for one-in-a-million events or those that are out of your control (eg. a provider is down), you may want to apply the same rule. The energy invested in preventing the incident can outweigh the pain incurred by the incident itself.
Secondly, compiling information can take a lot of human time. I’ve seen engineers spend hours trying to find the exact timestamps, or pour over the best way to describe what happened. There is pressure to fill all the fields in a post-mortem template, otherwise it won’t be signed off as “done”.
With proper tooling, automation can be used to compile the important information, freeing up a responder’s time for thinking about the important questions like:
To get the most out of a post-mortem, it’s important to establish the level on which an incident is reflected on. The classic gotcha is: contributors assume that the post-mortem is a place to work out how to fix X so that it doesn’t happen again. But that would be on the lowest level of reflection - you could also look at how to prevent this class of incidents, or step back and ask if your existing solution needs a whole re-think. There’s no point fixing a leaky sink in a burning building. Your post-mortem should help you discover what the problem really is.
💭 Since we're on the subject of post-mortems, we actually put together our very own post-mortem template for you to use as inspiration for your own!
Post-mortems are meant to be blameless, meaning they focus on how a mistake was made rather than who made it. But they can easily get too retrospective, focusing on what could have been if a decision was made differently. If someone has a bee in their bonnet about how a particular service was built, and that service went down, you’ll likely end up with a rant. Post-mortem discussions need steering so they are only looking forward.
A post-mortem normally results in action points. This is great, but where are these tasks of work meant to lie alongside all the other tickets for this sprint? How do we prioritise them? I’ve seen entire lists of “Action points” lay dormant in post-mortem documents whilst everyone tries to recover from the incident and pick up everything else they were meant to get done. Just like any planning meeting, action points should be drawn up into an issue tracker like Linear/Jira, and ongoing work should be re-prioritised if needs be.
Given the negative tone of most of this, I’ll have to round off by reiterating that I do find post-mortems incredibly useful. When the laboursome bits are automated, the discussion is steered, and action points are seamlessly plugged into a team’s roadmap, what you have left is a brainstorming session on a tricky problem. And that’ll take any engineer’s fancy.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Better learning from incidents: A guide to incident post-mortem documents
Post-mortem documents are a great way to facilitate learning after incidents are resolved.
Luis Gonzalez
How we’ve made Status Pages better over the last three months
A few months ago we announced Status Pages -- the most delightful way to keep customers up-to-date about ongoing incidents. Since then, we've launched several features to add an extra bit of delight. Read on to learn more.
Asiya Gorelik
The balancing act of reliability and availability
To prevent issues like downtime, you have to focus on the reliability and availability of your product. But there's a balance to be struck here.
incident.io