Last week, we spent some time talking to Gergely Orosz about our thoughts on what happens when an incident is over, and you're looking back on how things went.
If you haven't read it already, grab a coffee, get comfortable, and read Gergely's full post Postmortem Best Practices here.
But before you do that, here's some bonus material on some of our points.
I'm sure we can all recall a time when we were we sat in an incident debrief, walking through the timeline, and we've reached the critical point where 'someone' pushed the button that triggered the cascade of events that led to the incident. We all talk in hypotheticals about how 'someone' should have read the docs, or consulted someone else, and how if that had happened we wouldn't all be here at all.
This is unequivocally not how good debriefs happen. Being blameless doesn't mean we need to tiptoe around the problem and avoid using names. It's more nuanced than that.
It's about the starting point for an investigation being based on the premise that everyone arrived at work on that day to do all the right things.
It's about starting from an assumption of good intent and sharing personal accounts of the events that unfolded. They might well have been the person that pushed the button, but why did that make sense to them in this instance? How many times has the button been pushed where everything work exactly as intended?
If we understand the specific motivations of the folks who were there when this was happening, we stand to learn the most about the situation, and ultimately turn that into actionable follow-ups or knowledge that can be shared.
If you've spent time in incident debriefs, especially big ones with senior leaders, you'll likely be familiar with questions like "how are we going to prevent incidents like this from happening in future?". Cue a room full of engineers rolling their eyes.
<svg class="w-8 h-6 text-orange-600" viewBox="0 0 35 28" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M25.983 17.068h5.608V28H19.605V15.617c0-4.834 1.356-8.626 4.069-11.378C26.459 1.413 30.234 0 35 0v5.578c-3.079 0-5.352.818-6.818 2.454-1.466 1.636-2.2 4.127-2.2 7.474v1.562Zm-19.605 0h5.608V28H0V15.617C0 10.784 1.356 6.992 4.069 4.24 6.854 1.413 10.629 0 15.395 0v5.578c-3.079 0-5.352.818-6.818 2.454-1.466 1.636-2.2 4.127-2.2 7.474v1.562Z" fill="currentColor"/></svg> How do we prevent this from ever happening again? </p>
There is a class of incident where we can reasonably expect the likelihood of recurrence to be almost zero. If a disk on a server fills and brings down our service, we can add both technical controls to prevent this happening, and detective alerts that'll warn us if we're close to having a similar issue. It's going to be hard (though not impossible!) for that same incident to happen again, and everyone walks away from the debrief happy.
Now take the scenario where a feature of a system you didn't know about, behaved in a way you didn't expect, and put you in a situation you couldn't foresee. How do you prevent that scenario from happening again? By virtue of fixing the issue during the incident, we learned something we didn't know, and we can place some controls in place to reduce the likelihood of that specific thing happening again. But what about the hundred other features of that system we don't know about? Do we prioritise a deep dive on the system to understand everything? And once we've done that, how many other systems do we need to do the same on?
The point here isn't that we should throw our hands in the air and give up. Everyone wants to drive towards better service, fewer incidents and happier customers, but you need to get comfortable with the fact you can't prevent everything. Trying to do so will likely tie you knots on low value work, with little to no guarantee that it'll actually pay off.
Ultimately, by fixing the issue (ideally using incident.io, and out in the open) you've already done the best thing you can to stop this happening again; you've learned something.
It's easy to get carried away in a debrief and generate 37 actions items to tackle that 5 minutes of downtime you experienced. Incidents shine a light on a particular problem, and combined with recency bias (i.e. this is most important because it's fresh in my memory), it's easy to get lured into prioritising a bunch of work that really shouldn't be done.
The sad reality is that there's always more than can be done in pretty much every corner of everything we build. But it's important we approach things with perspective, and avoid letting the pressure and spotlight on this incident drive you to commit to arguably low value work.
The best solution we've found is to introduce a mandatory time gap – "soak time" – to let those ideas percolate, and the more rational part of your brain figure out whether they really are the best use of your time.
Perhaps one of my biggest gripes in the post incident flow is organisations that value the written incident artifact over all else. As the famous Eisenhower quote goes, "Plans are nothing; planning is everything", and the same is mostly true of incidents.
<p class="!mt-4"> Plans are nothing; planning is everything. </p>
The postmortem/debrief artifact isn't quite 'nothing', but in our experience these reports are typically not written to read or convey knowledge, but instead are there to tick a box. The folks in Risk and Compliance need to know that we've asked the five whys and written down exact times that everything happened, because that's how the risks are controlled.
Personal experiences aside (😅) this is actually pretty common, and if you find yourself here it's useful to remember that – documents aside – the process of running debriefs is itself a perfectly effective way to get your money's worth out of incidents.
Building safe-by-default tools in our Go web application
At incident.io, we're acutely aware that we handle incredibly sensitive data on behalf of our customers. Moving fast and breaking things is all well and good, but keeping our customer data safe isn't…
Lisa Karlin Curtis
Deploying to production in <5m with our hosted container builder
Fast build times are great, which is why we aim for less than 5m between merging a PR and getting it into production. Not only is waiting on builds a waste of developer time — and an annoying…
New Joiner: Katie Hewitt
Hi! I'm the newest member (and first non-engineer!) to join the incident.io team. I'm going to be working on all things Strategy and Ops, from getting the rails in place to keep us working effectively…