Article

Whose fault was it anyway? On blameless post-mortems

Picture of incident.ioincident.io

No one wants to be on the receiving end of the blame game—especially in the wake of a major incident.

You know you were the one who made the final change that led to the incident, and fortunately it wasn't anything too serious. Still, the weight of knowing you caused something bad should be enough, right?

Unfortunately, sometimes fingers get pointed, your name gets called, and suddenly, everyone knows that you’re the person who created more work for everyone. While the reality of this isn’t so bleak, and it’s rare to find people staring you down during a post-mortem meeting, feeling like a scapegoat isn’t great for morale.

This is where the idea of a blameless post-mortem comes into play. It’s the idea that, when incidents happen, it’s best to focus less on who caused it and more on the myriad of contributing factors that allowed it to happen.

Sounds like a good idea, right? In reality, the answer isn’t so black and white.

If you’re building a culture of continuous improvement, avoiding attribution to individuals can actually be a roadblock in learning from your incidents. Plus, there are plenty of things that you can do to make sure that, when folks are called out by name, it's done in the service of better learning–not blame.

What is a blameless post-mortem?

The idea behind blameless post-mortems is that incidents should be free of finger-pointing and scapegoating. Instead of focusing on the "who" behind an incident, blameless culture calls for focusing on the many causes of incidents instead.

And, as John Allspaw of Etsy put it, "...without fear of punishment or retribution."

By doing this, the thinking is that you eliminate the anxiety around post-incident processes. When responders don't have to worry about being singled out for an incident, it creates a healthier environment where folks aren't afraid to make mistakes, and when they do make them, they’re not afraid to share what happened.

In the context of blameless post-mortems, you might end up considering the person pushing the change to production as a “root” cause, but looking deeper will reveal many contributors that allowed it to happen.

A quick note on root cause analysis

You'll often hear the term "root cause analysis" when discussing post-mortems. While it's a method that aims to zero in on the specific triggers of an incident, it's a bit of an old-school approach, primarily because looking for root causes tends to lead to shallower incident analysis.

In the context of blameless post-mortems, you might end up considering the person pushing the change to production as a “root” cause, but looking deeper will reveal many contributors that allowed it to happen.

In short, it’s not a “root” at all, and what could have been blame directed at an individual might be better considered a problem with a process that allowed the issue to take place.

In today's landscape, we think it makes sense to look at contributing factors, some of which will be things people did or didn’t do, and some which will be environmental, process, or systems related.

So, there's really no need to separate the two; they're better together.

The fine line between zero blame and zero accountability

The concept of blameless post-mortems was created by well-meaning engineers and SREs who have dealt with the anxieties of being the focal point of post-incident meetings. And while we aren’t here to say that this approach is wrong—far from it—we acknowledge that there’s also a line to tip-toe here. Let’s use two examples here.

The first is an organization that’s developed a strong culture of blame. While this organization is unlikely actually to exist (we hope), we’ll use it for illustrative purposes. The other is an organization that takes blamelessness to a level that’s detrimental to them.

Organization A uses post-incident meetings as a forum to figure out who caused an incident and interrogate them. Their top concern is weeding out “error-prone engineers,” not learning from incidents. As a result of this approach, this organization never gets any meaningful insights from these post-mortems other than a list of folks to blame for past incidents.

Now, engineers are afraid to take risks and move quickly since they're so worried about being singled out for any issues they cause. Not ideal.

Organization B takes blamelessness to the extreme. No names ever get mentioned, leaving the causes of incidents to be a bit of a black box. The incident debrief is vague and meandering, and it’s hard to learn who did what or why it made sense to do it.

We’re calling for neither of these. We like to go with a different approach: accountable learning post-mortems.

Forget blame—let’s learn through accountability instead

Here at incident.io, we take a different approach to a wholly blameless culture.

We never single anyone out for the sake of it. Sometimes, it’s necessary to reference someone specific in order to better contextualize and understand contributing factors of incidents.

Here’s what we mean.

Let’s say a product engineer pressed a button that triggered an incident. In a blameless world, this person would never be identified, which might lead to a more shallow, reactive approach of just fixing the button.

But our approach would be this: during the post-mortem, we’d figure out who pressed the button, reference that person in either a post-mortem document or meeting, and then move on to discuss contributing factors. By doing this, we feel that we can better understand what may have led this specific person to trigger this incident in the first place. We can then ask questions like:

  • Why was this button even available in the first place?
  • Should this person and folks with similar tenure and experience have access to triggers like these?
  • Was this person adequately trained on our infrastructure and processes?

With a completely blameless approach in place, you may be able to answer the first question and gather surface-level insights about the incident. But the other two would be out of the question if you never identify the “who.”

The goal isn’t to assign blame. It’s to understand who triggered an incident to better contextualize why they were able to in the first place.

This way, we’re able to both learn more from our incidents and introduce better processes and stopgaps to prevent similar ones in the future. In the end, our goal is to create better business outcomes by getting the most out of our post-incident processes.

By knowing who caused an incident, we can better diagnose contributing factors and set everyone up for success.

No blame–just accountability in the service of deeper learning from incidents.


Share on

Move fast when you break things