How to build a strong incident response process

When building an incident response process, it’s easy to get overwhelmed by all the moving parts. Less is more: focus first on building solid foundations that you can develop over time. Here are three things we think form a key part of a strong process.

I’d recommend taking these one at a time, introducing incident response throughout your organization.

Just being honest: we’re a startup selling incident management software. We believe that using our software will help build a good incident process, otherwise, we wouldn’t be doing it. But beyond that, we’ve also got a lot of experience building and participating in incident processes and think this advice is generically useful, regardless of whether you choose our product.

1. Define what an incident is

This is, in the words of Julie Andrews, a very good place to start.

Getting a common understanding of what an incident is (and isn’t) is the first step in bringing people into your incident response process.

An incident is where something unexpected happens that has (or might have) a negative consequence. If you’re interested in reading more, we’ve dedicated a whole blog post to this question.

Particularly when you’re getting started, the best way to embed a process into your organization is to use it. A lot. This also helps everyone learn the process, and get better at incident response overall, meaning that when something really bad happens it feels like a well-oiled machine.

To this end, you want a broad and inclusive definition of an incident:

Declaring an incident should be easy, and everyone should be able to do it without fear of repercussions. Avoid long questionnaires which add friction to the process, and de-incentivise declaring incidents at all.
Don’t assign blame (or other negative consequences) to declaring an incident. This includes not using ‘number of incidents’ as a metric for team health or performance. Here's a handy guide we wrote about blameless post-mortem meetings.
Your lowest severity should represent something that happens regularly and has a low impact, so people feel comfortable declaring incidents (and don’t feel like a boy crying wolf)
Don’t limit the definition to engineering: getting your whole org to think about anything unexpected in the same way means you can collaborate better (see section 2)

2. Keep everyone updated

Transparency by default is a really important value to bake into your incident process.

First up: make sure it’s clear who is responsible for communicating. Whether that’s the incident lead, or another chosen individual, making some explicitly accountable is the best way to keep the updates coming.

Make it really easy to tell stakeholders what’s going on, and use the tool that makes the information easiest to consume (whether that’s email, Slack or something else entirely).

Use a predictable format for the updates, as this makes them easier to parse and scan if you’re a busy stakeholder flying through their notifications.

These updates also advertise your incident process and normalise the fact that incidents happen. Ideally, someone’s first interaction with the incident process at your org should be as a consumer of an update - not being parachuted into the middle of something.

Get comfortable admitting that things go wrong, both internally and (to some extent) externally with customers. This builds trust and enables people to adapt their behaviour to mitigate the impact on their side (e.g. if a customer knows you are having an outage, they’ll move to another task and come back tomorrow instead of furiously refreshing the page).

You should also check out five steps to better customer comms to learn more.

3. Remember: it’s not over until it’s over

Your incident process shouldn’t end once the problem is resolved. To get value from your incidents, you want to be using them to learn and improve your day-to-day operations. There are often follow-up actions that need to be put on someone’s backlog or wider problems that should be considered and prioritised.

Post-mortem documents and post-incident incident meetings are a great way to extract learnings from an incident. They help encourage reflection and often bring up related concerns about the way the team is working.

However, there’s a Goldilocks’ zone. If the post-incident process feels painful, people will stop declaring incidents at all.

Clearly communicate the value of your post-incident process, making sure it doesn’t feel like a box-ticking exercise. Don’t make it harder than it needs to be: try to avoid repetitive manual work (e.g. using a tool or template).

Give people autonomy to decide what’s appropriate for the incident they’ve experienced. Sometimes a simple update in the incident channel explaining what happened and why is completely sufficient. Sometimes you’ll want to run a full-blown cross-team workshop to understand what happened and why the response didn’t go as smoothly as you’d hoped.

Summary

When you’re first building an incident response process, focus on a few key things:

Get a shared understanding of what an incident is, and don’t make it too specific
Make it easy to provide updates, both to internal and external stakeholders
Reflect on incidents to gather learnings, but make sure the process is adding value

Once you’ve got these nailed, you can start layering more stuff on top. But don’t try to run before you can walk.