The three pillars of great incident response

Pillars at the Temple of Edfu, Egypt

There’s no one-size-fits-all incident response process. Depending on your organization’s shape and size, you’ll have different requirements and priorities.

But the same three pillars form the core of any good process, whether it’s for the largest e-commerce giant or a scrappy SaaS startup.

1. Clarity

You can’t fix a problem you don’t understand. An incident response team needs to have a clear and shared understanding of what the problem is, and the steps they are taking to resolve it.

To achieve clarity during an incident, it’s important to:

  • Have clear roles so everyone knows who is responsible for what. Context switching between resolving the issue and communicating with stakeholders is tiring and challenging: let everyone do their one job, and do it well.
  • Stay focussed on the problem at hand: don’t get distracted by unrelated issues that you discover (ticket them up to look at later).
  • Use actions so everyone knows what is in-flight: this avoids duplication of effort and helps people provide useful context at the right time.

2. Transparency

Default to transparency

Incidents should be public by default, and only made private when absolutely necessary.

Being transparent means that everyone knows what’s happening, as it is happening. That unlocks access to all the context and skills in the whole of your organization, rather than relying on the incident reporter to pull in all the people that they need. If someone sees something that might be related to the issue, they can let the incident team know instead of wasting valuable time investigating themselves.

Transparency builds trust, both with internal stakeholders and customers.

Transparency shatters the illusion of perfection: replacing it with a much more useful ‘things go wrong here, and we work hard to fix and learn from them’ attitude.

This can make you feel vulnerable, but that’s the only way to build trust with stakeholders. They’ll start to feel confident that you’ll inform and involve them where needed, and they have more faith in your team’s ability to handle difficult situations.

Blameless culture

Transparency goes hand-in-hand with a blameless culture where mistakes are learning opportunities, not fireable offences.

Without this, people will hide their mistakes and won’t pull in the more experienced folks needed to help resolve the issue safely: in the worst case, small hiccups can turn into full-blown outages.

A blameless culture is enabled by humility. This kind of culture is often driven by senior people - if senior people never make mistakes, junior people will be afraid to admit theirs. Similarly, it’s important that people ask for help if they feel out of their depth and need support.

Intentionally share information

Transparency isn’t just about making it possible for everyone to see the info they need, it’s about making it easy for them to see it.

Think actively about who needs to know: both inside and outside the organizations. Communicate clearly and frequently, providing relevant context to keep people informed.

Keeping the context in one place makes it easier for people to understand what’s happening during the incident, and learn from it afterwards. It’s likely that things that you learn in an incident on one team will produce learnings that many people across the organization will find useful.

3. Calm

When responding to incidents, we’re only human. Unfortunately, flooding your body with adrenaline doesn’t help you make good decisions, or collaborate well with others. Take a breath, grab a glass of water, and situate yourself.

Calm comes with trust

Many problems within organizations are rooted in a lack of trust. If you don’t trust the people you’re working with, you’ll be stressed that they might do the wrong thing, or even be doing nothing at all.

The same applies in reverse: if other people trust you and your team, then they’re more likely to give you space to do what needs doing without interrupting or asking for reassurance.

Calm comes with good tools

If you can easily gather information that you can verify and share, it’s easier to collaborate and problem solve. Having a good observability setup, as well as easy access to data is key.

Ideally, your whole team will be familiar with these tools, and use them day-to-day, so the learning curve while something is going wrong isn’t too steep.

Calm comes with energy

Overworked and tired people make bad incident responders. Incident response needs to be distributed across the team; share the load.

Everyone should have an extra 10% contingency energy they can use when unexpected things happen, not already be in their overdraft. If someone’s involved in a tough incident, try taking other things off their plate to give them time to recover.

Calm comes with experience

Having someone who’s taken down production before, who’s had the ‘oh shit’ moment, is incredibly valuable.

Of course, it helps that they’re more likely to be able to diagnose and fix the issue. But the real value is that they know that it isn’t world-ending.

That handling incidents well can make customers trust you more, not less, and turn a negative into a positive. That you can admit mistakes, and you won’t lose your job.

Finally, always remember that calm is contagious, just like panic. If a team’s leaders are calm, it’ll perforate down to everyone around them.

Image credit: Jeremy Bezanger

Picture of Lisa Karlin Curtis
Lisa Karlin Curtis
Technical Lead

Operational excellence starts here