There’s no one-size-fits-all incident response process. Depending on your organisation’s shape and size, you’ll have different requirements and priorities.
But the same three pillars form the core of any good process, whether it’s for the largest e-commerce giant or a scrappy SaaS startup.
You can’t fix a problem you don’t understand. An incident response team needs to have a clear and shared understanding of what the problem is, and the steps they are taking to resolve it.
To achieve clarity during an incident, it’s important to:
Incidents should be public by default, and only made private when absolutely necessary.
Being transparent means that everyone knows what’s happening, as it is happening. That unlocks access to all the context and skills in the whole of your organisation, rather than relying on the incident reporter to pull in all the people that they need. If someone sees something that might be related to the issue, they can let the incident team know instead of wasting valuable time investigating themselves.
Transparency builds trust, both with internal stakeholders and customers.
Transparency shatters the illusion of perfection: replacing it with a much more useful ‘things go wrong here, and we work hard to fix and learn from them’ attitude.
This can make you feel vulnerable, but that’s the only way to build trust with stakeholders. They’ll start to feel confident that you’ll inform and involve them where needed, and they have more faith in your team’s ability to handle difficult situations.
Transparency goes hand-in-hand with a blameless culture where mistakes are learning opportunities, not fireable offences.
Without this, people will hide their mistakes and won’t pull in the more experienced folks needed to help resolve the issue safely: in the worst case, small hiccups can turn into full-blown outages.
A blameless culture is enabled by humility. This kind of culture is often driven by senior people - if senior people never make mistakes, junior people will be afraid to admit theirs. Similarly, it’s important that people ask for help if they feel out of their depth and need support.
Transparency isn’t just about making it possible for everyone to see the info they need, it’s about making it easy for them to see it.
Think actively about who needs to know: both inside and outside the organisations. Communicate clearly and frequently, providing relevant context to keep people informed.
Keeping the context in one place makes it easier for people to understand what’s happening during the incident, and learn from it afterwards. It’s likely that things that you learn in an incident on one team will produce learnings that many people across the organisation will find useful.
When responding to incidents, we’re only human. Unfortunately, flooding your body with adrenaline doesn’t help you make good decisions, or collaborate well with others. Take a breath, grab a glass of water, and situate yourself.
Many problems within organisations are rooted in a lack of trust. If you don’t trust the people you’re working with, you’ll be stressed that they might do the wrong thing, or even be doing nothing at all.
The same applies in reverse: if other people trust you and your team, then they’re more likely to give you space to do what needs doing without interrupting or asking for reassurance.
Ideally, your whole team will be familiar with these tools, and use them day-to-day, so the learning curve while something is going wrong isn’t too steep.
Overworked and tired people make bad incident responders. Incident response needs to be distributed across the team; share the load.
Everyone should have an extra 10% contingency energy they can use when unexpected things happen, not already be in their overdraft. If someone’s involved in a tough incident, try taking other things off their plate to give them time to recover.
Having someone who’s taken down production before, who’s had the ‘oh shit’ moment, is incredibly valuable.
Of course, it helps that they’re more likely to be able to diagnose and fix the issue. But the real value is that they know that it isn’t world-ending.
That handling incidents well can make customers trust you more, not less, and turn a negative into a positive. That you can admit mistakes, and you won’t lose your job.
Finally, always remember that calm is contagious, just like panic. If a team’s leaders are calm, it’ll perforate down to everyone around them.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Image credit: Jeremy Bezanger