When critical incidents happen — which they inevitably do 😅 — and you’re in the middle of trying to figure out what the best thing to do is, it can feel comforting to know that you’ve got a pre-prepared list of instructions to follow, commonly known as an “incident response plan”:
An incident response plan is a document that outlines an organization's procedures, steps, and responsibilities of its incident response program
In theory this sounds quite simple, and a typical flow you might envision is:
It might be tempting to think that the hardest part of running incidents is finding or writing a checklist that exhausts all of the things that could go wrong. Once written you’ll repeatedly refer to it in future incidents, following the detailed instructions whilst you’re trying to do everything else that’s required during an incident.
A quick Google search further validates this view. We spent some time looking for incident response plans online and found a wealth of detailed lists (focused on cyber security/data breaches) titled:
This is usually coupled with a variety of multi-page (some as long as 30 pages 😅) PDF templates, that all guarantee that theirs is “comprehensive” and that following it will “ensure you’re prepared”. These cover a variety of topics, including:
These long documents will usually have numerous placeholders where you can “insert your business name here”, so you can feel like you’ve personalised it to your company’s needs.
Whilst the detail and length of these documents might provide some readers with a feeling of rigour, there’s a few reasons why we believe this approach is flawed:
To put it simply (and borrowing from a commonly used Mike Tyson quote): “Everyone has a plan until they get punched in the face”. It’s because of this we believe that overly detailed plans and long “comprehensive” documents are not as useful as they might initially seem.
So what should you do instead? - here’s some practical things we recommend:
All incidents are unique, and by definition a break from expected or ‘normal’ work. You’ll often find yourself dealing with unfamiliar situations and needing to improvise under pressure.
But, underneath the complexity and unfamiliarity of every incident, is a process. The same process. A repeatable process. We recommend outlining the common steps between incidents which we’ve written in our Incident Management Guide chapter on response.
We don’t believe these steps should be written and documented away, in the hope that others will find (and refer to) it when an incident is declared. Whatever tools you use to deal with incidents, these steps should be encoded into them so they help (rather than hinder) you during an incident.
Your mental bandwidth should be occupied with the incident at hand rather than trying to remember whether you remembered step five of your incident response plan. Next time you’re dealing with an incident, your tool should prompt and (ideally) run the step for you.
Organisations generally set their threshold for incidents high, where only the most severe events are called incidents. We believe smaller incidents are extremely valuable, and there's a lot of upside from lowering your threshold for an incident.
Smaller incidents are a great way to learn about the failure cases of systems and provide an opportunity for teams to practice how they’d respond to larger issues.
Additionally, many organisations view incidents as a solely engineering concern. Our experience is the polar opposite. Incidents often start in product/engineering, but they usually require people from around the organisation to form a temporary team to collaborate, communicate and solve a problem.
Imagine a significant outage at a payments fintech like Stripe. The source of of the issue might have started in engineering, but it’s not long before others need to get involved. Customer support and public relations need to start communicating publicly as soon as possible. Engineers begin discussing potential resolutions, including a rollback. Legal need to get involved to understand any potential contractual implications. Compliance need to get involved to ensure that they’re following the regulator’s guidance. An executive is pulled in to make the final call. Responding is a whole-organisation effort.
Ensuring non-engineering teams are familiar with the process of running an incident means you’re well placed next time you have a time critical incident that requires input from non-technical teams; this also has an added benefit of allowing those teams to declare incidents themselves - potentially saving you critical time in future incidents.
Finally, one of the many reasons we don’t believe having overly detailed documentation is useful - is that it can provide you with a false sense of security. Writing a perfectly structured, neat document makes no difference to you, your team or your company when a critical incident happens. In fact, the more effort you’ve put into structuring your process, the more fractured you risk becoming if your teams aren’t trained to follow that process.
Thankfully, it’s easy to fix this by setting aside dedicated time for your teams to practice. Read our "practice or it doesn’t count" chapter in our practical guide to incident management for a detailed breakdown of how we’d recommend you to practice incidents with your team.
To conclude, if you want to ensure you’re well prepared for your next critical incident; make sure you have more than a neatly written document/list of steps written. Build this logic into your incident management tools, lower the severity of what you classify as an incident and practice with your team as much as possible.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!