When critical incidents happen â which they inevitably do đ â and youâre in the middle of trying to figure out what the best thing to do is, it can feel comforting to know that youâve got a pre-prepared list of instructions to follow, commonly known as an âincident response planâ:
An incident response plan is a document that outlines an organization's procedures, steps, and responsibilities of its incident response program
In theory this sounds quite simple, and a typical flow you might envision is:
It might be tempting to think that the hardest part of running incidents is finding or writing a checklist that exhausts all of the things that could go wrong. Once written youâll repeatedly refer to it in future incidents, following the detailed instructions whilst youâre trying to do everything else thatâs required during an incident.
A quick Google search further validates this view. We spent some time looking for incident response plans online and found a wealth of detailed lists (focused on cyber security/data breaches) titled:
This is usually coupled with a variety of multi-page (some as long as 30 pages đ ) PDF templates, that all guarantee that theirs is âcomprehensiveâ and that following it will âensure youâre preparedâ. These cover a variety of topics, including:
These long documents will usually have numerous placeholders where you can âinsert your business name hereâ, so you can feel like youâve personalised it to your companyâs needs.
Whilst the detail and length of these documents might provide some readers with a feeling of rigour, thereâs a few reasons why we believe this approach is flawed:
To put it simply (and borrowing from a commonly used Mike Tyson quote): âEveryone has a plan until they get punched in the faceâ. Itâs because of this we believe that overly detailed plans and long âcomprehensiveâ documents are not as useful as they might initially seem.
So what should you do instead? - hereâs some practical things we recommend:
All incidents are unique, and by definition a break from expected or ânormalâ work. Youâll often find yourself dealing with unfamiliar situations and needing to improvise under pressure.
But, underneath the complexity and unfamiliarity of every incident, is a process. The same process. A repeatable process. We recommend outlining the common steps between incidents which weâve written in our Incident Management Guide chapter on response.
We donât believe these steps should be written and documented away, in the hope that others will find (and refer to) it when an incident is declared. Whatever tools you use to deal with incidents, these steps should be encoded into them so they help (rather than hinder) you during an incident.
Your mental bandwidth should be occupied with the incident at hand rather than trying to remember whether you remembered step five of your incident response plan. Next time youâre dealing with an incident, your tool should prompt and (ideally) run the step for you.
đĄ If you use incident.io: use a combination of custom fields to encode the logic that matters to your company, and workflows to automate your incident response process
Organizations generally set their threshold for incidents high, where only the most severe events are called incidents. We believe smaller incidents are extremely valuable, and there's a lot of upside from lowering your threshold for an incident.
Smaller incidents are a great way to learn about the failure cases of systems and provide an opportunity for teams to practice how theyâd respond to larger issues.
âšď¸ Our definition of an incident is anything that takes you away from planned work with a degree of urgency.
Additionally, many organizations view incidents as a solely engineering concern. Our experience is the polar opposite. Incidents often start in product/engineering, but they usually require people from around the organization to form a temporary team to collaborate, communicate and solve a problem.
Imagine a significant outage at a payments fintech like Stripe. The source of of the issue might have started in engineering, but itâs not long before others need to get involved. Customer support and public relations need to start communicating publicly as soon as possible. Engineers begin discussing potential resolutions, including a rollback. Legal need to get involved to understand any potential contractual implications. Compliance need to get involved to ensure that theyâre following the regulatorâs guidance. An executive is pulled in to make the final call. Responding is a whole-organization effort.
Ensuring non-engineering teams are familiar with the process of running an incident means youâre well placed next time you have a time critical incident that requires input from non-technical teams; this also has an added benefit of allowing those teams to declare incidents themselves - potentially saving you critical time in future incidents.
đĄ If you use incident.io: type /inc
or /incident
from any channel in Slack to pop up the incident creation form; alternatively click the 3 dots on a Slack message to âcreate an incidentâ
Finally, one of the many reasons we donât believe having overly detailed documentation is useful - is that it can provide you with a false sense of security. Writing a perfectly structured, neat document makes no difference to you, your team or your company when a critical incident happens. In fact, the more effort youâve put into structuring your process, the more fractured you risk becoming if your teams arenât trained to follow that process.
Thankfully, itâs easy to fix this by setting aside dedicated time for your teams to practice. Read our "practice or it doesnât count" chapter in our practical guide to incident management for a detailed breakdown of how weâd recommend you to practice incidents with your team.
To conclude, if you want to ensure youâre well prepared for your next critical incident; make sure you have more than a neatly written document/list of steps written. Build this logic into your incident management tools, lower the severity of what you classify as an incident and practice with your team as much as possible.
Ready for modern incident management? Book a call with one of our experts today.