We’ve covered how to build an incident response process, from defining severities and assigning roles to establishing what constitutes an incident for your organization.
Here’s the secret, though: it’s only valuable if you practice it.
The more effort you put into structuring your process, the more disjointed things can become if your team isn’t trained to follow it. Fortunately, this is an easy fix, and one way to do it is with tabletop exercise.
Whether you’re onboarding new responders or refreshing your seasoned team members, it’s essential to schedule time for team-based tabletop response drills.
To get started you'll want to choose an incident scenario that:
Here’s our recipe for running simple yet effective tabletop exercises to help your team get familiar with your incident response processes.
The best training has a nominated facilitator who prepares in advance, and is responsible for making the exercise successful. It’s their job to find a suitable scenario and to run the next part of the process. Someone should also be chosen as the scribe, who will take notes throughout the whole session.
Once everyone arrives for the training, be it physically or video-call, the facilitator will explain the purpose of the meeting. It’ll go something like this:
Facilitator: In this session, we’re going to be running a tabletop exercise on responding to an incident. Our goal is to drill incident response, and each of you is required to engage as you normally would if this was really happening, but by vocalising your actions.
We’re aiming to follow our incident process as closely as possible. Once we reach the end of the exercise, we’ll stop and review how we responded, in an attempt to improve our response.
When everyone’s ready, the facilitator will begin. You can use props such as screenshots to make the exercise more real, if you choose. But with or without, it’s time for your responders to engage:
Facilitator: This alert appears in #errors [Screenshot of alert: “Increased API error rate”]
Alice: I’m on-call, I’m going to raise an incident for this.
If your process is more manual - get started now. Create whatever docs or channels you’d normally use, and have the scribe note anything you forgot or which bits took you a while. You may be surprised how long it takes someone to get ready, or what they have forgotten since their last incident!
Alice: I’ve opened an incident, and I’m taking the incident lead role. I’m inviting Bob to help in the response, as I think this is customer-facing.
Bob: I’ve joined the channel, and asked how I can help.
Alice: This looks like it’s going to have customer impact, and we’ll need to notify the business. I want to look into this problem but we’ll need someone on comms, can you take this Bob?
Bob: Sure! I’m going to take comms and write an incident update explaining customers might be impacted. Our automation just reminded me this type of incident will need us to update the status page, so I’m going to do that now.
You can see this is primarily focused on incident process, which is exactly what we wanted. It’s not about how we’re going to respond technically, and if you get too much into the detail, you’re probably not doing it right.
Incidents rarely go smoothly, and often you'll be faced with unexpected challenges along the way. The facilitator should have some plausible additions ready to step up the pressure.
Facilitator: We’ve got reports that customers are unable to access the dashboard
Alice: Ok, this is serious. I’m going to upgrade the incident severity to Major, and page the team responsible for the website: they might know what’s going on.
Growing the incident like this tests key gear changes, which is where incident process is designed to help. You can have participants stand in for whoever you need across the company, and even play a part in faking the response, perhaps jumping into the channel pretending to be customer support.
It shouldn’t go on for too much longer, but try growing the incident until responders are forced to delegate and split efforts. The exercise ends when you feel you've covered all the key aspects of your response, and before you accidently end up in technical rabbit holes!
This will normally take about 15-30 minutes, after which you can review the notes and examine your ‘incident timeline’, seeing where things went wrong, and where you might be able to improve.
With the exercise wrapped up, here's a few example questions you might ask:
You’ll find issues at the team level, and at the process level. It’s easy to spot them when you focus strictly on response, and they should be fresh in your mind to tweak or create action points for.
This is how you make sure people know their tools, can escalate appropriately, and are comfortable filling any of the incident roles they might be required to during a real incident. It’s low effort and can be really fun, and makes a huge difference when you respond for real.