We’ve spoken at length about how to build an incident response process, from picking your severities and deciding your roles to defining what an incident means for your organisation.
Here’s a secret, though: it’s not worth much unless you practice it.
In fact, the more effort you’ve put into structuring your process, the more fractured you risk becoming if your teams aren’t trained to follow that process.
Thankfully, it’s easy to fix this.
If you’re looking to onboard new responders, or refresh your grizzled veterans, it’s important to schedule time to drill your response as a team.
It’s crucial that you’re clear on the goals of training, which normally fall into two categories:
- Incident process and collaboration
- Technical response
If you can budget an entire day, this is what we’d suggest.
Tabletop exercise (10-11:30am) #
If you can only afford to address one of these areas, you’ll want to pick incident process and collaboration. Not only is this quicker to train, but it’s rare you get to spend time focused entirely on your process, while every real incident will force you to practice your technical response.
So how do you practice responding to an incident without a real issue, and in such a short period of time?
First, you decide on an incident scenario, which ideally should:
- Be realistic, staying close to the type of incident you normally encounter
- Start small, where the ideal response wouldn’t occupy all of your response team
- Grow bigger, eventually outgrowing the team drilling this exercise
The best training has a nominated facilitator who prepares in advance, and is responsible for making the exercise successful: it’s their job to find a suitable scenario and to run the next part of the process:
Once everyone arrives for the training, be it physically or video-call, the facilitator will explain the purpose of the meeting. It’ll go something like this:
In this session, we’re going to be running a tabletop exercise on responding to an incident.
Our goal is to drill incident response, and each of you is required to engage as you normally would if this was really happening, but by vocalising your actions.
We’re aiming to follow our incident process as closely as possible. Once we reach the end of the exercise, we’ll stop and review how we responded, in an attempt to improve our response.
Someone is then chosen as scribe, who will take notes. When everyone’s ready, the facilitator will begin:
Facilitator: This alert appears in #errors
[Screenshot of alert: “Increased API error rate”]
You can use props such as screenshots to make the exercise more real, if you choose. But with or without, it’s time for your responders to engage:
Alice: I’m on-call, I’m going to raise an incident for this.
This is the part of the response you can actually do:
If your process is more manual - get started now. Create whatever docs or channels you’d normally use, and have the scribe note anything you forgot or which bits took you a while. You may be surprised how long it takes someone to get ready, or what they have forgotten since their last incident!
If you’re using a product like incident.io, you can open a test incident (
/inc test) which will run all your normal automation and create a Slack channel as you’d get in a real incident.
Alice: I’ve opened an incident, and I’m taking the incident lead role. I’m inviting Bob to help in the response, as I think this is customer-facing.
Bob: I’ve joined the channel, and asked how I can help.
Alice (lead): This looks like it’s going to have customer impact, and we’ll need to notify the business. I want to look into this problem but we’ll need someone on comms, can you take this Bob?
Bob: Sure! I’m going to take comms and write an incident update explaining customers might be impacted. Our automation just reminded me this type of incident will need us to update the status page, so I’m going to do that now.
You can see this is primarily focused on incident process, which is exactly what we wanted. It’s not about how we’re going to respond technically, and if you get too much into the detail, you’re probably not doing it right.
Now the facilitator steps it up:
Facilitator: We’ve got reports that customers are unable to access the dashboard
Alice: Ok, this is serious. I’m going to upgrade the incident severity to Major, and page the team responsible for the website: they might know what’s going on.
Growing the incident like this tests key gear changes, which is where incident process is designed to help. You can have participants stand in for whoever you need across the company, and even play a part in faking the response, perhaps jumping into the channel pretending to be customer support.
It shouldn’t go on for too much longer, but try growing the incident until responders are forced to delegate and split efforts. Call it quits before the scenario gets forced or unrealistic, while people are properly engaged.
This will normally take about 15-30 minutes, after which you can review the notes and examine your ‘incident timeline’, seeing where things went wrong.
Questions you can ask:
- Did everyone know what they were responsible for?
- Were we comfortable with the incident roles?
- Was anything forgotten, like contacting key stakeholders?
- How do we feel about our incident updates, were they clear? Could we benefit from templates here?
You’ll find issues at the team level, and at the process level. It’s really easy to spot them when you focus strictly on response, and they should be fresh in your mind to tweak or create action points for.
At this point, you’re halfway into your 1 hour 30 minute slot, with ample time to switch scenarios and give it another shot.
This is how you make sure people know their tools, can escalate appropriately, and are comfortable filling any of the incident roles they might be required to during a real incident. It’s super fast and can be really fun, and makes a huge difference when you respond for real.
Game day (1-4pm) #
After your tabletop exercise focused on the incident response process, now you get to focus on technical response.
Where we only dry-ran response before, with fake scenarios, this part of training will have you breaking real systems and seeing how people react to them.
We’ll nominate a Villain who will create problems in real systems, triggering failures that the response team will act on as if they were a real incident. Of course, you don’t want to be deliberately breaking the things your customers are using - you can get a lot of value from working in non-production environments. Even when the infrastructure isn’t quite the same (and often it never is), you can still get a lot of value by treating alerts as if they were production.
This is called a game day, and can be an extremely fun way of getting your incident response machine into great shape.
1. Assign roles #
As before, it’s worth nominating a lead who can prepare in advance of the game day, and to know who is going to be in your response team.
Unlike before, we’re going to give the lead a fun name: for the duration of this exercise, we’ll refer to them as the Villain.
The Villain should have a lot of experience in the systems that you’ll be testing response for. Their responsibility is to:
- Define the scope of the exercise: what systems are our focus, what type of incident do we want to drill, who should be involved
- Create a plan about how they’ll break these systems, to the specific commands they’ll run and a general timeline for coordinating the ‘incident’
- As game days induce real problems (e.g. in a staging environment), the Villain should send comms to the rest of the company before starting the exercise, so everyone is aware of the ongoing drill
As the Villain, it’s useful to get a review on your plans from other tenured staff before executing the drill, to ensure their actions won’t accidentally cause more damage than they expect. We’ve been on game days that escalated to production incidents before, which – while leading to great learnings – wasn’t the original intention!
If you’re on the response team, you only need to turn up and be ready to respond.
2. Execute the plan #
When the game day starts, the Villain can begin executing their plan.
If your game day starts at 1pm, an example plan might be:
[1:15pm] Deploy a change that creates a big increase in log volume, aiming to stress the ElasticSearch cluster. We should see alerts ~15 minutes after this.
You don’t need to start breaking things right at the start: in fact, it can help if your response team are asked to work as normal until they see things break, rather than sit anxiously searching for alerts!
This first plan is aiming to create a situation that will lead to breakage. We’re stressing one of the systems (ElasticSearch) in scope for the exercise, and the Villain had prepared a PR in advance to create the additional log volume.
Note that we don’t need to know, specifically, what will happen in response to this stress. That’s part of game days: you may have a hypothesis, but in the spirit of Chaos Engineering, you’ll often be surprised at the varying resilience of the systems you run.
If the Villain’s expectations are correct, this will lead to ElasticSearch falling over within 15 minutes of deploying the PR. That will trigger some alerts and the response team should jump on it: this is where they begin to test their incident response, and should run the incident as closely as possible to their normal production behaviours.
While they’re doing that, the Villain should maintain notes about their observations, such as how long after impact the team noticed the problem or whether they missed anything when triaging the problem. We’ll have the response team compare notes after, and the Villain’s insider perspective can be extremely useful when finding opportunities for improvement.
[1:45pm] Drop an index from the Postgres database that access token lookup depends on, and start issuing lots of requests (ab -c20 -n 50000) to the API.
As we did in the tabletop exercise, gradually increase the pressure so the response team is forced to gear change.
To do this effectively, the Villain should be paying close attention to the team's response. It’s at the point where people are assigned roles and begin settling into the debugging motion that it’s most productive to dial things up: in the nicest of ways, you want to keep the team as unsettled as possible, forcing them to revert to their training.
An ideal outcome of a second issue would have the team identify the new problem, allocate resources to handle it, and maintain a separation between the two lines of investigation to allow both lines to progress independently
[2:15pm] Scale a staging Kubernetes deployment to 200 replicas, filling the cluster.
It can be useful to finish on a situation that might have a broader reach and take longer to recover, and to practice how to follow up at later incident stages.
This example, where you fill the Kubernetes cluster with resource and kick-out many of the other workloads, is such a situation.
The rest of the exercise is just more of the same, calling it before things get too tiring or extreme: you want to keep the situation within the bounds of a “realistically” terrible day, to ensure the learnings are relevant.
3. Learning (end of day, or the next) #
Once you officially end the exercise, you get everyone – Villain and responders – to share notes and review their performance.
Like you would a normal incident debrief, you can walk through the timeline and see how the response unfolded. But unlike normal incidents, you have a behind-the-scenes picture of exactly when the incident began from the Villain’s report, which can help catch improvements that aren’t obvious in real incident reviews.
While just doing the drill is extremely valuable in providing hands-on experience, a short debrief can help to solidify the experience and drive home the learnings. If you’re running low on time, sacrifice time in the exercise to make space for the learnings, as it’s generally a better return on investment.