All incidents are different, but there are general themes and events which appear consistently. This chapter outlines the typical flow of an incident, taking you step-by-step from declaration through to the close.
When something unexpected happens, someone (or perhaps some automation) declares an incident by starting the formal incident response process.
It’s important that everyone knows when they should declare an incident (see Defining an incident) and how to declare an incident. This means all the great process that you’ve built will be invoked, and you’ll get the benefits of a coordinated response.
When it comes to responding to an incident, the first thing to do is create a space to collaborate.
In years past, responders would be in the same office. An incident often started with a takeover of a meeting room or ‘war room’, where responders would migrate along with whatever tools they needed to handle the situation.
As the world embraces remote work, or if you’re paged at an unsociable hour, it’s more likely we’ll be meeting across the internet instead. The first action in any incident should be creating a channel in a tool like Slack or Teams. If you’re not using these products, a Zoom, a Google Meet, or even a shared document can work just fine.
If you’re using incident.io, we’ll do this for you, as will open-source tools like monzo/response and Netflix’s Dispatch.
A public and dedicated space can level-up your response by increasing ease of access to information. This can prevent coordination issues that can be pretty frustrating (two separate incident teams tackling the same incident, with no knowledge of each other’s existence, is not a world any of us want to experience twice).
We strongly advocate against private incident response channels unless they’re absolutely necessary. There are occasions where it’s unavoidable (e.g. for legal or security reasons) but these should be the exception and not the rule.
Incidents are often hard because they require a cross-team response. What’s more, it’s not uncommon to find yourself assembling a group of individuals that’s never worked together before. This makes incidents unique and reinforces the need for consistency so everyone knows what they should be doing.
Lean on the roles you’ve defined (see Roles) – starting with the Incident Lead. Once you’ve got an Incident Lead, their job is to assemble the rest of the team that you need to solve this particular problem.
It can be really tempting to just jump into debugging/fixing things, particularly for leads with a technical background. But if you’re doing that, no one will be focussing on driving the incident to completion and coordinating the response. That’s your job.
As soon as everyone knows who is doing what, they can get their heads down and start problem-solving.
Firstly, identify what the issue is. If you can establish the scope of an incident early, it means your next steps will be much more likely to address the problem.
Identify what’s causing the issue, then work through the dependencies to understand whether there are any knock on or second-order effects.
Be extremely wary of assumptions. For everything you hear from a third-party, trust but verify. Record whatever you did to verify (e.g. a link to the dashboard that you used). An incorrect assumption can derail your response, so do your best to avoid them!
Once you’ve found the source, try to understand the full impact; who has been affected by the issue? How badly? Don’t delay your response with this work, but if can afford the time, it’s worth looking into. An inaccurate understanding of impact can lead to poor decisions, and clarity on who is affected can help certain parts of your organization (Customer Success, Support, etc) respond appropriately.
Once the team understands the nature of the incident, you can begin to stop the bleeding. Your goal is now to stop the immediate pain: defer clean-up to a less pressured time.
For this, we need to prioritise actions to achieve the best chance of a positive outcome. Note the phrase “best chance”: options that are quick to apply should be taken first, even if you suspect it may only partially fix the problem.
Rollback to a known good version, even if you think you can write a fix really quickly — you can always do that after you’ve rolled back, when there is less urgency.
Take action to preserve critical systems, even at the expense of other less critical flows. If a single part is causing the whole system to fail, don’t hesitate to disable that part if it restores service to the most important things. A system that is 80% working is a lot better than being totally down!
Make full use of your team and proactively apply whatever fixes you think are low-risk, even if you suspect it might not fix the whole problem. Scale down non-essential queues, put a freeze on deploys, and restart that component.
If you can delegate effectively to your team, it doesn’t cost much to try these simple low-risk fixes (as long as other responders continue to work on root cause analysis assuming these fixes will fail).
After the initial impact has been mitigated, you’ll want to start thinking about longer-term resolutions. But first, this is a good time to take a breather. Make sure everyone’s eaten enough, and take some time to reset before jumping into the next phase. You can read more about this in Caring for your team.
If it’s a complicated incident, take some time to re-read the incident log (e.g. Slack channel).
Using incident.io? The timeline on the Incident Homepage is a great place to re-familiarise yourself with the incident.
Once you’re reminded of all the evidence you have, you can start to investigate further. Using a series of theories as described in Solving complex problems, you can identify the steps you need to resolve the incident.
There’s always an uncomfortable balance here between strategic solutions and tactical ones. Once a potential resolution step is above a certain size (say a day of work), it can be useful to take a step back. Should this be handled inside the incident, or as a follow-up that’s passed to a team to work on?
Having a clear process to feed these follow-ups into your day-to-day work is critical to ensuring you are resolving incidents fully, and getting the full value from each incident.
Once the incident reaches its conclusion, we need to close the incident. A closed incident means that the immediate impact has been mitigated and any further follow-ups will be scheduled as part of ordinary work amongst your teams.
Leaving incidents open for a long time can be disruptive to teams - it’s pulling people away from their day-to-day work to operate in a different set of processes, which are intended for time-limited responses. Actions and follow-ups can also be left in purgatory: no one is working on the incident, but the activities haven’t been officially handed over to the responsible team.