At the crux of an incident, there’s normally a problem that needs to be solved. Why is the bad thing happening? What do we need to do to mitigate or fix it?
To understand why something is happening, you’ll need to gather evidence from your various systems. It can be difficult to know where to start - there’s undoubtedly more evidence than you could possibly read or process.
When you’re trying to diagnose a problem, you’ll always start with a question. It’s probably pretty vague:
Why is the system not working?
Over the course of an incident, you’ll come up with lots more questions which are more specific:
What is making this task take longer than usual?
Resolving the incident is all about answering these questions. When you’re clear about what question you’re trying to answer at any given moment, it’s easier for everyone to understand and contribute.
To make progress answering a question, you’ll want to come up with a theory.
I think this task is taking longer than usual because a 3rd party is responding slowly.
Once you have a theory, you can work out what evidence you need to prove or disprove it:
What would I see if the 3rd party is responding slowly? How long does the 3rd party usually take?
It’s likely that your investigations will throw up more questions. As highlighted in Show your working, it’s good to share these so you can refer back to them, but don’t let them distract you from the task at hand. Continue until you’re comfortable that you’ve either:
(a) proven the theory - great, we know why the task is taking longer
(b) disproved the theory - great, we now know to look elsewhere
(c) decided to abandon this line of investigation - we think we’ve got something else to investigate that’s more important, let’s drop this.
As your incident response team grows, it’s possible to investigate multiple theories in parallel. This means it’s less important to guess the correct theory the first time around, and is very useful if you’ve got a complicated problem that you’ve not seen before (or a problem which needs to be fixed very quickly).
When you’re collaborating on more complex incidents, the incident lead should be managing the different threads and communicating frequently in the channel about who’s doing what, and why. Regular incident updates are a great mechanism for this: repeatedly stating the current status of the various threads, and who’s doing what, helps keep the team on track.
When you’ve been at a company for a little while, you develop a gut feeling about the likely cause of a particular problem. That gut feeling is incredibly valuable, and often enables a team to identify and fix issues more quickly. But remember, it’s an assumption.
If you jump to conclusions too quickly, you can spend a long time before realising that you don’t really know that the problem is in system X – you’re here on a hunch. Equally, if you ignore your experience completely, you’re trying to solve a problem with one hand tied behind your back.
Be clear when you’re making assumptions versus when you have gathered evidence to strongly support a theory. Recovering from bad assumptions can be very difficult to do. Once the team responding to an incident has established a ‘fact’, they’ll be anchored by it, and all the conclusions that follow can be impacted.
The best incident policy is to trust-but-verify: ask for proof, or better yet go back into your incident channel to confirm that the evidence supports this statement. This is where showing your working comes in: if responders are explaining all their reasoning in the channel, and attaching evidence, it should make these claims easy to verify.
Without reliable information, it’s impossible to build and test a hypothesis. This sounds obvious, but it’s surprisingly difficult to achieve. It’s easy to pull lots of data that isn’t quite right, and build beautiful but ultimately misleading dashboards. That might be because you’re over-aggregating data (e.g. including test data in your averages) or simply misunderstanding what the data is telling you (maybe that number is seconds, not milliseconds).
To gather reliable information, you need to be using the data frequently; if you only ever use it once every few months, it’s unlikely you’ll be able to trust that it’s correct. Everyone needs to know where to get reliable data, and also what the provenance of that data is. Without understanding where it’s coming from, you won’t be able to truly understand what it’s telling you.