No matter how good your tooling is, how experienced you are, or how much you've prepared, incidents can still be hard. Despite the five (soon to be more!) of us at incident.io racking up some serious incident hours, we didn’t struggle to come up with a lengthy list of things we still find difficult in incidents. Whilst we’ll be using this list to drive improvements to our product, we thought we’d open it up for the world to see in the meantime.
Whenever I’m leading larger incidents, I find it really hard to avoid the temptation to try and debug/fix the things, as opposed to assembling a team and running the process that fixes the thing. Because of my technical background, it’s always tempting to dive into the logs, look at the code, and see how I might fix things.
You might be qualified to do it, but that’s not why you’re here — you’re here to lead the incident
You might be qualified to do it, but that’s not why you’re here — you’re here to lead the incident. I always have to remind myself that I’m playing an explicit role, of which there is only one of me in the incident, so it’s really important that I stay focused on driving the incident to completion.
More generally, working out the most high leveraged role to play can be tricky - do you step in and offer to take over over the lead role and free more people up to investigate, throw yourself into the fray and start investigating / providing any context you might have, or ask, but then wait to be told what to do? Sometimes it’s clear, and that’s fine. Other times it’s not. Get it wrong and you risk bad customer outcomes and/or disempowering people. Ideally you’d ask clarifying questions to make the call, but when everything is one fire, that’s hard too, and it’s very easy to give an impression you don’t trust the people already involved.
Before starting incident.io, my most recent involvement in incidents has been as an Incident Manager, where I’d typically be drafted into relatively serious issues. That’d typically mean a large impact on our customers, a complex systems failure, or some kind of organisational complexity to navigate.
More often than not, I’d be pulled in after a number of people had already been involved and were deep in the weeds of a problem. Getting up to speed and in the loop quickly was always something I found hard, for a few reasons:
The best thing about these problems? I recognised they were largely self imposed, based on poor assumptions, and totally manageable. Finding myself in the same situation today, I’d offer this advice:
Take a breath and accept that you can’t hit the ground at 100mph
Take a breath and accept that you can’t hit the ground at 100mph. Let folks know why you’ve joined, how you can help, and that you're taking 5 mins to get up to speed. Don’t be afraid to put the brakes on the response and ask literally everyone to provide an update on what they’re seeing, what they’re doing and what they need. Remember that the folks in the incident have pulled you in because they want your support, so they have every motivation and incentive to help you out.
In some cases, particularly for common offenders (I know they shouldn’t exist, but they always do) I feel like I just know what we need to do. Maybe “it’s that thing again” or a “you just need to X” moment. In most cases, I have time to run those actions and assumptions past others, to make a more informed decision as a team and share context. In other cases, something is urgently wrong, and you find yourself forced into the less satisfying equivalent of “I can’t explain why right now, but trust me this is the right thing to do”. This can be understandably both concerning and frustrating for others involved, and is quite a blunt instrument - it can be hard to know when it’s the right call.
Incidents with lots of parallel actions and investigations are really hard to monitor as a lead
Incidents with lots of parallel actions, investigations etc are really hard to monitor as a lead. You can designate owners for various avenues of investigation, but actually tracking them is quite hard. Separate channels usually don't make sense in all but the largest incidents with large separate threads. Slack threads are OK, but hard to follow chronologically and from a visibility standpoint. Outstanding actions might relate to one of many investigations and as a lead it can be tricky to track it all and repeatedly summarise/condense current status of everything. I find this is far worse at the start where everyone is moving very fast, but as things become less urgent (i.e., bleeding stopped) you hit a stride and it starts to become a lot easier.
At the risk of sounding like I’ve primed you for a sale, incident.io does make a lot of this easier, but there’s plenty more we can do to make this better!
As you continue in your career, particularly when you’ve been at a company for a little while, you develop a gut feel about the likely root cause of a particular problem. That gut feel is incredibly valuable, and often enables a team to identify and fix issues more quickly. However, you’ve got to strike a balance between trusting your gut / experience versus gathering information in a less biased way.
If you jump to conclusions too quickly, it’s possible to spend a long time down a rabbit hole before stepping back and realising that you don’t really know that the problem is in system X – you’re here on a hunch. Equally, if you ignore your experience completely, you’re trying to solve a problem with one hand tied behind your back.
The key here is to be really explicit about when you’re making assumptions vs. when you have evidence to support a theory
I think the key here is to be really explicit about when you’re making assumptions vs. when you have evidence to support a theory. This is also made easier by having a bigger incident response team - if you can safely parallelise investigations then it’s less important to guess the correct root cause first time around.
Recovering from bad assumptions can be incredibly difficult to do. Once the team responding to an incident has established a ‘fact’, they’ll be anchored by it, and all the conclusions that follow can be impacted.
One example of this happening is if you join an incident team where you have a high amount of trust with the people already there. If an influential responder says “this was caused by the database failing” it’s too easy to accept that into your picture of how the incident has played out, even if it comes with no evidence.
Try to avoid this. The best incident policy is to trust-but-verify- ask for proof, or better yet go back into your Slack logs (or whatever you use!) to confirm the evidence supports this statement. Good responders should be pushing as much evidence into the incident log as possible, which should make this easy to verify.
Another cause of bad data can be due to an incident having ramped up or changed in nature since the responders first began tackling it.
In my experience, some of the worst incidents begin as small problems, often caught by an individual or single team
In my experience, some of the worst incidents begin as small problems, often caught by an individual or single team. When you’re the sole responder investigating a low-impact bug, the implications of assuming a thing are vastly different than several hours later, when the incident has been upgraded and the impact might be really severe.
Take the opportunity to reassess your assumptions when an incident changes shape, like when you upgrade severity or expand the team. Have new incident responders cross-check your work, and make sure the team isn’t pinning their hopes on a scribbled note from back when the incident first started, not after it’s snowballed into something much more serious.
Well that was cathartic! I think it's fair to say that incidents are still hard, and there's plenty of scope for improvement. Something of a target rich environment, and we're on the case! 🎯
Building safe-by-default tools in our Go web application
At incident.io, we're acutely aware that we handle incredibly sensitive data on behalf of our customers. Moving fast and breaking things is all well and good, but keeping our customer data safe isn't…
Lisa Karlin Curtis
Deploying to production in <5m with our hosted container builder
Fast build times are great, which is why we aim for less than 5m between merging a PR and getting it into production. Not only is waiting on builds a waste of developer time — and an annoying…
New Joiner: Katie Hewitt
Hi! I'm the newest member (and first non-engineer!) to join the incident.io team. I'm going to be working on all things Strategy and Ops, from getting the rails in place to keep us working effectively…