Getting a common understanding of what an incident is (and isn’t) is the first step in bringing people into your incident response process.
You need this so that:
- Incidents get declared consistently. If people aren’t sure, they’ll often err on the side of not declaring one, and response suffers.
- Everyone’s talking the same language, so they can collaborate across functions
Don’t limit the definition to engineering: getting your whole org to think about anything unexpected in the same way leads to better collaboration when things go wrong (see Communicating within your organisation)
Our definition #
Clearly, there's no hard-and-fast definition that'll work for everyone, but we recommend the following as a sensible default:
An incident is anything that takes you away from planned work with a degree of urgency.
Declare more incidents #
Organisations generally set their threshold for incidents high, where only the most severe events are called incidents. We believe smaller incidents are extremely valuable, and there's significant value to be obtained by lowering your threshold for an incident. Smaller incidents are a great way to learn about the failure cases of systems and provide an opportunity for teams to practice response to larger issues.
When the cost of declaring an incident is low, there's little reason to avoid reporting and plenty of value to be extracted. Give it a go!
Practice makes perfect
The best way to embed a process into your organisation is to use it. A lot. This helps everyone learn the process, and get better at incident response overall, so when something really bad happens it feels like a well-oiled machine. If you’re on the fence - lower your threshold!
The golden questions #
Still struggling to pin-point exactly which things are and aren't incidents? If the answer to one or two of these is yes, you probably have an incident on your hands.
Does the problem you're facing have a risk of some negative impact on the product, business or customers?
I hope it goes without saying, but incidents generally aren't a positive thing. Someone, or something, might be negatively impacted in a small or large way.
Do you need to respond to the problem urgently, outside of your planned work?
Incidents require some urgent action to be taken to investigate and remediate. They aren't deferrable in the same way as a normal task or a bug might be.
Do you need to coordinate between multiple people or departments?
Incidents are generally not something one person can reasonably address end to end without involving anyone else.
Do you need to communicate to the rest of the organisation, or to your customers?
Incidents often require other stakeholders to be kept in the loop with periodic updates on status. These might be your team, your board of directors, or your customers.
Is this something you want to discuss afterwards and review to extract learnings?
After an incident, you'll generally want to dig deeper to learn more and see how you might prevent this in the future.
Incidents and non-incidents #
Convinced, but looking for some examples? We've got you covered. As with most things, context is everything, so read these as guidelines rather than hard and fast rules.
Things we'd call incidents #
- A website checkout error repeatedly preventing a single customer from paying for their basket is an engineering incident.
- You’re running a food delivery service and there are not enough delivery riders being on shift. Delivery times to customers have dramatically increased, and customers are complaining. As a result, you have an operational and product incident.
- Your largest customer threatening to churn unless you re-negotiate their contract is a customer success incident.
- An ex-employee threatening to maliciously leak confidential information about the business is a security incident.
- A customer support agent sending data to the wrong customer is an operational and privacy incident.
Things we wouldn't call incidents #
- A minor CSS formatting issue affecting users on a tiny percentage of browsers. It has a small negative impact on a very small number of users but doesn't require an urgent response and you can prioritise it against other work.
- An employee resigning from the business. Although it has negative impacts, it's an expected normal business flow and doesn't need to be responded to urgently.
- Someone dropped a glass in the office. This requires urgency in response to clean it up so others aren't hurt but doesn't require coordination, communication or systemic improvements.
How we work at incident.io #
At incident.io we treat any interruption to normal work as an incident. That might mean a single Sentry error that needs investigation, some behaviour we can’t explain, or a report from a customer. We do this so that we:
- Make ownership clear: “I can see Lawrence is looking at that bug, I’ll leave him to it”
- Aid knowledge sharing: “I see Sophie looked into an issue last night which was related to the thing I was working on. I’ll scan through what happened”
- Ruthlessly prioritise: we use incidents to systematically triage issues quickly to determine whether a fix needs to be picked up immediately or deferred until later
We’re not logging these incidents with the intention of running lengthy post-mortems or extracting rich insights. Those things will be valuable at scale, but as a small organisation, we're optimising for transparency and speed of recovery.
We suspect we’ll change our approach as we scale, but for our early days, this approach has served us really well.