Getting a common understanding of what an incident is (and isn’t) is the first step in bringing people into your incident response process.
You need this so that:
Don’t limit the definition to engineering: getting your whole org to think about anything unexpected in the same way leads to better collaboration when things go wrong (see Communicating within your organization)
There's no hard-and-fast definition that'll work for everyone, but we recommend the following as a sensible default:
An incident is anything that takes you away from planned work with a degree of urgency.
We love this definition for its simplicity and versatility across an entire organization. While some teams use ITIL-style definitions like "an unplanned interruption to a service, or reduction in service quality," these are often too narrow for broad, practical use.
Our definition focuses on two key questions:
By plotting these dimensions on a 2x2 grid, we can see how incidents relate to other types of work:
High Urgency | Low Urgency | |
---|---|---|
Planned | Hotfix deployment | 'Normal' work |
Unplanned | Incident response | Bug filed on backlog |
High Urgency, Planned: Hotfix deployment This is a fix for a known issue planned in advance but requiring swift action, often to prevent user impact or reduce a risk.
High Urgency, Unplanned: Incident response Unplanned, critical work triggered by an issue that requires immediate attention.
Low Urgency, Planned: 'Normal' work Routine work that's planned over time and prioritised alongside everything else.
Low Urgency, Unplanned: Bug filed on backlog A minor issue discovered in production that's added to the backlog for potential future resolution.
Organizations generally set their threshold for incidents high, where only the most severe events are called incidents.
We believe smaller incidents are extremely valuable, and there's significant value to be obtained by lowering your threshold for an incident. Smaller incidents are a great way to learn about the failure cases of systems and provide an opportunity for teams to practice response to larger issues.
When the cost of declaring an incident is low, there's little reason to avoid reporting and plenty of value to be extracted.
For a deeper dive on the value of tracking smaller incidents, check out this talk from SEV0: Stop, Drop, and SEV4: Why small incidents are a big deal.
Practice makes perfect
The best way to embed a process into your organization is to use it. A lot. This helps everyone learn the process, and get better at incident response overall, so when something really bad happens it feels like a well-oiled machine. If you’re on the fence - lower your threshold!
Still struggling to pin-point exactly which things are and aren't incidents? If the answer to one or two of these is yes, you probably have an incident on your hands.
Does the problem you're facing have a risk of some negative impact on the product, business or customers?
I hope it goes without saying, but incidents generally aren't a positive thing. Someone, or something, might be negatively impacted in a small or large way.
Do you need to respond to the problem urgently, outside of your planned work?
Incidents require some urgent action to be taken to investigate and remediate. They aren't deferrable in the same way as a normal task or a bug might be.
Do you need to coordinate between multiple people or departments?
Incidents are generally not something one person can reasonably address end to end without involving anyone else.
Do you need to communicate to the rest of the organization, or to your customers?
Incidents often require other stakeholders to be kept in the loop with periodic updates on status. These might be your team, your board of directors, or your customers.
Is this something you want to discuss afterwards and review to extract learnings?
After an incident, you'll generally want to dig deeper to learn more and see how you might prevent this in the future.
Convinced, but looking for some examples? We've got you covered. As with most things, context is everything, so read these as guidelines rather than hard and fast rules.
At incident.io we treat any interruption to normal work as an incident. That might mean a single Sentry error that needs investigation, some behaviour we can’t explain, or a report from a customer. We do this so that we:
We’re not logging these incidents with the intention of running lengthy post-mortems or extracting rich insights. Those things will be valuable at scale, but as a small organization, we're optimising for transparency and speed of recovery.
We suspect we’ll change our approach as we scale, but for our early days, this approach has served us really well.