Home/Incident management foundations

Defining an incident

Getting a common understanding of what an incident is (and isn’t) is the first step in bringing people into your incident response process.

You need this so that:

  • Incidents get declared consistently. If people aren’t sure, they’ll often err on the side of not declaring, and your response and ability to learn and improve can suffer.
  • Everyone’s talking the same language, so they can collaborate across functions

Don’t limit the definition to engineering: getting your whole org to think about anything unexpected in the same way leads to better collaboration when things go wrong (see Communicating within your organization)

What is an incident?

There's no hard-and-fast definition that'll work for everyone, but we recommend the following as a sensible default:

An incident is anything that takes you away from planned work with a degree of urgency.

We love this definition for its simplicity and versatility across an entire organization. While some teams use ITIL-style definitions like "an unplanned interruption to a service, or reduction in service quality," these are often too narrow for broad, practical use.

Our definition focuses on two key questions:

  1. Was it a surprise? If yes, it falls under reactive work.
  2. Does it need immediate action? If yes, it has a degree of urgency.

Urgency and reactivity in action

By plotting these dimensions on a 2x2 grid, we can see how incidents relate to other types of work:

High UrgencyLow Urgency
PlannedHotfix deployment'Normal' work
UnplannedIncident responseBug filed on backlog
  1. High Urgency, Planned: Hotfix deployment This is a fix for a known issue planned in advance but requiring swift action, often to prevent user impact or reduce a risk.

  2. High Urgency, Unplanned: Incident response Unplanned, critical work triggered by an issue that requires immediate attention.

  3. Low Urgency, Planned: 'Normal' work Routine work that's planned over time and prioritised alongside everything else.

  4. Low Urgency, Unplanned: Bug filed on backlog A minor issue discovered in production that's added to the backlog for potential future resolution.

Declaring more incidents

Organizations generally set their threshold for incidents high, where only the most severe events are called incidents.

We believe smaller incidents are extremely valuable, and there's significant value to be obtained by lowering your threshold for an incident. Smaller incidents are a great way to learn about the failure cases of systems and provide an opportunity for teams to practice response to larger issues.

When the cost of declaring an incident is low, there's little reason to avoid reporting and plenty of value to be extracted.

For a deeper dive on the value of tracking smaller incidents, check out this talk from SEV0: Stop, Drop, and SEV4: Why small incidents are a big deal.

Practice makes perfect

The best way to embed a process into your organization is to use it. A lot. This helps everyone learn the process, and get better at incident response overall, so when something really bad happens it feels like a well-oiled machine. If you’re on the fence - lower your threshold!

The golden questions

Still struggling to pin-point exactly which things are and aren't incidents? If the answer to one or two of these is yes, you probably have an incident on your hands.

  • Does the problem you're facing have a risk of some negative impact on the product, business or customers?
    I hope it goes without saying, but incidents generally aren't a positive thing. Someone, or something, might be negatively impacted in a small or large way.

  • Do you need to respond to the problem urgently, outside of your planned work?
    Incidents require some urgent action to be taken to investigate and remediate. They aren't deferrable in the same way as a normal task or a bug might be.

  • Do you need to coordinate between multiple people or departments?
    Incidents are generally not something one person can reasonably address end to end without involving anyone else.

  • Do you need to communicate to the rest of the organization, or to your customers?
    Incidents often require other stakeholders to be kept in the loop with periodic updates on status. These might be your team, your board of directors, or your customers.

  • Is this something you want to discuss afterwards and review to extract learnings?
    After an incident, you'll generally want to dig deeper to learn more and see how you might prevent this in the future.

Incidents and non-incidents

Convinced, but looking for some examples? We've got you covered. As with most things, context is everything, so read these as guidelines rather than hard and fast rules.

Things we'd call incidents

  • A website checkout error repeatedly preventing a single customer from paying for their basket is an engineering incident.
  • You’re running a food delivery service and there are not enough delivery riders being on shift. Delivery times to customers have dramatically increased, and customers are complaining. As a result, you have an operational and product incident.
  • Your largest customer threatening to churn unless you re-negotiate their contract is a customer success incident.
  • An ex-employee threatening to maliciously leak confidential information about the business is a security incident.
  • A customer support agent sending data to the wrong customer is an operational and privacy incident.

Things we wouldn't call incidents

  • A minor CSS formatting issue affecting users on a tiny percentage of browsers. It has a small negative impact on a very small number of users but doesn't require an urgent response and you can prioritise it against other work.
  • An employee resigning from the business. Although it has negative impacts, it's an expected normal business flow and doesn't need to be responded to urgently.
  • Someone dropped a glass in the office. This requires urgency in response to clean it up so others aren't hurt but doesn't require coordination, communication or systemic improvements.

How we work at incident.io

At incident.io we treat any interruption to normal work as an incident. That might mean a single Sentry error that needs investigation, some behaviour we can’t explain, or a report from a customer. We do this so that we:

  • Make ownership clear: “I can see Lawrence is looking at that bug, I’ll leave him to it”
  • Aid knowledge sharing: “I see Sophie looked into an issue last night which was related to the thing I was working on. I’ll scan through what happened”
  • Ruthlessly prioritise: we use incidents to systematically triage issues quickly to determine whether a fix needs to be picked up immediately or deferred until later

We’re not logging these incidents with the intention of running lengthy post-mortems or extracting rich insights. Those things will be valuable at scale, but as a small organization, we're optimising for transparency and speed of recovery.

We suspect we’ll change our approach as we scale, but for our early days, this approach has served us really well.