Severity levels are a high level measure of the impact of an incident. They answer the question "how bad is this incident?"
If you've ever seen
Critical associated with an incident — this is a severity level.
Severity levels are used for communicating impact to your coworkers, customers, and stakeholders.
Severities help us to respond #
Well-designed severity levels create shared expectations between people responding to the incident. This makes it easier to coordinate, and prioritise effectively.
Different severity levels may trigger different processes or automation. When we launched Workflows, we found that most organizations wanted their automation driven by incident severities. For example, notifying the executive team when a Critical incident is declared.
Before we go any further, let’s be clear on a few points. Severities are subjective. It might not be clear whether something is Critical or Major, and in the vast majority of cases, it really doesn’t matter. Think of your severities as a model for the impact, and a way to signal that clearly and mobilise quickly.
Fixing the issue is the priority
A common failure case with severities is deliberating over what level makes sense, and forgetting to actually respond to the issue. If you’re spending any amount of time deliberating, pick the more severe level and focus your effort on fixing things.
Designing severity levels #
There's a number of things to consider when designing or updating the severity levels you use at your organization. If you're embarking on doing this, here's our advice.
Have one set of severity levels for the whole organization
The primary benefit of severity is sharing common definitions across teams, so people can easily understand the type of urgency associated with an incident without going too much into the detail.
Set your severity levels consistently across your organization. If your severity levels are different team-to-team, it'll be hard for newcomers to understand how bad any particular incident is, and vital time will be wasted.
Add clear guidance on how to set them
Add business-specific guidance on how to triangulate and set the severity of an incident, and make sure it’s front and centre to your organization. People may be stressed or tired, and are looking for clear boundaries and instructions.
You should be able to summarise each level in 1-2 sentences. For example, 'Any financial loss above £100K is a Critical incident.'.
For larger organizations, you may want to include specific examples that relate to a specific business area, to help folks calibrate across different incident types (e.g. an example of a critical security incident).
Choose the smallest number you can get away with
You need to apply the Goldilocks principle to severity levels.
You don't want too few: they won't capture the nuance between different incidents effectively. You don't want too many: they'll be confusing and hard to discern the boundaries between — ultimately, losing power as a communication mechanism.
You want it just right. 3-5 severity levels is the right amount in our experience. Startups should start with 3, and add severity levels as the maximum possible impact size increases through company and product growth.
Choose human words over code words
Prefer human words like Low, Medium, over codewords like SEV-1 or P1. Some people will expect P1 to be more severe than P5 and others, well, won’t! 🙈
Human words communicate these clearly with little room for misinterpretation: important in a stressful situation.
Our recommended severities #
Despite approaching severity levels from first principles, we often see most organizations ending up with very similar severity levels. We recommend adopting the following tried-and-tested severity levels:
- Minor: Issues with low impact, which can usually be handled within working hours. Most customers are unlikely to notice any problems (e.g. a slight drop in application performance).
- Major: Issues causing significant impact. Immediate response is usually required. Some workarounds are available that cause negative impact to customers (e.g. an important sub-system failing).
- Critical: Issues causing very high impact to customers. Immediate response is required (e.g. a full outage or data breach).
There's a clear distinction between levels: enough for nuance, but not too many to overload responders that are trying to decide which severity applies.
Use these levels as a starting place, and customize the description to your organization.
As discussed in Defining an incident, we believe incidents are for the whole organization.
It’s important to share severities (e.g. Minor/Major/Critical) so that everyone can align on what the priority is, and so it’s possible to aggregate incidents meaningfully across different teams.
It can be useful to provide some contextual information to describe what each severity means in different contexts. This should include some explanation, and potentially also an example or two to help crystallise the severities.
Changing the severity of an incident #
As we’ve mentioned, severities are a coarse mechanism for modelling the impact of an incident, and a mechanism for triggering certain actions and processes. In the interest of pragmatism, it’s worth highlighting that severities do not need to be fixed throughout the lifetime on the incident.
Many organizations have processes in place to deal with the aftermath of high severity incidents. That might mean reporting to the board, a regulator, or implementing a more heavyweight process around follow-up actions. If an incident was misclassified, you might end up triggering processes and work that needn’t happen and don’t serve the best interest of the organization.
When the dust has settled, it’s entirely reasonable to recategorise based on the information you have to hand. We’d suggest keeping track of the original and updated severities together though, as both are useful data points for learning.