Designing your incident severity levels

We know a thing or two about incident response. As such, we're often asked to advise when companies are mapping out their incident response lifecycle.

A common question is "How do you design your incident severity levels?" It's a great question given how central they are to incident management and response!

In this article, we walk through:

What incident severity levels are
What to consider when designing your severity levels
Our recommended incident severity levels

We wrote this article in response to a question asked in our Slack Community. Click here to join hundreds of technology leaders discussing best practices for incident response! ✨

What are incident severity levels?

Severity levels measure the impact of an incident. They answer the question "how bad is this incident?" If you've ever seen SEV-1 or SEV-2, P1, and so on— this is a severity level.

Severity levels are used for communicating impact to your coworkers, customers, and stakeholders. In short, severity levels allow you to categorize and incidents such as security breaches, data losses, system outages and more appropriately.

Well-designed severity levels create shared expectations between people responding to the incident. This makes it easier to coordinate, and prioritize effectively.

Different severity levels may trigger different processes or automation. When we launched Workflows, we found that most organizations wanted their incident response automation driven by incident severities. For example, notifying the executive team when a Critical incident, such an outage, is declared.

Common naming conventions for incident severity levels: P1 to P4 explained

P1, P2, P3, and P4 are commonly used incident severity levels in various industries. While we don't use any of these ourselves (see our guide on severity vs priority), we're including them here since you're likely to come across them at some point. More on this later.

P1 incidents: Understanding critical incidents and immediate response needs

P1 incidents, also known as critical incidents, are the most severe and require immediate attention. They typically involve a complete system failure or a critical business process being disrupted. P1 incidents have a high impact on the organization and require immediate resolution to minimize downtime and financial losses.

Example: A total system outage where a major production server goes down. This in turn affects the entire organization's ability to operate, resulting in significant financial losses or a breach of service level agreements (SLAs).

P2 incidents: Managing high-priority issues with significant business impact

P2 incidents are considered high priority incidents. They have a significant impact on the organization but are not as critical as P1 incidents. P2 incidents may involve partial system failures or disruptions to important business processes. They also require prompt attention and resolution to prevent further impact on the organization.

Example: A key business application is inaccessible, causing inconvenience and impacting employee productivity.

P3 incidents: Addressing medium-priority incidents with moderate impact

P3 incidents are medium priority incidents. They have a moderate impact on the organization and may involve minor system failures or disruptions to non-critical business processes. P3 incidents should be resolved in a timely manner to prevent any further impact on the organization's operations.

Example: A small software bug that affect usability but has low-effort workarounds.

P4 incidents: Handling low-priority incidents with minimal business disruption

P4 incidents are low priority incidents. They have a minimal impact on the organization and may involve minor issues or requests that do not significantly affect business operations. P4 incidents can be resolved within a reasonable timeframe without causing any major disruptions.

Example: Suggestions for new features or improvements with low business impact.

| Severity Level | Impact                     | Example                                              |
|----------------|----------------------------|------------------------------------------------------|
| P1 / SEV-1     | Critical incident           | Total system failure, all customers affected          |
| P2 / SEV-2     | Major incident              | Subset of users affected, critical functionality down |
| P3 / SEV-3     | Moderate incident           | Minor bug with moderate business impact               |
| P4 / SEV-4     | Low-priority incident       | Suggestions or non-urgent issues                      |

How to design effective incident severity levels: Best practices and tips

We have the privilege to talk to people working on their incident response processes all the time. Here’s the 4 top tips we’ve learned from that.

Have one set of severity levels for the whole organization

The primary benefit of severity is sharing common definitions across teams, so people can easily understand the type of urgency associated with an incident without going too much into the detail.

Set your severity levels consistently across your organization

If your severity levels are different team-to-team, it'll be hard for newcomers to understand how bad any particular incident is, and vital time will be wasted.

Add clear guidance on how to set them

Add business-specific guidance on how to triangulate and set the severity of an incident, and make sure it’s front and centre to your organization. People may be stressed or tired, and are looking for clear boundaries and instructions.

You should be able to summarize each level in 1-2 sentences. For example: "Any financial loss above £100K will be considered a Critical incident."

Why fewer incident severity levels work best: the Goldilocks Principle

You need to apply the Goldilocks principle to severity levels.

You don't want too few: they won't capture the nuance between different incidents effectively. You don't want too many: they'll be confusing and hard to discern the boundaries between—ultimately, losing power as a communication mechanism.

You want it just right. 3-5 severity levels is the right amount in our experience. Startups should start with 3, and add severity levels as the maximum possible impact size increases through company and product growth.

Why human terms are better than codes for incident severity levels

Remember earlier when we explained what P1, P2, P3 and P4 mean?

Feel free to ignore that!

Instead, choose human words like Low, Medium, over codewords like SEV-1 or P1. Some people will expect P1 to be more severe than P5 and others, well, won’t!

Human words communicate these clearly with little room for misinterpretation: important in a stressful situation. So an incident happens and people are frantically trying to deduce what a Sev-5 is, you’ll wish that you categorized major issues as a "Critical incident" instead.

Recommended incident severity levels for effective incident management

Despite approaching severity levels from first principles, we often see most organizations ending up with very similar severity levels. We recommend adopting the following tried-and-tested severity levels:

Low: minimal impact and can be handled in work hours
Medium: business impact (either internal or external) but doesn’t significantly impact normal operations
High: impact warrants immediate response and may disrupt normal operations, as the name suggests, these are high priority
Critical: a major incident where executives are involved and there could be reputational damage or significant impact to the business. This generally isn’t necessary unless you’re 100+ people.

There's a clear distinction between levels: enough for nuance, but not too many to overload responders that are trying to decide which severity applies.

Use these levels as a starting place, and customize the description to your organization.

We hope that was helpful. If you've got any questions on how to design your incident severity levels, come ask us in our Slack Community and we'd be happy to help.

TL;DR:

Incident severity levels measure the impact of an incident and are used for communication and categorization.
When designing severity levels, it is important to have one set for the whole organization, provide clear guidance on how to set them, and choose the right number of levels.
The recommended severity levels are Low, Medium, High, and Critical, with clear distinctions between each level.
Customization of severity level descriptions is encouraged.

What are incident severity levels?

Common naming conventions for incident severity levels: P1 to P4 explained

P1 incidents: Understanding critical incidents and immediate response needs

P2 incidents: Managing high-priority issues with significant business impact

P3 incidents: Addressing medium-priority incidents with moderate impact

P4 incidents: Handling low-priority incidents with minimal business disruption

How to design effective incident severity levels: Best practices and tips

Have one set of severity levels for the whole organization

Add clear guidance on how to set them

Why fewer incident severity levels work best: the Goldilocks Principle

Why human terms are better than codes for incident severity levels

Recommended incident severity levels for effective incident management

TL;DR:

See related articles

Your genie is vanishing: introducing the Opsgenie rescue program

De-risking a PagerDuty migration: the objections we hear most, and how to clear them

Customers over control: how we measure On-call reliability

So good, you’ll break things on purpose

We’d love to talk to you about