Mastering the SEV0

Think back to the worst incident your organization has ever faced. If you were lucky, it might have been just a service outage. If you were less fortunate, it could have involved the loss, theft, or compromise of sensitive customer data— or even something as severe as the recent global Crowdstrike outage.

Whilst uncommon, incidents of this severity happen to every organization at some point. This criticality of situation is what many refer to as a SEV0, the most severe of incidents.

Fundamentally, Sev0's are not normal. Your usual incident response lifecycle will inform, but not dictate how you respond.

Inside the SEV0—scale, pressure and uncertainty

In case you’ve not experienced a SEV0-like situation (firstly, be grateful!), let’s paint a picture of how one normally plays out.

The first thing you’ll notice about SEV0 is the sheer number of people involved. Whether it’s executives looking for updates, legal folks collecting evidence for reporting, or customer support seeking up-to-date information to relay to customers, there is a lot of activity over-and-above just(!) fixing the issue.

Looking for SEV0, the conference?

SEV0 is the incident management conference aimed at practical, pragmatic approaches to the world of incident management, reliability and on-call. Join speakers from Netflix, OpenAI, and Hashicorp, in San Francisco on September 24th.

Uncertainty also plays a significant role in a SEV0. It’s likely that what’s happening has never happened before, or at least never this badly, making harder to understand and fix. Timelines are nearly impossible to share with any certainty and most updates are along the lines of “we’re still trying to understand what’s happening” as you gradually unfold information.

SEV0s commonly exert pressure from outside your organization too. Whether it’s customers tweeting about their inability to access your service, your biggest customer calling the CEO directly, or a regulator breathing down your neck, this added pressure compounds the already electric atmosphere.

On top of all of this, the situation is worsened by unfamiliarity with the incident process. Most people who’ve get involved don’t know how things normally work, where to go for information, or how hard it is for those fixing the issue to also be on the hook for communicating progress. Your well rehearsed plans rarely survive contact with reality unscathed.

To summarise, a SEV0 is going to mean high pressure, high uncertainty and a healthy side of sweaty palms.

ℹ️ A helpful but subjective model for incidents

The concept of severity levels such as SEV0, SEV1, and SEV2 originated from ITIL (Information Technology Infrastructure Library) practices — for a practical look at the level just below, see what is a SEV-1 incident. These levels help organizations manage incidents based on their impact and urgency.

SEV0 typically signifies a catastrophic event that demands immediate attention – the worst case scenario. But one of the biggest challenges with severity levels is their inherent subjectivity, especially as you near the SEV0 end of the spectrum. When the event your facing is something you’ve never faced before, it’s unlikely you’ll have a complete set of criteria on hand to capture it.

That said, when you’re in a Sev0, you probably know. And if you don’t know for sure you should probably assume the worst and get on with fixing.

Our advice for effectively handling the most severe of incidents

Between our own experience, and the collective experiences of our customers, we’ve been through our fair share of SEV0s.

And because of their infrequent nature, and disproportionate levels of human and system-level complexity, they rarely fit neatly into the normal way you run incidents.

Fundamentally, Sev0's are not normal. Your usual incident response lifecycle steps will inform, but not dictate how you respond. Adapting your response on-the-fly is going to be necessary.

Here’s a few of our pro tips for limiting the overall impact and keeping stress levels under control.

Embrace what works

Unless you’re regularly rehearsing SEV0s with all the realism of the actual event, your existing incident process will almost certainly break down in some fashion. After all, many of the people joining these incidents won’t have been in an incident recently, if at all.

Many organizations find the number of participants, the sheer volume of information, and complexity of situation hard to maintain inside of even the best tooling – incident.io included. Sometimes a Google Doc or other free-form way to track information is the best way to collaborate at the scale of these incidents.

Rather than fighting, lean into what works, and prioritize overall efficiency over rigidity. SEV0s are the most ‘war-time’ of incidents you’re going to encounter, and embracing what works for most is more important than dogmatically following the process.

Factor out smaller incidents

Whether you explicitly define them or not, sub-groups will form within a SEV0 — thinking about how to structure incident response teams in advance makes this far less chaotic. Whether it’s an engineering team that spin up their own channel to investigate a system they own, or the customer support team running their process face-to-face, the formation of “sub-incidents” is a natural tendency and necessary process.

Whilst groups will likely form organically, it can be helpful to be explicit about the process. This looks like explicitly defining different streams of work, and being clear about who is leading each. It’s helpful to keep a record of each stream and lead so folks know what’s happening, where it’s happening, and who they should talk to for more information.

Factoring out key subgroups can reduce the number of overlapping conversations, and keep the lines of communication under control; both things that

Communicate, communicate, communicate

Given the severity of the incident, the pressure to fix the issue will be at its highest. And with executives and other incident participants all within earshot, your bias to action and saving the day is likely to be high.

The combination of pressure and impact often pushes incident communication down the priority list, whether that's purely internal comms to keep everyone in the loop, or sharing updates with your customers. And when the frequency of updates decreases, anxiety and stress levels increase. And all of this ends up piling even more pressure on the response.

Whilst every second spent on communicating is a second less spent on fixing, the trade-off is a false economy — teams using incident response tools can broadcast status updates without pulling engineers off the problem. With more information flow comes greater distribution of load, lower pressure, and an overall smoother response.

Break the tension

When your dealing with a critical incident, emotions run high. You’ll be acutely aware of the magnitude and severity of the impact, whether that’s customers being unable to do something, the company losing money or something else. High-stress environments can lead to emotional responses like panic, which in turn impair decision making; all of which is counter to a well-run incident.

Breaking the tension that inevitably builds up over time is critical in managing your overall response.

Whether it’s a senior leader reminding everyone that they’re trusted and appreciated, a manager stepping in to do a coffee run, or the judicious use a of joke, breaking the tension can vastly improve the mood and effectiveness of the team responding.

Humor especially is known for its physiological and psychological benefits in reducing stress. But be careful – nobody wants a clown in the team when you’ve just leaked all your data.

Debrief on the process as much as the problem

In the aftermath of a SEV0, the pressure to prevent reoccurrence puts a huge focus on technical actions — but the incident debrief that follows is where organizational learning actually happens. System changes, technical controls and improved monitoring are frequently at the top of the list. Teams follow a more consistent process when using incident response tools and typically rely on a central source of truth for incidents.

But the SEV0 presents a useful learning opportunity for the organization too. Taking the same level of inspection to the response process itself during an incident debrief, the actions of the participants, and the emotional challenges can all lead to valuable learning and the identification of improvement areas

In summary

Understanding what a SEV0 is, where these severity levels come from, and how to effectively manage them can make all the difference. It’s about being prepared, staying calm under pressure, and continuously improving your processes. The next time you find yourself in the middle of a Sev0 storm, remember: clarity, communication, and coordination are your best allies.

But also, this post will probably be a distant memory in the midst of the chaos. Good luck!

Inside the SEV0—scale, pressure and uncertainty

Looking for SEV0, the conference?

Our advice for effectively handling the most severe of incidents

Embrace what works

Factor out smaller incidents

Communicate, communicate, communicate

Break the tension

Debrief on the process as much as the problem

In summary

See related articles

Designing your incident severity levels

So good, you’ll break things on purpose

We’d love to talk to you about