If severities answer "how bad is this incident?", statuses answer "how close
are we to getting back to normal?"
Statuses keep stakeholders up-to-date #
The last thing a busy incident response team needs is all the different
stakeholders asking "do we know what's wrong here?" or "is this fixed yet?". The
correct set of statuses preempt those questions.
Like severities, statuses are the common language across your organization for
making sure people know where in the incident lifecycle you are right now.
Choosing statuses that make sense for your organization #
Choosing the right number of statuses is key. Too few and you'll still be
fielding those disruptive questions, too many and you'll waste time deliberating
whether an incident is "mostly mitigated" or "fully mitigated".
Like a child on a long car journey, the first question to think about is "are
we there yet?" The most basic set of statuses are just "Ongoing" and
"Resolved". That lets the rest of your organization know whether there's
a known issue right now.
That's probably enough for a small team with short-lived incidents, but as your
organization and product grows in complexity so will your incidents.
What's useful during an ongoing incident?
Let's start during the incident. The key questions the status needs to answer
are "do we know what's wrong?" and "is the impact still happening?". You
might solve that with:
- "Investigating": we think something is wrong, but we're not sure what it is
- "Fixing": we've figured out what's wrong and we're trying to fix it
- "Monitoring": we think it's fixed, but want to double-check!
These statuses should be simple and clear enough that they make sense for
responders, internal, and external stakeholders.
What's useful once an incident is resolved?
As your organization scales, you'll want to put in place processes to help you
learn from incidents (we'll go into more detail about this later
in this guide).
If you're trying to hold incident leads accountable for following the
post-incident process you define, statuses can help you do that. You might split
"Resolved" into different stages:
- "Impact mitigated": things are back to normal, and it's time to start learning
- "Debrief completed": you've met to discuss what can be learned from this
incident, and any follow-ups have been assigned to the relevant team
- "Closed": the post-incident process is over 🎊
In summary #
When you're designing your organization's incident response process, it's
helpful to think about what incident statuses will be used for. They will:
- Frame incident updates: do your statuses help Incident Leads send updates at
the right time?
- Communicate with stakeholders: if all you could see about an incident was the
name, a short summary, the severity, and the status, would you know whether
you could help? Would executives understand the impact? Would your customer
support folks be able to keep your customers updated?