Register now: Why you’re (probably) doing service catalogs wrong
Register now: Why you’re (probably) doing service catalogs wrong
Metrics beyond MTTx: Measuring the quality of your incident management processes
Metrics like Mean Time to Detect, Acknowledge, and Recover (often grouped as “MTTx” metrics), along with uptime, are widely used to assess incident management performance. These metrics are popular because they’re well known, easy to calculate, and intuitive: if my MTTx goes up, that’s bad; if it goes down, that’s good.
But MTTx metrics can be misleading, especially when used in isolation to evaluate overall incident management performance. Definitions of “fixed” or “resolved” vary both within and across organizations, making MTTx metrics difficult to compare. Additionally, their detachment from the broader incident response process makes it hard for them to reflect how effectively an organization manages and learns from incidents.
If the answer isn’t MTTx, it begs the question: what does “good” incident management look like? What metrics can help us track the quality of our incident management processes? And how do these metrics vary across different company sizes?
We need metrics that are actionable — ones that define what "good" looks like and highlight areas for improvement. However, many organizations fall short here. In our recent survey, nearly half of respondents reported that their incident management metrics are calculated and reviewed, but no actions are taken based on them.
We’ve analyzed over 100,000 incidents — from Fortune 500 enterprises with thousands of employees to 10-person startups — and identified a set of industry benchmark metrics.
These metrics are:
While it’s impossible to capture the perfect numbers for every organization and context, these benchmarks are designed to be directionally accurate and help guide you in the right direction.
We’ve intentionally used the term “good” to describe these benchmarks — this is based on our hands-on experience with building an incident management product (and dealing with incidents ourselves!). We recognize the inherent subjectivity but believe these benchmarks offer practical value regardless.
We’ve grouped our benchmarks by each stage of the incident lifecycle, with recommendations on how to improve for each:
We’ve grouped (most) benchmarks in this report into 1 of 3 buckets based on customer size: <250, 250-999, and 1,000+ employees. We’ve done this to show the trends based on company size, and to provide you with a relevant peer group to benchmark yourself against.
Within each bucket, we take the median of the per-customer metrics across all customers in that bucket, as it’s the easiest way for you to benchmark yourself against your peer group.
But what about calculating per-customer metrics? While some — like the percentage of escalations occurring outside working hours — don’t rely on averages, most do.
The mean is useful for capturing skew, but it’s highly susceptible to outliers — making it less useful for understanding typical performance. This is especially relevant for behavioral metrics (like most in this report), as human behavior naturally includes outliers.
Take the following benchmark metric: "How often are updates being shared in incidents?". If measured, for major & critical incidents, using:
We recommend aiming for internal updates every 15–20 minutes, but the mean won’t tell you whether you’re consistently achieving this—it only highlights how extreme your outliers are.
That doesn’t mean you should ignore outliers—far from it. But for benchmarking, the median is more useful. A helpful exercise is tracking the 90th/95th percentile alongside the median to see if the gap between them is growing or shrinking over time.
As a result, for any per-customer benchmarks in this report where we take an average (e.g. average time to mobilize) we take the median.
Historically, organizations have tracked measures like mean time to acknowledge (MTTA). While useful, it doesn't do a good job of highlighting how long it takes responders to actually get to their laptop, log in to Slack/Teams, and be in a position to start dealing with the issue.
Median time to mobilize, or MTTM, is a measure that more accurately captures the ‘lag time’ of starting your incident processes. It measures the time between an alert firing and the first human message being sent in a Slack/Teams incident channel, rather than just the time between the alert and someone acknowledging a page.
On average there is a 4-5 minute delay between the first alert of an incident firing and first message being sent by a human in Slack or Teams.
How you perform against this metric will vary based on SLAs you have defined, the timezones you operate in, and a number of other factors. We’d suggest using it to understand your organizational dynamics and as an input to changes, rather than setting any specific targets.
Interestingly, this varies between companies of different sizes, with larger organizations tending to take longer to mobilize on average.
By this definition, incident.io responders mobilize in ~1.5 minutes, meaning there is a ~90s delay between an alert indicating something is broken to a human engaging in the incident response process. Not too shabby!
As you might expect, there’s a noticeable difference in the median time to mobilize at different times of the day, with overnight incidents taking nearly twice as long to get the incident response process going in earnest.
Unfortunately, you can’t control when things break and when you’re going to get paged.
Having a pulse on how many pages happen outside working hours is helpful for spotting problems that may contribute to poor on-call experiences and burnout.
There are many factors which influence this metric, like how often issues are correlated with change events, or where in the world your customers are compared to your engineering team.
If you notice more than 20% of pages are occurring outside of working hours, you may want to take a closer look at what’s driving the incidents, and whether your teams are managing the load ok.
At incident.io, we see approximately 36% of pages happening overnight. Clearly this is higher than the numbers above, but with an engineering team in London, and the majority of our users in the US, and many of our alerts being tied to user actions and errors, this makes sense.
We define “noisy alerts” as alerts that don’t signify something meaningful happening. We can bucket alert noise into two sources:
Calculating a single metric here can be challenging, but for simplicity we combine both these stages into one metric: how many alerts fire for each accepted incident?
False positives can drain your team’s energy and time. Ideally, every alert should signal a real issue worth investigating. In other words:
In a perfect world, every alert would signal a real incident, and every incident would be covered by an alert — a 1:1 ratio. But in reality, things are messier.
As a rough benchmark:
If escalations are frequently missed by the first responder and continue up the escalation chain, it may indicate inefficiencies such as:
Optimizing escalation paths can ensure that:
Ideally, all escalations are acknowledged by the person who’s paged and escalations up the chain are an active choice to bring in additional support.
In reality, people will be unavailable some of the time, cell service isn’t perfect, and drift around on-call configuration is a fact of life.
Ideally, fewer than 20% of your escalations should be missed by the first line on-caller.
This metric is a measure of how long it takes to determine if an incident is “real”. It reflects how efficiently your team can respond to, triage and recognize issues, and tracking it can help you identify delays in your incident response.
We count the start time here as the point at which an alert fired, and the end time as the moment an incident moves out of a “triage” status, showing that the issue has been confirmed as a real incident.
Quickly determining whether something is truly an incident is key to good incident management.
It reflects the team’s ability to make fast (and ideally accurate) judgments — a skill influenced by factors like expertise, observability, and experience.
In an ideal world, if an incident is a high severity, we want to know about it as soon as possible so we can respond appropriately. In reality, many incidents start as lower severity and get “upgraded” as the full extent of the issue becomes clear or as the duration of the impact increases.
This metric looks at how long it took for critical incidents to reach their final severity status.
When incidents have significant impact, we’d like to know about them as soon as possible so we can respond proportionately.
Delays here can mean slower escalations, high impact on customers and introduce brand reputation risk.
By default, the incident lead is responsible for coordinating the response and handling communication. Assigning a lead at the start of an incident helps bring the situation under control faster, and reassures the rest of the organization that the situation is being handled.
Regardless of the size of organization, you should be assigning an incident lead within the first 5 minutes of the incident being declared.
Providing regular updates during an incident can be challenging, but it keeps everyone aligned and helps improve the speed and quality of the response.
We looked at all incidents, broken down by severity, and established the median time between updates.
Frequent updates ensure everyone knows the status of the incident, especially when it’s resolved. Clear communication reduces confusion and improves coordination.
Naturally, incident severity plays a role as well—the more severe the incident, the more crucial it is to provide frequent updates.
Given the choice, most would prefer their teams were spending time on planned, value-adding work for their organization rather than responding to incidents.
Here, we break down the aggregate number of people-hours spent per incident by organization size and incident severity. There's too much nuance here to set benchmarks or suggested targets, so the data here is shared for information and insight only. Given this metric isn't being used for benchmarking, we show the mean hours spent per incident — not the median.
Understanding the time (and cost) spent on incidents helps you gauge the impact on productivity and resources.
This can vary based on company size and incident severity.
When looking at the distribution of mean hours spent on incidents by severity, a clear trend is seen where most minor incidents are worked on for < 8 people-hours (a full workday).
For major and critical incidents however, there is more variability, with these often spanning multiple workdays worth of workload.
While the ideal target is zero, incidents don’t always respect working hours. Keeping an eye on after-hours work helps you manage team burnout and workload balance.
Again, there’s too much nuance here to set benchmarks or suggested targets, so the data is here is shared for information only.
We recommend monitoring this closely and investigating if the average exceeds 15-20%.
Relying on a small group of people for incident response can lead to burnout and key-person risk.
While dedicated incident teams (e.g. SREs) can work in some setups, distributing the workload helps build resilience and prevents individuals from becoming overburdened.
You should aim for more than half of your responders to share the majority of the workload.
One way to evaluate this is to look at the percentage of responders that account for 80% of the overall incident workload.
Smaller companies tend to distribute workload more evenly, but larger organizations may have dedicated teams or more outliers, adding nuance to this metric.
High-severity incidents often impact customers, making timely communication essential. Proactively updating your status page (or another communications channel) can reduce customer stress, prevent support overload, and build external trust.
This metric (filtered for major/critical incidents) measures how long it takes organizations to publish their first status page update—specifically for incidents where one was posted.
When things go wrong, letting customers know you’re on it, quickly, helps maintain confidence.
Interestingly, mid-sized organizations show the slowest response times, likely due to growing complexity without the streamlined processes seen in large enterprises.
Regular status page updates can build trust and manage customer expectations by showing you’re actively addressing the issue.
Debriefing, analyzing, and documenting incidents offers valuable insights, but these activities can be costly. Most organizations reserve them for the most critical incidents, using severity as a guide for when they’re worth the effort.
This metric examines how often post-incident processes — like running debriefs and writing post-mortems — are conducted for high-severity incidents.
After an incident is resolved, there are often steps you can take to learn and improve for the future.
By doing so, you may be able to prevent the incident from happening again or improve your response to similar incidents.
Smaller companies often write fewer post-mortems since learning and analysis happen more informally.
In larger organizations, aiming to complete post-mortems for 80% of major and critical incidents is a sensible target.
As expected, larger organizations proportionally write the most post-mortems for high severity incidents. This is often the result of policy, regulatory requirements and other organizational processes.
After an incident, fixing and improving our processes and systems is important. This metric tracks the median time (in days) to complete follow-up items, from creation to completion.
Follow-up items should be completed promptly to avoid getting lost in the backlog, and to reduce the likelihood of repeat incidents. Of course, there’s plenty of nuance here to accommodate, like weighing up actions against other priorities.
Follow-up items are completed within 2 weeks, regardless of organization size or priority.
Want a .pdf version of the What does good incident management look like?
Download reportReady for modern incident management? Book a call with one our of our experts today.