How organizations are measuring their incident management
We surveyed world-class organizations about which metrics they track to measure the effectiveness of their incident response. In this report, we've analyzed the responses from companies like Ramp, Etsy, Skyscanner and others and share the insights their responses provided.
Chris Evans
Co-founder & CPO, incident.io
What are teams looking at to better understand how well theyâre doing at incident management?
This is a question we get asked all the time and one we empathize with deeply. While there are several well-established incident metrics that organizations commonly use, like MTTR and raw counts of incidents, a vast number of them are ineffective, or worse, entirely misleading.
The problem? These metrics tend to surface high-level insights that are difficult to action, and in some cases lead you to the wrong conclusion altogether. But itâs a tough situation to resolve, with few people offering viable alternatives that can be easily obtained or digested at different levels across the organization, and that provide high signal on the thing theyâre modeling.
So we wondered how teams sorted through all of this noise to figure out what would deliver the most insight and impact. To get some answers, we put together a simple survey:
Admittedly, some of the responses werenât too surprising. But others got us thinking and challenged some of the assumptions weâd made. Here, weâll share the findings that we collected from world-class organizations such as Etsy, SumUp, Ramp, and more.
With these insights in hand, we hope you can make some improvements to your existing metrics, or at the very least be confident youâre on the right track with what youâre already looking at! As with most things in the realm of incidents, context is paramount, so what works at one company wonât necessarily translate well at another.
To start, we asked what metrics people track within their teams. We wanted to uncover what individual teams really care about measuring and looking at, absent the pressures of management or other people across the organization Metrics have a tendency to become âpolitically chargedâ inside many organizations, often with leaders wanting them to convey a particular narrative. Itâs understandable why this happens and what the consequences likely are.
So with this in mind, letâs dive into what teams care about tracking for their own benefit...
Time to detect, time to resolve, time to acknowledge, mean time between failuresâŚ
Of the responses we received, the overwhelming majority of them mentioned tracking MTTx at their organizations. This is a broad range of metrics that average out a handful of incident response data points. In other words, from start to finish, how long it takes you to complete a specific action. For example, MTTD measures how long a problem exists before the appropriate parties are alerted to it. MTTA measures the time between an issue being detected and someone starting to work towards resolving it. In our survey, half of all respondents specifically cited MTTR, or some variation of resolve, resolution and restore.
In many ways, this is not wholly surprising. These metrics tend to be the most straightforward to track and arenât resource intensive to gather. And if weâre being honest, looking at how long it takes you, on average, to resolve an incident just makes sense. Organization want their response to be as fast as possible and will understandably look towards data points that give them the best measure of this, with the lowest friction. If youâre trying to understand whether or not youâre doing something as efficiently as possible, honing in on how long it takes you to do so feels right. Itâs simple: a high, or upwards trending average suggests that you need to improve things while a low or decreasing average means that things are working out well.
But this measure hides a lot of nuance thatâs impossible to distill if you just look at the raw numbers. Simply put, thereâs too many uncontrollable variables in how long it takes you to respond to incidents to make MTTR a useful metric.
Takeaway
Based on the responses, it appears that Mean Time to⌠is still the workhorse for organizations when comes to incident metricsâbut MTTR in particular stands out.
The survey responses show that businesses still heavily rely on MTTR to provide incident insights. And while this may work out just fine for some organizations, the problem is these metrics tend to skew very easily, leaving you with insights that arenât representative of the actual state of things. Weâll use a hypothetical situation to illustrate the point.
Letâs imagine, in January, you had seven incidents, and across those incidents it took you an average of 20 minutes to resolve them. You decide youâd like to improve your MTTR of 20 minutes and reduce it down to 15âan objectively well intentioned and sensible goal, since your customers havenât been enjoying the downtime.
Now in February, your follow-up actions from the incidents in January, along with investments in reliability and knowledge sharing mean you have a much better month. You have one small incident thatâs swiftly resolved within three minutes, and in the final week of the month you have a particularly gnarly one. The person on-call takes longer than anticipated to fix the issue as a runbook is out of date, and it takes them 45 minutes to get things restored.
Your February MTTR ends up being 25 minutes. Youâve gotten five minutes worse.
Did things get worse? MTTR would suggest so, but youâve had fewer incidents in February, less aggregate downtime, and when it comes to the numbers the average was largely inflated due to an unpredictable and unique set of circumstances. You could argue February was a much better month than January.
To be clear, weâre not suggesting you shouldnât use MTTR. Like all metrics, they are a model of a complex world, so you should be aware that they are fallible. MTTR going up might mean things are getting worse, or it might happen when things are actually getting better. If youâre using it, itâs worth being aware of this fact, and ensuring itâs the starting point for a deeper investigation.
Responders are also looking past metrics that measure their incident response, and are tracking what happens after the incident as well. Several mentioned post-mortem and follow-up action completion times as primary measures they cared about.
But why are teams looking at these numbers in the first place? It ultimately boils down to two things: learning from incidents and reducing the likelihood of repeat incidents. When we dug a little deeper, we found post-mortem completion rates are being used as a numerical proxy for how much teams are learning from incidents. A higher number of post-mortem documents means more learning and vice versa.
Done well, post-mortem documents can unearth several areas of improvement teams otherwise may not have found. Issues like inadequate training, system vulnerabilities, misconfigurations, and process improvements can be discovered through a post-mortem.
Itâs important to acknowledge this kind of metric masks some complexity. Writing a document doesnât mean youâre learning and learning doesnât necessitate a document being written so, like all metrics, it requires cultural alignment and a degree of quality control.
When it comes to follow-up actions, respondents are most commonly tracking completion time and/or overall % of actions completed.
By making note of what these items are and whether or not theyâre being completed, teams can nip any process gaps in the bud before they result in repeated incidents that couldâve been avoided entirely.
While thereâs understandably a focus on how well teams are responding to live incidents, the insights you gather around what happens after the incident can be just as valuable.
Takeaway
Organizations are starting to approach incident metrics from a more wholistic point of view and not just focusing on response datapoints.
A few responders mentioned tracking their teamâs on-call metrics also, specifically alert fatigue. Given how important on-call is for organizations of all sizes, tracking whether or not certain teams and individuals are getting the brunt of the work is sensible.
Alert fatigue is a phenomenon that first made its rounds among medical professionals. It occurred when folks on-call got paged repeatedly during the course of a shift. This ended up in them growing desensitized to the alerts and, eventually, processing them as noise. Down the line this led to missed, ignored, or delayed responses to pager alerts.
Alert fatigue is a very real issue and sweeping it under the rug only delays the inevitable outcome of burnout. But think about the proactive measures you can take when you know exactly whoâs getting on-call alerts, when theyâre getting them, and whether or not theyâre in need of some reprieve.
Ultimately, by staying on top of this metric, teams can set their team up for success and demonstrate that employee well-being is top of mind.
Takeaway
Thereâs a correlation between the growing complexity of an organization and how much they ask of their on-call teams. With more teams tracking this load, it suggests that balancing the needs of the business with employee well-being is a priority.Â
Interestingly enough, in many instances the metrics that responders track on their teams and what they report upwards arenât one in the same. Hereâs how it generally broke down:
One metric that came up repeatedly? Total number of incidents.
From the feedback, it's clear that upper management tends to prioritize quantitative metrics, like minutes of downtime, incident counts, and other time-related measures. Thereâs less focus on depth and more focus on simple-to-understand numbers.
On the other hand, team members on the ground often lean towards qualitative metrics, focusing on patterns, the business impact of incidents, and the underlying factors contributing to them.
As a whole, this suggests that thereâs a disconnect between what teams think is the best reflection of their efficiency and what upper management thinks is most useful to make business decisions.
In general, this does at least suggest that thereâs still a bit of contention around what incident metrics provide the most impact and best represent the âstate of the worldâ when it comes to incident management.
Takeaway
When it comes to the usefulness of certain incident metrics over others, teams are at odds with leadership and thereâs a slight misalignment between team-level objectives and business-level objectives.
Surprisingly, this question prompted the most diverse set of responses. Yes, most teams use their incident metrics to take direct action. For one responder,
...these are triggers for my team to engage to get certain areas back on track.
But nearly half of all responders mentioned that the metrics they track donât trigger any actions whatsoever. This poses the question: if teams are making the time to track metrics at the team level and report them up, why arenât they prompting any corrective actions?
One theory for this might be time constraints. Teams are already spread thin, and the thought of adding corrective actions to an already packed backlog can seem daunting.
Another, perhaps more plausible theory? The metrics teams are tracking and reporting don't actually serve their needs. There's some support for this idea: several respondents admitted they monitor metrics like incident count and MTTR/MTTD but then do nothing with this information.
Although our survey didn't directly address this, follow-up discussions revealed that most people find more value and learn more effectively from examining âinterestingâ incidents in depth, rather than relying on superficial numerical data.
In an ideal scenario, the insights you gather from metrics should be valuable enough to course correct anything thatâs lagging behind. But if you arenât doing anything with these data points, is it time to consider a different approach?
Takeaway
With several teams spending time collecting metrics but not doing anything with them, it begs the question: are these metrics useful to organizations in the first place?Â
While this question understandably generated many context-based responses, one theme came up throughout a few responses: cost. And given the discussions around tightening balance sheets for companies across the board, this feels par for the course.
The reality is that every incident can represent a dollar lost for organizations. But oftentimes it can be difficult to assign an exact amount to incidents. And when you dig deeper, itâs easy to understand why.
X amount of downtime doesnât necessarily equate to X loss.
Some benchmarks youâll come across are a $427 per-minute cost of downtime for small businesses and $9000 for medium and large. But the problem with this is that it removes so much nuance. The industry youâre in, your business model, revenue streams. All of these and more make calculating your downtime costs based on averages very challenging.
If youâre an e-commerce website and your website goes down for ten minutes, will the customers who visited your site during that period never come back? Unlikely. So what is the best representation of loss during this period? Hard to say.
Other organizations use the following to get a benchmark for their downtime costs:
Outage cost = potential revenue + lost productivity costs + recovery costs
But again, thereâs so much variability here that makes relying on these calculations a bit of a gamble.
To make matters more complicated, cost cannot be reduced down to revenue lost. Thereâs productivity loss. Lost trust. Regulatory scrutiny. The list goes on. So as teams mature and the stakes get higher, being able to assign a wholistic value to incidents is becoming more and more critical, particularly for leadership.
Takeaway
As companies revisit the efficiency of their spend, the desire for more cost-quantifying incident metrics is top of mind for organizations of all sizes.
The results are inânow what? Many of the results of this survey affirmed what we assumed: organizations track a diverse set of metrics to understand the efficiency of their incident response.
And while there are some common themes such as MTTR, what metrics organizations track is highly context-based. For this conversation, however, itâs worth taking a step back.
Itâs undebatable that keeping up-to-date with incidents can give valuable insight into organizational health. And since incidents are often handled by people closest to the day-to-day, for anyone further from the action, incident data can be one of the most direct and honest signals you can get for how things are going and can facilitate better business decisions.
If youâre finding yourself looking for better signals, or just trying to up-level what you look at to track incident response efficiency, hereâs a few ways we think you can do that. Crucially, we think itâs important to combine multiple metrics to get an accurate picture, instead of relying on singular data points.
First, letâs look at an alternative way to measure incident impact: workload. Itâs fair to say that measuring incident impact can be highly nuanced and can leave you running around trying to answer loads of questions
But sometimes the simplest question is the best one. In this case itâs how much time people actually spent trying to resolve the incident. Time spent responding to incidents is time not building product or serving customers. It represents a typically overlooked cost of supporting the service you already provide, and can be spread across many people even in a single incident.
If you could directly quantify the amount of time spent on incidents, think of how many questions that might help you answer.
By breaking down incidents into data points such as, âHow many people responded to this incident?â How much time did each of them spend on it?â you can much better represent how bad an incident was.
Weâve written extensively about on-call and the burden it represents for teams.
Because of this reality, itâs important to be proactive about minimizing the disruption that being on-call creates. But outside of asking how folks are feeling, it can be hard to know when on-call has become painful.
By tracking frequency of pages and contextualizing those pages for the type of disruption they caused the person, you can address an increasing operational burden before it makes a turn for the worse.
Are you ready for your next incident?
All organizations deal with incidents day in and day out. But because things change frequentlyâsuch as tenured employees leaving and branching out into new product spacesâyour incident preparedness is likely to change, too.
And the last thing you want is to be stuck in the middle of a bad incident, looking around and saying, âwe werenât ready for thisâ or âI donât know how to do that.â The best proxy to measure how ready you are to tackle your next incident? How many of your team have responded to different types of incident over time.
In the chart above, you can see how many responders:
This can help point out any gaps in experience. Say youâve had a new joiner who, over the course of six months, has led very few incidents relative to other new joiners over the same time period. This can suggest to you that this person may be in need of some time in incidents where they have the opportunity to lead. Perhaps adding a shadow rotation could help?
At incident.io, we run Game Days to give everyone an opportunity to go through the motions of a typical incident, while playing different roles. We'll always make sure that anyone without recent experience of major incidents participates in these drills, which increases the number of people trained to respond when the next incident happens.
To reiterate: metrics that work at one organization wonât necessarily make sense at another. So if tracking MTTA, follow-up completion, and pager load makes the most sense in your context, keep doing so!
But regardless of what you end up tracking, remember that metrics are a starting point for any investigation. Numbers cannot convey, with 100% certainty, so many factors around responding to incidents.
How do folks feel? Did we do everything we possibly could have to prepare for this? How likely is this to happen again?
And the reality is that certain metrics, like MTTR or incident count, just donât do enough to paint a picture beyond the raw numbers. They leave out too much context and nuance and can leave you with a narrow worldview. Ultimately, metrics like these can be more harmful than good down the line.
And yet, tracking metrics for the sake of doing so isnât going to move the needle either. Whether youâre tracking workload, or pager load, operational readiness or something else, itâs just as important to actually implement any learnings you gather.
Ultimately, the metrics you track should enable you to make changes and improvements that deliver the most value.
When you operate with this framework everyone wins: you, your team, your business and the folks who make it all possible, your customers.