Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
You know that feeling when you’re in the middle of doing work against a deadline, and your internet suddenly goes out? You know how much you hate when that happens?
Yeah, same here.
And to really pile on, you know how frustrated you get when hours go by and your internet still isn’t working? And there’s radio silence from your provider?
Yeah, me too.
The scenario represents DORA’s Time to Restore Service metric to a tee. But there’s a lot more to this metric behind the scenes worth noting. And there’s loads of actionable advice worth implementing to reduce the time it takes to restore service after a disruption.
In this comprehensive guide, I’ll dive into the details of this crucial metric, explore its benchmarks, and provide practical tips to optimize your incident response time.
💭 This article is part of our series on DORA metrics. Here are some links to the rest:
In the context of DORA, the TTRS metric measures the time it takes for a team to restore service after an incident or service disruption occurs. It represents the time between the detection of an incident and the complete restoration of normal service.
By measuring TTRS, organizations can evaluate and analyze their incident management processes and identify any areas for improvement. In general, a lower TTRS indicates better incident response processes, which is typically associated with higher-performing teams and increased product reliability.
To assess their incident response performance, teams can look to DORA's benchmarks for Time to Restore Service. These benchmarks serve as a yardstick for measuring success:
It’s essential to note that the time it takes to restore service will depend on a myriad of contextual factors, so blindly targeting these benchmarks tends not to be especially helpful. Things like the complexity of the systems involved, the people who are responding, and the degree of understanding across the domain all introduce a number of variables that can affect restoration times.
Nonetheless, as an enduring goal, we can all agree a lower time to restore service is better. And if we see trends or outliers, or if we have hard constraints like deploy times that push restoration times high, TTRS can be a helpful starting point for an investigation.
It’s all in the details!
To minimize your Time to Restore Service, you have to really prioritize your incident response process. With a half-baked or ad hoc approach for responding to incidents, not only will your downtime extend more than necessary, your customers will suffer as a result.
With that in mind, let's explore some practical tips to optimize (and reduce) your incident response times.
I’ve noted this a few times already, but it’s of utmost importance that you develop a dedicated, coherent, and efficient incident response plan. This means creating a plan that outlines clear roles, responsibilities, and escalation procedures.
With it, you should be able to confidently answer questions such as:
But remember, incident response doesn’t end once the incident is closed out. So make sure you have a well-thought-out post-incident process as well. This includes creating post-mortem document templates, holding blameless post-incident meetings, and prioritizing learning from your incidents. We’ll touch on this again later.
Needless to say, every minute counts when it comes to incident response and cutting back on downtime. So if you aren’t currently using a monitoring tool, such as Datadog, or an alerting tool, such as Pagerduty, you’ll want to do so ASAP!
Tools like these can alert you the minute an incident is detected and kick off your incident response process. This way, even the smallest incidents won’t slip through the cracks.
Your incident response processes will be made or broken by your communication and collaboration—point blank. If either of these isn’t where they need to be, you can expect your downtime to suffer as well.
That’s why it’s important to prioritize how you communicate during incidents and how you collaborate to resolve them as well.
To start, each incident should have a single Slack channel, preferably named after some description of the incident—for example, #inc-our-server-is-down. With a single channel, all communication about that incident is in a central hub, eliminating the need to chase context across various DMs and channels.
Second, you should use an incident response solution that makes it easy to create responsibilities. For example, responders should be able to designate someone or volunteer to be an incident lead easily.
And as a bonus, having workflows that automate several steps of the incident response process is a major plus here. For example, a workflow that notifies folks what the “next step” in the response process is, as laid out by you.
In the end, this all may seem like table stakes, but like I said earlier, every minute counts when responding to incidents. And it’s always the things that feel the most trivial that add up to the most downtime.Don’t overlook your post-incident analysis!
Remember, incidents don’t end when they’re resolved. By prioritizing your post-incident workflow, including meaningful learnings from incidents, you can improve your response processes and build more resilient products in the long run.
I mentioned a few of these earlier, but this could look like holding blameless incident debrief meetings and using a dedicated post-mortem document.
Additionally, any insights you can gather into the efficiency of your incident response can go a long way here. For example, try to determine which teams respond to the biggest share of incidents or who’s been paged the most over the last 3 months.
Both of these data points, and others, can give you meaningful insights into what’s working and what isn’t, what should be improved, and what’s a building block for future success.
Unfortunately, incidents will happen despite your best efforts. And with those incidents will come a bit of downtime. It's an unfortunate combination that's best to accept as reality and do what you can to best circumvent it.
That said, everyone wants insights into ways to cut back on their downtime from incidents. But the process of compiling appropriate data can be pretty complicated, especially when it comes to critical incident response metrics that can give you meaningful insights to help you reduce downtime.
This is where incident.io's Insights dashboard comes in.
With it, team members and engineering leaders can glean dozens of insightful response metrics, allowing them to make meaningful changes to how they operate before, and during, incidents.
The best part? Many of these dashboards are pre-built, so you can jump right in and analyze key metrics without the overhead. But don't worry; you can set up your dashboards as well. Here are just a few of the metrics you can track right out of the box:
...and more.
If you're interested in seeing how Insights work and how its metrics can fit alongside Time to Restore Service, be sure to contact us to schedule a custom demo.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
What are DORA metrics and why should you care about them?
Google's DORA metrics can help organizations create better products, build stronger teams, and improve resilience long-term.
Luis Gonzalez
Development efficiency: Understanding DORA's Mean Lead Time for Changes
By using DORA's Mean Lead Time for Changes metric, organizations can increase their speed of iteration
Luis Gonzalez
Shipping at speed: Using DORA's Deployment Frequency to measure your ability to deliver customer value
By using DORA's deployment frequency metric, organizations can improve customer impact and product reliability.
Luis Gonzalez
Driving successful change: Understanding DORA's Change Failure Rate metric
By using DORA's change failure rate metric, organizations can highlight inefficiencies in deployment processes and prevent pesky incidents from repeating.
Luis Gonzalez