Register now: Why you’re (probably) doing service catalogs wrong
Register now: Why you’re (probably) doing service catalogs wrong
As consumers, we expect the products and software we buy to work 100% of the time. Unfortunately, that’s impossible. Even the most reliable products and services experience some disruption in service. Crashes, bugs, timeouts.
There are a ton of contributing factors, so it's impossible to distill disruptions down to a single cause. That said, technology is becoming more and more sophisticated, and so is the infrastructure that supports it. That added complexity creates more issues, like downtime, that affect the usability of products and services.
But today, the stakes are high. The cost of downtime is constantly increasing as companies move quickly to vie for the attention of prospects. And just one prolonged period of downtime can cost you hundreds of thousands of dollars in revenue and destroy hard-won customer trust.
It’s a harsh reality, but there's a path forward. To prevent losses like these, you have to consider the balancing act of reliability and availability.
Today, organizations typically outline reliability and availability in their SLA documents.
As a reminder, SLAs are contractual agreements between a provider and a client outlining details of the service. This includes the standards the provider must adhere to, the metrics to measure performance, and more.
Including availability and reliability in these agreements is a smart decision. Not only do you hold yourself accountable to high standards, but you also ensure that customers have confidence in you and your products, using metrics you both agree on.
If you are including these metrics in your SLAs, it’s best to actually try to honor them: infrastructure, processes, and teams should all be set up for success. No one wants to fall short of delivering on the expectations they laid out with customers. This can wreak havoc on your reputation.
But it’s important not to over-index on either of these. We’ll get into this soon.
Thankfully, there are a few tactics that teams can implement to ensure that downtime is kept to a minimum and reliability and availability aren’t areas of concern for engineers or customers.
Before we dive into some metrics and tactics, let’s get acquainted with what availability actually is.
Availability is a measure of the percentage of time that a service is in an operable state and not experiencing downtime. In layman's terms, it’s how often your product is working.
Here’s a formula to calculate your percentage of availability:
Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time
And here’s a table put together by Google outlining standard levels of availability that organizations can aim for:
Availability | Downtime | |||||
---|---|---|---|---|---|---|
per year | Per quarter | Per month | Per week | Per day | Per hour | |
90% | 36.5 days | 9 days | 3 days | 16.8 hours | 2.4 hours | 6 minutes |
95% | 18.25 days | 4.5 days | 1.5 days | 8.4 hours | 1.2 hours | 3 minutes |
99% | 3.65 days | 21.6 hours | 7.2 hours | 1.68 hours | 14.4 minutes | 36 seconds |
99.5% | 1.83 days | 10.8 hours | 3.6 hours | 50.4 minutes | 7.20 minutes | 18 seconds |
99.9% | 8.76 hours | 2.16 hours | 43.2 minutes | 10.1 minutes | 1.44 minutes | 3.6 seconds |
99.95% | 4.38 hours | 1.08 hours | 21.6 minutes | 5.04 minutes | 43.2 seconds | 1.8 seconds |
99.99% | 52.6 minutes | 12.96 minutes | 4.32 minutes | 60.5 seconds | 8.64 seconds | 0.36 seconds |
99.999% | 5.26 minutes | 1.30 minutes | 25.9 seconds | 6.05 seconds | 0.87 seconds | 0.04 seconds |
You’ll usually hear availability expressed as the “9’s.” For example, “5 nines uptime” means that a system is fully operational 99.999% of the time.
It's worth bearing in mind that availability is rarely black and white. With increasingly modular applications, full downtime is rare, and it's more common to consider availability in relation to individual subsets of functionality. You'll see this on public status pages where companies split their availability across 'components.’
A great example of this can be seen here.
It’s important to remember that just being available isn’t nearly enough, and having your product “online” doesn't mean that it’s working effectively.
If it’s available, it should be working as intended…which is a great segue to talking about reliability!
Reliability refers to how often customers can expect to use your product without running into issues. If customers are constantly running into hiccups like app timeouts or crashes, these impact reliability.
For modern organizations, reliability is crucial because it directly impacts user satisfaction and trust. It’s easy to draw a straight line from bad reliability to decreased customer trust and higher churn rates. No one wants to use a product that’s constantly experiencing bugs that result in crashes. It’s frustrating, annoying, and a poor user experience.
Unlike availability, there’s no one calculation to measure how reliable your product is. But there are a few ways to get a proxy for it:
In the real world, how organizations get a signal for reliability varies quite a bit. For example, some larger organizations will have an alerting rule that says, "If successful requests drop below 99.9%, then page an engineer," but at incident.io, our approach is more along the lines of "If we see any error that we don't know about, then page an engineer."
The crossover between these two concepts can be a little confusing, so let’s clear things up with some examples:
Low availability | High availability | |
---|---|---|
Low reliability | A beta version of an online game that's frequently down for maintenance and has many bugs when it is up. | A free online file converter. It's always up, but sometimes the conversions are incorrect or take too long. |
High reliability | A specialized analytics service that's only available during business hours but provides extremely accurate and detailed reports. | A well-maintained SaaS platform like incident.io. It's almost always accessible and performs its functions consistently well 😉 |
Between these two, there’s an important balance to be struck. When teams are trying to decide how reliable and available their systems should be, they need to balance costs and service quality.
Aiming for perfection can be the enemy here.
Remember the “9’s” of availability? Every effort you make to perfect this metric is going to be a strain on resources.
It’s often said that getting each additional "9" is an order of magnitude harder than the previous. You also don’t want to put reliability over everything else if it’s going to cause your service or product to suffer as a result.
This means finding a balance between investing in good infrastructure and performance to offer great service while also setting limits on how often things can go wrong without causing too many problems for the business and users.
Thankfully, there are quite a few tactics that teams can implement to both sustain and improve their levels of reliability and availability. Many of these apply to both, but here’s a few that we’ve categorized for you:
Despite your best efforts, downtime is still going to be a matter of when not if.
So, when incidents strike, it helps to have a platform in place that can help you manage them seamlessly to cut back on downtime.
incident.io reduces the stress and pain felt during incident response. And with Status Pages, you can count on robust internal and external comms that keep everyone in the loop even through the thorniest incidents.
With an intuitive UI that makes incident response simple, and powerful automation features such as Workflows to remove some of the overhead, incident.io levels up your incident response so you can focus on building more resilient products instead.
Here are a few strategies that might help you build up context, find the problems that really matter and turn these into a plan of action.
Site reliability engineers are responsible for quite a bit, but one thing is clear—their role is critical. In this article, we break down everything you need to know about SREs and what they focus on.
This is a technical write-up of an incident on Friday 18th November 2022 where we experienced 13 minutes of downtime from intermittent crashes.
Ready for modern incident management? Book a call with one our of our experts today.