The balancing act of reliability and availability

As consumers, we expect the products and software we buy to work 100% of the time. Unfortunately, that’s impossible. Even the most reliable products and services experience some disruption in service. Crashes, bugs, timeouts.

There are a ton of contributing factors, so it's impossible to distill disruptions down to a single cause. That said, technology is becoming more and more sophisticated, and so is the infrastructure that supports it. That added complexity creates more issues, like downtime, that affect the usability of products and services.

But today, the stakes are high. The cost of downtime is constantly increasing as companies move quickly to vie for the attention of prospects. And just one prolonged period of downtime can cost you hundreds of thousands of dollars in revenue and destroy hard-won customer trust.

It’s a harsh reality, but there's a path forward. To prevent losses like these, you have to consider the balancing act of reliability and availability.

Availability, reliability, and SLAs

Today, organizations typically outline reliability and availability in their SLA documents.

As a reminder, SLAs are contractual agreements between a provider and a client outlining details of the service. This includes the standards the provider must adhere to, the metrics to measure performance, and more.

Including availability and reliability in these agreements is a smart decision. Not only do you hold yourself accountable to high standards, but you also ensure that customers have confidence in you and your products, using metrics you both agree on.

If you are including these metrics in your SLAs, it’s best to actually try to honor them: infrastructure, processes, and teams should all be set up for success. No one wants to fall short of delivering on the expectations they laid out with customers. This can wreak havoc on your reputation.

But it’s important not to over-index on either of these. We’ll get into this soon.

Thankfully, there are a few tactics that teams can implement to ensure that downtime is kept to a minimum and reliability and availability aren’t areas of concern for engineers or customers.

What’s availability?

Before we dive into some metrics and tactics, let’s get acquainted with what availability actually is.

Availability is a measure of the percentage of time that a service is in an operable state and not experiencing downtime. In layman's terms, it’s how often your product is working.

Here’s a formula to calculate your percentage of availability:

Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

And here’s a table put together by Google outlining standard levels of availability that organizations can aim for:

Availability	Downtime
	per year	Per quarter	Per month	Per week	Per day	Per hour
90%	36.5 days	9 days	3 days	16.8 hours	2.4 hours	6 minutes
95%	18.25 days	4.5 days	1.5 days	8.4 hours	1.2 hours	3 minutes
99%	3.65 days	21.6 hours	7.2 hours	1.68 hours	14.4 minutes	36 seconds
99.5%	1.83 days	10.8 hours	3.6 hours	50.4 minutes	7.20 minutes	18 seconds
99.9%	8.76 hours	2.16 hours	43.2 minutes	10.1 minutes	1.44 minutes	3.6 seconds
99.95%	4.38 hours	1.08 hours	21.6 minutes	5.04 minutes	43.2 seconds	1.8 seconds
99.99%	52.6 minutes	12.96 minutes	4.32 minutes	60.5 seconds	8.64 seconds	0.36 seconds
99.999%	5.26 minutes	1.30 minutes	25.9 seconds	6.05 seconds	0.87 seconds	0.04 seconds

You’ll usually hear availability expressed as the “9’s.” For example, “5 nines uptime” means that a system is fully operational 99.999% of the time.

It's worth bearing in mind that availability is rarely black and white. With increasingly modular applications, full downtime is rare, and it's more common to consider availability in relation to individual subsets of functionality. You'll see this on public status pages where companies split their availability across 'components.’

A great example of this can be seen here.

It’s important to remember that just being available isn’t nearly enough, and having your product “online” doesn't mean that it’s working effectively.

If it’s available, it should be working as intended…which is a great segue to talking about reliability!

What’s reliability?

Reliability refers to how often customers can expect to use your product without running into issues. If customers are constantly running into hiccups like app timeouts or crashes, these impact reliability.

For modern organizations, reliability is crucial because it directly impacts user satisfaction and trust. It’s easy to draw a straight line from bad reliability to decreased customer trust and higher churn rates. No one wants to use a product that’s constantly experiencing bugs that result in crashes. It’s frustrating, annoying, and a poor user experience.

Unlike availability, there’s no one calculation to measure how reliable your product is. But there are a few ways to get a proxy for it:

Error rate: i.e., the percentage of requests that you serve that were not successful.
Response time (or latency): i.e., how long it took to serve requests.
Percentage of sessions on mobile apps that are crash-free

In the real world, how organizations get a signal for reliability varies quite a bit. For example, some larger organizations will have an alerting rule that says, "If successful requests drop below 99.9%, then page an engineer," but at incident.io, our approach is more along the lines of "If we see any error that we don't know about, then page an engineer."

Still not quite clear? Here are some examples!

The crossover between these two concepts can be a little confusing, so let’s clear things up with some examples:

	Low availability	High availability
Low reliability	A beta version of an online game that's frequently down for maintenance and has many bugs when it is up.	A free online file converter. It's always up, but sometimes the conversions are incorrect or take too long.
High reliability	A specialized analytics service that's only available during business hours but provides extremely accurate and detailed reports.	A well-maintained SaaS platform like incident.io. It's almost always accessible and performs its functions consistently well 😉

The balancing act of availability, reliability, and resources

Between these two, there’s an important balance to be struck. When teams are trying to decide how reliable and available their systems should be, they need to balance costs and service quality.

Aiming for perfection can be the enemy here.

Remember the “9’s” of availability? Every effort you make to perfect this metric is going to be a strain on resources.

It’s often said that getting each additional "9" is an order of magnitude harder than the previous. You also don’t want to put reliability over everything else if it’s going to cause your service or product to suffer as a result.

This means finding a balance between investing in good infrastructure and performance to offer great service while also setting limits on how often things can go wrong without causing too many problems for the business and users.

What are some best practices around reliability and availability?

Thankfully, there are quite a few tactics that teams can implement to both sustain and improve their levels of reliability and availability. Many of these apply to both, but here’s a few that we’ve categorized for you:

Reliability

Monitoring: Continuously monitor the system's performance and health to identify and address issues proactively
Testing: Conduct thorough testing, including unit tests, integration tests, and load tests, to identify and fix reliability issues before they reach production
Automation: Automate repetitive tasks like deployment, scaling, and recovery to reduce the risk of human error
Fault tolerance: Design your systems with redundancy and failover stopgaps to minimize the impact of failures that do happen

Availability

Redundancy: Use redundancy in hardware, software, and infrastructure components to minimize single points of failure
Load balancing: Distribute traffic across multiple servers or instances to prevent overloading and ensure even resource utilization
Failover and disaster recovery: Implement failover mechanisms and disaster recovery plans to quickly restore services in case of outages.
Capacity planning: Continuously monitor resource utilization and plan for scaling to accommodate increasing loads.

Incidents happen. Let’s manage them better and more visibly

Despite your best efforts, downtime is still going to be a matter of when not if.

So, when incidents strike, it helps to have a platform in place that can help you manage them seamlessly to cut back on downtime.

incident.io reduces the stress and pain felt during incident response. And with Status Pages, you can count on robust internal and external comms that keep everyone in the loop even through the thorniest incidents.

With an intuitive UI that makes incident response simple, and powerful automation features such as Workflows to remove some of the overhead, incident.io levels up your incident response so you can focus on building more resilient products instead.