There are a handful of providers that large parts of the internet rely on: Google, AWS, Fastly, Cloudflare. While these providers can boast five or even six nines of availability, they’re not perfect and - like everyone - they occasionally go down.
For customers to get value from your product or service, it has to be available. That means that all the systems required to deliver the service are working, including:
There's a great paper The Calculus of Service Availability that applies maths to interpreting availability. It points out that in order for a system to provide a certain availability, any third parties that it depends on need to have about one order of magnitude higher availability (e.g. for a system to provide 99.99%, its dependencies need to have ~99.999%).
In practise, this means that there are some services which need significantly higher availability than others.
As a consumer grade service provider (e.g. an e-commerce site), a 99.99% availability is likely to be sufficient. Above this, the consumers dependencies (of which you have no control) such as their internet connection or device are collectively less reliable. This means that investment to significantly improve availability beyond this point isn't particularly valuable.
By contrast, if you're a cloud provider, your customers are relying on you having a significantly higher availability so they can meet their customers' expectations while building on top of your platform.
In general, most consumer systems can afford a small amount of unexpected downtime without world-ending consequences: in fact, most customers won't notice, as their connection and device is likely less reliable. Given achieving more reliability is extremely expensive, it's important you know when to stop, as the time you save can be invested in delivering product features that your customers will genuinely value.
Multi-cloud is a great example. Multi-cloud is a shorthand for building a platform that runs on both multiple cloud providers (e.g. AWS, GCP, Azure etc.). This is the only way to be resilient to a full cloud provider outage - you need a whole second cloud provider that you can lean on instead.
This is an incredibly expensive thing to do. It increases the complexity of your system, meaning that engineers have to understand multiple platforms whenever they’re thinking about infrastructure. You become limited to just the feature set that is shared by both cloud providers, meaning that you end up missing out on the full benefits of neither.
You’ve also introduced a new component - whatever is doing the routing / load balancing between the two cloud providers. To improve your availability using multi-cloud, this new component has to have significantly higher availability than the underlying cloud providers: otherwise you’re simply replacing one problem with another.
Unless you have very specific needs, you’ll do better purchasing high availability products from the best-in-class providers than building your own.
I understand not wanting a single point of failure. But when you add a cloud you don't get more reliability; you almost certainly get less.— Charity Majors (@mipsytipsy) September 19, 2021
Instead of worrying about AWS being down a few min a year, now you have to worry about AWS, GCP, *and* the unholy plumbing between them
If you’re interested in reading more, there’s a great write-up from Corey Quinn on the trade-offs on multi-cloud.
Being on the receiving end of a big provider outage is stressful: you can be completely down with very limited recovery options apart from ‘wait until the provider fixes it’.
In addition, it’s likely that some of your tools are also down as they share dependencies on the third party. When Cloudflare goes down, it takes a large percentage of the internet with it. AWS is the same. That can increase panic and further complicate your response.
So how should we think about these kinds of incidents, and how do we manage them well?
Your site is down. Instead of desperately trying to fix things to bring your site back up, you are … waiting. What should you be doing?
As we discussed above, availability is something that cloud providers are really very good at. The easiest thing you can usually do to improve availability is to use the products that cloud providers build for exactly this reason.
Most cloud providers offer multi-zone or multi-region features which you can opt into (for a price) and vastly decrease the likelihood of these outages.
As with all incidents, it’s important to understand the impact of the outage on your customers. Take the time to figure out what is and isn’t working - perhaps it’s not a full outage, but a service degradation. Or there are some parts of your product which aren’t impacted.
If you can, find a way to tell your customers what’s going on. Ideally via your usual channels, but if those are down then find another way: social media or even old-fashioned emails.
Translate the impact into something your customers can easily understand. What can they do, what can’t they do. Where can they follow along (maybe the third party’s status page) to find out more.
Can you change anything about your infrastructure to bypass the broken component? Provide a temporary gateway for someone to access a particular critical service? Ask someone to email you a CSV file which you can manually process?
This is your chance to think outside the box: it’s likely to be for a short time period so you can do things that won’t scale.
What’s going to happen when the third party outage ends: is it business as usual? Have you got a backlog of async work that you need to get through, which might need to be rate limited? Are you going to have data inconsistencies that need to be reconciled?
Ideally, you'd have some tried and tested methods for disaster recovery which the team is already familiar with and are frequently rehearsed (see Practise or it doesn't count for more details).
In absence of that, try to forecast as much as you can, and take steps to mitigate the impact of these. Maybe scale up queues ready for the thundering herd, or apply some more aggressive rate limiting. Keep communicating, giving your customers all the information they need to make good decisions.
After the incident is over, what can we learn?
Writing a debrief document after a third party outage doesn’t feel good:
What happened? Cloudflare went down
What was the impact? No-one could visit our website
What did we learn? It’s bad when Cloudflare goes down 🤷♀️
Incidents that you can control often feel better than third party incidents where you can’t control the outcome. After the incident you can write a post-mortem, learn from it, and get a warm fuzzy feeling that you’ve improved your product along the way.
However, in the cold light of day, the numbers are unlikely to support this theory. Unless you have the best SRE team in the world, you aren’t going to ship infrastructure products with better availability than a cloud provider.
Instead, we should again focus on the things that are within our control.
It's pretty stressful to be trying to figure out what is impacted by a third party outage in the middle of an incident. To avoid that, you need to understand the various dependency chains in advance.
This is tricky to do as a pen-and-paper exercise: often the most reliable way is to spin up a second environment (that customers aren't using) and start turning bits of the system off.
Once you've got an understanding of the dependencies, when an incident does happen, you'll be able to focus your attention on the relevant parts of your system.
As part of this, you can also run Game days to help train responders in disaster recovery. These are the exercises which can produce the disaster recovery plans (and familiarity) which can be so valuable when bringing your systems back online.
Sometimes, often for historic reasons, you’ll end up relying on multiple third parties where really, one would do the job. Whenever you add a dependency, you significantly reduce your availability. If you can consolidate on fewer appropriately reliable dependencies, it will significantly improve your overall available.
We can also consider blast radius here: are there ways to make some of your product work while a certain provider is down. This doesn’t mean using another provider necessarily, but perhaps you could boot service [x] even if service [y] is unavailable.
Reducing the number of components is likely to reduce your exposure to these kinds of outages.
Your availability is always, at best, the availability of all your critical providers, combined. Be honest with yourselves and your customers about what a realistic availability target is within those constraints, and make sure your contracts reflect that.