Humans aren’t fast enough for 4 9’s

When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers.

It’s easy to lose track of what’s meant when saying “99.95%” availability, and even more is lost when thinking how much harder it is to achieve 99.99% compared to 99.95%.

On a monthly basis, and in concrete terms, 99.95% availability means you get 21 minutes and 55 seconds of downtime. This isn’t to say you’d be happy or proud to get that amount of downtime every month, just that if an incident were to happen, and you were hard down, you’d need to recover faster than that.

99.99% availability on the other hand, means you get only 4 minutes and 23 seconds. From the moment something goes wrong to fixing it, that’s what you get. And that’s a fixed amount, every month.

Availability	Monthly downtime	Quarterly downtime	Annual downtime
99.9%	43 min	2h 10 min	8h 46 min
99.95%	22 min	1h 5 min	4h 23 min
99.99%	4min 23sec	13min 9sec	52min 36sec
99.995%	2min 11sec	6min 34sec	26min 18sec
99.999%	26sec	1min 19sec	5min 16sec

I’m writing about this because every conversation I have about resilience starts in the same place: infrastructure. We discuss multi-region, failover strategies, stress testing, and more. At incident.io we do all those for what it’s worth. We run a hot standby in a secondary region, and we drill full regional promotion to get us back up and running, in under that critical 4 minutes 23 second window. We smoke test automatically and we page ourselves if it fails. This level of infrastructure investment is the cost of entry for an on-call product.

What is more interesting is how we think about reliability when humans in the loop is no longer something that is feasible. That, to me, is the biggest difference between 99.95% and 99.99%.

Why is this hard?

One of our engineers put it like this in a 1:1 a few weeks ago: “Humans are never fast enough for four nines. Even a rollback command that can run in less than 30 seconds, still requires being paged and finding the command to run, which can end up taking more than four minutes.”

Let’s look at a 3am scenario that goes roughly as planned:

03:00:00: Database primary can’t write to disk
03:00:15: 3 consecutive metrics scraped agree that there’s a problem
03:00:16:
- alert is sent to incident.io
- alert is routed to the escalation path
- person is paged
03:00:16: person wakes up and acknowledges the page
03:01:16: person opens the laptop, authenticates and is looking at the alert
03:01:30: they roughly know what to do, but they need to find the runbook
03:02:00: they’ve found the runbook, and are giving it a quick read
03:03:15: they’ve given it a quick read and will run the command to promote database
03:04:00: they’ve double checked the command, and that they have the right identifier for the database; they’re confident and pressed enter
03:04:35: database has gone through the promotion process, verified the consistency of the data, and the app has now reconnected to the new primary and everything is back up
- Note: unfortunately you’ve breached your 99.99% SLA

That’s pretty amazing already but still quite idealistic. Most people don’t go from sound asleep to opening the laptop, and fully authenticated in 1 minute.

And this is exactly the point: for most cases, even idealistic ones, you’re going to breach any 99.99% SLA if people have to be involved. And no, I’m not ranting about engineers being slow. Far from it. Five minutes is really good. My point is that 99.99% isn’t an SLA you achieve by doing the things you’ve always done, but faster.

What you need to optimise for is the system’s ability to survive those crucial first minutes without the human, long enough for the human to play a role in further recovery or deal with the fallout.

We don’t need to automate all of the things in ensuring 99.99%, even though we might like to. If we can buy even 15-30 minutes of autonomous survival, this gives any human in the loop enough time to wake up, get on a laptop, and look at what the system has already done.

The job of the human is to then verify and validate, not to make every-second-counts decisions under stress and unrealistic time constraints.

So, how do we approach this?

Don’t forget about the basics

If your service depends on a database, a queue, and compute, your maximum theoretical availability is the product of those. Three components at 99.99% each, and they’re dependent on one another, your maximum availability is roughly 99.97%, not 99.99%. Think of it this way: each one of those can go down, up to their maximum error allowance, at different times.

When you consider an on-call product, you have to take into account a variety of factors:

load balancers
queues
compute
database
telecom providers

Even if every piece of a stack holds perfectly, you only need one of those to not be at 99.99% and you end up missing the target.

We learned this the hard way in January 2025, when Google PubSub went down for about 7 minutes and we breached our monthly target.

This is why we have to think so deeply about our dependencies when building the system. Any dependency below a demonstrable history of achieving four nines, we build alternatives.

For example, we run multiple active/active telecom providers, verified to be hosted on separate cloud providers. We treat supplier resilience as part of our resilience and we demand the same from them as we do for ourselves.

Beyond the basics

Now that I’ve talked about the human engagement, and the dependencies/infra, let’s talk about some other factors that play a role. When we say 99.99%, I’m not talking about servers, regions, or failovers. Those are all important but not the whole point.

Resilience and reliability is as much about people, code, and operational practices. Therefore, we also need to consider how we review, and deploy code.

From how fast we can deploy a fix to how fast we can roll back a bad deploy, etc. Our hotfixes can be shipped in under 3 minutes, which is awesome. But be warned this is only useful if you also have the discipline of building the right automation (automatic rollbacks, gradual rollouts, etc) for when error rates spike on your most critical paths.

On top of that, add high quality incident response; even if everything is auto recoverable, how you go about verifying the fix by the automation, and how you communicate with customers, still matters a huge deal. And then, what you learn out of it, and how to improve.

Last but not least, measuring your critical paths, and having deep observability is key. You need, at a minimum, to be able to answer one question: did we breach? Then, if the answer is yes, how did we breach, and for how long?

In our journey to 4 nines, we made a bunch of calls on this:

We measure per-target, not per-notification: in its essence, we measure if we were able to page a person in some way, not if we sent an SMS.
We are about to make this a requirement in our app and show the user that is covered by the SLA if they have at least two independent notification methods configured.
We measure all parts, from the edge of our infrastructure, to every single component that might be rate limited, all the way to the specific notification you have configured to receive (either through a 3rd party provider, or directly)

The promise of AI

AI is making a lot of promises in this space. I am going to hedge a bit here because any kind of claim that sounds confident about AI’s role in reliability today, will look ridiculous in eighteen months. In any case, let’s take a look at where we are.

For investigating, basically diagnosing what’s broken, and where to look, I am very confident in AI and specifically our own product, AI SRE, which we use ourselves every single day. Investigating is effectively reasoning with evidence, which is the thing language models are really good at.

For remediation, I’m more guarded in the short term. Our own AI SRE will suggest code changes and even write a PR for you, but the human is still in the loop, and it still depends on your own code processes and CI/CD speeds. Even if you were to trust these systems to just do it, they still need time to reason, write up the PR, test the changes, and put them in production. That’s very different from recovery automation. That actually looks more like a human, just more in depth and slightly faster. This is why everything beyond infra, processes, operational practices and speed, matter!

For the time being, I think AI compresses the time between “something is wrong” and “we know what’s wrong”. A four minute budget still requires the system to recover autonomously and AI can be useful but I think automation and protections in the form of circuit breakers, automated failover, multi-provider redundancy, traffic gates, are all better placed to buying us time.

AI shines to assist the human in the loop, and that is neither counter to the point I’m making in this post, nor made redundant by it.

In the near future, AI will shift left even more, and decide to escalate before the problem is even revealed to automation: could a message appearing in kern.log or dmesg in your database lead AI to correlate with some other increased latency, which a human wouldn’t be able to easily detect, or at least not in under a minute or two, and escalate before the filesystem in the primary becomes read-only and your system is about to go down? I think the answer is yes in the very near future.

Four minutes, every month, forever

If you’ve read this far, thank you. At this point, you’ve spent more time reading this article than your allowed downtime, if you’re offering 99.99% as an SLA.

The reason I wrote this is that I keep chatting with folks that overly focus on the infrastructure piece, rather than the automated recovery, the CI/CD speed, and the daily operational practices.

As for us, we know we need to be up when you’re down. 99.99% availability is genuinely still a hard target and whoever says otherwise is lying. Those 4 minutes you get every month to recover if you have a problem, has nothing to do with how fast your engineers can wake up. It has to do much more with how much of yourself you’ve been willing to remove from the critical path, how much work you’re doing every day to ensure your system can cope and recover, and how much you’re learning after the fact.

Four minutes a month, forever. That’s the budget. Resilience never lets up.