
When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers.
It’s easy to lose track of what’s meant when saying “99.95%” availability, and even more is lost when thinking how much harder it is to achieve 99.99% compared to 99.95%.
On a monthly basis, and in concrete terms, 99.95% availability means you get 21 minutes and 55 seconds of downtime. This isn’t to say you’d be happy or proud to get that amount of downtime every month, just that if an incident were to happen, and you were hard down, you’d need to recover faster than that.
99.99% availability on the other hand, means you get only 4 minutes and 23 seconds. From the moment something goes wrong to fixing it, that’s what you get. And that’s a fixed amount, every month.
| Availability | Monthly downtime | Quarterly downtime | Annual downtime |
|---|---|---|---|
| 99.9% | 43 min | 2h 10 min | 8h 46 min |
| 99.95% | 22 min | 1h 5 min | 4h 23 min |
| 99.99% | 4min 23sec | 13min 9sec | 52min 36sec |
| 99.995% | 2min 11sec | 6min 34sec | 26min 18sec |
| 99.999% | 26sec | 1min 19sec | 5min 16sec |
I’m writing about this because every conversation I have about resilience starts in the same place: infrastructure. We discuss multi-region, failover strategies, stress testing, and more. At incident.io we do all those for what it’s worth. We run a hot standby in a secondary region, and we drill full regional promotion to get us back up and running, in under that critical 4 minutes 23 second window. We smoke test automatically and we page ourselves if it fails. This level of infrastructure investment is the cost of entry for an on-call product.
What is more interesting is how we think about reliability when humans in the loop is no longer something that is feasible. That, to me, is the biggest difference between 99.95% and 99.99%.
One of our engineers put it like this in a 1:1 a few weeks ago: “Humans are never fast enough for four nines. Even a rollback command that can run in less than 30 seconds, still requires being paged and finding the command to run, which can end up taking more than four minutes.”
Let’s look at a 3am scenario that goes roughly as planned:
03:00:00: Database primary can’t write to disk03:00:15: 3 consecutive metrics scraped agree that there’s a problem03:00:16:03:00:16: person wakes up and acknowledges the page03:01:16: person opens the laptop, authenticates and is looking at the alert03:01:30: they roughly know what to do, but they need to find the runbook03:02:00: they’ve found the runbook, and are giving it a quick read03:03:15: they’ve given it a quick read and will run the command to promote database03:04:00: they’ve double checked the command, and that they have the right identifier for the database; they’re confident and pressed enter03:04:35: database has gone through the promotion process, verified the consistency of the data, and the app has now reconnected to the new primary and everything is back upThat’s pretty amazing already but still quite idealistic. Most people don’t go from sound asleep to opening the laptop, and fully authenticated in 1 minute.
And this is exactly the point: for most cases, even idealistic ones, you’re going to breach any 99.99% SLA if people have to be involved. And no, I’m not ranting about engineers being slow. Far from it. Five minutes is really good. My point is that 99.99% isn’t an SLA you achieve by doing the things you’ve always done, but faster.
What you need to optimise for is the system’s ability to survive those crucial first minutes without the human, long enough for the human to play a role in further recovery or deal with the fallout.
We don’t need to automate all of the things in ensuring 99.99%, even though we might like to. If we can buy even 15-30 minutes of autonomous survival, this gives any human in the loop enough time to wake up, get on a laptop, and look at what the system has already done.
The job of the human is to then verify and validate, not to make every-second-counts decisions under stress and unrealistic time constraints.
So, how do we approach this?
If your service depends on a database, a queue, and compute, your maximum theoretical availability is the product of those. Three components at 99.99% each, and they’re dependent on one another, your maximum availability is roughly 99.97%, not 99.99%. Think of it this way: each one of those can go down, up to their maximum error allowance, at different times.
When you consider an on-call product, you have to take into account a variety of factors:
Even if every piece of a stack holds perfectly, you only need one of those to not be at 99.99% and you end up missing the target.
We learned this the hard way in January 2025, when Google PubSub went down for about 7 minutes and we breached our monthly target.
This is why we have to think so deeply about our dependencies when building the system. Any dependency below a demonstrable history of achieving four nines, we build alternatives.
For example, we run multiple active/active telecom providers, verified to be hosted on separate cloud providers. We treat supplier resilience as part of our resilience and we demand the same from them as we do for ourselves.
Now that I’ve talked about the human engagement, and the dependencies/infra, let’s talk about some other factors that play a role. When we say 99.99%, I’m not talking about servers, regions, or failovers. Those are all important but not the whole point.
Resilience and reliability is as much about people, code, and operational practices. Therefore, we also need to consider how we review, and deploy code.
From how fast we can deploy a fix to how fast we can roll back a bad deploy, etc. Our hotfixes can be shipped in under 3 minutes, which is awesome. But be warned this is only useful if you also have the discipline of building the right automation (automatic rollbacks, gradual rollouts, etc) for when error rates spike on your most critical paths.
On top of that, add high quality incident response; even if everything is auto recoverable, how you go about verifying the fix by the automation, and how you communicate with customers, still matters a huge deal. And then, what you learn out of it, and how to improve.
Last but not least, measuring your critical paths, and having deep observability is key. You need, at a minimum, to be able to answer one question: did we breach? Then, if the answer is yes, how did we breach, and for how long?
In our journey to 4 nines, we made a bunch of calls on this:
AI is making a lot of promises in this space. I am going to hedge a bit here because any kind of claim that sounds confident about AI’s role in reliability today, will look ridiculous in eighteen months. In any case, let’s take a look at where we are.
For investigating, basically diagnosing what’s broken, and where to look, I am very confident in AI and specifically our own product, AI SRE, which we use ourselves every single day. Investigating is effectively reasoning with evidence, which is the thing language models are really good at.
For remediation, I’m more guarded in the short term. Our own AI SRE will suggest code changes and even write a PR for you, but the human is still in the loop, and it still depends on your own code processes and CI/CD speeds. Even if you were to trust these systems to just do it, they still need time to reason, write up the PR, test the changes, and put them in production. That’s very different from recovery automation. That actually looks more like a human, just more in depth and slightly faster. This is why everything beyond infra, processes, operational practices and speed, matter!
For the time being, I think AI compresses the time between “something is wrong” and “we know what’s wrong”. A four minute budget still requires the system to recover autonomously and AI can be useful but I think automation and protections in the form of circuit breakers, automated failover, multi-provider redundancy, traffic gates, are all better placed to buying us time.
AI shines to assist the human in the loop, and that is neither counter to the point I’m making in this post, nor made redundant by it.
In the near future, AI will shift left even more, and decide to escalate before the problem is even revealed to automation: could a message appearing in kern.log or dmesg in your database lead AI to correlate with some other increased latency, which a human wouldn’t be able to easily detect, or at least not in under a minute or two, and escalate before the filesystem in the primary becomes read-only and your system is about to go down? I think the answer is yes in the very near future.
If you’ve read this far, thank you. At this point, you’ve spent more time reading this article than your allowed downtime, if you’re offering 99.99% as an SLA.
The reason I wrote this is that I keep chatting with folks that overly focus on the infrastructure piece, rather than the automated recovery, the CI/CD speed, and the daily operational practices.
As for us, we know we need to be up when you’re down. 99.99% availability is genuinely still a hard target and whoever says otherwise is lying. Those 4 minutes you get every month to recover if you have a problem, has nothing to do with how fast your engineers can wake up. It has to do much more with how much of yourself you’ve been willing to remove from the critical path, how much work you’re doing every day to ensure your system can cope and recover, and how much you’re learning after the fact.
Four minutes a month, forever. That’s the budget. Resilience never lets up.


A look at how on-call schedules work, and how we made rendering them 2,500× faster — through profiling, smarter algorithms, and some Claude.
Rory Bain
For the last 18 months, we've been building AI SRE, and one of the things we've learned is that UX matters more than you think. This week, I used AI SRE to run a real incident, and I walk you through it end-to-end.
Chris Evans
Everyone is using AI to help with post-mortems now. We've built AI into our own post-mortem experience, pulling your Slack thread, timeline, PRs, and custom fields together and giving your team a meaningful starting point in seconds. But "AI for post-mortems" can mean very different things.
incident.ioReady for modern incident management? Book a call with one of our experts today.
