We're using AI to build an agentic product that works collaboratively with responders to improve incident investigations and resolve incidents faster. A bold claim, I know, and I think pretty impressive to land the word “agentic” so early on—I promise it’s the last time I use it.
After six months of digging into this, I’m convinced: AI in incident response won’t just be helpful—it’ll be essential. As more software is built with, and increasingly by, AI, responders will have less and less context about the systems they’re operating. That shrinking understanding—combined with the ever-growing volume of software—only increases the need for tools that can assist.
Done right, there's a huge upside in this approach too—faster incident resolution, reduced customer impact, and less cognitive burden on the folks putting out the fires.
But with more automation comes a new shape of risk—much of which is captured in Lisanne Bainbridge’s 1983 paper, Ironies of automation. In the paper, Bainbridge explains that automation meant to help can paradoxically make things harder. As routine tasks get automated, human skills fade from lack of practice, so when the system fails (and they will!), responders are left underprepared and out of context.
Working in tech companies, I’m yet to see these risks materialise seriously, but there are definite elements of truth here. Count the number of Kubernetes incidents where operators have no idea what’s happening and you’ll get the gist.
✈️A fatal negative consequence of automation
In 2009, Air France Flight 447 crashed into the Atlantic Ocean after autopilot disengaged during a severe storm. The pilots, accustomed to automation handling routine flight conditions, struggled to correctly interpret the sudden barrage of unfamiliar warnings and data. Their manual flying skills and diagnostic abilities had degraded through reliance on automation, leaving them tragically underprepared for this critical scenario.
At the core, I see incidents as a fundamentally human activity. They’re what happens when all of your preventative controls have failed, leaving you entirely outside the world of normal operation. So when I think about automation in incidents, it makes no sense to me that we’d be looking to replace humans. The only sensible mode of operation is one where humans and machines cooperate to reach a resolution.
Before going any further, I want to touch on a few ideas from Bainbridge’s paper that feel especially relevant when we think about automation in incident response.
Let’s start with what I’d consider to be an over-automated vision of the future—where AI tries to take over critical work and decision-making from human responders. That’s a subjective line, sure, but an easy one to spot when you look at real examples.
Bainbridge actually calls this out directly. She writes that “using the computer to give [the human] instructions is inappropriate if the operator is simply acting as a transducer”—a passive channel for actions, rather than someone thinking and deciding.
If AI sets the direction and humans just follow along, we risk losing human judgment, context, and ultimately, effectiveness.
✅Rubber-stamping our way into outages
An engineer gets paged at 3am, half-asleep, and sees that the AI has diagnosed an issue and is asking them to “confirm” a reboot of all servers in Amsterdam. The rationale? A recent software change might be causing instability, and a reboot fixed it the last few times this kind of issue happened.
You hit confirm—because it’s 3am, you’re tired, and hey, the AI seems confident.
Except… the issue wasn’t caused by the new release. It was a noisy-but-harmless log storm triggered by a recent feature flag toggle. The reboot interrupts real customer traffic, kicks off alerts across regions, and suddenly you’ve turned a minor hiccup into a cross-regional mess.
The AI didn’t make the wrong call—it made a reasonable one, based on limited context. The problem is, you acted like a glorified button-clicker instead of a thinking responder. That’s when judgement, context, and that gut-level “wait a second…” instinct go out the window.
A slightly exaggerated example, granted, but it’s easy to see how a confident AI and a tired human can to drift into a world of human’s rubber-stamping decisions they didn’t actually make.
In a world where AI can write convincing human-like text, it’s tempting to delegate responsibility to an automated system to have it update a status page or send Slack messages to execs. But tone, timing, and nuance matter. Too many updates cause panic. Too few erode trust. Communication during incidents is part judgment, part empathy—still very much human territory.
📢 Good intentions, bad broadcast
An incident kicks off. Services are falling over, people are scrambling, and your trusty AI companion jumps into action: it posts to the status page, triggering a number of subscribes to get updates via email.
Problem is… it doesn’t quite have the full picture.
It mentions “data loss” based on an ambiguous log line. And then, just to keep the chaos rolling, it goes silent for three hours while humans are heads-down fixing things—leaving customers, execs, and half of LinkedIn assuming the company has fallen into the sea.
Technically, the AI did what it was told: communicate. But without human judgment around tone, timing, and what not to say, it turned a tricky situation into a reputational mess.
Automatically restarting services or flipping traffic sounds like an SRE’s dream. But when things go wrong, they really go wrong—and mitigation can end up being more destructive than the original issue. Until systems have the full context their human counterparts do, the ability to take consequential actions should be treated with caution.
🔥From graceful degradation to total meltdown
Traffic to a dashboard page is spiking, and CPU usage on its backing services is climbing fast. But the team’s not worried—they know it’s the result of a marketing email. The load is expected, and the service is degrading gracefully.
The AI, lacking that context, decides to scale up the service. Sounds helpful—until the shared database gets overwhelmed, and the whole product goes offline. What started as a manageable situation just became a full-blown outage.
The AI made a reasonable decision based on what it knew. But without the surrounding context—why the traffic was spiking, how the service handles load, the impact on shared infrastructure—it turned a non-problem into a disaster.
A slightly reductive example as autoscaling is a well-established practice. But it’s a good illustration of how automation, without guardrails or context, can easily go a step too far.
In the paper, Bainbridge poses two key questions that automation designers should wrestle with:
In incident response, effectiveness comes from repetition, exposure, and experience. People get better over time by seeing unusual and surprising failures, talking through trade-offs, and building up mental models of how systems behave. If automation quietly takes all that away, we lose opportunities for growth, learning, and curiosity.
And that’s the real long-term risk: a future where incidents are “handled” by machines, responders disengage, local knowledge fades, and no one wants to supervise a machine solving problems they don’t understand. And when automation breaks you’ll be relying on humans whose skills have quietly eroded, right when you need them most.
Fortunately it’s not all doom-and-gloom. I believe there’s a bright future for AI automation in incident response. As someone building it I would say that, of course, but setting aside the obvious incentives and biases it’s something I truly believe.
I think success lies in a few key principles that shape both the technical implementation and the overall interaction model (i.e. the UX) of the system. When we talk about what we're building internally, we're firmly anchored to the notion of cooperative agent rather than replacement for work. The vibe we're aiming for is "your best engineer in every incident", where best covers both technical and social excellence.
Here’s how I think about the “good future”...
incident.io will not automate away critical human decision-making. Our goal is to strengthen it with humans remaining central, and AI tools providing supportive context.
Practically speaking, we repeatedly ask ourselves the questions “what would we want an experienced human collaborator to do here?”. Rarely do you draft in extra people only for them to take over and leave you on the sidelines. Our agents should act similarly.
❌ I’ve reverted the PR and hotfixed it in production
✅ This PR (#2709) was merged just before this incident. It adds error handling that looks incorrect, and could lead to an alert like this
We’re not building black-box automation that hides complexity and reasoning—we’re doing the opposite. As we build out our cooperative automation, trust is one of the biggest hurdles we face, and we believe transparency is the way through it.
Just like with a human teammate, we’re far more willing to forgive a bad call if we can understand the reasoning behind it. When our agent reaches a conclusion, it will show its working: what it looked at, why it thinks it’s relevant, and where there’s uncertainty. No “just trust me” vibes.
❌ Service-X is the problem. Click here to page them.
✅ From this graph, it looks like error rates are spiking in service-X. The logs show an uptick in ‘timeout’ errors starting at 14:23, which coincides with this config change. If that looks correct, you can page the team by clicking below.
We want responders to stay actively involved, not fall into the trap of deferring to the AI just because it sounds confident. That means keeping humans in the loop—and responsible—when it comes to meaningful decisions. Language and UX matter a lot here.
We aim to be loud enough to offer useful input, but quiet enough not to bulldoze the conversation or imply there's a single "correct" path. Our goal is to present credible theories, relevant data, and paths worth exploring—not to steer people into autopilot mode.
❌ This is a regression from PR #456.
✅ One possibility: PR #456 modified retry logic in this area. I could be worth checking if that’s related?
Humans aren’t great at trawling through piles of incident data to spot patterns—but machines are. Our goal is to surface the right context at the right time, helping responders build and refine their mental models in real time.
By handling the retrieval, synthesis, and pattern-matching, our tools free responders up to focus on the hard stuff: making judgement calls, thinking strategically, asking “what now?” Ironically, the more context we surface, the sharper responders become.
❌ This alert has fired 12 times in the past.
✅ This alert fired during incidents in Feb and Oct last year—both related to queue backlogs in service-X after deploys. Might be worth checking if that’s happening again. See INC-2412 and INC-2508 for more details.
Automation ironies show up when humans become passive, unskilled observers. And if your goal is to take meaningful work away from responders during incidents, you'd have a legitimate case for those issues showing up. But that’s not what we’re building.
We expect humans to stay deeply involved—working alongside the machine to gather context, explore problems, form hypotheses, and choose sensible paths to resolution. Lofty claims, sure—but ones we believe we’ll ship in the next few months.
Bainbridge’s paper calls out critical pitfalls in automation. We’ve read them, we’ve talked about them, and we’re actively building to avoid them.
If we get it right—and I’m confident we will—our product won’t just avoid these automation pitfalls; it’ll amplify the human skills we rely on when everything else has failed, providing timely advice, useful hypotheses and critical context exactly when it’s needed the most.
Why blinding trusting AI to optimize your prompts can backfire, and human intuition is still essential when building intelligent agents.
Building with AI is one of the easiest ways to create a huge infrastructure bill. Teams need visibility and awareness of what they're spending, along with guardrails to catch mistakes. This is how we control spend at incident.io.
When the time taken to execute a prompt becomes an issue, these strategies can optimize response latency without impacting prompt behaviour.