Introduction
I want to walk you through how incident management has evolved, drawing from real data and the experiences of some of the most sophisticated tech organizations out there.
I'll also introduce you to a framework we’ve developed at incident.io: the Incident Maturity Model. This framework is the result of thousands of conversations with companies and provides a clear roadmap to help your organization improve its incident management practices—no matter where you're starting from.
If you'd like to watch instead, I gave a talk on this at SEV0!
Why it’s useful
- It’s BS-free. I’m going to challenge current beliefs, and lay out what I think a better world looks like. I’ll call things how I see them, and you don’t have to agree with me!
- It’s backed by data. We’ve spoken to thousands of teams and gathered insights from our own product usage. This isn’t just theory—it’s grounded in what’s actually happening in organizations today.
- It’s actionable. By the end of this talk, I want you to walk away with concrete ideas you can implement tomorrow. Whether it’s a shift in mindset, new tools to explore, or changes in how you approach incident response, you’ll leave here with something practical.
The status quo
Incidents are inevitable. Or, to put it bluntly, shit happens.
No matter how well-engineered your systems are or how prepared you feel, incidents will occur. What sets resilient organizations apart isn’t if they experience incidents—it’s how they respond.
Fragmented tooling makes this response much harder:
- Manual processes slow things down. Responders are forced to juggle tools, copy information between platforms, and rely on outdated data. Many of the companies we talk to are using seven or more tools to navigate an incident.
- Data is scattered. There’s no single, up-to-date view of what’s going on. Each tool shows only part of the picture, making it tough to understand the incident as a whole. If you’ve ever built a postmortem, you know the pain of piecing together a CSI-style reconstruction that takes hours.
- No intelligence, no insights. Fragmented data makes it nearly impossible to generate useful insights. Without visibility, leaders can’t see what’s working, what’s not, or where to improve. As a result, multi-million-dollar tech investments often miss the mark entirely.
The doom loop
This is where companies fall into the doom loop of incident management. Here’s how it plays out:
- Incidents are hard to declare and run, so teams avoid declaring them unless absolutely necessary.
- Fewer incidents get declared, leaving teams out of practice. Responders lose the muscle memory they need to handle incidents effectively.
- When incidents do happen, the response is slow, disorganized, and inefficient. Customers are impacted more severely, and a few key individuals shoulder the burden. When those people aren’t available or leave, things fall apart—fast.
- Because the response is poor, systemic issues go unnoticed. There’s no data to analyze, no insights to learn from, and no clear direction for improvement. It’s like trying to solve a puzzle with half the pieces missing.
This loop keeps repeating. Teams stay stuck, leadership remains blind, and the organization becomes more fragile. It’s a vicious cycle that traps many companies.
The Incident Maturity Model
Today, I want to introduce you to a framework we’ve developed: the Incident Maturity Model. This model maps out the journey companies take as they mature in their incident management practices.
It’s designed to help you understand where your organization stands today and what steps you can take to improve.
The model has three stages: Centralized, Distributed, and Democratized. Each stage reflects a different level of maturity in how incidents are managed, how teams collaborate, and how tooling and data are used.
Stage 1: Centralized incident management: "You build it, they run it"
The first stage is Centralized Incident Management. This is where most companies begin—and where many stay.
Characteristics
- Centralized management: A small, dedicated team defines, runs, and iterates on the incident management process. They act as a “center of excellence,” providing incident support to the rest of the organization.
- Less mature tooling: Since incidents are managed by a single team, they tolerate manual processes and learn them through repetition.
- Lower operational maturity: Outside the central team, operational concerns are seen as “someone else’s problem.”
The good
- A trained team of experts: Small enough for individual training and easy knowledge sharing.
- Consistent practices: Easier to maintain consistency and share knowledge across incidents.
The bad
- Misaligned incentives: Service teams prioritize speed; the central team prioritizes stability. This creates friction and slows progress.
- Lack of ownership: Service teams miss out on developing operational skills and owning their software end-to-end.
- Scaling challenges: As the company grows, the central team gets overwhelmed, leading to burnout and turnover.
- Fear of incidents: Since everything goes through the central team, declaring an incident feels like admitting failure, making teams reluctant to do it.
Stage 2: Distributed incident management: "You build it, you run it"
The second stage is Distributed Incident Management. This marks a significant step up from the centralized model and will be familiar to anyone in DevOps or SRE roles.
Characteristics
- Teams take ownership: Incident management is no longer the central team’s job alone. Individual teams own their software and services, responding to the alerts their code generates.
- Better tooling and processes: Tooling improves out of necessity—manual processes don’t scale. Teams follow a more consistent process and typically rely on a central source of truth for incidents.
- Incidents are mostly ‘engineering things’: Technical teams handle incidents, often without the broader organization knowing unless there’s a major disruption. It’s still engineers solving engineering problems.
The good
- Engineers are more practiced: Faster responses and less panic when incidents happen.
- Improved tooling: Reduces manual work and captures better data and insights.
- More context: Teams know their software best, which speeds up fixes.
- Pain leads to resilience: Teams feel the impact of their software’s issues, which drives them to build more resilient systems.
The challenges
- Training at scale: With hundreds or thousands of potential responders, training everyone to handle incidents well is tough. This often requires tooling investments.
- Consistency is harder: Keeping incident practices uniform across dozens or hundreds of teams is a real challenge.
How to move from Stage 1 to Stage 2
As with any process change: make the right thing the easiest thing.
You’re asking non-experts to run incidents, so you need to give them a clear, well-paved road to follow.
The goal is simple: a brand-new engineer should be able to manage an incident reasonably well just by following instructions. This requires investing in tooling to codify and automate your process, making sure it’s easy to follow and hard to get wrong.
If you don’t make it easy, distributing incident responsibility will give you all the pain and none of the benefits.
You can build this in-house, or you can buy a solution like incident.io. In the past, companies tended to build it themselves. But with so many tools available now, buying is usually the smarter choice (though I would say that!).
Stage 3: Democratized incident management: "You see it, you report it, you help fix it"
The final and most advanced stage is Democratized Incident Management. This is where cutting-edge organizations are headed—and it represents the future of incident response.
Characteristics
- Incidents are for everyone: Incident management goes beyond technical teams. It’s a company-wide effort where anyone—customer support, legal, compliance, risk, even executives—can declare and participate in incidents.
- Cross-functional collaboration: Teams from across the business collaborate to solve incidents. If customer support spots something off, they can declare an incident. If legal or compliance sees a risk, they’re involved from the start. This catches issues earlier and addresses them more holistically. Incidents are no longer just technical problems—they’re organizational challenges.
- Advanced tooling with centralized data: Tooling is highly sophisticated, offering a central source of truth accessible to everyone. Data from past incidents is captured, analyzed, and used to drive continuous improvement.
The good
- Faster detection and resolution: When anyone can declare an incident, you’re not waiting for the “right” people to notice. More eyes mean quicker detection and resolution.
- Better response through diverse perspectives: Involving customer support, legal, or compliance early means addressing both technical and business impacts. Why wouldn’t you bring in people from around the business?
- A culture of resilience: Incidents become routine, not scary. The process is smooth, the tooling helps, and the whole company learns and improves. Incidents are seen as opportunities, not failures.
The challenges
- Cultural change takes time: You can accelerate it, but behavior change doesn’t happen overnight.
- Shared language is hard: Engineers understand each other. Engineers and lawyers? Not so much. Building shared norms takes time and repetition—but it breeds empathy.
How to move from Stage 2 to Stage 3
Bring people in.
Involve cross-functional leaders in your incident management program. Customer-facing teams are often the first affected by incidents—make them feel included.
Use tooling to loop them in automatically. Notifications via email, Slack, or calls based on severity can ensure high signal, low noise.
Most of the org wants to be involved in incidents. They’ve just been left out by previous tooling. You’ll likely see a strong uptake here!
Why the Incident Maturity Model matters
Why does the Incident Maturity Model matter? Because it gives you a clear roadmap for improvement.
No matter where your organization stands—whether you’re just starting out with a centralized approach, or you’re on your way to a distributed model—this framework shows you where you are, where you need to go, and how to get there.
Many organizations are stuck in the centralized stage, knowing it’s not ideal. The real benefits kick in when you progress to a distributed model and ultimately to democratization—where incident management becomes a company-wide effort driven by collaboration, data, and continuous improvement.
At each stage, your tooling, processes, and culture mature, making your organization more resilient to failure.
The goal isn’t just to respond to incidents faster but to learn from each response and get better over time.
Where are you?
So here’s my challenge: Where is your organization on this journey?
- Are you stuck in the doom loop, burdened by fragmented tools, manual processes, and a centralized team carrying the weight of incidents?
- Or are you moving toward democratization, empowering teams across the company to take action, and leveraging data and AI to predict and prevent incidents?
Wherever you are today, you can take steps to improve. The shift from reactivity to proactivity doesn’t happen overnight, but it starts with understanding your current state and your destination. The Incident Maturity Model can be your guide, helping you chart the future of your incident response process.