We’ve spent our lives building products customers love, but we've also been pulled into more incidents than we can count, ranging from small blips quickly solved, to massive incidents that required weeks in an incident room.
We’ve also been lucky enough to work in organizations where everyone engages in the process, leveraging incidents as a super power to consistently improve their service, products and customer engagement.
Monzo, Cloudflare and Slack are great examples of companies doing this well, both in terms of their response when things go wrong, but also the quality of their follow-up and the learnings they share afterwards.
Handling incidents well isn't just an opportunity to limit the damage. Done well, they are a way to increase customer trust, and can turn the worst day into the best learning opportunity.
This is brilliant to see and one of the reasons I always recommend you guys to anyone. Super open, Super transparent, Super all round! 🤘
— Luke Roberts (@L337LUKE) August 12, 2019
However, historically this sort of response and benefit has been confined to incidents happening in engineering and product teams.
We think there's a better version of the world out there, where the same principles that help the world's best technology teams can be enjoyed by the entire organization. We also think we know why it's not happening, and what can be done about it.
Many think incidents are solely an 'engineering thing'. Our experience is the polar opposite. Incidents often start in product/engineering, but they usually require people from around the organization to form a temporary team to collaborate, communicate and solve a problem.
We've experienced this first-hand working for companies like Monzo and GoCardless. Incidents required many different teams from around the business to coordinate: regulatory communications, customer support, public relations, legal, finance, product, compliance, engineering. Each of these disciplines had a role to play in seeing the incident through to its conclusion. Excluding them wasn’t an option.
Having the right people in the room is essential for great incident management. Unfortunately, almost all existing tools on the market focus on solving problems for engineers, leaving the rest of the organization out in the cold. Solving incidents is inherently collaborative — why should engineers get all the goodies?
Incidents aren’t just limited to engineering — they happen around the organization, all of the time:
You may not call these incidents, but they are. They require urgency in response, multiple people to coordinate, effective communication to different stakeholders and follow-up to investigate and learn. All hallmarks of an incident! Not calling these incidents causes multiple issues:
In product-led companies, it’s common for companies to only use the label "incident" for large, externally visible problems. Rhetoric in the industry doesn't help - we only hear about the "really bad" incidents, such as global S3 outages, or a major CDN falling over.
Other industries have incidents around this as part of their standard operating procedure, large or small. There's no reason why product-led companies shouldn't take a leaf out of those books.
The bar for 'what is an incident?' is far too high. In reality, software and processes fail all the time, in less severe ways. Think about bugs affecting a small subset of users or a customer support agent adding the incorrect amount of credit to an account. These are incidents, but commonly fall under the bar.
Smaller incidents are extremely valuable! They're a great way to learn about the failure cases of systems and provide an opportunity for teams to practice response to larger issues. However, organizations often don't apply the same process, rigour and follow-up to these smaller issues. Why?
Gone are the days when incidents only occurred in technology, with limited understanding for the rest of the organization. There's just too much value being left on the table.
Instead, organizations must adopt organization-wide incident management: a single system to help your entire company respond, review and learn when things go wrong, big or small. Tools that everyone loves using, and that help them solve any type of incident, at any scale.
Our experience at Monzo proved that embracing this approach has numerous advantages over what came before:
A unified approach to incident management allows you to engage the power of your whole organization. No matter the role, people approach problems with a shared mental model.
Executives, customer support, product, engineering are all operating in-sync. It's easy to keep the people that need to know, in the know.
As a result, your response becomes easier, higher quality and more predictable — making your customers happier.
With organization-wide incident management, everyone feels comfortable raising incidents and running them, whether for small issues, or complete outages. For example:
Incidents are respected, but normalised throughout the organization. Your whole organization levels up at responding to problems, no matter how big or small, and learning from them.
A proper organization-wide incident management system is flexible enough to model your process for any type of incident: be it engineering, security, operational or privacy related. It should glue together the tools that exist in each discipline — from GitHub to OneTrust — and help them work better together, instead of siloing information.
All processes run through the same system, giving you a single view over the issues your organization faces, and the ways you can improve. You have the data you need to operate and invest your time appropriately, whether it be engineering time or operational capacity — the data you never had before.
We believe organization-wide incident management is the future, and this is the vision for what we're building here at incident.io.
We're currently working with customers who feel the same way and we're already seeing great results:
If you've felt any of the pain we've described, or resonate with our view of the world, we'd love to chat. Follow @incident_io on twitter, join our Community on Slack or drop us an email via hello@incident.io.
We created a dedicated page for Anthropic to showcase our incident management platform, complete with a custom game called PagerTron, which we built using Claude Code. This project showcases how AI tools like Claude are revolutionizing marketing by enabling teams to focus on creative ways to reach potential customers.
We examine both companies' comparison pages and find some significant discrepancies between PagerDuty's claims and reality. Learn how our different origins shape our approaches to incident management.
The EU AI Act introduces new incident reporting rules for high-risk AI systems. This post breaks down what Article 73 actually mandates, why it's not as scary as it sounds, and how good incident management makes compliance a breeze.
Ready for modern incident management? Book a call with one our of our experts today.