We’ve spent our lives building products customers love, but we've also been pulled into more incidents than we can count, ranging from small blips quickly solved, to massive incidents that required weeks in an incident room.
We’ve also been lucky enough to work in organizations where everyone engages in the process, leveraging incidents as a super power to consistently improve their service, products and customer engagement.
Monzo, Cloudflare and Slack are great examples of companies doing this well, both in terms of their response when things go wrong, but also the quality of their follow-up and the learnings they share afterwards.
Handling incidents well isn't just an opportunity to limit the damage. Done well, they are a way to increase customer trust, and can turn the worst day into the best learning opportunity.
This is brilliant to see and one of the reasons I always recommend you guys to anyone. Super open, Super transparent, Super all round! 🤘— Luke Roberts (@L337LUKE) August 12, 2019
However, historically this sort of response and benefit has been confined to incidents happening in engineering and product teams.
We think there's a better version of the world out there, where the same principles that help the world's best technology teams can be enjoyed by the entire organization. We also think we know why it's not happening, and what can be done about it.
It's time to change how we think about incidents
Incidents involve more people than we think. Tooling just makes it really hard for them to help.
Many think incidents are solely an 'engineering thing'. Our experience is the polar opposite. Incidents often start in product/engineering, but they usually require people from around the organization to form a temporary team to collaborate, communicate and solve a problem.
We've experienced this first-hand working for companies like Monzo and GoCardless. Incidents required many different teams from around the business to coordinate: regulatory communications, customer support, public relations, legal, finance, product, compliance, engineering. Each of these disciplines had a role to play in seeing the incident through to its conclusion. Excluding them wasn’t an option.
Having the right people in the room is essential for great incident management. Unfortunately, almost all existing tools on the market focus on solving problems for engineers, leaving the rest of the organization out in the cold. Solving incidents is inherently collaborative — why should engineers get all the goodies?
We have incidents in all parts of the business. We just don't call them incidents
Incidents aren’t just limited to engineering — they happen around the organization, all of the time:
- Not enough food delivery riders being on shift, and ETA times spiking as a result is an operational incident.
- Your largest customer threatening to churn unless you re-negotiate their contract is a customer success incident.
- An ex-employee threatening to maliciously leak confidential information about the business is a security incident.
- A customer support agent sending data to the wrong customer is a privacy incident.
You may not call these incidents, but they are. They require urgency in response, multiple people to coordinate, effective communication to different stakeholders and follow-up to investigate and learn. All hallmarks of an incident! Not calling these incidents causes multiple issues:
- You duplicate your incident management process around the business. You end up with N different processes for handling these situations, siloed in each area of the business. If different parts of the business want to collaborate on an incident (they will!), they need to learn N different processes to solve them. Without a shared mental model for how to solve problems, incidents take longer to resolve with less effective communication.
- It's impossible to get a single view of all your incidents. Distributing your incident management process around the business means it's impossible to get a single view of all of the incidents. As a leader, you want your finger on the pulse of the organization — not just the engineering organization.
We have more incidents than we realise. We just don't hear about them.
In product-led companies, it’s common for companies to only use the label "incident" for large, externally visible problems. Rhetoric in the industry doesn't help - we only hear about the "really bad" incidents, such as global S3 outages, or a major CDN falling over.
Other industries have incidents around this as part of their standard operating procedure, large or small. There's no reason why product-led companies shouldn't take a leaf out of those books.
The bar for 'what is an incident?' is far too high. In reality, software and processes fail all the time, in less severe ways. Think about bugs affecting a small subset of users or a customer support agent adding the incorrect amount of credit to an account. These are incidents, but commonly fall under the bar.
Smaller incidents are extremely valuable! They're a great way to learn about the failure cases of systems and provide an opportunity for teams to practice response to larger issues. However, organizations often don't apply the same process, rigour and follow-up to these smaller issues. Why?
- Current tooling and processes are hard to use. As a result, you skip declaring an incident, and send a Slack DM to someone to fix the issue directly. Your problems are hidden away.
- Existing processes set off the wrong alarm bells. If your incident tooling pages the CEO, or wakes someone up, you don't want to be the person who pressed the wrong buttons and caused a panic for a minor issue.
- People are worried they will be blamed. People don’t want to sound the alarm as they’re worried they’ll be seen as culpable.
We need organization-wide incident management
Gone are the days when incidents only occurred in technology, with limited understanding for the rest of the organization. There's just too much value being left on the table.
Instead, organizations must adopt organization-wide incident management: a single system to help your entire company respond, review and learn when things go wrong, big or small. Tools that everyone loves using, and that help them solve any type of incident, at any scale.
Our experience at Monzo proved that embracing this approach has numerous advantages over what came before:
Your whole team, on the same team
A unified approach to incident management allows you to engage the power of your whole organization. No matter the role, people approach problems with a shared mental model.
Executives, customer support, product, engineering are all operating in-sync. It's easy to keep the people that need to know, in the know.
As a result, your response becomes easier, higher quality and more predictable — making your customers happier.
Practice makes perfect
With organization-wide incident management, everyone feels comfortable raising incidents and running them, whether for small issues, or complete outages. For example:
- Customer support agents feel comfortable raising incidents for process failures, in the knowledge that we'll use the data to improve controls for next time
- Operations teams use incident tooling to coordinate, communicate and respond when food delivery ETA times are high
- Engineering teams run simulated incidents to load test their systems, and practice how they'd respond
Incidents are respected, but normalised throughout the organization. Your whole organization levels up at responding to problems, no matter how big or small, and learning from them.
A single source of truth for all incident data
A proper organization-wide incident management system is flexible enough to model your process for any type of incident: be it engineering, security, operational or privacy related. It should glue together the tools that exist in each discipline — from GitHub to OneTrust — and help them work better together, instead of siloing information.
All processes run through the same system, giving you a single view over the issues your organization faces, and the ways you can improve. You have the data you need to operate and invest your time appropriately, whether it be engineering time or operational capacity — the data you never had before.
If you agree, let's talk!
We believe organization-wide incident management is the future, and this is the vision for what we're building here at incident.io.
We're currently working with customers who feel the same way and we're already seeing great results:
- After using incident.io, customers see a change from fewer, larger incidents, to many smaller incidents. Our customers are getting visibility on all the incidents that were flying under the radar before. As our product is so easy to use, and designed for everyone, the friction of creating an incident is a lot lower. Each of these smaller incidents is a chance to skill up more of the team, learn and improve.
- Many different disciplines use incident.io to collaborate. From engineering to operations, we're used as a tool to solve problems more effectively, together.
If you've felt any of the pain we've described, or resonate with our view of the world, we'd love to chat. Follow @incident_io on twitter, join our Community on Slack or drop us an email via email@example.com.
Image credit: Matt Chesin