We’ve spent our lives building products customers love, but we've also been pulled into more incidents than we can count, ranging from small blips quickly solved, to massive incidents that required weeks in an incident room.
We’ve also been lucky enough to work in organisations where everyone engages in the process, leveraging incidents as a super power to consistently improve their service, products and customer engagement.
Monzo, Cloudflare and Slack are great examples of companies doing this well, both in terms of their response when things go wrong, but also the quality of their follow-up and the learnings they share afterwards.
Handling incidents well isn't just an opportunity to limit the damage. Done well, they are a way to increase customer trust, and can turn the worst day into the best learning opportunity.
This is brilliant to see and one of the reasons I always recommend you guys to anyone. Super open, Super transparent, Super all round! 🤘— Luke Roberts (@L337LUKE) August 12, 2019
However, historically this sort of response and benefit has been confined to incidents happening in engineering and product teams.
We think there's a better version of the world out there, where the same principles that help the world's best technology teams can be enjoyed by the entire organisation. We also think we know why it's not happening, and what can be done about it.
Many think incidents are solely an 'engineering thing'. Our experience is the polar opposite. Incidents often start in product/engineering, but they usually require people from around the organisation to form a temporary team to collaborate, communicate and solve a problem.
We've experienced this first-hand working for companies like Monzo and GoCardless. Incidents required many different teams from around the business to coordinate: regulatory communications, customer support, public relations, legal, finance, product, compliance, engineering. Each of these disciplines had a role to play in seeing the incident through to its conclusion. Excluding them wasn’t an option.
Having the right people in the room is essential for great incident management. Unfortunately, almost all existing tools on the market focus on solving problems for engineers, leaving the rest of the organisation out in the cold. Solving incidents is inherently collaborative — why should engineers get all the goodies?
Incidents aren’t just limited to engineering — they happen around the organisation, all of the time:
You may not call these incidents, but they are. They require urgency in response, multiple people to coordinate, effective communication to different stakeholders and follow-up to investigate and learn. All hallmarks of an incident! Not calling these incidents causes multiple issues:
In product-led companies, it’s common for companies to only use the label "incident" for large, externally visible problems. Rhetoric in the industry doesn't help - we only hear about the "really bad" incidents, such as global S3 outages, or a major CDN falling over.
Other industries have incidents around this as part of their standard operating procedure, large or small. There's no reason why product-led companies shouldn't take a leaf out of those books.
The bar for 'what is an incident?' is far too high. In reality, software and processes fail all the time, in less severe ways. Think about bugs affecting a small subset of users or a customer support agent adding the incorrect amount of credit to an account. These are incidents, but commonly fall under the bar.
Smaller incidents are extremely valuable! They're a great way to learn about the failure cases of systems and provide an opportunity for teams to practice response to larger issues. However, organisations often don't apply the same process, rigour and follow-up to these smaller issues. Why?
Gone are the days when incidents only occurred in technology, with limited understanding for the rest of the organisation. There's just too much value being left on the table.
Instead, organisations must adopt organisation-wide incident management: a single system to help your entire company respond, review and learn when things go wrong, big or small. Tools that everyone loves using, and that help them solve any type of incident, at any scale.
Our experience at Monzo proved that embracing this approach has numerous advantages over what came before:
A unified approach to incident management allows you to engage the power of your whole organisation. No matter the role, people approach problems with a shared mental model.
Executives, customer support, product, engineering are all operating in-sync. It's easy to keep the people that need to know, in the know.
As a result, your response becomes easier, higher quality and more predictable — making your customers happier.
With organisation-wide incident management, everyone feels comfortable raising incidents and running them, whether for small issues, or complete outages. For example:
Incidents are respected, but normalised throughout the organisation. Your whole organisation levels up at responding to problems, no matter how big or small, and learning from them.
A proper organisation-wide incident management system is flexible enough to model your process for any type of incident: be it engineering, security, operational or privacy related. It should glue together the tools that exist in each discipline — from GitHub to OneTrust — and help them work better together, instead of siloing information.
All processes run through the same system, giving you a single view over the issues your organisation faces, and the ways you can improve. You have the data you need to operate and invest your time appropriately, whether it be engineering time or operational capacity — the data you never had before.
We believe organisation-wide incident management is the future, and this is the vision for what we're building here at incident.io.
We're currently working with customers who feel the same way and we're already seeing great results:
If you've felt any of the pain we've described, or resonate with our view of the world, we'd love to chat. Follow @incident_io on twitter, join our Community on Slack or drop us an email via firstname.lastname@example.org.
Building safe-by-default tools in our Go web application
At incident.io, we're acutely aware that we handle incredibly sensitive data on behalf of our customers. Moving fast and breaking things is all well and good, but keeping our customer data safe isn't…
Lisa Karlin Curtis
Deploying to production in <5m with our hosted container builder
Fast build times are great, which is why we aim for less than 5m between merging a PR and getting it into production. Not only is waiting on builds a waste of developer time — and an annoying…
New Joiner: Katie Hewitt
Hi! I'm the newest member (and first non-engineer!) to join the incident.io team. I'm going to be working on all things Strategy and Ops, from getting the rails in place to keep us working effectively…