Learning from incidents has become something of a hot topic within the software industry, and for good reason.
Analyzing mistakes and mishaps can help organizations avoid similar issues in the future, leading to improved operations and increased safety. But too often we treat learning from incidents as the end goal, rather than a means to achieving greater business success.
The goal is not for our organisations to learn from incidents: it’s for them to be better, more successful businesses. I know, how corporate.
The more we learn, the more successful we are, and the cycle continues. We learn because we want to succeed.
You might conclude that I don’t care about learning from incidents; I do, deeply. But I care about learning from incidents because more informed, more experienced people are going to be more effective at their jobs, and likely happier, too. A culture of learning is good for the people that work here, it’s good for our customers, and ultimately that’s good for business.
The more we learn, the more successful we are, and the cycle continues.
We learn because we want to succeed.
I’ve seen a considerable amount of research and effort being applied to study of learning from incidents in software, and a lot of interesting and thought provoking material shared as a result.
Often though, what I see highlights a growing gap between the academia and the practical challenges that most face on a day-to-day basis. I’m not ashamed to say I’ve given up on reading papers or watching some talks because they felt so wildly disconnected as to be useless.
I spend every working day thinking about, and talking with people about incidents and it feels impenetrable to me – that feels wrong.
At incident.io, I’m fortunate to work with a diverse set of customers: from 10 person startups, to enterprises with tens of thousands of employees. For the majority of these customers, the problem isn’t anchored in academic concepts and complex theories. It’s a lot more fundamental.
Many struggle to define what an incident is, how they should react when something goes wrong, or how to make sure the right people are looped into the right incidents so things run as smoothly as possible.
When it comes to post-incident activities, they don’t know how to run an incident debrief, they can’t keep track of follow-up actions, and they’re stuck trying to convince their senior leaders that targeting an ever reducing mean time to recovery isn’t a great idea (pro tip: it’s not a good idea).
If you’re are trying to improve the incident culture at your organization, or convince your management that an investment of time to really learn from a major incident is a good idea, an academic approach just doesn’t work.
Telling someone who wants a report on the root cause that there’s “no root cause”, alienates the very people we need to convince. If we want buy-in from the top, more needs to be done to take people on the journey of zero-to-one, and that means connecting learning and change to tangible business outcomes.
None of this is meant to criticise the good work of the incident community. There are plenty of folks doing excellent work and extolling the value of more practically focused incident management. But I’ve equally seen what I consider to be semi-harmful advice given too. Advice around devoting days or weeks of effort to investigate even the smallest of incidents. I’m almost certain you’ll be able to learn something, but does the return on investment justify it?
And then there’s all the things people are told they shouldn’t be doing, like reducing incidents down to numbers for comparison. Yes, MTTR is fundamentally flawed metric, but when you have a conversation about replacing it with people who believe it’s useful, what are you suggesting? Most people are time constrained and if they’re told to draw the rest of the owl, they simply won’t.
This is one of the many reasons I started incident.io with my co-founders.
Between us, we’ve been at the business end of highly effective incident management programmes, semi-broken ones, and many in between. What’s common among the high performers is the fact that a healthy culture has started from a position of engaging the whole organization. Learning is connected to practical benefits that everyone understands, and there’s been a person (or group of people) at the heart of the culture, applying time and effort to meet people where they are and bring them on the journey. Learning has never been positioned as the primary motivator, it’s been a side-benefit of more business-oriented objectives.
So, to make this a little more action focused, here’s a few tidbits of advice for how to practically synthesize learning alongside your role.
Think carefully about the return on investment of your actions
Nothing will put roadblocks up faster than work being done without good justification for how it helps the business. Whether you think it’s meaningful or not, if you’re spending a week performing a thorough investigation of an incident that degraded a small part of your app for a few minutes, you’re unlikely to win over anyone who cares about delivering on the broader priorities of the organization. This might mean less time (or no time) spent on these incidents, in favour of using more significant ones.
Use transparency as a catalyst for serendipitous learning
Whether you like it or not, folks learn. Collisions of teams, individuals, and systems result in knowledge transfer and a front row seat to expertise in action. If you’re looking for the fastest way to learn from incidents, the best starting point is making them very visible to the whole organization, and actively celebrate great examples of incidents that have been done well. Don’t underestimate the power of implicit learning that happens alongside everyone just doing their job.
Sell the upside of changes, rather than telling people what they shouldn’t do
If your leaders believe a monthly report on shallow incident data, like MTTR and number of incidents, is the most useful for way for them to understand the risks facing the business, you’ll struggle to wrestle it out of their hands. And if you haven’t got a concrete answer for what they should be looking at instead, telling them what they shouldn’t do simply isn’t helpful. First, find a better way. Give them a qualitative assessment of the risks and a handful of key learnings alongside their numbers. If what you have is more valuable and useful, removing the numbers becomes an easy task.
Ultimately, if you’re struggling to make change to how your organization learns from incidents, start small, start practical, and connect the activity to something that advances the goals of your business.
It’s absolutely fine to cherry-pick more academic concepts and sequence them alongside less valuable practices that many organizations are anchored to today. Incremental improvements compound over time, and every small change can aggregate to something meaningful.
And if you’re looking for practical advice from peers in a similar position, we’d love to have you in the incident.io community.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Using DORA metrics deployment frequency to measure your DevOps team's ability to deliver customer value
By using DORA's deployment frequency metric, organizations can improve customer impact and product reliablity.
Trust shouldn’t start at zero
Whenever someone new joins your team, folks tend to default to a trust level of zero. Here's why that's a big mistake.
What are DORA metrics and why should you care about them?
Google's DORA metrics can help organizations create better products, build stronger teams, and improve resilience long-term.