By Chris Evans5 min readArticle
We can't have more than two major incidents per quarter.
It happens all the time: senior folks at your company feel like things are out of control, and they attempt to improve the situation by counting how many incidents you're having.
And it's not an unreasonable approach — on the surface, the number of incidents seems like a great measure for how well things are going.
Whilst setting targets might work in some organisations, it's worth considering whether they provide the signal you expect, and whether the implications of doing so have been properly considered. We've had this conversation more times than we can count, so here's a few tips on how to navigate the situation.
The absence of incidents doesn’t mean your systems are reliable or things are safe. I've worked in teams where we've had months of smooth sailing, followed by intense periods of seemingly everything being on fire. Nothing materially changed between the two periods. Deeper analysis showed the many contributing factors present throughout. We just got lucky and the perfect storm of latent errors and enabling conditions didn’t occur in the first instance.
Incidents aren't an evil we need to stamp out. In many cases, they're the cost of doing business. We shouldn't encourage failure, but despite our best efforts to maintain high levels of service, surprises will catch us out. When done right, a healthy culture of declaring incidents can be a super power. I want my teams to feel comfortable sharing when when things may be going wrong, be excellent at responding when they do, and democratising knowledge and expertise after the fact.
I’ve seen people arguing why something is or isn’t an incident because they don’t want to reset the “days since incident” counter. Equally, I’ve seen engineers waste time in an incident trying to justify a minor severity rating, rather than major, because they don't want to trigger the company target.
As stated in Goodhart's Law, "when a measure becomes a target, it ceases to be a good measure". If you set a low target with severe consequences, you'll probably meet it, whether that means suppressing reporting, arguing over labels, or some other counterproductive measure.
The vast majority of incidents are outside of our control. At best, a “no incident” goal is un-actionable and ignored. At worst, it can alter behaviour to the detriment of the organisation.
If you were set a target of not spilling a drink for a year, what would you do differently? Nobody sets out to spill a drink, and when it happens it’s not because you’re careless, it’s just random chance sprinkled with misfortune. Pick a better target, like suggesting I don't run with drinks.
So you've convinced your leadership team it might be a bad idea, but to seal the deal they're after an alternative. What can you offer in return?
The best advice is to understand their motivations for the goal. For example, is there a lack of trust between leadership and engineering? Is that fuelled by them seeing incidents, but not seeing the analysis and follow-up that happens afterwards? Perhaps a target around the number incidents which didn't have a debrief would help.
Whatever the motivation, here's a few options you might want to consider.
You don't really care about the number of incidents. You care about what that means; whether it's lost revenue, customer satisfaction, or the service you provide — incidents are just a useful proxy.
Instead, measure the thing you actually care about like service uptime, the number of times PII data was shared, or the number of failed payments. These are tangible measures that can be targeted and improved.
If you can accept that incidents are unavoidable surprises, why not measure how well your org is using them to improve?
We suggest writing debrief documents that are used to educate, holding sessions to discuss them, and ensure you're seeing follow up actions through to completion. If you do all of the above, you’re likely getting your money's worth.
If you can't convince people not to target the number of incidents, why not provide the metrics they want but with the context they need to understand the full picture?
Rather than "we had 5 major incidents", share the contributing factors and risks, the commonalities and differences, and what's being done to improve. It's relatively easy to take the heat out of a number by providing some qualitative context. As it happens, there’s a great post from the Learning from Incidents blog about this here.