We can't have more than two major incidents per quarter.
It happens all the time: senior folks at your company feel like things are out of control, and they attempt to improve the situation by counting how many incidents you're having.
And it's not an unreasonable approach — on the surface, the number of incidents seems like a great measure for how well things are going.
Whilst setting targets might work in some organizations, it's worth considering whether they provide the signal you expect, and whether the implications of doing so have been properly considered. We've had this conversation more times than we can count, so here's a few tips on how to navigate the situation.
The absence of incidents doesn’t mean your systems are reliable or things are safe. I've worked in teams where we've had months of smooth sailing, followed by intense periods of seemingly everything being on fire. Nothing materially changed between the two periods. Deeper analysis showed the many contributing factors present throughout. We just got lucky and the perfect storm of latent errors and enabling conditions didn’t occur in the first instance.
Incidents aren't an evil we need to stamp out. In many cases, they're the cost of doing business. We shouldn't encourage failure, but despite our best efforts to maintain high levels of service, surprises will catch us out. When done right, a healthy culture of declaring incidents can be a super power. I want my teams to feel comfortable sharing when when things may be going wrong, be excellent at responding when they do, and democratising knowledge and expertise after the fact.
I’ve seen people arguing why something is or isn’t an incident because they don’t want to reset the “days since incident” counter. Equally, I’ve seen engineers waste time in an incident trying to justify a minor severity rating, rather than major, because they don't want to trigger the company target.
As stated in Goodhart's Law, "when a measure becomes a target, it ceases to be a good measure". If you set a low target with severe consequences, you'll probably meet it, whether that means suppressing reporting, arguing over labels, or some other counterproductive measure.
The vast majority of incidents are outside of our control. At best, a “no incident” goal is un-actionable and ignored. At worst, it can alter behaviour to the detriment of the organization.
If you were set a target of not spilling a drink for a year, what would you do differently? Nobody sets out to spill a drink, and when it happens it’s not because you’re careless, it’s just random chance sprinkled with misfortune. Pick a better target, like suggesting I don't run with drinks.
So you've convinced your leadership team it might be a bad idea, but to seal the deal they're after an alternative. What can you offer in return?
The best advice is to understand their motivations for the goal. For example, is there a lack of trust between leadership and engineering? Is that fuelled by them seeing incidents, but not seeing the analysis and follow-up that happens afterwards? Perhaps a target around the number incidents which didn't have a debrief would help.
Whatever the motivation, here's a few options you might want to consider.
You don't really care about the number of incidents. You care about what that means; whether it's lost revenue, customer satisfaction, or the service you provide — incidents are just a useful proxy.
Instead, measure the thing you actually care about like service uptime, the number of times PII data was shared, or the number of failed payments. These are tangible measures that can be targeted and improved.
If you can accept that incidents are unavoidable surprises, why not measure how well your org is using them to improve?
We suggest writing debrief documents that are used to educate, holding sessions to discuss them, and ensure you're seeing follow up actions through to completion. If you do all of the above, you’re likely getting your money's worth.
If you can't convince people not to target the number of incidents, why not provide the metrics they want but with the context they need to understand the full picture?
💭 If you’re interested in benchmarks that can help improve the way your engineering team operates, how you build products, and improve your resiliency, then be sure to check out our blog post on DORA metrics.
Rather than "we had 5 major incidents", share the contributing factors and risks, the commonalities and differences, and what's being done to improve. It's relatively easy to take the heat out of a number by providing some qualitative context. As it happens, there’s a great post from the Learning from Incidents blog about this here.
If you've got any pro tips of your own, we'd love to hear them! Send us an email at hello@incident.io, or find us on Twitter at @incident_io.
I'm one of the co-founders, and the Chief Product Officer here at incident.io.
We created a dedicated page for Anthropic to showcase our incident management platform, complete with a custom game called PagerTron, which we built using Claude Code. This project showcases how AI tools like Claude are revolutionizing marketing by enabling teams to focus on creative ways to reach potential customers.
We examine both companies' comparison pages and find some significant discrepancies between PagerDuty's claims and reality. Learn how our different origins shape our approaches to incident management.
The EU AI Act introduces new incident reporting rules for high-risk AI systems. This post breaks down what Article 73 actually mandates, why it's not as scary as it sounds, and how good incident management makes compliance a breeze.
Ready for modern incident management? Book a call with one our of our experts today.