Engineering

Shifting left on incident management

In the fast-paced world of software development and product delivery, incidents are often viewed as unwanted disruptions.

Traditionally, incident management might only trigger for critical issues, like complete system outages, data loss of some kind, or security-related ones - you don’t need to go back that far for a few that were very serious: Heartbleed, xz utils, and more.

There has always been the idea that we should minimise the number of declared incidents - that’s seen as a good thing - it avoids alarming stakeholders, and also maintains this appearance of stability.

I believe this perspective has to shift.

“Shifting left”

The concept of “shifting left” has already gained traction in various domains, like security and observability, where people promote the idea of addressing potential issues earlier in the development cycle.

An example of this "shifting left" in security is having product managers and engineers involve security teams and think of security themselves during the scoping stage, not when deploying to production.

A “shift to the left” is beneficial for an organization’s growth and resilience.

When I say “shift left” I mean lowering the threshold for declaring incidents so that you become a lot more proactive rather than reactive. Depending on your maturity, I would go so far as to say you should declare an incident for any unexpected behavior that deviates from normal operations.

You go from dreading incidents to embracing them as opportunities for improvement.

Fear of declaring incidents

I understand the reluctance within organizations to declare incidents. Concerns range from worrying about desensitizing team members to the seriousness of incidents, to the daunting meeting where you have to justify the frequency of incidents to upper management.

This is all compounded by the fear of damaging your company’s reputation. “What if people perceive our many incidents as a lack of control, or much worse, competence?”

A shift to the left has already brought many advantages to other domains, namely continuous delivery, security, and observability. Let’s see how doing the same in incident management will also bring many advantages.

More incidents, more understanding, more resilience

Using avoidance as a strategy usually doesn’t pay dividends. By lowering the threshold for what is considered an incident, and bringing it to the masses, I believe you’ll raise your company’s operational readiness and resilience. It’s similar to pulling the Andon cable - anyone in a Toyota factory can pull the Andon cable and halt production until a solution is found.

So, what changes when you shift left on incident management?

  • Increased readiness - a lower threshold for incidents creates a more prepared team. By capturing minor issues early, you prevent them from escalating into bigger problems. Your team will also be more familiar with the tooling and the processes.
  • Improved understanding - more incidents provide you with valuable data on system (and people) behavior. This then helps pinpoint weaknesses and areas for improvement.
  • Better overall prioritization - with a bigger and sharper view of issues, you can prioritize based on real-world impact, not just perceived severity.
  • Fearless communication - people will adjust to being more open to failure, and discussing failure; doing so will also improve communication, both time and transparency, with your customers.

How this works in practice

I had an "aha" moment when one of our engineers made a change to trigger incidents when we stopped being able to deploy to production.

Lisa describes in detail what happened in this lovely LinkedIn post. She broke the build on our main codebase, which meant that no one could deploy it to production. Because we had hooked our CI pipeline to our HTTP alert source, which triggers an incident, and configured a workflow to escalate the incident to the person who merged the PR that is failing, she immediately got paged. We’ll save many developer hours in the long run, through this kind of fast action, not to speak of the learnings we’ll get once we get to reviewing the incident.

Conclusion

You need to declare more incidents.

Let's move beyond the fear of having too many incidents and instead learn from the valuable data they provide to build stronger, and more resilient systems. It requires a cultural shift away from fear and avoidance, towards openness, curiosity, and learning.

By embracing a broader definition of incidents, organizations can unlock valuable insights into their operations, fostering a culture of resilience and adaptability. In doing so, they are not merely preparing for the future; they are actively shaping it.

I’m very proud that we're here and it feels like we’re living in the future.

Picture of Norberto Lopes
Norberto Lopes
VP Engineering

Operational excellence starts here