Home/Learning from incidents

Learning and expertise

Incidents offer a unique learning opportunity. They pull people out of their usual roles, and bring together 'ephemeral teams' to tackle unexpected and often complex problems.

If you’re looking to quickly deepen your expertise, incidents are a veritable goldmine of learning. They offer the chance to develop your incident response efforts, while also building valuable skills in debugging, communication, and cross-functional collaboration.

A lot of the examples here are related to technology and engineering, but the principles apply universally across organizations, and irrespective of the type of incidents you’re dealing with.

Developing expertise

Incidents can impact parts of your system which you don’t interact with day-to-day; the stuff that ‘just works’ (until it doesn’t). That makes them a great opportunity to expand your horizons and understand how your team and components fit together into the wider landscape.

For example, as an engineer working in a primarily application-focussed team, it’s likely that your first experience of debugging infrastructure-related issues will be in an incident. When everything’s working as expected, you don’t have to understand exactly how all your things deploy, just that they do. But as soon as something goes wrong, it’s a different ballgame.

Incidents are usually caused by (or manifest in) the most difficult parts of the systems we interact with. Seeing multiple incidents impact the same component is a great way to learn about that component, while simultaneously signalling that an investment in learning and boosting resilience is likely to be a worthwhile activity.

Understanding the edges of the things that your team owns, and the ways your processes interact, is a sure-fire way to build better overall solutions.

Building better systems

We’re not perfect: our jobs are hard, and whatever we build will likely go wrong at some point. Rather than striving for perfect systems, incidents show us the value of designing both technical and social systems to fail in safe, recoverable ways. Key practices include:

  • Prioritizing debuggability: Incidents often reveal areas of our systems that are difficult to troubleshoot or lack visibility. Experiencing these challenges firsthand can drive us to build observability into systems from the start.
  • Minimizing the blast radius: Define what’s truly 'critical' for each process, and remove any non-essential elements. For example, failing to log a user event should never degrade the customer experience.
  • Clear, traceable alerting: When a system detects something unexpected, it should signal that clearly, with alerts that are easy to trace back to the root cause for quick resolution.

While these principles can be learned from textbooks, experiencing their impact in real incidents makes them practical. And during incident reviews, focusing on these areas can reveal opportunities for meaningful improvements.

Creating cross-team relationships

Incidents provide a unique opportunity to connect with people outside your immediate team — a benefit both for personal growth and for the organization.

Building relationships with colleagues who have different skills expands your network and creates valuable learning opportunities. You might meet someone with deep expertise in a particular technology or someone who’s an exceptional coach. Having a network of talented people to turn to for advice can be a powerful accelerator.

As an organization, one of your primary challenges is keeping everyone pulling in the same direction. Relationships that cross team boundaries are very powerful; they help build empathy for the challenges that other teams are facing, and often highlight when that alignment isn’t there.