Learning from your peers
Incidents are a unique learning opportunity. They take people out of their day-to-day roles, and drive us to form short-lived “ephemeral teams” to solve unexpected and challenging problems.
If you’re looking for a time-efficient way to level up your expertise and understanding of something, you’ll find a proverbial gold mine of information in your individual incidents.
The most obvious opportunity is learning how to respond to incidents better: how to debug problems, communicate clearly and collaborate effectively. This is particularly powerful when your teams are showing their working.
A lot of the examples here are related to technology and engineering, but the principles apply universally across organizations, and irrespective of the type of incidents you’re dealing with.
Broaden your horizons #
Incidents can impact parts of your system which you don’t interact with day-to-day; the stuff that ‘just works’ (until it doesn’t). That makes them a great opportunity to expand your horizons and understand how your team and components fit together into the wider landscape.
For example, as an engineer working in a primarily application-focussed team, it’s likely that your first experience of debugging infrastructure-related issues will be in an incident. When everything’s working as expected, you don’t have to understand exactly how all your things deploy, just that they do. But as soon as something goes wrong, it’s a different ballgame.
Incidents are usually caused by (or manifest in) the most difficult parts of the systems we interact with. Seeing multiple incidents impact the same component is a great way to learn about that component, while simultaneously signalling that an investment in learning and boosting resilience is likely to be a worthwhile activity.
Understanding the edges of the things that your team owns, and the ways your processes interact, is a sure-fire way to build better overall solutions.
Learn to fail gracefully #
We're not perfect: our job is hard and whatever we build is very likely to go wrong at some point. Instead of trying to create perfect systems, incidents demonstrate that it's more important to make systems that fail in safe ways. This includes:
- Making a system alert loudly and clearly if it sees something unexpected. That alert should be easy to trace to its source, so someone debugging it can quickly understand the cause and take steps to resolve it. Resolving an incident is tricky if you don’t even know it’s happening.
- Keeping the blast radius for failures as small as possible: think carefully about what should be considered 'critical' for a given process, and get everything else out of the way. Being unable to log a user tracking event should never degrade the customer experience.
While it's possible to read this stuff in textbooks, seeing the impact of these choices in real incidents is what taught me how to put this advice into practice.
Build your network #
Incidents are a great opportunity to meet people outside your team. This is useful for your personal growth, and the organization.
It’s useful to build relationships with colleagues who have different skillsets from your usual teammates. Maybe there's someone who knows lots about a particular technology, or someone who is a really great coach. Having a network of talented people to ask for advice is a great accelerator.
As an organization, one of your primary challenges is keeping everyone pulling in the same direction. Relationships that cross-team boundaries are very powerful: they help build empathy for the challenges that other teams are facing, and often highlight when that alignment isn’t there.