I joined GoCardless as a junior engineer. It was one of my first coding jobs, and in my time there I progressed to senior much faster than I had expected. When I reflect on how this happened, one pattern stands out to me; the big step changes in my understanding, and my ability to solve larger and more complex engineering problems, came as a result of incidents.
When I encountered an incident, I was introduced to new technologies, learned new skills and met people who became some of my closest friends. And every time, I’d come out as a better engineer; I accelerated my career by running towards the fire.
I learnt to love incidents early on at GoCardless. In the first few months of my time there, we had a major one: our API had slowed to a crawl, which pretty much broke our entire product. I was curious, so I jumped into the incident channel.
One endpoint in particular was consistently timing out, so we disabled it to get the system back up and running. Phew.
Now we had to understand why this had happened. There weren’t any recent changes that looked suspicious, so our attention shifted to the database. It turned out that the query plan for this particular query had changed, from something that was expensive but manageable, to something that was not at all manageable. We made a subtle change to the query which made the database revert to the ‘good’ query plan. Everything was back up and running: we’d fixed it.
Well, I say “we”… In fact, I watched quietly from the sidelines furiously taking notes. After the incident was over, I turned to a senior engineer in my team:
What’s a query plan?
At the time, I’m not sure whether I appreciated how lucky I was to experience such incidents, to join slack channels, take notes, and ask questions about subjects that I knew nothing about. But looking back now, I can see that there were five principal benefits to this early exposure.
As engineers, we live in a world full of black boxes - programming languages, frameworks, databases. Instead of forever opening these boxes up, we learn how to use the interface and move on. This is sensible; if we tried to understand how everything worked down to the metal we’d never get to ship anything.
But incidents force you to open the black boxes around you, peek inside, and learn just enough to solve the problem.
After that API problem, I read up on query plans. This proved useful; it was by no means our last query plan incident, and understanding query plans made me better at my day-to-day work; for the first time I was able to write code that scaled well without lots of trial and error.
Incidents give you great signal about which of these black boxes are worth opening, and a real-world example that you can use as a starting point.
One of the key follow-ups from the API incident was to add statement timeouts on all our database calls. That meant that, if we issued a bad query, Postgres would try for a few seconds and then give up.
This is an excellent example of resilient engineering: our system could now handle unexpected failures. We didn’t need to know what would issue a bad query. Just that it was likely that something would.
It’s possible to read about a solution like that in a book, but nothing compares to seeing and applying it in action. During this incident, I learned a whole set of tools that I could employ to reduce the blast radius of potential failures - not just the statement timeouts which we implemented, but all the other options that the incident response team discussed and discarded.
Observability isn’t straightforward. I’ve shipped plenty of useless log lines and metrics. To build genuinely observable systems, you need to have empathy for your future self (or team mate) who’ll be debugging an issue. This empathy is hard to learn in the abstract.
The people I’ve met who do this well are leaning on their experience of debugging issues. They are pattern matching on things they’ve seen before, allowing them to identify useful places for logs and metrics.
Incidents are a great shortcut to get this kind of experience and build a repository of patterns that you can recognise going forwards
Incidents provide a great opportunity to meet people outside your team, and forge strong relationships along the way. As psychologists have known for a while: there’s something about going through a stressful situation with someone that forges a connection more quickly than normal.
Most of the non-engineering folks I met at GoCardless I met during during incidents. Those relationships were really valuable: they gave me a mental map of the rest of the company, and meant that I had a friendly face I could talk to when I needed advice. And as I became more senior, that network became more important as I was responsible for larger and larger projects.
When things go really wrong, people from all over the organisation get pulled in to help fix it. But they’re not just any people. They’re the people with the most context, the most experience, whom everyone trusts to fix the problem. The people with the technical know-how to enter unknown territory.
Getting to spend time with these folks is rare; they’re likely to be some of the busiest people in the company. Incidents provide a unique opportunity to learn from them, and to see first-hand how they approach a challenging problem.
These five benefits of seeing and handling incidents suggest something fundamental; incidents have unusually high information density compared with day-to-day work, and they enable you to piggy-back on the experience of others. The API incident, for example, gave me opportunities to learn much faster than I otherwise would have. Who knows how long it might have been before I’d realised that I needed to know what a query plan was? I probably wouldn’t have realised until my own code broke in the same way.
How can you make sure that your engineers have a healthy diet of incidents?
At GoCardless, I was lucky. Their culture and processes meant that I could see incident channels and follow along, giving me the opportunity to speed up my progression. But this isn’t always the case…
Some teams run incidents in private channels by default: operating an ‘invite only’ policy. That means that junior members, who want to observe rather than participate, might not even be aware that they’re happening.
Sometimes people are excluded for other reasons: it’s not culturally encouraged to get involved - there’s an ‘in-group’ who handle all the incidents and everyone else should just get out of the way. Joining that in-group, even as a new senior, can become almost impossible.
Let’s look at what we can do to build a culture where incidents are accessible, and where everyone can learn from them.
This is the single most impactful change you can make to your incident process. If you only declare incidents when things get really bad, you won’t get a chance to practise your incident process.
By lowering the bar for what counts as an incident, when the really bad ones do come around, the response is a well-oiled machine. It also helps with learning; when problems are handled as incidents, it makes them more accessible to everyone around you.
Incidents are great learning opportunities, and they should be accessible to everyone. Incident channels should be public by default, and engagement encouraged at all levels.
Of course there can be too much of a good thing: having 20 people descend into a minor incident channel may not be the best outcome. But most incidents can comfortably accommodate a few junior responders tagging along.
It doesn’t have to come at the cost of a good response. Junior responders can still learn without slowing things down:
You could even compile a list of the best incident debriefs to share with new joiners. These debriefs are a great way to get started in a new company.
In an incident, you should put as much information as you can into the incident channel. What command did you run? what theory have you disproved? If you’re debugging on your own this admittedly can feel a bit strange. I’ve been sat at 10pm in an incident channel having a conversation with myself. But it’s worth it, I promise.
It’s useful for your response as it means you don’t have to rely on your memory to know exactly what you’ve already tried and when.
Keeping a thorough record is also good for your teams. By writing everything down, you’re enabling everyone to learn from your experience: just because it’s obvious to you, it doesn’t mean it’s obvious to everyone.
How do I go about showing my working? Well, use public Slack channels wherever possible, and have a central location where everyone can go to find incidents that they might be interested in (using an incident management platform really does help with this one).
You can also read more in our Incident Management Guide.
Often, a single engineer takes on a lot of the incident response burden, fixing everything before anyone knows it’s broken. Maybe this is you.
This doesn’t end well for the hero; they’ll stop getting as much credit as they expect for fixing things. It becomes normalised. And what’s more, they’re at risk of burning out.
But the hero also causes problems for the rest of the team. Without meaning to, they are taking away these learning opportunities from everyone else by fixing things quietly in the corner. That means that nobody else is ever going to be as effective as the hero, because no-one’s had enough practice. A hero culture is not going to result in a high performing team.
If you think you get a lot of recognition for resolving incidents, imagine how much you’ll get for levelling up your whole team, for teaching them all how to run towards the fire.
Lisa Karlin Curtis was a senior engineer at GoCardless, a payments company. In 2021, she left to join incident.io as employee #2. This blog is adapted from a talk that Lisa gave at the LeadDev conference in June 2022.
If you want to get in touch with Lisa, then feel free to reach out to her on Twitter @paprikati_eng. If you're interested in incidents in general, incident.io has a Slack Community where we talk all things incident management: incident.io/community.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!