Article

How to react when things go wrong

Tom Blomfield

My name's Tom - I previously founded Monzo and GoCardless, and I'm an angel investor based in London. I'm proud to have backed the incident.io team.

If you run any kind of always-on software business, customers expect your service to be available 24/7. When it's not, you'll get customer resentment and negative headlines. We all remember the days of the Twitter fail whale.

The issue, though, is that as your business gets more and more complex, you'll inevitably encounter problems - whether that's a software problem, a business process not working as expected, or simply human error. For the sake of common language, we call these problems "incidents". You can (and should) try to minimise their frequency and severity, but it's naive to think that you'll completely eliminate incidents. Once you've accepted that you're never going to be perfect, you can focus on responding to incidents as effectively and efficiently as possible in order to minimise impact and recover quickly.

One of the most memorable incidents I was involved with as CEO of Monzo was 10 hour outage on a Sunday - it was the 5th of March 2017. Before Monzo was a full bank, we offered customers a prepaid debit card, and this prepaid card relied on a small number of outsourced technology providers. At the time, I think we had about 250,000 customers.

The incident started early during Sunday afternoon, as automated alerts started sounding. Very quickly, our customer service queues were overwhelmed as customers around the world experienced their payments being declined. I think I was notified by a PagerDuty call, and quickly logged into Slack.

I joined a channel that had been created for the incident, and tried to follow along as our engineers posted updates from various systems as they investigated. I didn't want to ask too many questions, because I wanted our people to focus on identifying and fixing the problem.

It quickly became apparent that our outsourced card processor (which connected our systems to Mastercard) was no longer sending us any data. We had no way to process card transactions without them, and our customers couldn't spend their money. When we called our relationship manager for updates, there was a note of alarm in his voice. Things didn't sound good. Their service was completely unavailable and they had no timeline for a fix.

As each new person joined our Slack channel, they went through the same inefficient process.

As each new person joined our Slack channel, they went through the same inefficient process as I had - scanning back through the channel for updates, perhaps asking questions for clarification. It wasn't initially clear who the incident technical lead was, who was directing proactive customer communications, or who was dealing with the growing support backlog. We fixed that, assigning specific roles and responsibilities.

The incident technical lead would post updates in the channel every 15 minutes and coordinate the engineers who were investigating the problem. People joining the channel could read these updates rather than interrupt individual engineers with questions.

A separate team ran proactive customer communications. We quickly updated our public statuspage, wrote a blogpost, posted in our community forum and tweeted updates. When it became obvious that the incident wasn't going to be over quickly, we proactively sent a push notification to all Monzo app users, letting them know that their cards wouldn't work.

Another team dealt with the inbound customer service queues (mostly in-app chat). They drafted in as many staff members as they could find on a Sunday afternoon, and heroically battled to keep the queues under control.

It felt good to be part of a team that was pulling together to tackle a crisis together.

Once people's roles and responsibilities were clearer, and we'd established good communication protocols, things settled into more of a rhythm. It was still a horrible situation - our entire service was down - but it felt good to be part of a team that was pulling together to tackle a crisis together.

Finally, after 10 hours, our supplier managed to resolve their technical issues, and the service came back online. And because we'd been organised and communicated proactively, customers were incredibly understanding. The amount of support was honestly unbelievable.

Monzo ultimately survived this incident, and perhaps even came out of it stronger than before. We published public post-mortems and made several changes to the way we operated as a company, including totally rebuilding the card processor with in-house software (which I am glad to say has been rock-solid to this day).

I don't think a single person consulted our documentation to see what they should do.

Actions aside, one of the biggest learnings we took away from this incident was that coordinating a response was just too hard in the heat of the moment. We thought we were well prepared with our incident management procedures defined on paper, but when it came to the pressure of an incident none of it actually worked. I don't think a single person consulted our documentation to see what they should do.

In fact, the biggest step-change at Monzo was the introduction of good incident response tooling, in the form a slack bot we called "Response". It took the processes we'd defined on paper, and made them run on rails.

Response made it easy for anyone in the company to declare a new incident, spin up a new Slack channel and then quickly assign key roles (incident lead, comms manager etc). You were prompted to define a severity ("critical", "major", "minor" etc), and then escalate to the relevant people as necessary via a slick integration with PagerDuty, without leaving Slack. It had enough structure to keep everyone informed and up-to-date without the need for constant question-and-answer.

It was these simple set of rails that gave our incident management structure and clarity.

If you needed to update the company's Statuspage, or assign new actions to team-members, you could do all of that within Slack, without having to log into a different, potentially unfamiliar tool. And if you forgot anything, the tool would prompt you to follow the processes we'd defined. It meant each team could do their job efficiently without confusion or duplication. It was these simple set of rails that gave our incident management structure and clarity.

Once things had returned to normal, the tool allowed us to run post-incident analysis, and view high-level trends in our incidents over time. Because the key information was already stored in a structured format, we could spend our time focussing on the learnings and improvements we wanted to make. And a simple integration with our issue-tracker meant that these improvements actually got done!

If you want to bring this kind of structure and clarity to your incident response, you should try this tool. It's now available as a fully-supported service at incident.io. I worked with the three founders (Chris, Pete and Stephen) at Monzo and GoCardless, and I'm proud to be an investor in their new company.

Image credit: Monzo

Picture of Tom Blomfield
Tom Blomfield

Modern incident management, built for humans