Driving a customer-focused incident response process

Deep into an incident, Slack firing, up to your ears in decisions, not sure where to turn next?

It’s easy for external communication with your customers to fall far down the list of priorities in these moments.

However, these are the exact situations where comms are vital, and where underestimating their importance can having damaging and lasting effects on your organization.

A quick look at social media will show the stark difference in sentiment towards companies that communicate well when things go wrong, and those who don’t. Handling incidents well is one of the best opportunities you’ll have to build trust and strengthen the relationship between you and your customers.

🙀 What happens when it goes wrong?

Let’s take a look at Atlassian’s recent outage.

Following a migration that went badly wrong, around 400 companies with anywhere from 50,000 to 800,000 users were suddenly left with no access to any Atlassian services (think Jira, Confluence, Opsgenie, and Statuspage).

Whilst it’s good to analyse how to prevent incidents like this, in these situations it’s not what goes wrong, but how you deal with it that will come to define you as a company.

The communication breakdown that followed in the next fortnight was where it all went wrong for Atlassian.

We’ll dive into a few core mistakes they made along the way, and how a customer-centric response process will avoid these.

📣 Make communication a mandatory part of response

During Atlassian’s outage, there were 8 days without clear and transparent communication on what went wrong, for who, and how they were going to fix it.

It was only on day 4 that they acknowledged it on Twitter. After this update, they went silent for another 5 days.

While running a maintenance script, a small number of sites were disabled unintentionally. We’re sorry for the frustration this incident is causing and we are continuing to move through the various stages for restoration. [1/3]
— Atlassian (@Atlassian) April 7, 2022

How did they manage to go so long without a useful update? It feels ridiculous, but if communication isn’t integral to your process, it’s surprising how easily it slips, with everyone heads down, thinking someone else will handle it.

External updates should be so baked into your incident process that it’s impossible to avoid them.

Here are some ways we’ve made this second nature during incident response:

A Communications Lead role For incidents of a certain scale or severity, we make it mandatory to have a communications lead. For smaller incidents this could be the incident lead, but in larger ones it’s useful to have a dedicated person tracking the status of external updates. Depending on the scale and type of the incident, this could be anyone from an engineer, a customer support rep, or even someone from the PR team. Having a single person responsible means no-one falls into the trap of thinking someone else is on it.
Involve someone close to customers It’s vital to involve someone with direct insight into the customer(s) affected in your incident. Engineers here at incident.io communicate directly with our users. This isn’t the case everywhere, and in these situations it’s important to loop someone in who has empathy for the people affected. Incident channels should not be exclusive to engineers discussing the deep technicalities of a problem.
Don’t wait until you have all the information Proactive incident communication is best. People always appreciate a “we’ve seen something went wrong, we don’t know why, but we’re on it” message. You don’t need to wait for all the information before sending an update, as this will often delay things further. Opening a channel of communication early can mean you learn more useful information from the customer about what went wrong. Once the conversation exists, it’s a nice reminder to send small updates as and when you have them.

👂 Learn from your customer

Your core objective should always be to minimise customer impact. This doesn’t necessarily align with shortest time to a technical fix.

Let’s take Atlassian’s example.

They prioritised getting all their services back on track for all their customers as soon as possible. Sounds like a pretty sensible goal, right?

In reality, they missed a chance to minimise customer impact by not involving their users until far too late in the process. Not all product areas are equal in impact for all customers. For some businesses, having a fix for Opsgenie would rank far above their other services. For others, having an export of raw data would have allowed them to manage for much longer with no access to Jira.

Whilst tailoring your response to the customer can increase time to fix, it can significantly reduce the negative impact.

So, how do you work out what route to take when you have multiple available?

Ask your customer! Customer needs are complex and varied. If your incident has multiple options with various pros and cons, this is an excellent signal that it’s time to involve your customer. They have the best idea of what impacts them the most and why, and involving them at this stage demonstrates that you care about their experience. It’s really important to know your audience here. Including specific details about what went wrong and why goes a long way for fostering a sense of transparency. Just make sure you’re pitching the details at the right level of technicality.

🤸‍♀️ Be flexible

Incidents are unexpected by nature, so a rigid approach to the type and timeline of your communication won’t work.

Striking the right level of transparency is difficult. When incidents occur it’s important to be honest about them, but you don’t want to worry other customers over something that isn’t affecting them.

For small incidents involving just one customer, there's no need to tell everyone. Instead, create a shred comms channel, and provide frequent direct updates.

On the other extreme, for incidents affecting everyone, it’s important to have public updates. Ensure you’re keeping an up to date status page. A good rule of thumb is that if people are tweeting asking you if you’re down, you should probably have shared a status update.

*half the internet on fire*
*aws dns issues*

AWS status page: pic.twitter.com/lZ5rEciDqm
— I Am Devloper (@iamdevloper) May 27, 2022

The in-between state can be more complex - for incidents affecting a specific subcategory of people should you go public or lock things down? There isn’t a clear answer here, but as communication becomes a core part of your response, it can become clearer. Having the communications lead role we mentioned above is a great way to trigger these conversations and make those involved think about the best way to tell people about this.

💡 While transparency is always a good default, sensitive incidents involving security or data privacy require some nuance. Seek a second opinion before sharing comms in these scenarios.

Conclusion

As you introduce tactics like these, your process will naturally shift towards being customer-centric, with comms becoming second nature.

Ultimately, customers are who we are writing our code for, so keeping them in the loop to what’s going on is a really sensible way to foster trust.

On day 9 of Atlassian’s outage, the CTO posted a long article detailing what had gone wrong and why it had taken so long to fix, providing clear timelines about when to expect restored services. The change in the tone from the users was huge, anger from Twitter was quickly replaced with understanding and empathy (alongside some more questions). The 9 day wait for an update definitely did some very avoidable damage, but it goes to show what an impact being transparent and honest can make.

🙀 What happens when it goes wrong?

📣 Make communication a mandatory part of response

👂 Learn from your customer

🤸‍♀️ Be flexible

Conclusion

See related articles

So good, you’ll break things on purpose

We’d love to talk to you about