Deep into an incident, Slack firing, up to your ears in decisions, not sure where to turn next?
It’s easy for external communication with your customers to fall far down the list of priorities in these moments.
However, these are the exact situations where comms are vital, and where underestimating their importance can having damaging and lasting effects on your organisation.
A quick look at social media will show the stark difference in sentiment towards companies that communicate well when things go wrong, and those who don’t. Handling incidents well is one of the best opportunities you’ll have to build trust and strengthen the relationship between you and your customers.
Let’s take a look at Atlassian’s recent outage.
Following a migration that went badly wrong, around 400 companies with anywhere from 50,000 to 800,000 users were suddenly left with no access to any Atlassian services (think Jira, Confluence, Opsgenie, and Statuspage).
Whilst it’s good to analyse how to prevent incidents like this, in these situations it’s not what goes wrong, but how you deal with it that will come to define you as a company.
The communication breakdown that followed in the next fortnight was where it all went wrong for Atlassian.
We’ll dive into a few core mistakes they made along the way, and how a customer-centric response process will avoid these.
During Atlassian’s outage, there were 8 days without clear and transparent communication on what went wrong, for who, and how they were going to fix it.
It was only on day 4 that they acknowledged it on Twitter. After this update, they went silent for another 5 days.
While running a maintenance script, a small number of sites were disabled unintentionally. We’re sorry for the frustration this incident is causing and we are continuing to move through the various stages for restoration. [1/3]— Atlassian (@Atlassian) April 7, 2022
How did they manage to go so long without a useful update? It feels ridiculous, but if communication isn’t integral to your process, it’s surprising how easily it slips, with everyone heads down, thinking someone else will handle it.
External updates should be so baked into your incident process that it’s impossible to avoid them.
Here are some ways we’ve made this second nature during incident response:
A Communications Lead role
For incidents of a certain scale or severity, we make it mandatory to have a communications lead. For smaller incidents this could be the incident lead, but in larger ones it’s useful to have a dedicated person tracking the status of external updates. Depending on the scale and type of the incident, this could be anyone from an engineer, a customer support rep, or even someone from the PR team. Having a single person responsible means no-one falls into the trap of thinking someone else is on it.
Involve someone close to customers
It’s vital to involve someone with direct insight into the customer(s) affected in your incident. Engineers here at incident.io communicate directly with our users. This isn’t the case everywhere, and in these situations it’s important to loop someone in who has empathy for the people affected. Incident channels should not be exclusive to engineers discussing the deep technicalities of a problem.
Don’t wait until you have all the information
Proactive incident communication is best. People always appreciate a “we’ve seen something went wrong, we don’t know why, but we’re on it” message. You don’t need to wait for all the information before sending an update, as this will often delay things further. Opening a channel of communication early can mean you learn more useful information from the customer about what went wrong. Once the conversation exists, it’s a nice reminder to send small updates as and when you have them.
Let’s take Atlassian’s example.
They prioritised getting all their services back on track for all their customers as soon as possible. Sounds like a pretty sensible goal, right?
In reality, they missed a chance to minimise customer impact by not involving their users until far too late in the process. Not all product areas are equal in impact for all customers. For some businesses, having a fix for Opsgenie would rank far above their other services. For others, having an export of raw data would have allowed them to manage for much longer with no access to Jira.
Whilst tailoring your response to the customer can increase time to fix, it can significantly reduce the negative impact.
So, how do you work out what route to take when you have multiple available?
Ask your customer! Customer needs are complex and varied. If your incident has multiple options with various pros and cons, this is an excellent signal that it’s time to involve your customer. They have the best idea of what impacts them the most and why, and involving them at this stage demonstrates that you care about their experience. It’s really important to know your audience here. Including specific details about what went wrong and why goes a long way for fostering a sense of transparency. Just make sure you’re pitching the details at the right level of technicality.
Incidents are unexpected by nature, so a rigid approach to the type and timeline of your communication won’t work.
Striking the right level of transparency is difficult. When incidents occur it’s important to be honest about them, but you don’t want to worry other customers over something that isn’t affecting them.
For small incidents involving just one customer, there's no need to tell everyone. Instead, create a shred comms channel, and provide frequent direct updates.
On the other extreme, for incidents affecting everyone, it’s important to have public updates. Ensure you’re keeping an up to date status page. A good rule of thumb is that if people are tweeting asking you if you’re down, you should probably have shared a status update.
*half the internet on fire*— I Am Devloper (@iamdevloper) May 27, 2022
*aws dns issues*
AWS status page: pic.twitter.com/lZ5rEciDqm
The in-between state can be more complex - for incidents affecting a specific subcategory of people should you go public or lock things down? There isn’t a clear answer here, but as communication becomes a core part of your response, it can become clearer. Having the communications lead role we mentioned above is a great way to trigger these conversations and make those involved think about the best way to tell people about this.
As you introduce tactics like these, your process will naturally shift towards being customer-centric, with comms becoming second nature.
Ultimately, customers are who we are writing our code for, so keeping them in the loop to what’s going on is a really sensible way to foster trust.
On day 9 of Atlassian’s outage, the CTO posted a long article detailing what had gone wrong and why it had taken so long to fix, providing clear timelines about when to expect restored services. The change in the tone from the users was huge, anger from Twitter was quickly replaced with understanding and empathy (alongside some more questions). The 9 day wait for an update definitely did some very avoidable damage, but it goes to show what an impact being transparent and honest can make.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Using DORA metrics deployment frequency to measure your DevOps team's ability to deliver customer value
By using DORA's deployment frequency metric, organizations can improve customer impact and product reliablity.
Learning from incidents is not the goal
Learning from incidents is a hot topic within the software industry, but the goal is not for organisations to learn from incidents: it’s for them to be better, more successful businesses.
Trust shouldn’t start at zero
Whenever someone new joins your team, folks tend to default to a trust level of zero. Here's why that's a big mistake.