Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Deep into an incident, Slack firing, up to your ears in decisions, not sure where to turn next?
It’s easy for external communication with your customers to fall far down the list of priorities in these moments.
However, these are the exact situations where comms are vital, and where underestimating their importance can having damaging and lasting effects on your organization.
A quick look at social media will show the stark difference in sentiment towards companies that communicate well when things go wrong, and those who don’t. Handling incidents well is one of the best opportunities you’ll have to build trust and strengthen the relationship between you and your customers.
Let’s take a look at Atlassian’s recent outage.
Following a migration that went badly wrong, around 400 companies with anywhere from 50,000 to 800,000 users were suddenly left with no access to any Atlassian services (think Jira, Confluence, Opsgenie, and Statuspage).
Whilst it’s good to analyse how to prevent incidents like this, in these situations it’s not what goes wrong, but how you deal with it that will come to define you as a company.
The communication breakdown that followed in the next fortnight was where it all went wrong for Atlassian.
We’ll dive into a few core mistakes they made along the way, and how a customer-centric response process will avoid these.
During Atlassian’s outage, there were 8 days without clear and transparent communication on what went wrong, for who, and how they were going to fix it.
It was only on day 4 that they acknowledged it on Twitter. After this update, they went silent for another 5 days.
While running a maintenance script, a small number of sites were disabled unintentionally. We’re sorry for the frustration this incident is causing and we are continuing to move through the various stages for restoration. [1/3]
— Atlassian (@Atlassian) April 7, 2022
How did they manage to go so long without a useful update? It feels ridiculous, but if communication isn’t integral to your process, it’s surprising how easily it slips, with everyone heads down, thinking someone else will handle it.
External updates should be so baked into your incident process that it’s impossible to avoid them.
Here are some ways we’ve made this second nature during incident response:
Your core objective should always be to minimise customer impact. This doesn’t necessarily align with shortest time to a technical fix.
Let’s take Atlassian’s example.
They prioritised getting all their services back on track for all their customers as soon as possible. Sounds like a pretty sensible goal, right?
In reality, they missed a chance to minimise customer impact by not involving their users until far too late in the process. Not all product areas are equal in impact for all customers. For some businesses, having a fix for Opsgenie would rank far above their other services. For others, having an export of raw data would have allowed them to manage for much longer with no access to Jira.
Whilst tailoring your response to the customer can increase time to fix, it can significantly reduce the negative impact.
So, how do you work out what route to take when you have multiple available?
Ask your customer! Customer needs are complex and varied. If your incident has multiple options with various pros and cons, this is an excellent signal that it’s time to involve your customer. They have the best idea of what impacts them the most and why, and involving them at this stage demonstrates that you care about their experience. It’s really important to know your audience here. Including specific details about what went wrong and why goes a long way for fostering a sense of transparency. Just make sure you’re pitching the details at the right level of technicality.
Incidents are unexpected by nature, so a rigid approach to the type and timeline of your communication won’t work.
Striking the right level of transparency is difficult. When incidents occur it’s important to be honest about them, but you don’t want to worry other customers over something that isn’t affecting them.
For small incidents involving just one customer, there's no need to tell everyone. Instead, create a shred comms channel, and provide frequent direct updates.
On the other extreme, for incidents affecting everyone, it’s important to have public updates. Ensure you’re keeping an up to date status page. A good rule of thumb is that if people are tweeting asking you if you’re down, you should probably have shared a status update.
*half the internet on fire*
— I Am Devloper (@iamdevloper) May 27, 2022
*aws dns issues*
AWS status page: pic.twitter.com/lZ5rEciDqm
The in-between state can be more complex - for incidents affecting a specific subcategory of people should you go public or lock things down? There isn’t a clear answer here, but as communication becomes a core part of your response, it can become clearer. Having the communications lead role we mentioned above is a great way to trigger these conversations and make those involved think about the best way to tell people about this.
đź’ˇ While transparency is always a good default, sensitive incidents involving security or data privacy require some nuance. Seek a second opinion before sharing comms in these scenarios.
As you introduce tactics like these, your process will naturally shift towards being customer-centric, with comms becoming second nature.
Ultimately, customers are who we are writing our code for, so keeping them in the loop to what’s going on is a really sensible way to foster trust.
On day 9 of Atlassian’s outage, the CTO posted a long article detailing what had gone wrong and why it had taken so long to fix, providing clear timelines about when to expect restored services. The change in the tone from the users was huge, anger from Twitter was quickly replaced with understanding and empathy (alongside some more questions). The 9 day wait for an update definitely did some very avoidable damage, but it goes to show what an impact being transparent and honest can make.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Better learning from incidents: A guide to incident post-mortem documents
Post-mortem documents are a great way to facilitate learning after incidents are resolved.
Luis Gonzalez
How we’ve made Status Pages better over the last three months
A few months ago we announced Status Pages -- the most delightful way to keep customers up-to-date about ongoing incidents. Since then, we've launched several features to add an extra bit of delight. Read on to learn more.
Asiya Gorelik
The balancing act of reliability and availability
To prevent issues like downtime, you have to focus on the reliability and availability of your product. But there's a balance to be struck here.
incident.io