Incidents are a continual buzz of activity, with people working on multiple streams of work and a lot going on. But if you want to understand – from the outside – what is happening, you’ll rely on the response team to distil updates and provide them at regular intervals.
If no one else knows what’s happening, they can’t make informed choices. Maybe that’s working out whether they should jump in and help, or tell you about something that could be related. Maybe that’s making sure they don’t make another change that will exacerbate the problem. Maybe it’s pretty intense, so they go and buy you some sandwiches.
When you’re writing an internal update, it’s easy to forget that the audience likely doesn’t have the same context that you do. Effective updates:
If you can, it’s also useful share when to expect the next update - or give them an idea whether this is a ‘few minutes’ fix or a ‘few hours’ fix.
Bad
Two of the cluster nodes have failed due to an AWS issue impacting m4.2xlarge instances. This means quorum writes are failing. We're working on it though.
Better
Our cloud provider, AWS, is having issues which means we can’t write data to our database. This means our customers can’t make purchases through the website.
We’re working with AWS to bring our database back online and we’ll send another update in the next hour.
Using a predictable format is useful too: it makes updates easier to parse if you’re a busy stakeholder flying through your notifications.
As a side benefit, updates also advertise your incident process and normalise the fact that incidents happen. Ideally, someone’s first interaction with the incident process at your org should be as a consumer of an update - not being parachuted into the middle of something.
Make sure it’s clear who is responsible for communicating. Whether that’s the Incident Lead, or another chosen individual, making someone explicitly accountable is the best way to keep the updates coming.
Long pauses can be stressful for your stakeholders, and can create confusion. They might start trying to pull the information themselves, creating additional work for whichever responder they happen to DM. Updates are always best handled as one-to-many push comms.
Sending updates should be easy: too much friction will decrease the frequency and encourage bad behaviour.
War story – The accidental chaos monkey
We had one afternoon where there was an issue with our database. Lots of alerts were firing - API calls were getting slower and our queue latencies were out of control.
The platform team jumped on it and started investigating. They started scaling down workers, trying to remove unnecessary load, while simultaneously trying to understand why the database was suddenly struggling.
The metrics started improving, but then suddenly deteriorated again.
It turned out that someone was running an expensive backfill from a console. After they killed the console the database started to recover.
During the chaos, no one had remembered to send updates to the rest of the engineering team.
The engineer running the backfill had no idea anything was wrong - they weren’t in the channels where the alerts fired, and they didn’t regularly check the database health metrics - why would they?