Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
I caught the tail-end of a Twitter thread the other day which centred around the use of Slack channels for incidents, and whether creating a new channel for each new incident is helpful or harmful. It turns out this is a much more evocative subject than I thought, and since I have opinions I thought I’d share them!
Broadly speaking the two approaches being discussed were:
I’ll disclose up front that I have a strong preference for the former, and in the interest of getting biases into the open it’s also what we do here at incident.io. I have, however, worked in organizations that used both approaches so you can consider this equal parts personal experience and corporate shilling.
*this is my understanding from the thread. If I’ve misinterpreted, or you think there’s another option I’ve missed, I’d love to hear from you! I'm @evnsio on Twitter 👋
In this mode of response, the flow of an incident looks something like this:
This approach leads to each incident having a fresh space to coordinate, which means context loading is easy, and there’s complete clarity over who’s involved in the incident as they’ve explicitly joined. Combined with an incident index – something like a central #incidents channel where each new incident is broadcast – folks can clearly see what’s going on across the org.
Additionally, new channels embrace the messy reality of incidents, where it’s not uncommon for them to be fuzzy-edged and misaligned with existing team structures, so a fresh space for a new ‘ephemeral team’ makes sense.
As with most things, there’s no such thing as a free lunch. With new channels created per incident, it makes it harder to see the details of everything that’s going on at once. Without tooling to help, it can also mean unhappy Slack admins who have to ask folks to clean up old channels 😅
In this mode of response, the flow looks like this:
With this approach, your main channel for response contains all of the context of what’s going on with your organization. If you stay on top of this channel, you should be well plugged in to the current state of things.
When incidents are understood and/or the ownership is clear, the general approach it to move the response to an existing channel to be mitigated. This means fewer channels being created and more context being kept alongside the other communication for that team/service/product.
The drawback here is that the main tracking channel can get super noisy, so whilst the distribution of information is low, it requires some effort to follow along.
One other downside is that stakeholders and folks in supporting teams who might need to be looped into the response effort now need to join the owning team’s channel. Over time this can converge on everyone being in everyone else’s channels, which in turn can lead to muting and missing information.
Rather than argue the relative merits of each approach, I thought I'd take a step back and come at this from a more principled standpoint. If we agree on the principles, and the resulting implementations follow them, there's really no better or worse, just personal preference.
To us, these principles look like:
Both approaches can achieve this, and which works best is likely to depend on the myriad of nuances that exist within your organization!
We’ve taken a pretty opinionated stance at incident.io, and I’d like to walk through why we think this makes sense, and how we’ve mitigated the drawbacks that were flagged in the original Twitter thread.
With a central channel where each incident is announced, everyone can see a concise summary of each incident, who reported it, who’s leading and a number of other useful data points quickly and easily.
Creating new places to talk about [incidents] is adding even more entropy to the responses!
We’re essentially providing a rich index of things going on. An index that makes these events easier to navigate and the linked response channels in a single, easy to find place. No entropy increase here!
In a world where new channels are created for each new incident, there’s a definite risk that folks report something that’s already in flight. We’ve been there, so we introduced safeguards that flag recent ongoing incidents at the point of declaration. Infallible? No. But it drastically reduces the likelihood of multiple disparate responses to the same thing.
Multiple responses to the same incident that aren’t coordinating with each other in the same communications channel is a recipe for complete disaster.
So now we have:
a) Stakeholders don’t know what’s going on
b) You don’t know who is reading what
c) Potentially multiple teams launching their own responses
These points are 100% valid, but since we avoid multiple responses in separate places, they’re founded on a situation that doesn’t materialise.
With a freshly created channel near the start of your incident, it’s incredibly easy to scroll back in time and see how things materialised. No need to hunt for the start point, or unpick overlapping responses. Just you, the other responders, and all of the clean context in a single place.
What’s more, with this approach it’s easy to cherry pick interesting points into an incident timeline – a timeline most folks will want to pull together for any material incident.
A new channel per incident makes it easy to inform anyone who wants to know about incidents, optionally filtering by arbitrary criteria. For those that need to know (or want to be included in a specific responses), they can be invited automatically.
If, like me, you’ve spent many hours painstakingly curating your Slack sidebar to keep things organised, you’ll be pleased to hear our incident channels aren’t going to break that. Nobody wants old channels lying around, so we’ve built a smart archiving feature which understands the state of incidents – and under conditions you control – closes and archives old channels.
Clearly, there is no silver bullet when it comes to incident tooling, and the intention here isn’t to convince you that incident.io will solve all of your problems. Getting good at incident response requires time, effort, trust, training, and a whole lot more. But good tooling built by people who care can certainly make life a lot easier 🙂
I'm one of the co-founders and the Chief Product Officer of incident.io. I've spent my whole career working in engineering.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Better learning from incidents: A guide to incident post-mortem documents
Post-mortem documents are a great way to facilitate learning after incidents are resolved.
Luis Gonzalez
How we’ve made Status Pages better over the last three months
A few months ago we announced Status Pages -- the most delightful way to keep customers up-to-date about ongoing incidents. Since then, we've launched several features to add an extra bit of delight. Read on to learn more.
Asiya Gorelik
The balancing act of reliability and availability
To prevent issues like downtime, you have to focus on the reliability and availability of your product. But there's a balance to be struck here.
incident.io