One of my favourite features in incident.io is Decision Flows. With it, you can create a series of questions which eventually lead to a decision based on what you’ve answered. You can pull up this flow during an incident and it’ll guide you through the questions. It’s like having an experienced on-caller calmly guide you through what to do when a crisis hits.
This is complementary to incident.io’s Workflows feature. With Workflows, you can automate things like assigning roles and severities, and escalating to the right people. Sometimes though, you still need a human in the loop to make a call on what to do next, and that’s where Decision Flows are really powerful.
It’s tucked away a bit in the app settings and was originally designed to support teams managing a data breach incident, but it’s really flexible and can be used for a lot of different things. Once you’ve created one, you can invoke it by typing /incident decision
in any incident channel.
Triggering a decision flow in Slack.
You can also start a decision flow automatically using the Workflows feature. This gives you the flexibility to trigger different decision flows for different flavours of incidents.
Decision flows in the wild
I’ve been digging into how decision flows are currently being used, and I wanted to share some of the examples I’ve found.
1. What severity is this?
The most common use-case for decision flows (beyond the data breach flow we offer as a default template) is to help teams decide which severity to give to an incident. The rules for what qualifies as a “Sev 0”, “P1” or “Critical” varies between companies, and is often based on what business sector they’re in.
A screenshot of the decision flow interface in the app, with an example of a basic incident severity flow.
Severities can also change throughout an incident’s lifecycle when new impacts are learned. A decision flow will help your teams determine how an incident should be classified, which will ensure the right processes are being followed during the incident, and make your reports more accurate after.
Here’s what it looks like to users following a Decision Flow dialogue in Slack. This one is an example that helps determine an incident’s severity.
2. Are they down or us?
Often, incidents are due to a problem with a third party service. A decision flow can run the on-call lead through the typical steps of checking the status page of related services, and where to go to contact support, ensuring those incidents run more smoothly.
3. Should we update the status page?
Deciding whether to update an external status page is often a debate during an incident. A popular decision flow is one that guides team members on what and whether to update for external comms. The very best examples include some guidance on wording so that some pre-approved content can be copy-pasted into a status page or email.
4. Should we escalate this?
incident.io has some nice features to help automatically escalate to the right teams depending on the service impacted (and we’re working on some more!), but some companies prefer to do this as a decision flow rather than something that triggers automatically. These help the person on call to determine which team owns the impacted service, and also provide guidance on whether to pull in security, privacy and legal teams depending on exactly what’s been impacted.
5. How do I restart the database?
In the heat of an incident, even the most experienced engineers might need a reminder of how to restart a database, or what places to check when users are reporting issues logging in.
Decision flows that guide engineers on the steps they need to take will help resolve incidents faster. In my engineering days, I made myself detailed notes on things like how to log into the servers and restart the pods, and how to check for long-running queries. With a decision flow, I could have saved myself (and the next on-call engineer) a lot of time and stress!
6. Helping with non-engineering incidents
Some teams (including ours!) are using incident.io to manage incidents outside of engineering, so we’ve seen some innovative uses of decision flows, such as guiding team members on what to do if a company device is lost or stolen.
7. Can we close this already?
When is an incident done? Is it when the impact has been mitigated (done), or when the post-mortem has been written (done done)? This varies by company, so you may want to create a decision flow to help put that debate to rest. Or, now that we have multiple closed incident statuses, you could always add a second closed status so you can differentiate between done and done done.
What will you make?
Now you’ve seen some ideas on how other teams use incident flows, I hope you’ll be inspired to try your own and share them on the incident.io community Slack!