Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Run incidents without leaving Slack
Tailored to your organization for smarter incident management
Confident, autonomous teams guided by automations
Learn from insights to improve your resilience
Build customer trust even during downtime
Connect everything in your organization
Streamlined incident management for complex companies
Like most tech companies, we use an on-call rota and various alerting tools. We do this to respond to incidents before they’re reported. Proactively identifying issues and communicating to customers helps us provide great experiences and fosters trust.
Internally, we’ve been using these alerting tools in tandem with our incident triggers feature. We’ve found that it’s made responding to the pager much smoother - it’s one less thing to do when you get paged at 2am. In this post, I’d like to share how exactly we have it set up.
As you might have read in previous posts [1, 2] we use Sentry for error reporting & PagerDuty for alerting. When an alert triggers, that integration looks like the following:
The core part of this flow is our incident triggers feature. Enabling this feature means that anytime a PagerDuty alert triggers, incident.io creates an incident in a triage state. Triage state is a precursor to a full incident.
When one is created, you'll get a Slack message with the following options
If you choose to create an incident, you'll get an incident with the PagerDuty alert attached. Additionally, we'll fetch its details from PagerDuty. If we can identify attachments on it - such as the originating Sentry error - then we'll attach them to the incident as well.
If you choose to merge the incident, we'll let you choose an existing incident to merge it into. This means the original incident will receive the new PagerDuty alert as an attachment. When you later close the incident, we'll automatically resolve both alerts in PagerDuty.
If you dismiss the incident, then we'll resolve the PagerDuty alert, we'll archive the generated Slack channel and it won’t appear in your metrics.
To read more about setting this up, see our help center article.
So we have this setup, but why have we chosen to do it this way?
When a PagerDuty alert triggers, you go straight to the incident channel and you have everything you need right in front of you. You’ve got a link to the Sentry error, the Pagerduty alert, and all your usual incident commands at your fingertips. There's no manually keeping track of related alerts, or spending time creating incidents. Instead, there's more time for you to focus on communicating and fixing the issue.
When an error occurs and Sentry triggers a PagerDuty alert, we notify an engineer. We also use PagerDuty’s Slack integration to receive updates about those alerts inside Slack. Those updates can get pretty noisy, especially if you have a large outage and everything is stuck in failure loops.
With the incident triggers feature, you get a configurable debounce window per service. So if the same service fails twice within 30 minutes, you’ll only get one incident (and the option to attach non-initial alerts to the incident). This means you can focus on your targeted incident channel and ignore the noise coming from PagerDuty.
Receiving these updates in Slack is useful as it lets people know an alert has occurred and then resolved.
There are however some drawbacks to these PagerDuty alerts. Instead of relying on those messages, we encourage people to follow issues via our incidents channel. Here’s why:
We’re always looking at how we can improve our product. Part of doing that is using it ourselves and understanding it deeply. Having used the triggers feature extensively, we think it’s a great way to run your incident response process. And, by using it on a day-to-day basis, we’re able to keep on top of points of friction and build improvements into our product roadmap.
I hope you've found this post useful. If you're interested in trying out incident.io, you can sign up for a demo. If you're already a user, and you want to use incident triggers, check out our help center article or reach out via Intercom.
Enter your details to receive our monthly newsletter, filled with incident related insights to help you in your day-to-day!
Better learning from incidents: A guide to incident post-mortem documents
Post-mortem documents are a great way to facilitate learning after incidents are resolved.
Luis Gonzalez
How we’ve made Status Pages better over the last three months
A few months ago we announced Status Pages -- the most delightful way to keep customers up-to-date about ongoing incidents. Since then, we've launched several features to add an extra bit of delight. Read on to learn more.
Asiya Gorelik
The balancing act of reliability and availability
To prevent issues like downtime, you have to focus on the reliability and availability of your product. But there's a balance to be struck here.
incident.io