Everyone experiences stress at work—thankfully, it’s a topic folks aren’t shying away from anymore.
But for on-call engineers, alert fatigue is a phenomenon closer to home. Unfortunately, like stress, it can be just as insidious and drastically impact those it affects.
First discussed in the context of hospital settings, this phrase later entered engineering circles. Alert fatigue is when an excessive number of alerts overwhelms the individuals responsible for answering them, often over a prolonged period, resulting in missed or delayed responses, or them being ignored altogether
The impact of this fatigue can have an effect beyond the individual and can create significant risks for your organization.
But, if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we'll dive into the tactics teams can implement to address alert fatigue and its underlying causes.
Alert fatigue is a common issue impacting various sectors, including healthcare, aviation, IT, and engineering teams.
These days, teams are responsible for a vast web of complex products, all of which are used by user bases that are growing daily. Put together, the potential for incidents has increased significantly. And the impact these incidents can have can cause a lot of issues and pager noise.
By treating every alert (or group of alerts) as an incident you can start to build up a picture of what you're actually doing when they fire
The massive volume of alerts generated by the systems monitoring organizations’ products can overwhelm folks, leading to many ignored or dismissed alerts. This reality is precisely why it’s imperative for organizations to proactively address the causes and symptoms of alert fatigue before they spiral into something untenable.
To address alert fatigue effectively, it helps to first understand its many causes.
Preventing alert fatigue is possible but it requires a proactive approach. Here are some practical steps to tackle this widespread issue head-on:
This is likely to feel a little uncomfortable, especially in a world where you’re drowning in alerts, but it’s the single most important change you can make to your workflow on the way to improving.
By treating every alert (or group of alerts) as an incident you can start to build up a picture of what you're actually doing when they fire. This is vital in completing the feedback loop, and will quickly help you spot the alerts that correlate with no action, and those that are commonly linked to high severity incidents.
Another important change you can make is to ensure that alerts waking engineers out-of-hours genuinely require immediate attention. You want to avoid disturbing someone for low-priority alerts that could wait until their working day or be ignored altogether.
Regular reviews of alert effectiveness, informed by team feedback and system changes, can help ensure alerts remain relevant and actionable.
In fact, there’s a debate to be had around whether low severity alerts should exist at all. If they’re purely informational and don’t require action, you could make a case that they should be replaced by dashboards.
If you’re linking alerts to incidents, an easy way to stay on top of whether the alerts require action is to include an alert review step as part of your post-incident processes.
If alerts were fired but didn’t provide a useful signal, or they were false alarms, they should be considered immediate candidates for reducing priority, or removal entirely. Conversely, an alert review step is a good opportunity to add alerts that might be missing!
Connect all of your alerts, configure a schedule for every team, and have confidence that the right people will be notified every time 👇
Often your alerts will be directionally good, but poorly tuned. Thresholds can feel like a moving target, but a review of metrics to understand your systems' normal operating ranges can quickly surface misconfigurations.
Regular reviews of alert effectiveness, informed by team feedback and system changes, can help ensure alerts remain relevant and actionable.
We’ve seen success in the past by introducing a recurring meeting to review the alerts that have been fired. If you sort your alerts by how often they’ve fired over the review period, and work from top to bottom, you can make a huge step forward in reducing noise by tackling the worst offenders.
Effective alert consolidation and de-duplication can significantly reduce notification noise. By grouping alerts into incidents, you can start to track which ones fire for the same or similar underlying causes.
Many monitoring and alerting tools will allow you to configure alert suppression rules, which allow you to say “If alert X is firing, don’t tell me about alert Y.” A good example of this: If a database failure causes multiple services to alert, consider reconfiguring your alerts to avoid a barrage of noise.
In our experience, people are far happier to add alerts than to remove them. It stands to reason since adding an alert provides a sense of security: “I can worry less if I know an alert will fire here.”
And it’s the reverse of this situation that makes removing alerts a less comfortable practice. When you ask about it, you’ll almost certainly hear: “But what if X happens and we don’t know about it?”
When it comes to removing alerts, here are a few things to consider:
The scheduling of on-call rotations greatly affects exposure to alerts. Strive for balance to prevent burnout, ensuring enough responders are in rotation without losing familiarity with the system.
The best thing you can do is set up processes to ensure that, when folks are on call, they do so in a way that is fair for everyone involved.
It can be helpful to consider various scheduling types, like daily, weekly, or follow-the-sun rotations to accommodate geographical and workload diversity.
Here's a quick breakdown of other rotation types you'll come across, but there are many more!
It’s true that alert fatigue affects a lot of people on-call. But the good news is that more people are talking about it openly and are treating it as the exception, not the rule.
No one should be expected to be the hero for their organization and work days on end being on-call. No one should deal with dozens of low-severity alerts at 1 AM. The reality is that working these shifts is hard, disruptive, and can get in the way of personal life and responsibilities.
The best thing you can do is set up processes to ensure that, when folks are on-call, they do so in a way that is fair for everyone involved.
By making it a less stressful experience, you can help everyone do their best work, reduce operational risks, and ultimately deal with incidents much more effectively.
Before you drop folks into a rota, it's worthwhile to lay the groundwork for on-call that's fair for everyone and protects your company as well.
By putting thought into how you structure your on-call rotation, you can set everyone up for success: your team and business.
Deciding how, and if, you're going to compensate folks for being on-call can be a tough conversation. Here, we outline several of the most common compensation structures you'll come across.
Ready for modern incident management? Book a call with one our of our experts today.