Whenever you're providing a service to businesses or individuals that they rely on, it's important to make sure that it's up and running as much as possible without disruptions.
But the reality is that, despite your best efforts, downtime does happen. Regardless of when incidents strike, whether it’s 2 PM in the middle of the working day or 2 AM, it's important to have people available to diagnose and resolve issues as soon as possible. And in a world of increasingly global service offerings, around-the-clock reliability is critical.
However, unless you have a globally distributed workforce, you’re going to need to think about that round-the-clock coverage, and that’ll likely mean having people on-call.
Whether you have an existing on-call function that you’re looking to improve or you’re in the process of setting up your first team of people on-call, we’ve got you covered with practical advice covering everything from scheduling to driving a healthy culture and looking out for heroes.
Making the effort to get on-call right
While the prospect of a world without on-call is a total non-starter for most organizations, it's more than just a matter of setting up some schedules and hoping for the best.
Introducing incident.io On-call
Connect all of your alerts, configure a schedule for every team, and have confidence that the right people will be notified every time 👇
It’s important to be very deliberate about your choices when building out an on-call function. First, you'll need to think through what your actual schedule looks like and make sure you're providing coverage everywhere you need it. At the same time, you'll need to consider the impacts on your team and their well-being.
It's all a balancing act.
And consider this: what happens if someone doesn't know how to respond to an incident and they have to escalate it? What should they be doing? What do handoffs look like from one person to another?
All of these questions should have answers before you start throwing teammates’ names onto an on-call schedule.
💡 You build it, you run it
Before going any further, it’s worth mentioning that regardless of how you set up your function, having folks be responsible for what they build is most sensible.
By using the “you built it, you run it” approach and having folks take ownership of their work, you can avoid anti-patterns where one team builds and the other team maintains. You’ll also incentivize folks to build better and more resilient products since they'll be directly plugged into operational feedback from their systems and services.
Scheduling—protecting your product & your team
The heart of any on-call function is the actual schedule. Here's some guidance to help you create a schedule that protects your business needs.
Do you have enough people?
While it may seem obvious, don’t overlook having the right number of people on your on-call rotation. The last thing you want is to overload a handful of folks with overnight shifts—a tried-and-true recipe for alert fatigue—and have them be responsible for areas of the product they typically don’t work on. Remember, you build it, you maintain it.
The single place you can turn to when things go wrong
When it comes to defining the “right” number of people, context plays a large part. But generally, a group of 6-8 people makes sense on a weekly rotation. Ultimately we’re trying to balance them not being on-call so often that they burn out, but equally being on-call often enough that they remember how things work.
If you’re struggling to get enough people onto a particular rota, it’s worth considering merged rotations across teams. While you’ll obviously have less context supporting someone else's systems, having two small teams share a rotation can often be better than the alternative of them either not offering out-of-hours support or offering it with a small team that’s under operational overload.
No experience? Shadow someone
Just because someone is on your team doesn’t mean that they’re ready to be on call. This can be attributed to a few things: experience as an engineer, tenure at your organization, lack of familiarity with your systems, or a host of other entirely justifiable reasons.
Whatever the reason, there’s one thing you can do to make sure people are set up on the right track to comfortably be on-call: shadow shifts.
This looks like teaming up a new joiner with someone more experienced until they feel ready to take things on alone or, if team size and other constraints allow, permanently having more than one person on-call at once.
This achieves a few important things: first, folks get to dive head first in a low-stakes environment that’s set up for learning. Second, relationships get built, which can’t be overlooked, especially in this era of distributed work. And lastly, you help foster an environment where people feel protected and there isn’t an expectation for them to defy the odds and do work they aren’t ready for.
Incentivize folks to be on-call
Being on-call is not easy, and it’s rarely something people want to do, so it can be helpful to create incentives. These might be financially related, progression-related, or purely about developing operational experience.
Being on-call is a great opportunity to get into details that are tricky to uncover in “normal” operations and offers some of the most concentrated learning environments around. Ultimately, the responsibilities of on-call give folks the opportunity to become better engineers.
By being more selective of what requires a page, you can avoid folks getting woken up in the middle of the night for issues that can be dealt with during normal business hours.
Whatever approach you take here, thinking about the right incentives is important in developing an on-call function that both serves the needs of the business and that also feels valuable to the individuals.
Acknowledging humanness—protecting your team
Being on-call can have a very real impact on your teammates' lives. Because of this, it's important to factor in the physical and mental well-being of your team when thinking your rota through.
Setting up (and encouraging) overrides
Life happens. And just because someone is on-call doesn’t mean that they should be required to ignore all of their personal responsibilities to cover a shift. This is where overrides come in handy.
You should encourage and support your team in handing off the pager to someone when things come up, no questions asked. Whether it’s to go to the gym, hang out with their kids, go for a walk, or anything else. An hour or two of cover can really go a long way towards making everyone feel protected and supported. On-call doesn’t need to mean putting your life on hold for a week.
Similarly, if you’ve seen that someone has been paged overnight and dealt with a particularly tricky incident, encouraging team members to take the pager from them for a few hours while they get some rest can foster a great environment for everyone.
Be conscious of alert fatigue
Alert fatigue sets in when someone on-call is dealing with an overwhelming number of alerts and becomes desensitized to them. This can lead to them missing, ignoring, or delaying their response to future alerts. And given the unpredictable nature of being on-call, alert fatigue can set in quickly.
What happens during handoffs? What does a Critical severity incident look like? How do you triage something you aren’t sure is an incident? How do I escalate? What if I need an override at 2 AM?
But it's wholly avoidable if you're intentional about what incidents require paging alerts in the middle of the night. Just because someone is on-call doesn't mean that they need to be responding to incidents that are of very little consequence during unsociable hours.
By being more selective of what requires a page, you can avoid folks getting woken up in the middle of the night for issues that can be dealt with during normal business hours.
Be on the lookout for heroes
Hero engineers. These are well-meaning responders who are just trying to do the right thing and step up when someone is needed.
While it feels great, and there’s a collective sigh of relief when that one engineer drops into an incident, letting folks do this all the time can have long-term consequences. Once the expectation is set that this person is the backstop, people will always expect them to step in. You’re essentially papering over the cracks of operational maturity, and one day, when they’re not around, you’re in for a pretty tough time.
This gets back to the point about making sure you have enough people staffed for on-call. By making sure there aren’t any gaps in coverage, you can avoid situations where folks feel like they need to step up to protect the company. Shadow rotations can help here, too. Pairing folks up with high performers is a great way to socialize tacit knowledge.
Living on-call—giving your team the structure to thrive
And finally, when folks are in the middle of an on-call shift, there are a few things you can do to make sure that they're set up for success.
Preparing folks to be on-call
This technically happens before the first shift, but you should be crystal clear about what being on-call looks like. Documentation goes a long way here. You should proactively anticipate any questions that may come up during someone’s first week and answer them in a place they can quickly reference. Things like:
What happens during handoffs? What does a Critical severity incident look like? How do you triage something you aren’t sure is an incident? How do I escalate? What if I need an override at 2 AM?
The last thing you want is someone who’s working their first on-call shift not knowing how to handle specific situations because you haven’t set up the proper documentation.
Automating where it makes sense
Today, many organizations have systems in place to help take low-leverage activities off the plate of on-call responders. Things like routine operations to notify key stakeholders or propose a set of actions that should be taken under certain conditions.
Take incident.io's Workflows as an example. With them, teams can automate many actions during the response process that they'd be completing anyway. We’re not trying to automate the resolution, but encoding the boilerplate parts of your response process can free up mental cycles that can be better spent elsewhere and minimize the amount of up-front training required before people can go on-call.
If folks know that there's some form of automation to back them up while they're on call, they can operate with much more confidence than they would otherwise.
Make it OK for people to say, “I don’t know.”
And finally, just because someone is on-call doesn’t mean that they will always have the answers. Not knowing how to resolve an incident every single time is a totally normal and expected part of being on-call.
Making it clear that this is OK is crucial to building an environment where folks feel comfortable and supported while on-call. Part of this is making it clear what the next steps are for escalations. Anyone holding the pager should not be afraid or dissuaded from paging folks when they don’t know how to address an incident.
Doing this creates safety nets for everyone and ensures that incidents get resolved by the people who are best equipped to respond to them, and no one is shamed for not knowing the answer.