Fundamentals of on-call

When you provide a service that people rely on, it’s essential to have the right people available when things go wrong and need human intervention. This process, known as on-call, involves a rotating schedule where designated individuals take turns being the first responders.

On-call typically goes hand-in-hand with a paging system—a setup that alerts on-call team members by triggering an alarm when someone or an automated system raises an alert. While it’s often used for out-of-hours support, on-call can be just as valuable during the day, especially when teams are balancing meetings and other responsibilities.

On-call isn't solely for engineers

If there’s a chance you might be needed in an incident, there’s a good chance you should be on-call.

Many organizations consider incidents—and by extension, on-call—to be solely an engineering matter. But our experience shows it’s far from that. While incidents may start with engineering, they often require a team from various functions to come together to communicate, collaborate, and resolve the issue.

Imagine a major outage at a payments company like Stripe. Although engineering might have been the first to notice the issue, others quickly need to get involved: customer support needs to update users, public relations may need to issue statements, engineers may discuss potential fixes, legal needs to evaluate any contractual impact, and compliance must ensure regulatory requirements are met. Responding effectively becomes a whole-organization effort.

Whether an incident happens at 2pm or 2am, swift decisions are essential. Being able to quickly pull the right person into the room can make all the difference. That means having on-call rotas for legal teams, execs, marketing, compliance: any team with a unique set of skills or knowledge that might be needed in a time-sensitive situation.

You build it, you run it

Some organizations have a dedicated on-call team, who are responsible for resolving issues across the whole company. While this can be attractive, and protects the rest of the company from the ‘burden’ of being on-call, we believe that it’s better to distribute on-call across your organization.

Being on-call for the work you do has clear benefits:

When people feel the pain of things going wrong, they are motivated to fix it — they have skin in the game. That results in more resilient systems, and aligns incentives.
Healthy organizations are ones where the support load is evenly distributed. That means everyone is pulling together, rather than throwing problems over the fence.

Knowing when to escalate

For on-call to be successful, it’s important that each on-caller understands their responsibilities and the decisions that they are able to make. This empowers people to make decisions at speed, which is critical to reducing your resolution times.

The other half of this is just as important: on-callers must know who to escalate to (and how) when they need to pull someone else in.

It’s always a difficult decision to escalate, particularly out-of-hours. You’re interrupting someone (perhaps someone who’s very busy like an exec, or someone who’s asleep because it’s 2am). It’s embarrassing if it turns out that you didn’t really need them after all. The worst outcome, though, is if you don’t escalate and that results in a bad outcome that someone else could have prevented.

Just because you’re supposed to know how to do something, doesn’t mean you actually do. It takes courage, but admitting that you need help is much better than winging it during a critical incident.

Escalate early: this is one of those times when it’s better to be safe than sorry. The point of an on-call process is to get the right people into the room to respond to an incident. If responders aren’t empowered to use the on-call process that you’ve so lovingly built, it’s not doing its job.

This is one reason it’s important to get your compensation process right: if teams are being compensated fairly for being on-call, it’s more comfortable to pull people in.

Onboarding new on-callers

No on-call rotation can be successful unless you are able to onboard new team members into it.

For someone to be an effective on-caller, they must:

Have enough context and skill to respond to many of the incidents they’re likely to encounter
Understand your incident response process (and any relevant tooling), and ideally have experienced a number of incidents (either real ones or practice-runs: see Practice, or it doesn’t count)
Know who they can escalate to, and how, if they aren’t equipped to handle things themselves

It’s also useful to give someone a safe space to learn about being on-call; the equivalent of training wheels. This might be pairing someone with an experienced team member for their first few shifts, or giving someone the pager during working hours where there’s a team to back them up if they get stuck.

Psychological safety is critical for anyone who’s on-call, and that starts with a great onboarding experience.

Chapter list

On-call isn't solely for engineers

Knowing when to escalate

Onboarding new on-callers