When you provide a service that people depend on, it’s important that you have the right people available when things go wrong and need a human to step in. The process of being available around the clock is commonly referred to as being on-call, and it usually boils down to a schedule of people who, one at a time, take on the role of being the first to be contacted.
On-call is commonly paired with a ‘paging system’ whose job is to make your phone make a noise and get your attention when a person or automated system raises the alarm.
This is most commonly used for out-of-hours support, but can also be useful during the day when teams are juggling meetings and other commitments.
A couple of terms worth knowing
- On-caller: Someone who’s the first point of contact when something goes wrong: “Isaac is the on-caller today”. Most commonly, on-call is linked to out-of-hours work.
- Pager: We used to carry actual pagers back in the day, but these days, the phrase ‘holding the pager’ means being the first person whose phone will be called when something goes wrong.
On-call isn't solely for engineers #
If it's possible you'll be needed during an incident, there's a fair chance you ought to be on-call.
Many organisations view incidents — and by extension on-call — as a solely engineering concern. Our experience is the polar opposite. Incidents often start in product/engineering, but they usually require people from around the organisation to form a temporary team to collaborate, communicate and solve a problem.
Imagine a significant outage at a payments fintech like Stripe. The source of of the issue might have started in engineering, but it’s not long before others need to get involved. Customer support and public relations need to start communicating publicly as soon as possible. Engineers begin discussing potential resolutions, including a rollback. Legal need to get involved to understand any potential contractual implications. Compliance need to get involved to ensure that they’re following the regulator’s guidance. An executive is pulled in to make the final call. Responding is a whole-organisation effort.
Incidents can happen at 2pm or 2am, and either way the decisions need to be made quickly. Being able to quickly pull the right person into the room can make all the difference. That means having on-call rotas for legal teams, execs, marketing, compliance: any team with a unique set of skills or knowledge that might be needed in a time-sensitive situation.
You build it, you run it
Some organisations have a dedicated on-call team, who are responsible for resolving issues across the whole company. While this can be attractive, and protects the rest of the company from the ‘burden’ of being on-call, we believe that it’s better to distribute on-call across your organisation.
Being on-call for the work you do has clear benefits:
- When people feel the pain of things going wrong, they are motivated to fix it — they have skin in the game. That results in more resilient systems, and aligns incentives.
- Healthy organisations are ones where the support load is evenly distributed. That means everyone is pulling together, rather than throwing problems over the fence.
Consider when to escalate #
For on-call to be successful, it’s important that everyone who’s on-call understands their responsibilities and the decisions that they are able to make. This empowers people to make decisions at speed, which is critical to reducing your resolution times.
The other half of this is just as important: on-callers must know who to escalate to (and how) when they need to pull someone else in.
It’s always a difficult decision to escalate, particularly out-of-hours. You’re interrupting someone (perhaps someone who’s very busy like an exec, or someone who’s asleep because it’s 2am). It’s embarrassing if it turns out that you didn’t really need them after all. The worst outcome, though, is if you don’t escalate and that results in a bad outcome that someone else could have prevented.
Just because you’re supposed to know how to do something, doesn’t mean you actually do. It takes courage, but admitting that you need help is much better than winging it during a critical incident.
Escalate early: this is one of those times when it’s better to be safe than sorry. The point of an on-call process is to get the right people into the room to respond to an incident. If responders aren’t empowered to use the on-call process that you’ve so lovingly built, it’s not doing its job.
This is one reason it’s important to get your compensation process right: if teams are being compensated fairly for being on-call, it’s more comfortable to pull people in.
Onboarding new on-callers #
No on-call rotation can be successful unless you are able to onboard new team members into it.
For someone to be an effective on-caller, they must:
- Have enough context and skill to respond to many of the incidents they’re likely to encounter
- Understand your incident response process (and any relevant tooling), and ideally have experienced a number of incidents (either real ones or practice-runs: see Practice, or it doesn’t count)
- Know who they can escalate to, and how, if they aren’t equipped to handle things themselves
It’s also useful to give someone a safe space to learn about being on-call; the equivalent of training wheels. This might be pairing someone with an experienced team member for their first few shifts, or giving someone the pager during working hours where there’s a team to back them up if they get stuck.
Psychological safety is critical for anyone who’s on-call, and that starts with a great onboarding experience.