It's fair to say that effectively managing an on-call rota is crucial for ensuring the 'round-the-clock availability of your services. But it's more than that. Spending the time getting your rotas right also empowers and protects the folks who make it all possible: your team.
Some best practices for doing this include using software to automate scheduling, setting up teams with clearly defined responsibilities, establishing escalation policies, and defining time limits for issue resolution.
But there's also the empathy and human side of it.
Things like enabling (and recommending) easy overrides for cover, ensuring 24x7 coverage while keeping an eye on pager load, and maintaining high levels of transparency and communication among team members.
Put together, by implementing these tactics, you can lead the charge on improved team accountability, better service reliability, and most importantly happier customers.
Here, we'll dive a bit deeper into some best practices around on-call rotas so you can get the most out of them.
Today, availability means everything. Think about how frustrating it can be when you're trying to use an application, only for it to time out or, worse yet, be completely unavailable.
Connect all of your alerts, configure a schedule for every team, and have confidence that the right people will be notified every time 👇
So, just like you in the scenario above, when incidents like these happen, your customers can be left looking around for alternative solutions. While one or two incidents may not cause churn, if customers notice you have a history of incidents with extended downtime, you run the risk of losing them.
Think about the potential impact of a high severity incident going unacknowledged for even half an hour. A situation teams should be incredibly proactive about avoiding.
All of this necessitates folks on-call who can address these incidents as quickly as possible. Whenever they may occur.
Having engineers on standby to monitor, fix, or escalate issues during their shifts allows organizations to minimize downtime and respond quickly to incidents. This proactiveness is usually the first step towards demonstrating reliability to customers and users.
Done well, an on-call rotation can ensure 24x7x365 coverage, meaning someone is always ready to take immediate action when something goes wrong. Down the line, this reduces the impact of downtime on customers and helps meet service level agreements (SLAs).
Yes, on-call rotations are incredibly important but creating them can pose some challenges—both from a practicality and human perspective.
Alert fatigue, where engineers are overwhelmed by constant notifications and alerts, is a very common issue. To challenge this, it's important to distribute on-call responsibilities among multiple individuals or teams. This helps share the workload, prevents burnout, and ensures a healthy work-life balance for engineers.
Another challenge is ensuring that the right information reaches the right person quickly.
Think about the potential impact of a high-severity incident going unacknowledged for even half an hour. A situation teams should be incredibly proactive about avoiding.
By setting up timeout thresholds for each tier of an escalation path, organizations can ensure incidents are acknowledged or resolved within a specific timeframe.
The DevOps movement and the 'You built it, you maintain it' philosophy have significantly altered the structure of response teams.
Developers going on-call for their own code not only enhances collaboration and service performance but also speeds up troubleshooting and incident resolution. Ultimately, this encourages rigorous testing before deployment and fosters a deeper sense of ownership, prompting developers to double-check their code consistently.
Creating an effective on-call rotation schedule requires considering several factors. Here are some best practices:
Setting up a rotation isn't just a matter of putting X-number of responders on a weekly schedule.
Ideally, you can collect feedback from your team during the scheduling process. Consider their availability, skills, and seniority to ensure a fair distribution of on-call duties. The last thing you want is to put someone on-call who just isn't ready to be. Shadow shifts with a more experienced engineer can come in handy here.
When it comes to determining the "appropriate" number of folks in a rotation, the context plays a significant role.
Generally speaking, a group consisting of 6-8 people is reasonable as a baseline. Ultimately, the goal is to strike a balance where responders are not on call so frequently that they experience burnout, but also not so infrequently that they completely forget how things work.
The person responsible for the pager should not hesitate or be discouraged from contacting others when they are unsure how to handle an incident.
If you are facing difficulties in recruiting enough people for a specific rotation, it may be worthwhile to consider merging rotations across teams.
While this approach may result in less contextual knowledge of someone else's systems, having two small teams share a rotation can often be better than the alternative of either not providing out-of-hours support or overloading a small team with operational responsibilities.
There's also room for a bit of experimentation here.
You can try daily, weekly, round-robin, or split shift rotations. As a reminder, you should consider location-based and responsibility-based scheduling for geographically dispersed teams. Here's quick breakdown of some other rotation types you'll come across, but there are many more!
Being on-call doesn't guarantee that the person will always have the solutions. It is completely normal and expected for on-call individuals to not know how to resolve every incident.
It's important to encourage and support your team in handing off the pager to someone when unexpected situations arise, without asking any questions.
It's crucial to create an environment where people feel comfortable and supported while on-call by clearly communicating that it is okay to not have all the answers.
Part of this involves establishing clear steps for escalating issues. The person responsible for the pager should not hesitate or be discouraged from contacting others when they are unsure how to handle an incident.
Doing this creates safety nets for everyone and ensures that incidents get resolved by the people who are best equipped to respond to them, and no one is shamed for not knowing the answer.
Life happens. And just because someone is on-call doesn’t mean they should have to neglect their personal responsibilities to cover a shift. This is where overrides can be helpful.
It's important to encourage and support your team in handing off the pager to someone when unexpected situations arise, without asking any questions.
Whether it's to go to the gym, spend time with their kids, take a walk, or anything else. A few hours of coverage can make everyone feel protected and supported. Being on-call doesn't have to mean putting your life on hold for an entire week.
If the team size and other constraints allow, it may even be beneficial to have multiple people on-call simultaneously.
Likewise, if you notice that someone has been paged overnight and has dealt with a particularly challenging incident, it's beneficial to encourage team members to take over the pager for a few hours so that person can rest. This creates a positive environment for everyone.
Just because someone is part of your team does not necessarily mean they are prepared to be on call. There may be various reasons for this, such as their level of experience as an engineer, their tenure at your organization, their lack of familiarity with your systems, or other valid reasons.
Regardless of the reason, there is one action you can take to ensure that individuals are adequately prepared to handle on-call responsibilities: shadow shifts.
This involves pairing up a new team member with someone who has more experience, until the new member feels confident enough to handle tasks independently. If the team size and other constraints allow, it may even be beneficial to have multiple people on-call simultaneously.
Implementing this approach accomplishes several important objectives. Firstly, it allows individuals to immerse themselves in a learning-oriented environment with minimal consequences. Secondly, it facilitates the development of relationships, which is particularly crucial in today's distributed work settings.
Lastly, it helps create an atmosphere where individuals feel supported and are not expected to undertake tasks they are not yet ready for.
While there's a lot of emphasis put on the actual processes supporting on-call, how you structure your rotations plays a big role here, too. From scheduling to training, escalations and more, your rotation and every facet of it is the backbone of your on-call function.
By being proactive about how you set it up, and addressing any gaps head-on, you can set everyone up for success. In the end, by coming to the table with some of the best practices we've laid out above, you can create a rotation that best protects you, your customers, and your team.
A win-win for everyone involved.
Alert fatigue is a real issue that affects people holding the pager. Here, we explain how you can address it proactively before it spirals into a bigger issue.
Before you drop folks into a rota, it's worthwhile to lay the groundwork for on-call that's fair for everyone and protects your company as well.
Deciding how, and if, you're going to compensate folks for being on-call can be a tough conversation. Here, we outline several of the most common compensation structures you'll come across.
Ready for modern incident management? Book a call with one our of our experts today.