It's fair to say that effectively managing an on-call rota is crucial for ensuring the 'round-the-clock availability of your services. But it's more than that. Spending the time getting your rotas right also empowers and protects the folks who make it all possible: your team.
Some best practices for doing this include using software to automate scheduling, setting up teams with clearly defined responsibilities, establishing escalation policies, and defining time limits for issue resolution.
But there's also the empathy and human side of it.
Things like enabling (and recommending) easy overrides for cover, ensuring 24x7 coverage while keeping an eye on pager load, and maintaining high levels of transparency and communication among team members.
Put together, by implementing these tactics, you can lead the charge on improved team accountability, better service reliability, and most importantly happier customers.
Here, we'll dive a bit deeper into some best practices around on-call rotas so you can get the most out of them.
On-call rotations cannot be overlooked, especially today
Today, availability means everything. Think about how frustrating it can be when you're trying to use an application, only for it to time out or, worse yet, be completely unavailable.
Introducing incident.io On-call
Connect all of your alerts, configure a schedule for every team, and have confidence that the right people will be notified every time 馃憞
So, just like you in the scenario above, when incidents like these happen, your customers can be left looking around for alternative solutions. While one or two incidents may not cause churn, if customers notice you have a history of incidents with extended downtime, you run the risk of losing them.
Think about the potential impact of a high severity incident going unacknowledged for even half an hour. A situation teams should be incredibly proactive about avoiding.
All of this necessitates folks on-call who can address these incidents as quickly as possible. Whenever they may occur.
Having engineers on standby to monitor, fix, or escalate issues during their shifts allows organizations to minimize downtime and respond quickly to incidents. This proactiveness is usually the first step towards demonstrating reliability to customers and users.
Done well, an on-call rotation can ensure 24x7x365 coverage, meaning someone is always ready to take immediate action when something goes wrong. Down the line, this reduces the impact of downtime on customers and helps meet service level agreements (SLAs).
But creating on-call rotations is not that straightforward
Yes, on-call rotations are incredibly important but creating them can pose some challenges鈥攂oth from a practicality and human perspective.
Alert fatigue, where engineers are overwhelmed by constant notifications and alerts, is a very common issue. To challenge this, it's important to distribute on-call responsibilities among multiple individuals or teams. This helps share the workload, prevents burnout, and ensures a healthy work-life balance for engineers.
Another challenge is ensuring that the right information reaches the right person quickly.
Think about the potential impact of a high-severity incident going unacknowledged for even half an hour. A situation teams should be incredibly proactive about avoiding.
By setting up timeout thresholds for each tier of an escalation path, organizations can ensure incidents are acknowledged or resolved within a specific timeframe.
The influence of the "You build it, you maintain it" framework
The DevOps movement and the 'You built it, you maintain it' philosophy have significantly altered the structure of response teams.
Developers going on-call for their own code not only enhances collaboration and service performance but also speeds up troubleshooting and incident resolution. Ultimately, this encourages rigorous testing before deployment and fosters a deeper sense of ownership, prompting developers to double-check their code consistently.
Five best practices for creating an effective on-call rotation
Creating an effective on-call rotation schedule requires considering several factors. Here are some best practices:
Equity and balance
Setting up a rotation isn't just a matter of putting X-number of responders on a weekly schedule.
Ideally, you can collect feedback from your team during the scheduling process. Consider their availability, skills, and seniority to ensure a fair distribution of on-call duties. The last thing you want is to put someone on-call who just isn't ready to be. Shadow shifts with a more experienced engineer can come in handy here.
When it comes to determining the "appropriate" number of folks in a rotation, the context plays a significant role.
Generally speaking, a group consisting of 6-8 people is reasonable as a baseline. Ultimately, the goal is to strike a balance where responders are not on call so frequently that they experience burnout, but also not so infrequently that they completely forget how things work.
The person responsible for the pager should not hesitate or be discouraged from contacting others when they are unsure how to handle an incident.
If you are facing difficulties in recruiting enough people for a specific rotation, it may be worthwhile to consider merging rotations across teams.
While this approach may result in less contextual knowledge of someone else's systems, having two small teams share a rotation can often be better than the alternative of either not providing out-of-hours support or overloading a small team with operational responsibilities.
Variety in rotation types
There's also room for a bit of experimentation here.
You can try daily, weekly, round-robin, or split shift rotations. As a reminder, you should consider location-based and responsibility-based scheduling for geographically dispersed teams. Here's quick breakdown of some other rotation types you'll come across, but there are many more!
- Bi-weekly: The bi-weekly on-call schedule rotates team members every other week or twice a month
- Week and weekends: In a week/weekend schedule, one set of team members is on call during the week, while another set takes over during the weekend. This schedule is particularly useful when overnight hours are involved as it allows employees to have breaks from night shifts
- Follow-the-sun: A follow-the-sun schedule arranges on-call team members based on their work locations. This type of arrangement is ideal for remote teams with members spread across different geographic areas. It ensures that there is always an employee available during their regular work hours to handle incidents
Openness and communication
Being on-call doesn't guarantee that the person will always have the solutions. It is completely normal and expected for on-call individuals to not know how to resolve every incident.
It's important to encourage and support your team in handing off the pager to someone when unexpected situations arise, without asking any questions.
It's crucial to create an environment where people feel comfortable and supported while on-call by clearly communicating that it is okay to not have all the answers.
Part of this involves establishing clear steps for escalating issues. The person responsible for the pager should not hesitate or be discouraged from contacting others when they are unsure how to handle an incident.
Doing this creates safety nets for everyone and ensures that incidents get resolved by the people who are best equipped to respond to them, and no one is shamed for not knowing the answer.
Backup and escalation
Life happens. And just because someone is on-call doesn鈥檛 mean they should have to neglect their personal responsibilities to cover a shift. This is where overrides can be helpful.
It's important to encourage and support your team in handing off the pager to someone when unexpected situations arise, without asking any questions.
Whether it's to go to the gym, spend time with their kids, take a walk, or anything else. A few hours of coverage can make everyone feel protected and supported. Being on-call doesn't have to mean putting your life on hold for an entire week.
If the team size and other constraints allow, it may even be beneficial to have multiple people on-call simultaneously.
Likewise, if you notice that someone has been paged overnight and has dealt with a particularly challenging incident, it's beneficial to encourage team members to take over the pager for a few hours so that person can rest. This creates a positive environment for everyone.
Training!
Just because someone is part of your team does not necessarily mean they are prepared to be on call. There may be various reasons for this, such as their level of experience as an engineer, their tenure at your organization, their lack of familiarity with your systems, or other valid reasons.
Regardless of the reason, there is one action you can take to ensure that individuals are adequately prepared to handle on-call responsibilities: shadow shifts.
This involves pairing up a new team member with someone who has more experience, until the new member feels confident enough to handle tasks independently. If the team size and other constraints allow, it may even be beneficial to have multiple people on-call simultaneously.
Implementing this approach accomplishes several important objectives. Firstly, it allows individuals to immerse themselves in a learning-oriented environment with minimal consequences. Secondly, it facilitates the development of relationships, which is particularly crucial in today's distributed work settings.
Lastly, it helps create an atmosphere where individuals feel supported and are not expected to undertake tasks they are not yet ready for.
Don't make your on-call rotation an afterthought
While there's a lot of emphasis put on the actual processes supporting on-call, how you structure your rotations plays a big role here, too. From scheduling to training, escalations and more, your rotation and every facet of it is the backbone of your on-call function.
By being proactive about how you set it up, and addressing any gaps head-on, you can set everyone up for success. In the end, by coming to the table with some of the best practices we've laid out above, you can create a rotation that best protects you, your customers, and your team.
A win-win for everyone involved.