There’s no two ways about it: on-call is stressful. But with humans at the center, it’s especially important to find ways to make it as manageable and empathetic as possible.
In this webinar with our friends at ELC, incident.io VP of Engineering, Norberto Lopes, and Intercom Staff Product Engineer, Andrej Blagojević, discuss their own experiences with on-call, and how the process can be better.
Read on for the high-level overview, and if you want the full context, check out the video in its entirety.
Simplifying and streamlining the on-call process
As an on-call engineer for the past 15 years, Andrej Blagojević has experienced his fair share of incidents with varying degrees of severity.
One of the changes he’s seen by leaders throughout this time is embracing the concept of blamelessness–one of the most important aspects for improving the on-call experience.
When an engineer on-call is unsure of how to proceed, Blagojević explained, this should not be viewed as an error on their part–rather, it should be viewed as an indicator that there was not enough context or knowledge provided to them.
“People need to feel empowered that they can improve the process themselves,” said Blagojević.
Taking a more holistic approach and avoiding pointing fingers gives engineers that power. In addition to blamelessness, a good process is one that is simple and prioritizes communication.
“You shouldn’t have a process that’s too strict, that ties people down, you should have a process that allows good collaboration and good communication,” said Blagojević.
Lopes agrees: “When people are uncomfortable, they want rigidity…and I think that works against them.”
For Lopes, another important factor is simply telling people things will be okay—yet many don’t think of this as part of the process. However, when leaders echo this sentiment, it can help alleviate some of the fear and anxiety of on-call engineers.
Finally, “If a team doesn’t have the ability to invest and improve things in response to a noisy pager, the people on-call will start to resent holding the pager, they won’t want to do it,” said Lopes.
To combat this challenge, Intercom engineers have the opportunity to volunteer for being on-call after hours, and are compensated for their time. This is not only an important learning opportunity–engineers actually begin to enjoy being on-call because of the incentives provided and the knowledge and confidence that they can fix problems going forward.
Communication when something goes wrong
For Blagojević, good communication during incidents means everyone is on the same page. When communication structures aren’t in place, the incident can become exacerbated.
"There were so many times when I had to decide between fixing the problem or communicating about the problem,” agreed Lopes.
Being on the same page means those involved have the agency to answer their questions instead of involving and distracting the on-call engineer who should be focused on solving the problem.
Through incident.io, this is possible for Blagojević and his team: they can find all of the relevant information about an incident in one Slack channel, allowing anyone to easily understand the details of an incident, like the summaries that can be generated automatically by the click of a button with the help of AI.
Another aspect of good communication is sticking to well-defined roles.
“When incident tools allow those roles to be utilized the best, then you get force multiplication reducing the downtime,” said Blagojević.
In terms of what hasn’t worked well for communication, Blagojević emphasized again the importance of blamelessness. It’s often human nature, he noted, to not want to cause commotion, or to be reluctant when asking for help.
But for Blagojević, “It’s about why you weren’t familiar with that…that’s a system problem, a process problem, why haven’t we given enough context or knowledge to fix this problem.”
Importance of product design for user experience
When dealing with an incident, the user experience plays a crucial role: the last thing you want is your incident management tool getting in your way. Yet on-call design is often thought of as second-rate to most companies.
“Really, good UX and good experience are there for any tool that any human person needs to use for serious work. And on-call is really serious work,” said Blagojević.
One of the major differences Blagojević noted of using incident.io in the incident management process is the ability to see all incidents that were occurring in real-time; Intercom engineers can jump into channels if they want more information or see a question they know the answer to.
“For an on-caller, having clarity of information is absolutely critical,” said Lopes.
Another difference for Blagojevi from incident.io is how the platform automatically tracks actions, creating a richer retrospective.
When a platform brings everything together, Blagojevic noted, the human is just there to make decisions.
On-call in personal life
For the engineers new to on-call, “the first thing you don’t realize is the impact it has on your personal life,” said Blagojevic.
The late night phone calls, inability to leave your laptop or even your house, and the accompanying stress are on-call pains Blagojević knows well – especially as a father of three young children.
However, some small changes can make an outsized impact for the lives of their on-call engineers.
“You need to have flexible schedules, you need to have the ability for people to excuse themselves on legit grounds… if someone was paged in the middle of the night, don’t have them come into work the next morning,” said Blagojević.
Conclusion
“Incidents will always happen, so you really need to figure out how to get good at them,” said Lopes. “Having preparation ahead of time, even if it’s just the basics, is something you won’t regret.”
For Blagojević, the incident management process is a journey he is still on: “Don’t expect perfection, make it an iterative process, try something that might seem to work and then work on it.”