Podcast

How to build a successful on-call team

Summary:

In this podcast, our panellists discuss what it means to build a successful on-call team. Drawing on their experiences at fast growing start-ups and scale-ups, incident.io co-founders Pete and Chris cover everything from who should be on the rota and how to build a compassionate on-call culture, to compensation structures and tips for operationalising on-call.

Notes:

Chris Evans is co-founder and Chief Product Officer at incident.io. In practice, he covers everything from Customer Success to Sales to Product development. Chris has spent his entire career working in Technology. Starting out as a Software Engineer he later transitioned towards Platform work, most recently as Head of Platform and Reliability at Monzo, where he was also responsible for incident management and on-call.

Pete Hamilton is co-founder and Chief Technology Officer at incident.io. Most of Pete’s time is focussed on Engineering and Product (although he also covers lots of other areas, including Operations, Legal and Finance). He’s worked in Engineering for start-ups and scale-ups for the last decade, starting his career at GoCardless and later moving to Monzo.

Key topics/timestamps:

[04:07] What is on-call and why is it important?

[06:25] Who should be in-call?

[09:13] Should all teams be responsible for their own on-call, or should there be a dedicated team?

[12:59] How can you build a compassionate on-call culture?

[17:23] Spotting (and stopping) on-call heroes

[27:58] On-call compensation and other incentives

[39:30] Tips for operationalising on-call

Where to find Chris Evans

Twitter: https://twitter.com/evnsio

Linkedin: https://www.linkedin.com/in/evnsio/

Where to find Pete Hamilton

Twitter: https://twitter.com/peterejhamilton

Linkedin: https://www.linkedin.com/in/peterejhamilton/

Referenced

Chris references incident.io’s Practical Guide to Incident Management - full of tips on how to run your end-to-end incident management process.

Pete references a LeadDev talk by Lisa Karlin Curtis on using incidents to level up you teams

Transcript:

Disclaimer: this has been transcribed by machines, so apologies in advance for any mistakes!

[00:00:00] Charlie: Welcome.

[00:00:01] Chris: Charlie, it is always a pleasure to chat to you. Nice to be here.

[00:00:05] Charlie: Thank you so much, and Pete also welcome to you.

[00:00:07] Pete: Thank you very much. I look forward to making a fool of myself on my first ever podcast.

[00:00:12] Charlie: Excellent. So for the benefit of the listeners, could you both just spend 30 seconds or so telling us who you are, what you've: done a trip through your career so far and, currently what you're focused on? Chris, maybe first.

[00:00:26] Chris: Yes. By title I am CPO here at incident io. In reality lots of things. So right now I'm spending most of my time talking to customers, both customer success, salesy type things, bits of marketing, and then working with Pete on product. Yeah, usual kind of founder mix. I think going back in time, I have worked my entire career in technology in one sort of guise or another.

So originally software engineering - I spent nearly 10 years doing that. All sorts of fun stuff like embedded coding in through to like web apps, Python, all that sort of fun stuff. And then drifted into the platform space. So immediately before starting here I was at Monzo with Stephen, who's the CEO here.

And I was running platform and reliability there, which was a whole barrel of laughs. I got to also run on-call there and, things like incidents and one thing led to another into starting this company.

[00:01:23] Charlie: Amazing. Thanks so much. Pete.

[00:01:26] Pete: Cool. Yeah. So I guess I'm Pete. I'm one of the other co-founders of the three of us. And my role is CTO which means I focus a lot of my time on products and engineering. And then obviously also work really closely with the finance team, the operations team, the legal team. Lots and lots of things that have nothing to do with engineering, but as a founder, you end up picking up.

I love it. I wouldn't change it for the world. My background is pretty firmly in engineering products. I spent the last 10 years or so working in startups and scale up. So I cut my teeth at a company called GoCardless when they were super small, maybe like 15. And they're now something crazy, like a thousand people.

I left them after six years. And I spent a lot of the time there focused on what we build in our product. Building out teams, then moved into management for several years. And then I wanted to do a swing back to explore more of the like IC leadership track.

So I made a change. I moved to Monzo, where I met Chris, Stephen and I worked as a senior staff engineer there in the operations team, which is super interesting. So Monzo has 5 million customers and while Chris is beavering away, trying to figure out how our infrastructure can support all of that, the part of the business that I worked in was very much supporting the customers more directly.

So like lots of teams working. What was a, or what is a super interesting, super advanced system internally that our support agents use and that, that's super, super interesting. We're gonna be talking a lot about on-call - a lot of the sharp end of the stick kind of pops up in ops, right?

Cuz customers go, “Oh, there's problem”. And then you go from there. So it's a really interesting bit of the business to work in from that perspective as well.

[00:03:08] Chris: Fun, fun fact as well. Pete and I before we had ever met at all. Pete DMed me on Twitter. And we were chatting about all things incident management software. So it was like the writing was on the wall. So I think he was messaging when he was a GO cardless asking about what we had done in like Monzo.

Cause I'd just given a conference talk or something about Monzo response, which was the open source tool we built there. So yeah, that was fun. We were noodling on that over dinner the other day.

[00:03:37] Pete: Yeah, I think it was just spent like a huge amount of my time, having been at GoCardless for a long time, being one of the longer tenured employees, just getting Pulled into incidents and was like, There must be a better way. Which sounds very I've scripted this and I really haven't. And then I was like, “Oh my God, this thing looks amazing”.

And yeah, serendipitous. And then that was like years before I even moved to Monzo, so it's

[00:03:57] Chris: Yeah. And then Pete was lucky enough to end up working with me you know,

[00:04:00] Pete: Yeah.

[00:04:00] Charlie: And now look at you both. Now on the pod, all your dreams have come true.

[00:04:06] Pete: Very, strange.

[00:04:07] Charlie: Perfect. Thanks both for the introductions there and as Pete kind of spoiled for us the theme of today is gonna be chatting through on-call, and I think we should just jump straight in.

Maybe Chris you could start by explaining super simply, what do we mean by on-call? And why is it important.

[00:04:28] Chris: Yeah, so I think. If you look at companies that provide any kind of service that needs to be online all of the time banking clearly, right? People need access to their money. E-commerce, you need to sell stuff. And if you're down you aren't taking money marketplaces or there's like a myriad of organizations that exist and needs to make sure that they are online and working. As I think anyone who's worked in technology will know, technology does not work all the time, and humans are often the sort of the glue that holds these things together and keeps systems online, keeps services running, all of that kind of thing. And during the day that's like easy mode during working hours, like that's just the job.

But when everyone clocks off at five, 6:00 PM or whatever else, time. Then you need to have a support structure in place that's able to keep track of what's going on, but through like monitoring and be available to respond to things that might not be working. And that is broadly what we mean by on-call.

So it's typically, if we look at the elements, it's a group of people who are assigned to be responsible for the like ongoing operations of a particular service or something. They will be on a schedule, typically, so they'll be on a rotation. You don't need the whole team online at once. You just need a person who's there to be paged which is like old term not technology, cuz he used to be paged with a little pager. Now you just get a phone call or a push notification. And then it will also include things like escalation policies. Getting hold of an individual was never going to be a hundred percent watertight.

They could be on a train or their phone could have died, or they could have some other reason why they can't get to it. And so escalation policies are like the building blocks that typically say if this person isn't available, here's the next person that I would go to. All of this whole onco thing is in service of just keeping your systems and everything else.

[00:06:25] Charlie: Great, and I can imagine with a thing like on-call, getting people invested in that process and getting them to sign up could be quite challenging in some organizations. I wonder, Pete, if you've got any lessons learnt or thoughts on who should be included in an on-call rota?

[00:06:48] Pete: Yeah, I think there's sort of two broad schools of thought or ways that I've seen companies do this before, and I think it's worth maybe talking a little bit about each. But I think for me, getting as many people as possible onto some degree of oncall radar is really important for a whole number of reasons.

I think first and foremost it's about giving - particularly in the context of engineering teams, I actually think on-call is much broader than that - but let's focus on engineering for a second. Giving engineers that really tight connection with the impact of the work that they're shipping is really important.

And obviously there's the customer visible version of that. You ship something, customer says it's great. Woohoo. There's also the downside of that where you ship something in it and it goes wrong. And I think unless you've got the positive iteration loop and feedback cycle and the negative one where you understand maybe the downside of you shipping that thing a bit quickly or maybe you missed something and it's gone wrong.

I think you really do yourself a disservice as a team. So I think if you can get everyone to be responsible for the products that they're shipping, both in the positive sense and the negative sense, I think that's, for me, that's the winning move. I think it's hard to, I guess it's hard to generalize that, to say everyone should be holding a pager overnight at all times.

So I think personal circumstances can differ. And it's quite personal thing to say I'm willing to be woken up overnight. Like I've got a new baby. So I essentially have a full-time pager that poops in my house at the moment. And I guarantee you if I added another page to that, I would be even more sleep deprived than I am now.

But I think as a default saying where possible everyone should be holding the pager for the products they're shipping is a pretty sensible starting point. Yeah, I. dunno what.

[00:08:31] Chris: Pete, because you've got the little pooping pager there is a case to say that you should be on-call because you're already awake. You know you're not

[00:08:39] Pete: It's not, it's probably not actually that far wrong. It's just I'm gonna be up anyway. I'd question whether you want me in my more sleep deprived state being responsible for things that have gone wrong overnight, but, um, Yeah.

[00:08:51] Charlie: You may be sub-par there,

[00:08:52] Pete: Yeah. I think it, like something we should talk about later, and maybe you're already gonna do this, Charlie, is just like I think there are all sorts of different models for on-call around.

How you get people on board and I think a lot of people say on-call is this big scary thing. And it's a case of pick up this thing and now you're responsible for the entire of the company. And actually it really, it doesn't have to be that way, but let's, Yeah.

[00:09:13] Charlie: I think it's a good thing for us to explore. And maybe we could start with you Chris. You spoke a little bit about having all teams being responsible: is that everyone, or a dedicated team.

[00:09:36] Chris: Yeah, I think it's, I think it's a killer question. I think it's actually something that a lot of organizations struggle with. And I can talk firsthand about how this worked at Monzo because we went on quite, quite a on-call, and it was the starting point was, I think, similar to pretty much every company I've worked in, which was there was a group of people who were on-call and they shouldered the brunt of all operational workload, hour of hours for the whole company.

But that kind of thing, that kind of thing never really works. I think that Pete mentioned earlier, like the feedback cycles and feedback cycles are like the killer thing here with on-call, right? What you want is people who are shipping their products and understand how they're running and are feeling a little bit of that pain coming back through when they aren't running.

And I think if you have a central team, what you lose is those feedback cycles. And there was a great example of this actually at Monzo where there was an overnight system that used to break a lot, and it was like the responsibility of the on-call team to go and fix that. And they would come in, they would go on the next day and be like, “Listen your thing, your batch job broke again last night”.

We had to execute your runbook. It was a good runbook. It was very like formulaic, but Listen could you just fix this? And they're like yeah. Totally, When it got on a backlog, this happened for weeks and weeks on-call. We used to do retros at Monzo for the on-call team, and we'd sit and be like, What's going well, what's not?

And this came up time and time again and I was like action item. Here we are re routing that alert for that thing. That team does not have to be on-call for everything, but they're gonna be on-call for that one specific alert. And the first day after they were paged by, it got fixed. And I think that is that, like when I talk about philosophies of on-call, that is the type of thing that I'm trying to shoot for.

And so the holy grail here is that everyone is responsible for the services that they run and that they have that feedback cycle. But it is difficult, right? So Monzo, we started in that place. We moved to a model from central on-call team that was a bunch of like volunteers from across the company to a model where, We had that team and then a bunch of like teams who were available as escalations.

So they were still the first line, but you were like, Great, now I can, if I can't run your runbook, or you haven't given handed over stuff that the first line on-call team could deal with the thing, they would go it's going out to you. And you'd have nice processes for that.

And then the final stage was then like inverting that and getting to the point where those teams that were closest to the work were the ones that actually got the alerts first, and they were then able to escalate up to a central incident management team if something got pretty thorny or spanned multiple teams.

But that took quite a long time. Like it's hard, right? You need to think about how you're incentivizing people to go on-call. When people hired into a world where I don't have to worry about this going Please could you start spending some of your spare time doing work for the company too?

It's a, legit pretty hard.

[00:12:35] Charlie: And maybe Pete, you could translate some of this into what we're seeing at incident.io. So in Chris's story, there is Monzo with a massive engineering organization. We've got 15, 20 engineers at incident.io. We clearly don't have a centralized incident response team. How does this look in our organization?

[00:12:59] Pete: yeah, for sure. Yeah, maybe talk about like we, so we have everyone on-call pretty much, and when new joiners join, usually the idea is within their first three months they go onto the rota, I think just. Before I talk a bit more about incident, I just touching on one thing that Chris said around what I think works really well at Monzo and one of the reasons that Monzo is able to pull a lot of their teams to the app on-call model or why I think that worked quite well is I think there's all sorts of reasons why getting people to sign up for the pager, kinda like I mentioned earlier, could be like a bit tricky.

I think one of them is Holding this pager is, yes there's a personal inconvenience and we can talk about I think there's a ton of ways that you can mitigate that. And we do that really well at incident.io I’ll cover that in a sec. But there's also the level of responsibility and going if I am in Buck Stops here mode at 2:00 AM do I feel comfortable with that?

And actually it doesn't have to be that as the definition of on-call. And I think one of the things that made me comfortable getting on the page at Monzo really quickly was going Cool. I'm happy to be on-call for my stuff, but if it breaches my level of context or expertise, I can press a button where I'm gonna get someone who's been here for a long time can help me debug this, can help me figure this out.

And having that escalation path is one of, one of the things that I think helps with that. So maybe to contextualize it in your question of how does it work at incident.io, are things that we do to help get more comfortable with signing up for the pager? So I think one is always having an escalation path.

So the three founders, for example, would always open and we all on the pager as well. Obviously we all come from an engineering background, so that's quite familiar, but I think that's quite abnormal that you kind of executives are on an on-call router. But at any point the team can pull us in.

So if there's something where we need to make calls at 2:00 AM like of course I'd rather hear about it than not. I think other things that the team do themselves, so this is not we drive this, the team have organized this themselves and I think it makes a healthy rota is We're pretty flexible with schedules, so we have a default schedule, but the team will coordinate amongst themselves.

We compensate people for on-call. Maybe that's something we'll talk about a bit later, but that removes the IOU element of Oh, God take Chris's rota. But like at this point it's all spending goodwill and it's all trading IOUs. And actually it's more like no kind of, everyone's happy to pick up pick up the pagers.

Cool shifts with each other. And the company bears some of the sort of inconvenience cost of that as opposed to it just being personal relationship credits that you're trading and then being really, open to overrides is something that we do a lot. So it's not that once you hold the pager, it's like tag your, it see you in a week and there's nothing you can do.

Like we have lives, we have things that we wanna do. We're actually occasionally getting someone else to cover you for an hour. A thing you wanna do. And so we make extensive use of overrides. We've made that a very socially acceptable thing in the team. What else do we do? I think even over things like Christmas and New Year, people will be like, Cool I'm on-call, but I'd love to go for like a 15 minute walk with my family, I'd rather not take my laptop and phone with me is anyone available to cover. And right now at our size, and then I can't say that it works when you're a thousand person team, but I think in a, in an ideal world, it probably would that team are very active about shifting stuff between them.

So I think it's having a very group collaborative mentality to ensure we all want this to be covered. We're all gonna help each other out. One of the things the team's really good at here is someone has a really rough night. Maybe someone in the team got paid five times recently.

It was like really rough evening and very, abnormal for us as a team. And without asking someone's come in and by 9:00 AM they've gone, Oh, that Like I've taken your shift for the next two nights rest up and get some sleep. And I. Building a culture like that around your on-call makes a huge difference.

I've gone on a bit of a on-call shouldn't be a scary thing, rant, but it's I think we've embedded that mentality really well here at the moment. And it's even as we scale, that's something that I really wanna cling onto, I think.

[00:17:13] Charlie: Perfect

[00:17:13] Chris: Pete. You've used all the time.

[00:17:15] Charlie: Yeah. Thanks for joining

[00:17:20] Chris: Charlie, and I could just leave you to it if that's cool.

[00:17:23] Charlie: We welcome the oncall rants on the On-call podcast, so you're in the right spot. I think there's a few really interesting points that you made there around compassionate oncall culture. And one thing that I know we've spoken offline a bit about is this idea of heroes in the organization.

I think that's something in the on-call space that can be very prevalent. The person that's always getting pulled in because they've got the most knowledge. I'm wondering, Chris, maybe you could talk a little bit about heroes. How do you identify them? What's the approach to try and mitigate this situation where that person's always getting pulled in?

Because it feels great at the same time, right? You're the centre of these things, but something you know to look out for.

[00:18:12] Chris: Firstly, thank you for associating heroes and jumping straight this way rather than to Pete

[00:18:17] Pete: This, flip flop answer model we've got has really done me bad here

[00:18:25] Charlie: Sorry Pete

[00:18:26] Chris: No it is a good question. I think it's like this is especially true in organizations that are, No, I'm gonna, I'm gonna wind this back. I was gonna say this is true in, like startups because you get people who have domain knowledge and then like they are the first engineering and they have all the context.

But as I was saying that, I was like, this has been true at many companies. So it's completely false what I was gonna say. But you find these kind of special people in all sorts of organizations and they are both awesome and like terrible. And awesome for the obvious reasons that everyone breathes the sigh of relief when they drop into the incident channel because great its Pete, he knows everything about this stuff. It's that.

[00:19:19] Charlie: The poop pager has gone off

[00:19:20] Chris: The poop pager has gone off

[00:19:24] Chris: But it's great. But I think when you look at like organizational risk, they pose a massive one. They are people who if they're not there, the counter is true. And that is for me, like signal of a pretty unhealthy organization.

And I think what is hard in a lot of organizations is being able to see these people in existence at the right level of the organization. It's really obvious to engineers who these people are. It's obvious, but maybe less so to engineering leaders and to people at the top of the company who are managing managers of engineers or ics.

It's, they're often oblivious to that. And those are the kinds of things that they care a lot about. So this is actually something we think, about a fair bit with incident.io, in so much as we have this ability now when you're starting to declare incidents with our product, and we use this internally for all of our incidents.

You can start to see where all those hot spots are and. Yeah, I think this is, I think this is something where it's first of all, you need to see it and then you need to have a system in place of figuring out how you avoid it. Because it's great to go, Pete's the hero, but it's what's the systemic way that I can now, you know, de hero Pete, or commoditize the access to that knowledge.

And I think there's a few things we actually did with this at Monzo. So like one of them. Run booking things. So we introduced early days, Monzo, it was like a bunch of founding engineers who were responsible for all the on-call system. They were excellent, but they were like pulled into everything and they had all of this knowledge.

And so what we did there was we said listen, there's gonna be for every time one of you is on-call, there's gonna be someone who doesn't have that knowledge shadowing you. And their job is to scribe things out of incidents to go, I dunno how you even knew that dashboard existed over there.

I would never have gone and thought to look at. What triggered you to go there and you start to pull what they call like tacit knowledge, which is just that ethereal tribal knowledge and get it out into something a bit more formulaic. So that looks like runbooks, it looks like knowledge bases suggested like q and a, like there's a myriad of different things there.

But your job essentially, I think when you spot these people is to figure out how, what is the approach there to get it out. And I think like I have. I've better than shadowing as, a way to do that shadowing and trying to just elicit that knowledge. I dunno. Pete, have you got any other…

What have you done in this space?

[00:21:41] Pete: Yeah, I think so. So I think I can definitely echo the, it's not always obvious when someone is repeatedly in incidents, because it's naturally turned out that way versus, because they've there's a, there's an implicit org structure that maybe you dunno exists where you're aware that person's quite senior and quite tenured, but what you're not aware of is they're falling on the sword every time the incident comes up.

And actually when it can look like they're doing it from a I really love this and I really enjoy it perspective, but. And I think honestly that's often true. Like incidents are really fun to be part of in in a sort of walked way. But I think when it turns from something that you are doing because you are like willing and able and have the tenure in the context to something that actually you've become the default.

And now you're in all the incidents actually maybe some of the shines getting lost a little bit here and it's yeah, Pete's done another incident, but that's. Pete does, or or whoever, like suddenly you, I've seen people feel a little bit more taken for granted.

It's Oh, I did, I was doing this cuz I love it. But now I feel like I have to, and now I feel like it's just expected that I'll be the person that deals with the issues. So I think that can often be quite a kind of lagging indicator when you're in a management position where what can look like a really good thing.

Kind of rapidly turn into, Oh actually this person's now like super burned out. What happened? How did I miss this? And there's tons of signals I think along the way that if you're paying attention, you could probably spot. And one of the things, hopefully we can do it a bit with incident I is surface some of that in the future.

I think one thing Chris mentions shadowing. I think one thing I've seen work really well. is thinking of shadowing. It's there's two versions. One is the really formal version, so and something that we do here at incident.io, which is like actively say Chris is on-call.

He's the primary. and while you ramp up, you are just gonna essentially be glued to Chris. And now you know, we start with a, you do that during the day and then you move gradually to overnight. But that's really, good. Pulls people into instance really actively. I think it's another version that if you can culturally curate at your company is really good, which is just making instance something that you actively encourage people to get involved in if it makes sense to do so or to follow.

They maybe can't actively contribute. So this is particularly true of like junior engineers, for example. Now I'm not proposing that someone early in their career jumps into a channel, starts throwing questions all over the place and disrupting everything. Obviously that's not the time, but going if you see something where there's a a huddle of very experienced, very tenured engineers trying to solve a problem, I think it's totally acceptable for you to go and look at that and they're not something that happens 200 times a day.

And they're an amazing learning experience, and you're not gonna naturally be in the room because you're not yet on-call and you're not gonna naturally be in the room because you don't yet have all the context that means that you would be pulled in. But for you at this stage of your career and with the level of context you've got in the business, this would be amazing for you to go and be a fly on the wall floor and making that really acceptable, I think is really important.

And so one of our engineers Lisa did a really good talk on this at, lead dev, and she made exactly the same. As an engineering team, which was like some of her biggest learning jumps or times when she really clicked with a particular part of the infrastructure or the product.

When she was working at her last company were when she went and sat in an incident, wrote down a massive list of questions of I can see what you've done, but I have no idea why you've done that. And that feels like it'll be a useful thing to know. And then afterwards, when everything's not on fire anymore, go and chat to the people that are in the incident and be like help me understand, like, why did you do those things?

Should I do similar things in future? Like how does this impact me? My work? I think that informal making incidents an accessible thing is a really powerful move because often they're not, often it's, especially if you've got H culture, it's a the people with the context go in a room, everything gets fixed.

They come out and go, Don't worry everyone. I fixed it. And it's, I dunno, it's just another lens on it. But I think if you can build a company, Sort of incidents are for everyone, everyone's invited, as long as you've got clear boundaries of, or like rules of engagement, I think that can be a, that can be a really good thing.

If that makes any sense at all.

[00:26:04] Chris: Not at all, Pete.

[00:26:05] Charlie: No Pete

[00:26:07] Pete: Great. That's a novel for me, so I'll take that.

[00:26:11] Charlie: I love the plug for Lisa's talk there. We'll get that in the show notes.

[00:26:18] Pete: Shamelessly plugging the achievements of my team. Part, of my job.

[00:26:22] Pete: Um, I know it's really good.

[00:26:25] Charlie: Sweet.

[00:26:25] Pete: it, it reflects a lot of you know what I believe by instance I recommend people watch It.

[00:26:29] Chris: It is so good, actually. And again, I'm like putting my bias hat to the side, hopefully. But Pete and I went and watched Lisa give that talk at Lead Dev over in London and we were sat next to each other, middle of this big auditorium. We were sponsoring the conference, so like our logo was up on the thing.

[00:26:46] Pete: Yeah.

[00:26:46] Chris: Lisa gave this just like incredible talk where you could you know, when you were at a conference sometimes, and you can see people and there's like their phones and everyone's back to work and someone's talking. You're like, I feel so sorry for this person. It was like the people slightly edged forwards in their seats. Igenuinely think it was excellent. I was, I said to Pete at the time, I was like, I actually feel. I feel a bit emotional, like this little company that we started.

[00:27:11] Pete: I think, I think, your words were like, I think we might have built a real company or something like that at that point. . Yeah. No, it was, yeah, no, it was really good. And I think, yeah, it was yeah, that's the sort of stuff that, I guess it's also, I think for me personally, I think it's really cool seeing.

A lot of we care about this and we do a lot of it internally and there's a lot of things that Chris, Stephen and I believe passionately about, whether it's incidents or on-call and we're putting a lot of that into practice in the team. And then when you hear your team, go and talk about that publicly.

[00:27:58] Charlie: Absolutely. I'm gonna move us on a little bit and talk about the elephant in the room which is obviously on-call compensation. It touches all of the things that we've discussed so far about getting people bought in and feeling they're operating in a well functioning on-call team. Wondering if you could share some views on on-call compensation. What do we see across the industry? What's the thing that works best from your experie.

[00:28:26] Chris: Yeah I think this is, I think this is like a really evocative subject. It seems, and every time I talk about it on Twitter, LinkedIn, it's always like erupts into this I'm not paid to be on-call, or I think everyone must be paid. It's

[00:28:41] Pete: it's a lot more inflammatory than I thought it

[00:28:43] Chris: Yeah. It is.

[00:28:44] Pete: For me, it's just never been an elephant in the room. And then Yeah, basically Chris gets eviscerated on Twitter and I'm like, Oh, glad I didn't say that, But

[00:28:52] Chris: It's I think the interesting, So we actually ran a survey earlier this year. I lose track of time. I think it was earlier this year. We ran like an on-call survey. We got like really good responses from a bunch of companies, different sizes, different like locations around the world. And yeah it seems, it is.

It seems there is no common approach to this. So if you took look at Europe, typically it's quite common for folks to be paid for on-call as a sort of separate thing from, their main salary. In the US that is very much not the case. It is just here's your salary.

you're expected to be on-call. So there's that, those sort of two flip sides. And then when you look at breaking down compensation in Europe for those who get paid there is then different models for how you get paid. So there is should I be paid per incident that I get called out to?

Should I be paid a flat rate or an hourly slash weekly, whatever it is, rate that is I'm on-call. Here's how much I should get it paid. Is there a hybrid model for that? And there's obviously. There's obviously like weird incentives with various different ones of those. But the one that I think the one I would advocate for anyone who is looking to introduce compensation for on-call is the kind of, I'm gonna pay you a flat rate for being on-call for a week, for example.

And you choose that number based on what it needs to be to, I think, incentivize folks onto on-call. And we should probably talk separately about like how you incentivize people onto on-call. Cause it's not just pay. But if we look at just the compensation thing pay people. Monzo used to pay people a few hundred quid a week.

We do a similar thing here incident io. And I think there is then the immediate backlash when you take it with that approach is people go, Yeah, but if I work that hour as an hourly rate, you're paying me. Absolutely like nothing compared to my salary. And my response to that is that on-call pay should compensate for the inconvenience of having to think about bringing your laptop with you places and the ideal cases that you do.

Absolutely zero work in your week on-call, right? That's what everyone should be striving for, is I, If I have incredibly high page load, that is something I would advocate for a team swarming on and fixing more systemically. So the idea is that you get compensated for the inconvenience. If you happen to be paged during the week.

I would also include as part of my like incentive structure. Time off in lieu. So if I lose four hours of my Saturday, I am totally cool with someone on my team saying it. I would like to take a little bit of time back because I lost the opportunity to go to the supermarket to do my weekly shop or whatever else it might be.

And I think that, works really well as a pay, a sort of a compensation scheme. Things that it also. Like works. Things that really work well on that scheme as well is that when you look at overrides, which Pete was speaking about, in a world where you don't. where you don't pay people. is a, there's a weird if I say to you, Pete, could you give, could you cover me for an hour?

There's this weird I'm not getting anything for that by so you're not getting anything for that by covering me. So you are like, Yeah, but can you cover an hour of me at some point? And everything becomes very transactional and everyone's I've done more hours than you and this and that.

When you pay people per time unit, essentially what happens is you go, Pete, can you cover me for an hour? And you're like, Yeah, I'm gonna get paid for that. It's. compensated. Like it just removes I think a little bit of that. Like I feel indebted to Pete because he's covered me for an hour.

I think paying per incident is a slippery one. Not because anyone necessarily in, in like a healthy org. I don't expect anyone is going to be gaming that, though. That is the risk there, right? That someone is like, Oh, I'm not gonna

[00:32:33] Pete: breaking things and getting paid for it. Yeah

[00:32:36] Chris: Yeah, exactly. I think it is, for me, it's less the gaming aspect that concerns me about that. And it's more the, if I have to carry my laptop round for a full week and I don't get paged, and I've, in the back of my mind I'm like, Oh, I've gotta be 30 minutes away from wifi

[00:32:50] Pete: Yep.

[00:32:51] Chris: I also don't get any compensation, so I'm more concerned about that way round, which is it doesn't feel fair on the person on the receiving end of this,

[00:33:00] Pete: Yeah it makes it a bit less zero sum, right? Cause it's that the angle that you could go with is it's an inconvenience, but, , everyone has to do it. And so your sort of reassurance is at least I'm not the only one and it feels like that's a an or if we are trading our, I take over an hour of your shift, it's like you said I don't get anything.

You could argue you get the warm fuzzy feeling of knowing that you've helped Chris and that's something your team should want anyway. And it's yeah, but that's the sort of very technical answer and conveniently avoids. The emotional part of that, which is yeah, but it would be nice if there was some recognition beyond just a thank you because what it might be 30 minutes in hours is totally different to for me, like one hour from seven to 8:00 PM when I miss bath time.

[00:33:53] Chris: Hold on. Just to be clear, is this your bath time or your son's bath time?

[00:33:56] Pete: No, not my bath time

[00:33:59] Chris: You’re very militant about this

[00:34:00] Pete: Seven, 8:00 PM. every night. But yeah, it's and it. There's an assumption there that like all of Chris's hours and all of my hours are pretty equally, it's changeable as well, which is also not true. So an hour of Chris's time and hour of my time is an hour of my weekend, is that worth the same as an hour of Chris' morning?

It just you, don't even really want to be thinking about everything like that. And so I think if the company can shoulder a bit of the burden, like Chris to go Cool we'll acknowledge the inconvenience and we'll compensate for that. I think that's a, I think that's a really promising thing.

That's how we do it here. For what it's worth, I think one, maybe slight caveat is the difference between like in hours and out of hours and acknowledging that actually while you are at work, like you it, everyone should show the responsibility a bit more collectively and we're already paying you for the inconvenience.

And so I go, Oh, cool. So what I'll do is Chris, if you could cover my shift all evening, I'll cover your shift tomorrow between nine and five. Is that cool? And it's no, cuz that time is way more valuable to Chris that he could be spending with his family than the time where I'm gonna be frankly sat at my desk working anyway.

And yeah, I dunno there's some other little tweaks you can make to the way you structure how you comp that I think are, worth considering.

[00:35:38] Charlie: Absolutely. The key thing I'm taking away here. Do not interrupt Pete's bath time.

[00:35:42] Pete: Yeah

[00:35:44] Chris: Pete needs his bath time. Very valued time.

[00:35:47] Charlie: Anything you want to add there around incentivizing people in other ways?

[00:35:52] Chris: Yeah. I think it's interesting pay’s always the one as like the, that's the carrot that I dangle to engineers to say, This is why you should come and be on-call and I. I. think there's there's more, to it than that. And so I have I'll confess, I just wrote this down a minute ago cuz it came to me.

I was thinking about what else I use to incentivize people and I realized there's a lovely like three Ps structure to this thing.

[00:36:17] Pete: It, is this a pre-thought thing or are you inventing a methodology on the fly and

[00:36:22] Chris: So here's the deal. So I had, I have, they're legit three things that I actually do look at when I'm talking about incentivizing on-call. And I was writing 'em down just so I didn't forget them. Three P's. So pay important, right? People want to be compensated for it. I know there is some nuance around that. P process. It's like you need to, you can incentivize people who are making the process when they are actually called out by something going wrong.

Really good. This was literally the reason that I wrote the first incident tooling at Monzo. It was people were like, It's really stressful. I get paged, I jump into a channel. There's millions of things going on. There's alerts all over the place. I don't know who I need to contact. It's really scary. I'm an engineer, I just wanna fix the thing.

And so I think if you can build a system that sort of encodes your process and means that when you jump in, everything's been pre-thought and you are just pushed through a thing and nudged along and everything just works. That is it.

[00:37:43] Pete: like the manual stuff, right? That you would do the same every single time and it's just like it's administrative work while you are super stressed, yeah, a hundred percent on.

[00:37:53] Chris: Exactly MasterCard breaks, you need to go contact the MasterCard center. And it's I don't wanna have to remember that or go and read it. I want someone to, I want, if I tag my incident to say it's an MasterCard thing, I want someone to go. Go and call this number right now.

Good example. then the final P is progression, which is, I think legitimately on-call is a really solid progression like lever to pull for engineers. you will not learn as much about systems in, in the cold light of day doing things proactive work type way, as you will when they go wrong.

And so I think that is a really good incentive for I'm an engineering manager. I've my people are like, How do I progress to the next level? It's great, we want you to take some operational maturity here and own some of this stuff. Go and learn what happens when Kubernetes falls on the floor.

Or et cd quorum or do you know all of these things. So you go. P's, pay progression, and. Process.

[00:38:54] Charlie: I love it. You heard it here first. I hope you remember that people. And over to our fourth p I guess that's you Pete.

[00:39:26] Pete: Going. Keep going. Charlie. You can save this.

[00:39:30] Charlie: We've spoken a lot about on-call philosophy, compensation, the general compassionate elements, the human side, and so on. I'm wondering if we could, shift gears a little bit and talk about some like practical tips that listeners could takeaway for operationalizing their on-call team to get them vibing together? Maybe some tips on new joiners, getting them on-call. Just really getting down into the tactical things that people could take away from the pod.

[00:40:02] Chris: Yeah. Absolutely.

[00:40:04] Pete: Yeah I'm gonna brain unpack you and then Chris can give a much more like clear, polished answer afterwards that fills in all the gaps.

[00:40:22] Pete: This is our jam. What I'll do is give you a great answer and then Chris will come up with some like bullshit and that. yeah. no So I guess like really practical stuff and I'm just gonna lean really heavily on what we do at incident. I, what I've seen work really well before in, in other companies. Firstly, make sure your onboarding is really good. People on the pages sooner rather than later.

Make it not scary, I think. Making sure, so. what we do, just if it's helpful. . We have a kind of in hours on-call, out of hours on-call. Generally it's the same person all the way through, so we try and line them up. What we do is we start people on in hours on-call. So they'll literally just.

The page of Microsoft during the day and whoever's primary will pull that person in or they'll get pulled in automatically and it's there you're in the office anyway or you are, you're at work anyway. So it's a kind of, it's totally okay to deviate from whatever you're doing and this is how you get the kind of learning experience of a lot of firstly the mechanics.

So seeing how we at incident.io responds to incidents what do you do when you get page? What's the things that we, take action on early. That sort of stuff is really helpful. Also, you have nice little things like over lunch, you get used to taking your laptop out with you or being within X minutes of the office, so it's like a really nice on ramp.

Then you can flip that so you move from shadowing to reverse shadowing. So it's cool, you are now the primary in ours and if you need support, you escalate to someone else who's done this a lot before. So it's a really nice like training cycle for uncle in hours and then when you are really comfortable.

The process, the mechanics responding to issues, cuz we have international customers, so it's not like the overnight stuff is gonna be objectively scarier than the, in our stuff. And so really it's about getting you comfortable with being on-call. Then we move to cool, are you comfortable with taking that inconvenience out of hours and we'll put you on the full rotor and, then you essentially take on responsibility as part of the wider team.

So that's one thing I. Another is being really willing to firstly declare incidents and secondly escalate. So I think on-call onboarding only really works if you actually get to respond to some on-call things during your shift. So what we do is we'll treat a lot of things as incidents that maybe other companies wouldn't.

So maybe just a, customer's had a bug in production, right? A lot of companies, that's the thing that goes on the backlog and the team will deal with it in week pounding. Next week what we'll do is go cool, let's treat this as an opportunity to make a customer really happy. We'll treat it as a mini incident.

And Charlie, you've not run one of these before, so I'm gonna make you the leads and like you drive and let me know if you need any help. And it might sound a little bit forced, but that it is, it's really, useful. It's also dog feeding for us cuz it's we get to use our own product and we've spotted loads of stuff that way.

But I think just making. Training people to escalate more than less is just a really valuable thing. I think it's, I'd say it's quite similar to the world of security where even, out, much better to say, I think this might be a thing and be wrong and make it okay in your organization to be like, Hey, I've declared an incident, and then we decide it's not an incident.

Then we tag it as this wasn't an incident. Much better to have that than have someone go, I'm not really sure, so I'll leave it. And then six hours later, the same problem happens again at 2:00 AM and someone's been woken up and it's So yeah, I think that's like a, second thing.

And then just the last one that I'd add which we actually haven't done much of an incident which is interesting and we probably should think about like when we start doing some of these is game days. There's a, colleague who I used to work with at GoCardless called Noberto. He's a cto, another, a startup here in London, but he wrote like a really good article while he was at GoCardless of how we do this.

Encourage people to go find that and read it. But essentially dry running problems before their problems. And you're inventing scenarios that people can respond to and what you're really doing there, there is giving people a chance to try out different roles in an instance.

Go. You've done lots of leading, but maybe you've never done much on the comms front. So what I'd like you to do in this scenario is play the role of the person who communicates with the executive team, for example. what we're gonna do is dry run a scenario amongst the team of our database has disappeared we can't talk to it.

It's Cool, what are we gonna do? It's firstly you get a handle on how quick can we get it back up? What's the broad impact? What comes do we wanna send to And it's. You're going through the motions and you're often you have a goodie team and a badie team, and Badie team is trying to take stuff down and goodie team is trying to fix it so you can make it feel a bit more real versus just tabletop card exercises.

But just dry running things if you are if you don't have those real instance, big or small, can be a really effective way of trying out different scenarios with the team. It could be quite fun as well. I remember, a few times it's you can do things if you're feeling really nefarious.

I, I generally being on the villain team. You get the team to respond to a fake incident and then you just turn the wifi off things like that. And it's, it could sound really mean, but it's it's interesting to see how that works. And it's if the team really buy into it, it's cool.

Like this is an opportunity to, what would happen if that happened? Or if Chris’ internet suddenly cut out and he was the lead, like what do we do in those that they can be quite fun?

[00:45:39] Charlie: Chris, have you been able to think of any frameworks or, acronyms that we can take away from Pete's essay there.

[00:45:45] Chris: I'm glad you are. So I came up with the 14 Ds framework here. No I, have spared you. I've spared you. I'm not gonna, I'm not gonna go for it.

[00:45:56] Charlie: Absolutely.

[00:45:57] Chris: I think those points are really interesting. I think the thing I, I. Love most about our sort of learning how to like, operationalize on-call.

Is that one of pushing people to respond to in like small things like incidents? And I think honestly it's, so I remember I'm not like frontline on-call now for incident io drifted a little far away from engineering. But I remember when I was doing that just the power of. Declaring an incident for a small thing, linking it to the alert that you had so everyone knew that you were on something.

Then using that channel as I'm going to dump every single thought that I have about what I'm gonna go look at, what I'm gonna go and do. And often that process is just like the classic rubber ducking one, which is that by verbalizing that stuff, you're like, Oh my gosh, I can see that this idea is this idea.

[00:46:40] Pete: Yeah.

[00:46:41] Chris: I've seen. And then the benefit of all of that stuff. Anyone else in the organization can use that as their basis for learning, right? So someone, goes how did you know to fix this, thing? And you're like, Oh, I just saw it. And there was a great actually example of this, like early days.

So we have part of the product work that will let you send an SMS to someone automatically with incident io. And I remember there was a, an issue where someone, I got an alert while I was on-call and I'm like this one this SMS is failing to sand as part of this workflow. and I was like, Oh, I remember seeing someone who declared an incident that was debugging something about SMSs, and I went in and I literally just walked through their incident and was like, Cool, so I can go to Twilio to do this thing here.

And they it turned out to be almost exactly the same incident. It was we didn't explicitly allow a certain geography or country to, to be sending SMSs to. So great learning tools as well. When it comes to operationalizing on-call.

[00:47:38] Charlie: Great. I think this is probably a good point for us to, close out. We’ve covered all sorts of stuff from Chris's bath… I mean Pete's bath time, even poopy pagers.

[00:47:50] Chris: Charlie it’s nowhere near Petes bathtime at seven o'clock.

[00:47:52] Pete: I thought this session would get anchored around that. I regret bringing up at all.

[00:47:58] Charlie: More details about all the stuff we’ve spoken through will be in the notes. Thank you both for your time today. Look forward to continuing the conversation in future podcasts. Thanks a lot. See you later.

Picture of Charlie Kingston
Charlie Kingston
Product Manager

Move fast when you break things