Podcast

Building an incident management process

Summary:

In this podcast, our panellists discuss the foundations that any team needs to put in place when designing their incident management process. Starting from the basics of defining what we really mean by an incident, to how to set your severity levels, roles and statuses, Chris and Pete share their tips for building solid foundations to run your incidents.

Notes:

Chris Evans is co-founder and Chief Product Officer at incident.io. In practice, he covers everything from Customer Success to Sales to Product development. Chris has spent his entire career working in Technology. Starting out as a Software Engineer he later transitioned towards Platform work, most recently as Head of Platform and Reliability at Monzo, where he was also responsible for incident management and on-call.

Pete Hamilton is co-founder and Chief Technology Officer at incident.io. Most of Pete’s time is focussed on Engineering and Product (although he also covers lots of other areas, including Operations, Legal and Finance). He’s worked in Engineering for start-ups and scale-ups for the last decade, starting his career at GoCardless and later moving to Monzo.

Key topics/timestamps:

[00:55] What is an incident?

[06:35] Questions to ask to figure out whether or not to declare an incident

[12:27] Can you declare too many incidents?

[17:59] Defining your severities

[23:34] Why you need incident statuses

[31:15] Incident roles and responsibilities

[36:29] Using structured data to learn from incidents

Where to find Chris Evans

Twitter: https://twitter.com/evnsio

Linkedin: https://www.linkedin.com/in/evnsio/

Where to find Pete Hamilton

Twitter: https://twitter.com/peterejhamilton

Linkedin: https://www.linkedin.com/in/peterejhamilton/

Transcript:

Disclaimer: this has been transcribed by machines, so apologies in advance for any mistakes!

[00:00:00] Charlie: Pete, welcome back to the podcast.

[00:00:04] Pete: Delighted that you're giving me a second chance

[00:00:06] Charlie: Absolutely. And Chris, welcome back to you as well.

[00:00:09] Chris: It's a pleasure once again to be here. I'm looking forward to chatting about whatever we're chatting about.

[00:00:14] Charlie: Wonderful quick intros. I can introduce each of you as you taught us all about your background in the last episode. Chris is our Chief Product Officer here at incident.io and Pete, Chief Technology Officer at incident.io. If you want the full backstory check out episode, one, little plug there for the previous pod. This week we're going to be talking about sort of the foundations of incident management, so a sort of incident 1 0 1 starting out a company designing your incident management process, the best practices and so on. Pete, I'm gonna jump straight to you and give you this very open ended question of what is an incident?

[00:00:55] Pete: Nice. Yeah open, ended existential questions. I like it. Yeah as a product with a kind of core belief that sensible defaults are really important, I think this is probably a good one to start with, which is that I think a lot of people look at an incident as. The default is it's a really scary, huge event.

Maybe you have one a year. And I think that's quite typical as people go. We rarely have incidents. I think starting to reframe a little bit about how you think about incidents is a good, starting point cuz we embed that into our product. So the way we would look at incidents is kind of anything that takes you away from planned work usually with some degree of urgency.

And this is subjective, right? So for some companies like Urgent is needs to be done this week, next week. Other companies, it's drop everything you're doing. And so what you find is different companies have a different path for declaring an instance.

So for some companies it's less. This isn't, everyone drops everything. It's not an incident it's just planned work. And for other companies like incident.io being one of them, I think we would consider bugs that affect our customers in a kind of right now way. We would generally declare those instance.

I think that's often a surprise to people because they're like, it doesn't that mean you have loads and loads of incidents. And it's yes. But that's, the point. And I guess we've, done that since the start and Chris has always had a sort of firm belief here from, I guess your time at Monzo Chris.

I remember joining and finding this quite surprising, but making sure incidents are not something scary. And making sure that there's something that we, practice a lot as opposed to we only get together when everything's an inferno, is let's get together when you know, things are maybe not so bad.

And then we can use that as an opportunity to practice. I dunno if there's you'd add.

[00:02:51] Chris: Yeah I think it's, interesting it's, incredibly like context dependent, how often you should be declaring incidents and what, an incident is in your particular place. But my general like heuristic here is that folks should just declare stuff as often as is reasonably useful.

And I think there is a negative here, which is if you declare like millions of incidents all the time, people become so desensitized to them and you're like what's the difference between an incident and just normal work? So, no one wants to go that far, but I think people generally have this view that incidents are really bad things that happen once a quarter, maybe once a month.

Then they involve multiple teams all jumping in to fix a thing. Where I think our view is a lot more like fine grained than that, which is that it is useful. I think it's useful for multiple levels of organizations to declare things more frequently for lower severities there. So as you said, like for us, a bug is treated as an incident and what that means is that for engineers in the organization, they can see how other engineers are debugging bugs.

And that's a great sort of way for folks to learn. These sort of like lower severity things, how to respond and how to practice incidents and also just how things work. And then you've got the sort of high severity things which are then very useful for everyone to be able to see what's going on when everything really, bad happens.

And generally speaking, incidents tend to be a really good way to understand how organizations really work. And so the more often you can declare things the. The more sort of signal you get on there. And without drifting too far into like nerdy academia, there's this thing known as the I'm probably gonna ruin this.

I think it's called the fundamental regulator paradox, which is just that like the job of a regulator and not in like the banking regulator, but like a thing that is regulating a system to perform in a certain way. The job of that regulator is to constrain and make the thing work in the right way.

But the more it constrains that and the more it makes it work in the right way, the less signal it has to know when it's deviating. Cuz the regulator works by going well, that's outside of the bounds of what should be working. Therefore I need to constrain it back in and do these things.

But if you have a system that's seemingly working perfectly, the regulator, you have no idea how, it's actually working. And in this case, like the regulator could be senior management and the system. The whole organization working and if, you are sitting there going, we have absolutely no incidents, you have no way to steer and nudge and know where the risks are.

Whereas the more sort of low level signal you can get the, greater sort of observability you have across the organization and sorry, that is like way drifted into academia in a place that didn't expect it to go to, They.

[00:05:40] Charlie: Could you summarize that, Chris, in three key words? This was one of your strengths that you showed in the first podcast.

[00:05:48] Chris: have more incidents.

[00:05:51] Pete: Nice.

[00:05:59] Charlie: the HMI principle, the have more incidents. That's wonderful. Take that away. Podcast listeners. One of the things, Pete, I think you got out there was around incidents taking you away from planned. With urgency.

[00:06:16] Charlie: I think that's a little bit abstract not to call you out there, but to call you out there. I'm wondering if you could share any sort of practical definitions or maybe some examples we’ve asked ourselves at incident.io about this unplanned work that we, tend to do.

[00:06:35] Pete: Yeah, for sure. And yeah, very legit. I think saying we should have more incidents and saying anything that takes you away from planned work incident, obviously there's, an extreme there, which is that isn't originally what we planned. All sorts of levels becomes an incident, all gets a bit, all gets a bit silly.

So I think yeah, let me illustrate a bit what I mean. I think, yeah, you can look at this and there's a few like assessments you could do, right of it. Is it worth kicking off an incident? I think the first thing is this a situation where there's a degree, ideally like some meaningful degree of risk and negative impact on either like the product, our business, or our customers, right?

Cause if we are changing our plans, But realistically there's no it's, it we are adding a little new feature to the product. It's cool, that's clearly not an incident, but if we are maybe getting some feedback from customers that the thing we've just shipped is broken in prod, it's like, cool that, that has negative impact on our customers.

So I'd probably declare an incident. Equally, maybe there's a, I dunno. Someone has accidentally emailed some confidential data to someone they didn't mean to. It's cool, that has clearly has a sort of pressing and urgent risk. Not so much on the technology side, but on on the business, right?

And so I might care about that in, in, my role. So I guess the first thing is usually some negative connotation. Maybe that's one place to start. I think the second is do does this need to be done? So there might be a bug that we find in the product. Going back to product for a second, and honestly no one's gonna notice.

It's fine to sit there for a few days, like we will get to it, but we can pop that on the backlog. We can deal with it as part of normal product work. If we didn't deal with this now, everything would be absolutely fine. And I think in that case, I probably wouldn't be declaring incident if it's, this is causing problems right now.

Maybe we've got a, an amazing customer we're trying to close a deal with, right? And they're, trying to work out whether they wanna work with us. I would give that bug a little bit more urgency. Or perhaps we've shipped something which affects a large subset of customers that now has a little bit more urgency.

So it's if it needs to be responded to, That would be another trigger to go. Cool. Maybe this is a good scenario to declare an incident, cause I'm gonna have to react very quickly and I might have to pull some more people in. Which leads me to another thing I would be checking, which is is this something I'm just gonna deal with on my own?

And quite quickly, or is there something where I'm gonna need to coordinate with multiple people? So actually there's some overhead here and now there's some complexity in communication or in coordination where having a little bit more structure stability framework around me is gonna be really helpful.

And so sometimes the answer is No, I'm gonna do this on my own, but I'll kick off an incident anyway. Because the chance that it evolves into something else is non-negligible. And so being in an environment like an incident environment, tees you up to think in a certain way, and you'd maybe make decisions a bit differently too if you said, Oh it's, nothing.

And then I'm in a mindset of it's just me and this bug. And actually you're not thinking about okay, who else needs to know and who else do I need to pull in? And an instance primes you to be ready for that evolution of the problem. And then I guess the last thing related to the multiple, people and coordination is around coms.

Is this something where I need to keep lots of other people in the loop? If this is a little thing that I'm gonna fix in a corner and isn't really affecting anyone and no one needs to know, fine. it's something where actually maybe our support team are getting loads of flack from customers because everyone's reaching out, going what's going on?

And they need something to go back with. Suddenly again, you've prompted this like multiple people, coordination, communication element, and so all of these things are like, you don't need to have in incident infrastructure to solve all of these. But I think they're close enough to, I guess what I would model as an incident that it's worth declaring one and then, and if you do, even if you don't need it, that but you, put yourself in a mode where you've got all the tools you need at your disposal. Yeah. I dunno if you Yeah, you'd add anything to that, Chris, but Okay. That's, the things that we generally look at so incident. I just, to clarify so that, that's the stuff that we'd generally be looking for and we generally opt on the side of declare an incident versus not

[00:10:43] Chris: Yeah. Yeah I, would love to add something at this point, but you have once again used up all the time of the podcasts. That was episode two. Folks. Thanks so much

[00:10:54] Pete: I feel, bullied and attacked.

[00:10:57] Chris: As you should. As you should. Yeah.

[00:11:03] Chris: I think, no, I think you made some, good points there. Many good points there. The thing I think that's interesting here and I often talk to folks about is like Why, would you, if you detach the label incident from a thing, why would you, when you're doing something that is a bit new or whatever else, why would you not want to have a new space to coordinate that?

[00:11:25] Pete: Mm-hmm.

[00:11:25] Chris: a way to collaborate with folks that is a little bit easier and for, us, that is what an incident feels it is just, Convenient mechanism for which you can pull a bunch of people together and, work on a thing and have a bunch of levers you can pull to co communicate what's happening there.

And obviously taking that to an extreme. Like we wouldn't use incidents at incident.io, for Oh, I'm kicking off a project. But we often kick off a channel where we're like, this is a new channel for a new thing that's gonna be delivered over the next two weeks. And a bunch of people will collaborate there.

And so it's not, that far from, that world. And Injecting to that, like a minor degree of like urgency or some, bad negative connotations. And that's for us what an incident is. And the, cost of declaring a new incident is so incredibly low that it's like, why would I not do this? It's like visibility. I can see over time where we're spending our time and who's involved and what's going on with the product and that sort of thing. And it's I dunno there, there, seemed to be like very few negatives to treating like lower severity things as, I.

[00:12:27] Charlie: Have either of you ever seen a, situation where over declaring resulted in negativity? I can just imagine an incident every every 10 minutes how just general vibes that could be a could A negative. uh, sort of experiences there where this hasn't gone

[00:12:48] Chris: I, so I, haven't, because when I've been in an organization where it's ended up where incident to declared quite frequently, it's been a sort of slow, gradual cultural change that we've seen there. So incidents have gone from being the, big scary thing that happens once a quarter to us building some tooling.

So the thing I'm referencing here is Monzo, right? We didn't have great visibility of when incidents were starting or finishing or who was involved in these things. Cause it was just a channel where it was just like stuff happening. And then we moved to one team using this new mechanism of declaring with a slash command and getting a channel and like being able to coordinate like that.

And then it moved to another team. And so you'd go to having a few incidents a week, and then over time more and more people come in and suddenly you're in a world. People culturally understand that having a channel where your new incidents are being announced, and having that as a sort of a semi constant stream of 10, 15, 20 things a day is just normal.

And you don't have to be like, Oh my gosh you, move from this world rather where it's like, Oh my gosh, like a thing has happened. I must jump into this channel and see if I can contribute to, I want to be told when there is something happening that I feel like I should go and jump into.

And it inverts the whole incidents have become less scary. But the, point you made is, a good one though, which is d Does the, is there ever any negative connotations of having loads of incidents? And I think my answer is yes. If the way that you are dealing with incidents adds, a ton of friction to people. Your culture is misaligned with this ideal. And so every time an incident happens, someone is getting stressed out that things are going on there. yeah, I think if you're doing it in the right way and you're getting to that point slowly or you are, you've set that culture, that sort of tone already, I think it's, not something that's something to be concerned about,

[00:14:44] Pete: Yeah. I think one thing I might add to that is I think the way that you set up and configure your sort of incident process as a company and the way you report on it really matters here as well. So I think one way that I have seen this go wrong or, I can imagine this going wrong, is one where incidents are all to our point earlier, considered to be large scary things.

And I think if you take the attitude of lots of instances better and you put it into an environment where incidents have traditionally been something very scary, and then you report, for example, to your. Oh, we had a. hundred incidents last month. What you really mean is we had zero major, incidents and a hundred bugs got fixed really quickly by the team.

[00:15:24] Pete: I've seen that, I'm trying to pinpoint like an exact example, but I've definitely seen that that vibe, let's say particularly among like leadership teams, right? Where you're maybe having a conversation about how many incidents we had in engineering last week or last.

and they're like, Are we having like way too many incidents? This sounds really bad. Do we need to stop shipping stuff while we deal with this? And it's Oh no These are, normal low level incidents. We deal with them all the time. We haven't had any major stuff. Don't worry. It's like framing.

That's really, important. So getting your severities right, making sure that when you're reporting you make it super clear. I think particularly if you've got a regulator for. Often from the regulators perspective or in, I dunno, if you were in government and you declared a major incident or you said we're having an incident, I think that would generally be considered like quite a severe thing.

And so if you're reporting to a government body or some regulator, like a financial regulator, they're gonna look at that very differently. So realizing that the world maybe isn't as aligned with that viewpoint yet. And so how you report makes a big.

[00:16:29] Chris: I was just gonna say to, puncture that point, I think something that is, really harmful is when folks haven't thought about severities and how different incidents should be navigated through the org, and a real strong negative there is that people are like, I don't wanna declare loads of incidents because every incident has some baggage attached to it, which I have to do a postmortem or an incident debrief and I have to arrange a meeting and then someone from risk and compliance has to come and check all of my homework.

And

[00:16:58] Pete: Yep.

[00:16:58] Chris: it like that, is just a horrible place to be because everyone feels that they're checking boxes and you're not doing anything valuable and you're not prioritizing learning. Yeah, that, that is one area where it's very harmful to have lots of incidents being declared cuz you just multiplicatively add paperwork to people.

[00:17:13] Pete: Yeah, I was thinking if your tool's not set upright you risk pulling. Particularly if you're moving from a world where, for example, instance, the things where you wake an engineer up in the middle of the night, it's like, cool, make sure that if you are gonna try and declare more incidents, you are not implicitly pulling the wrong people in still, like your, team that deals with your major incidents may well look very different to the team that deals with your or low severity incidents. The last thing you wanna do is pull everyone in for everything and thinking a bit about what types of incident do you have and who do you need, and how do you respond to those and how do you follow up?

Is, I think a, Yeah. Is like super important to get this right.

[00:17:47] Chris: Yeah, I was gonna say maybe severities is an incident area to,

[00:17:52] Pete: Yeah.

[00:17:53] Chris: cuz it feels like a lot hangs off. A lot of the process hangs off of being able to have severities attached to a thing.

[00:17:59] Charlie: Absolutely. Let's, do it. Pete what, is the sort of default severity stack?

[00:18:16] Pete: Yeah. Cool. No I'm re I feel really conscious of it now. Yeah, so default severity stack. So I think a few things that I note here. So one is that depending on which bit of the organization you.

There are all sorts of different default sets that people use, I think in engineering teams. And in more mature incident environments. People quite often lean heavily towards having P1, P2, P3, P5, or SEV 1, et cetera. I think that can be quite hard to communicate the sort of underlying meaning of those.

The rest of the organization site personal preferences, like where possible. Use human friendly incident names and have as few as you need. So rather than saying Oh, we're gonna have nine different categories of incident and we are gonna have them all on this scale, and they've all got like different descriptions and different decision matrices that are required to decide them, it's like I'm a fan of having slightly more intuitive human friendly incident names that you can explain really easily and having as humor as possible.

So for me Like critical, major, minor or something like that would be much more useful than having nine severity levels, for example.

Keep it super, super simple. You want someone to be able to think about problem and go intuitively, which bucket does it feel like? And then obviously there's a bit of calibration and a bit of fine tuning required there Critical or is it major, is it minor or is it low depending on how you're talking about your severities.

But yeah.

[00:19:57] Chris: Yeah I think there's a really good point. I think the, there is a few common things I see with people when it comes to severities or a couple of like common objections, which are The first one, which is like quite practical, which is, and comes, to the fore quite a bit and I've had to smash heads together in incidents.

When you have people who spend more time arguing over whether it is a critical or a major incident than they do actually, like focusing on the thing going wrong. And that is, that's clearly like a risk, right? You attach these labels and people get you, you get told by the whoever it isn't within your org who's monitoring this stuff.

It's really important to categorize these things correctly and I. I think it is important to categorize these things correctly. But not live in the incident. It's I would prioritize make, a best effort guess as to what the severity is and, go with it. If you need to change it.

I don't care. Often processes hang off of these, severities, so you know, something will kick in when there's critical. So there's some, weird social pressures there but, generally speaking I would just like optimize for speed in those cases. and then that factors into advice.

I often like offer to folks who are trying to design these severity levels and they're like, cool, we've got like a, spreadsheet which has like the 40 criteria that makes something critical. And I'm like, 2:00 AM that is never getting consulted. So your, priority is to distill that down into.

Two or three bullet point list for how someone could make a snap decision there. And we, had this at Monzo fact, which is like someone to come up with like really rigorous really, strong criteria for like how we model these incidents of severities. And the work was really valuable because we wanted out the back of our process to be like, This is the way that we categorize instance.

Cause a bunch of things fell off of that. There were regulatory reporting, there was internal board reporting, there was processes for follow up actions, and so you wanted to be applying the right kind of lens to those things, but it just didn't matter in the instance. So I took it and I was like, Great.

I distill that down to roughly. There is a lot of money on the line. There are a lot of customers impacted or there is a big regulatory concern that means critical and that was plenty good enough to steer with pretty good accuracy people into the right sort of severity category. The other thing I think is, often touted when you start talking about severities is that people are like, Yeah, but like I can't attach a label.

My incidents are too too complex and like I can't distill down like a really complex thing with lots of teams involved and systems into this one little label, so I'm not gonna do it. And honestly yeah. Yes. Like incidents are gnarly and complex and often hard to attach a label. But like I often think of this as like that aphorism in statistics, which is like all models are wrong, but some are useful.

And so in a world where you go I'm gonna throw my hands up in the air and say, There's no way I can say whether this is a P one or a P two or a P three, you essentially lose the lever to. To respond in a sort of consistent way, and you're saying I put all of this on the individual who's responding, and that's, a horrible place to be as the person who's paged in at 2:00 AM where you're like, Cool.

I don't have any guidance on how I should think about this or whether I should escalate to people. So I would look at it as, yes, it is a model. Yes, it is going to be like fallible. It's not going to work in every situation, but long as it is, as long as it is like generally more helpful than it is harmful, would say severities. Like just a thing that everyone should have, and I'm yet to find an org that has decided to ditch them and not found as incredibly painful. So yeah just, general viewpoints there.

[00:23:34] Charlie: And when those people that are getting at 2:00 AM SEV1s come in or critical incident what's the way in which they get up to speed with what's happening during that incident like are uh, Following the incident coming - statuses is the thing that I'm trying to get at here, just around like knowing What, uh, how do we do that? How should people be thinking. Incident statuses as a mechanism for, understanding what's going on.

[00:24:05] Chris: Yeah I think again, this is another one of those ones where people are like, You can't boil down my incident into a linear thing where there's a set of states that you walk through and again, like I completely agree in, in, in the real world, Like it's rare that I go, Do you know what I have moved from investigating to fixing, like incidents are incredibly.

But like our viewpoint on this is what we think is, really important and certainly from our product standpoint, but even just like setting aside the product, just a general principles like standpoint, is that folks should all be aware of what is going on with incidents. Like you want to assume that you're going to have to onboard new people as they, as you're going through, or that there are external stakeholders who might need to know that information.

Sort of periodic snapshots. And so the thing that I would always encourage folks to do is to set up some sort of regular cadence of providing updates. And as part of that sort of update mechanism where you are providing more qualitative here's what we're doing here's what we're seeing and here's what we're gonna need from.

From people. As part of that, just to think roughly about cool, if I'm going to approximately try and shoehorn this into one of N states to help convey as a model where we are in this incident, like that is the time to do it. So it's like a secondary thing to providing updates is being able to fit it into a severity.

So how bad it is a state are roughly oriented. Where are we in this thing? And that's just super useful. If I join an incident and I'm. Cool. I can see this is in a monitoring state. I know that someone who's in that incident has thought about what is going on, has assessed the situation, has gone I think the, worst of this thing is over and we're just watching to make sure that we've either fixed the thing or something's recovered on its own and that sets a tone.

That means I can join that incident and not be like, What do I need to do? What do I need to it's like I can join and be like, Great. Is there anything I can do now to help with this sort of wash up here? Can I relieve someone who's maybe been this incident for two hours and super, super stressed out because probably everything is watertight at this point.

And can I nudge people to go back to bed? That's another thing. And again, I think this is just a nice way to model. A model, like quite a complex situation.

[00:26:16] Charlie: Do you see those states and severities changing for different types of incidents? I'm here, thinking like engineering and customer success? Should, organizations be driving to a single set of, severities and statuses, or deviate those based on different teams and incident types?

[00:26:51] Chris: I think this is a huge rabbit hole but a good one to go down for sure. So I so if we look at like severities specifically in fact, no. Maybe let's take one step back, which is we're talking about incidents in different parts of the org, and I think when you say the word incident, most people immediately go like, 'bing' that's like a system is down, the database is broken and customers can't log in, that kind of thing.

And I think that makes sense. Like, engineering technology domain is the place where ' 'incidents' as a sort of vocabulary, feels most natural. You don't hear many people in your legal team saying, We're having a legal incident, but you do have times where someone has maybe signed the wrong contract or you've shared some details in an email thread that with some counterparty that you shouldn't have.

And. That is an incident, They just don't often talk about it like that. So I think like the premise of this answer comes from a place of, yes, I think organizations have incidents in more places than just technology. And the question then becomes does a legal like team have its own set of severities and statuses for those things?

Or should we everyone tailor it to their use cases? And my general thinking here is it is just a big waste of time if everyone has to rethink and design their own incident process for their own area. It's one of those areas. Whereas like you should be optimizing globally rather than locally.

Like a world where the, legal team go we're gonna have SEV1 to SEV4 and the customer success team go from really bad to not bad as their thing. It's as someone looking across the org and trying to roughly say where, are the bad things happening? I have to do all this like mapping and try and figure this out.

So my, guidance would be that each team. Takes the standard set of and statuses that an organization has decided is, sensible figures out what each of those things within that model. So what a critical severity means to me in legal versus me in technology. then what you can do is as a sort of roll up, if you are looking at incidents across the board or you want to just drill down in, in specific areas, you can go, I wanna see what all critical incidents we had in the last three months.

And there's a nice way to do that. So yeah, I think generally global over local optimization is my sort of general gist here.

[00:29:14] Pete: Yeah,

[00:29:15] Charlie: that makes a bunch of sense.

[00:29:15] Pete: I was gonna say, I definitely agree with that. I think one thing that I'd just loop back to from what we were talking about earlier is if you look at, if you imagine a world where maybe your low severity instance tend to be more siloed so it's maybe something that engineering handle internal to engineering or legal handle, internal to legal, because it's probably not a super complicated, not a super complex, requires lots of coordination and communication style incident in a world.

Everyone fragments. What you're essentially doing is setting someone up, setting everyone up with totally different playbooks for that eventual occurrence where they all have to get in a room and coordinate really well. And in an ideal world, what you do is build a great process that everyone can train up on so that when you are in a room under immense pressure, you have a shared shorthand and actually counterintuitively, rather than making everyone have their own world, which is really nice for them, it's making it or designing a world where sort of everyone practices with the same set.

Sort of defaults or the same set of terminology means that when you do get together under immense pressure, even if you maybe haven't worked together before, that's quite common, right? Like someone from the engineering team is, we're now working with someone from the legal team and like the more shared framework, shared playbook, shared commonality and like terminology and stuff you can give them the better.

And so all that might sound attractive. As Chris said, I'd optimize globally rather than locally, but I think that practice element is a really key part of that for me. Cause otherwise what you're essentially. Doing is helping everyone build up muscles that are gonna be totally pointless in a major incident.

Cause you'll spend the first half going, It's critical. No it's not. It's minor. No it's not. It's major and

that's where that will then come from is me and Chris just have different definitions for everything. And so it would almost be better if we had no definitions because then it would be like, Cool, let's quickly

align on set of definitions we can use. Whereas if we're immediately on different pages, you set everyone up for failure at. . So yeah very, plus one on everyone having shared terminology and frameworks.

[00:31:09] Charlie: I can see one case where things Might differ, in the roles that people play.

[00:31:15] Pete: Mm-hmm.

[00:31:15] Charlie: If we are talking about a legal incident, I imagine we need a person of type X and a person of type Y in this particular thing. Is that something you see?

[00:31:31] Pete: Not so much. I, so for me, I've not so much seen, like different parts of the organization having like strictly different subsets of roles that they use. And actually I, tend to lean towards applying the same principle. which is ideally you have a common set of roles. I think where it can differ slightly is in which, which roles you pull on for an incident, if that makes sense.

So I wouldn't give them necessarily different names, but I might involve like a different subset of them. To make it more tangible. I think every instance should have a lead, right? And I think in every instance that should be called a lead. And it's not like in an engineering incident, it's called a technical lead, and in a legal incident is called. Legal point person, it's let's just call them leads and have a shared sense of terminology.

[00:32:23] Pete: a customer comms lead or point person, right? So you need someone dealing with communication with customers, and you probably also have.

Someone dealing with like internal legal or privacy. So maybe you've got those two roles, but then in lots of technical incidents, actually those roles aren't needed. They're very, internal incidents and we don't have to pull them in. And so I'd be more flexible with the subsets of which roles we use, depending on the type of incident.

But I would make sure that the names were, as far as possible, like consistent so that if you need to pull people in, they know what role they're playing intuitively because you have a shared definition across the org. If that makes.

[00:32:57] Chris: yeah I, was just gonna say that I think the, thing that's interesting here is there is like responsibilities in incidents and like roles fall off of responsibilities, and

[00:33:08] Pete: Hmm.

[00:33:09] Chris: In a small incident, you have one lead and they have all the responsibilities to fix the thing, to communicate about the thing, to write up about the thing.

And as those incidents get bigger and more and more people come in, there is like a high attacks on that person to. To have to do all of those different things. And so you might then go I actually want to have a communications person and someone ascribe to write things down when I have a really bad incident that involves a certain number of people or a certain severity.

And so I think of, I think of roles as like the, hats and you can delegate hats out to people

[00:33:41] Pete: Mm-hmm.

[00:33:42] Chris: yeah, generally I think it's useful to correlate that against. much responsibility that person's taking on and how much workload that is. And that in itself then correlates typically to some form of like severity model.

So incident happens and I should have someone who is focus is to make sure that we are communicating with our customers correctly and communicating with internal stakeholders like execs or senior leadership properly. Just generally. That's quite helpful.

[00:34:07] Charlie: Chris, I don't suppose you brought your collection of incident hats for our viewers on.

[00:34:14] Chris: I will wear one of my incident hats next week.

[00:34:17] Charlie: Okay, perfect. That's wonderful.

[00:34:20] Pete: Nice. There was a team I used to work in that actually had a hat for for instance, and it was a cowboy hat and. Got handed out when you were, when you needed to go and do something slightly sketchier production. And the general rule was essentially if you're wearing the cowboy hat, someone has to be sat next to you.

You can't like do stuff on your own. But it was quite fun. There's it's almost it tallied often one to one with the lead at, the early stage of the company that I'm referring back to. And that's quite, that was actually indication of role by, by attire at that point, which was like, Yeah, whoever's wearing the hat is actually the lead.

Now it's not implicit that the lead will be doing sketchy things in production, but just making it super clear who's playing, which roles, I think is really important. And I've seen different teams use very things. When we were in the office, I saw one one team had a flag that they would hand out.

Obviously in a remote or distributed environment, that doesn't work. And so tools we're obviously biased, like incident. I are really helpful for this kinda stuff and, making sure everyone knows who's, playing which roles. But yeah I would, I'd be very plus one on, wearing comical hats.

If you're on a video call to illustrate the same, you've gotta inject some humor when

[00:35:31] Charlie: Absolutely those listening to us only and not viewing this, Chris just went under his desk and pulled out an incident.io cap.

[00:35:40] Chris: on. There you go. I tell you what, Actually we should have a, we should have a thing, which is if you comment on the YouTube clip with a special word, you get this hat. I'll post it to you. It's not been worn. It's brand new. So what's the word? What are we gonna say? This is like testing.

[00:35:57] Charlie: Pete, what's the word please?

[00:36:00] Chris: Yeah. Pete,

[00:36:00] Pete: always give? Why do you always give this shit to me? I got nothing.

[00:36:08] Chris: Charlie, as the host, you get the honor, Choose the word.

[00:36:11] Charlie: The word is cucumber. So if you comment cucumber, then may or may not

[00:36:20] Chris: Chris promises to send you,

Yeah.

[00:36:22] Pete: if you comment on, our YouTube video with the random word cucumber, we'll definitely send you an incident.io baseball cap.

[00:36:28] Charlie: There we go.

[00:36:29] Chris: Done.

[00:36:29] Charlie: the fun we have on this podcast is just I'm gonna move us on. We've spoken quite a lot about sort of the mechanics of running incidents, statuses, severities, all that good and fun stuff. To chat a little bit about after the incident, so the importance of capturing data in a structured.

[00:36:54] Charlie: Very open question. Just wanna pass that over to one of you. What are thoughts on capturing this data? How are people leveraging It to track? Chris, maybe starting with you.

[00:37:07] Chris: Yeah I think a lot of what we've been talking about is collecting data and being able to attribute different dimensions to incidents already. So status is just a bit of data that's attached to an incident, and severity is too, and roles are structured data on there. And I think. Generally speaking, it's just useful to have as much structured data as possible.

The counter of that is obviously you don't want a world where people are filling out forms of 300 fields long, just to give you the most accurate picture of your, incident. On the other end of the spectrum, having just like a big free text blob that you have to grep to try and understand what's going on with your, organization is not helpful either.

We I, think it's helpful to be able to like, have attributes on incidents that you are able to, fill in. So that might be number of customers impacted. It might be, which systems do you want to tag this to, and all of that is in service. I think of two things. So there is the live, during the incident lens of, attaching that data, which is, I might have some processes which say a certain number of customers are impacted that I should escalate this or set this to a critical and invite the CTO to join to oversee what's going on here.

And so collecting that data allows me to see that stuff there. And then there's the. Post incident flow here, which is just the, I want to be able to see and slice and dice incidents across organizations. So in a world where you've been bought in, you're declaring lots of things. And you've suddenly got this library of incidents that there, it's just super useful to be able to go, show me all the ones that, that affected Kubernetes.

you go, Oh my gosh, 90% of them had, Kubernetes as a source. That's probably something that I could go and dive a little bit deeper into and I could go and step into some of those incidents. Then you go, Actually, do you know what, not only were these all things but, Pete was the one who led every single one of those incidents

[00:39:06] Pete: terrifying.

[00:39:06] Chris: you can start to go like not only have I identified a bit of maybe a system risk there, but I've identified like a key person risk.

And didn't have that structured data available, it's very hard to get that. You would be. Maybe Google Docs was the old way to do it, or you have a spreadsheet and you'd try and build that full of all these, this data anyway. And so yeah, gen general vibes, it is pretty helpful to have some dimensions of data structured on an incident and it's yeah, useful in many different ways.

[00:39:35] Pete: The one thing I was gonna add to what Chris is saying, I think yeah, differentiate between structured data that's useful in an incident in real time and structured data that's useful sort of the fact.

Very important. I also think when you gather that data is really important, I think I've, seen incident processes before where there's an attempt to gather as much of the structured information up front as possible. So it's I want to declare an incident. It's like here, A three page form that you must fill in order to declare an incident.

[00:40:04] Chris: They're both useful, both serve different purposes and like you wouldn't wanna smoosh one into the other, You gotta be careful about where, you want to put them.

[00:40:21] Pete: Yeah, But I think it's a classic pattern that people fall into, which is like massive form upfront because in the interest of structured data and it's actually most of this you should be just filling in after the over and it's, you still get the benefit.

But, yeah, be very be very minimalist.

[00:40:54] Charlie: End on your favorite non-engineering maybe low bar incident that you've been a part of at some point in your careers.

[00:41:04] Chris: Oh, go. If you've got one, Pete, go for

[00:41:07] Pete: Yeah. Gimme two seconds. I'm gonna go and just double check that I

[00:41:13] Chris: Oh, we got plenty of time,

[00:41:14] Pete: Get this right? No. I was just, I just double checking. I'm gonna get the details wrong but I'm gonna, I'm gonna paraphrase. So there's an incident where as the, this was an incident.io one where the three founders very, early.

So we just hired our first couple of employees and left them in the office on their own for a week while we went to figure out what we were gonna turn this company into. there was an incident very shortly after. We were, one of our API requests was failing and the product was blowing up because a third party service started sending a photo of an alpaca, our API request. So we were making like normal API requests. Nothing had we were getting this like weird malformed response when the team unpicked it. It was just, I think it was just a photograph of a al. So it was an incident being like, we are just getting our packers from MER api. was just like What, is going on? And then you dig in and it turned out that the company had actually exposed some internal

[00:42:20] Pete: genuine, incident where part of our product didn't work because we were getting photographs of cute, fluffy animals from a third party.

That was my favorite, incident

[00:42:31] Charlie: I was trying desperately there to think of an incident pun.

[00:42:36] Pete: I think it's an alpaca. I just I was, The thing that I paused

[00:42:45] Charlie: If anyone's listening to this and does have an incident pun, then feel free to comment or tweet us and Chris will send you a baseball cap. Chris, any funny incident you wanna finish on there

[00:42:56] Chris: Not a funny incident, but something that still makes me smile every week is when, So we do like a weekly team meeting, and it's during that where we do like a product demo with whatever it is that we're shipping. It's really nice, but what I always love is in people's like demo environments or like personal, like development environments for our product.

Like when they're testing things like custom fields, a lot of people have a lot of anti-culture things they deliberately put in there. So there's like a custom field on, our staging environment, in fact, which, Blame with the, thing of who is the individual who should be blamed for this thing?

And, like the various roles that people have in these things, which is which is always fun to see.

Picture of Charlie Kingston
Charlie Kingston
Product Manager

Operational excellence starts here