Podcast

How communication can make or break your incidents

Summary:

In this episode, Pete and Lisa discuss why great communication (both internally and externally) is essential to the success of any incident management process. From keeping your wider team in the loop to minimise disruption, to using customer communication to strengthen your brand when things go wrong, the team share their experiences and top tips for having a transparent incident communication culture.

Notes:

Pete Hamilton is co-founder and Chief Technology Officer at incident.io. Most of Pete’s time is focussed on Engineering and Product (although he also covers lots of other areas, including Operations, Legal and Finance). He’s worked in Engineering for start-ups and scale-ups for the last decade, starting his career at GoCardless and later moving to Monzo.

Lisa Karlin Curtis is Technical Lead at incident.io. One of incident.io’s very first joiners, Lisa previously worked as an Engineer at GoCardless.

Key topics/timestamps:

[01:10] Why is communication so important?

[07:30] Practical advice on building a good internal communication culture

[15:20] The cultural challenges of incident transparency

[25:30] Building trust with customer communications

Where to find Pete Hamilton

Twitter: https://twitter.com/peterejhamilton

Linkedin: https://www.linkedin.com/in/peterejhamilton/

Where to find Lisa Karlin Curtis

Twitter: https://twitter.com/paprikati_eng

Linkedin: https://www.linkedin.com/in/lisa-karlin-curtis-a4563920/

Full script

[00:00:00] Charlie: Hello, and welcome back to the Instant FM podcast. We're joined today by Lisa and Pete. Lisa, welcome to the podcast. Can you. Introduce yourself.

[00:00:11] Lisa: Yeah, sure. So I am Lisa. I'm the technical lead here at Incident io. I joined last August, so a year and a bit here, building things and then trying to build a team that builds the thing.

So yeah, that's all my jam.

[00:00:24] Pete: Perfect.

[00:00:24] Charlie: And Pete, we've already met you in previous episodes. You are also joined for those viewers on video by a. Maybe you can introduce your cat.

[00:00:33] Pete: Sorry, we've started now, so I'm not gonna go and throw her out. And who knows? She might have more insights than I do. This is Lovelace and she enjoys sitting on shelves in the background of my podcast episodes, but yeah.

Great. You're back. Let's get into it. Perfect.

[00:00:45] Charlie: We'll plug for the YouTube if you want to see Pete's cat. Lovelace. Cool. This week we're gonna be talking about the importance of communication during an. Very open topic. Not sure where this is going to go, but I wanna jump straight in and kick off with maybe for you, Lisa, what makes the communication aspect of an incident so important?

[00:01:06] Lisa: There's sort of two parts of this, so there's one part which is important, like good communication. Massively accelerate your incident response and, we'll, we can talk about that a bit more later, but like if you communicate well, everyone will know what's happening. The right people can provide input.

The right people can get into the room at the right time. And your customers from external point of view, they will trust you more. They will probably give you more time, ask you fewer questions, and all of those good things. And then the second part is that it's also the bit that we tend to be really. So if you sort of give a particular engineer a technical problem and you're like, fix this problem, it's like Right full steam ahead, blinkers on, I will fix the problem.

If you say, Hey, can you fix this problem? And also update all these people really clearly on exactly what's going on at the same time as fixing the problem. That's then a much more difficult thing to do. And so that's why we bring more people into an incident than one. And we have teams and we build like an ephemeral team in this incident where some people are more external facing and some people are more focused on solving the problem.

As with all work things, stuff gets harder when it's more than one person. And it's all about, you know, the way that those relationships work, the way that people communicate with each other inside the incident, to drive that good communication for all the other people. And I think that's where. This sort of a good incident response can be like made or broken.

I guess it's like our ability to do that and pull together that team to work

[00:02:24] Pete: well. Yeah. I guess there's an interesting kind of situation that often occurs where sort of people go like, oh, the overhead isn't worth it. As in like pulling lots of people in is only the, I dunno, often it's good motivations, right?

It's like, I don't want to distract someone else. Or maybe it's like, oh, it'll be quicker if I just fix. And I think those are like really well-meaning, but I think what Lisa's describing is like, yeah, for me, the sort of golden path and what I'd always Shar team to do, it's just like, it kind of, it can rarely go wrong if you pull more people in, but it can definitely go wrong if you keep it really small and solo.

There are a few exceptions to that where there's, you know, maybe you've got security or privacy instant where you can't have lots of people involved. I've seen a lot of false economy vibes here where someone sort of goes like, oh, it's just not worth it. And it's, in my experience, it's almost always worth it.

More than, more than not. And sort of, yeah, I guess, I dunno. I dunno, Lisa, there's. if I don't do that. In your experience, what happens? Cause I feel like both of us have a lot of burned fingers here and that might be interesting to reflect on and. Get some sort of resonance with people listening.

[00:03:22] Lisa: Yeah. So I think it's very easy, and I have done this painfully recently to get into this kind of incident environment promise.

We had a, a big incident a couple of weeks ago, which I was leading. Not as effectively as I have led some incidents in the past, shall we say. And primarily the issue was I was very much in fix the thing mode and maybe didn't do as much talking as would've been useful for anybody else. But anyway, more specifically, it's that if you're not communicating with other people, you take away their agency and their ability to be useful.

And in all likelihood in your organization, there are other people who are also good at what. I mean, you know, in an ideal world. And so as soon as you stop communicating and create a sort of Chinese wall, that means that other people who think they might have context start to go, oh, can I help? And they start to sort of badge you for updates and ask you questions.

And that's honestly quite frustrating because that's obviously not helping except for the point where sometimes they know something that you don't, and it's really, really important that they badger you and ask you questions until. I think it might be this thing that I did yesterday. Do you want to check that?

If you don't push updates to them, they have to sort of come and fight to get your attention and work out whether what they're saying is relevant at all. So it's a classic, like it's a bit of a false economy cuz then you have to have that conversation with six people about the six things that they've done over the last two days, only one of which is probably relevant.

And then the second side is, You are stopping people being able to make good choices about what they're currently doing. So as an example, if your database is really sad and then someone kicks off a backfill that's really database intensive, you are gonna be really, really frustrated, but you can't legitimately be annoyed at them.

unless you've told them that the database is sad and that everybody should not do anything. Same with like, oh, could everyone stop deploying? Or something like that, right? You should be pushing this stuff out to your team to help them make good choices about whatever they're doing. Otherwise, you could just end up in this really frustrating place where you've made your life significantly worse and you are just sort of, you know, shouting at somebody through Slack going, please, could you stop the horrible thing now?

[00:05:21] Pete: Yeah. I guess there's, there's also a big kind of, as well, there's a sort of streamlining stuff internally. There's a big external. Factor here as well. I guess what you've just described is a lot of the internal consequences. I think there's also external consequences. I think we should maybe talk about this as a separate point a bit later on, but I think it's worth noting that there's a huge amount of upside to communicating well externally to your customers.

I mean, it's definitely true internally, but it also applies outside the company and I think. There's a lot of value in making sure that your customers know that you are on it, even if nothing else, whereas often, similar to the internal incentive, it can often be like, until I know exactly what the problem is, I'm kind of hesitant to say anything because I don't wanna look silly.

But actually just telling someone like there is an incident. We think it's not worth deploying. You don't have to be like, I know exactly what it is and why you shouldn't. You can sort of just give a preemptive thing, ditto with customers. You can sort of say like, you know, we don't quite know what's wrong, but something's wrong.

We are investigating and even that can build up a bit of, bit of trust and that to me is at the root of why you do all of this. Right? Which is kind of. Build trust with the people around you, and it's kind of, otherwise you are gonna get hammered by people who don't have that trust and they're gonna ask you those questions and they're gonna be like, are you really on it?

Are you really? You know, are you really fixing this incident? So yeah, that's maybe something we could talk about a bit later on, but that's, I think that customer trust and that that customer comms piece is really important as well as the internal stuff that we've mentioned already.

[00:06:42] Lisa: I think there's also, there's like a blur line between like the internal and the external stuff.

Like if you're in a bigger org and you are like part of a platform team and your customers are other teams in your organization, that's then sort of all of the same kind of things apply around trust and people feeling more comfortable. Once you build that trust, it gives you that space to actually fix what's gone wrong.

But it's also, they're like chips that you can spend in other ways. So once you have that trust from either your customers or from other people in your organization, then you can sort of say, look, I know that you, you are not a hundred percent sure about this, but trust me, I know what I'm doing. I think that this is the right thing for you to do, and trust becomes like a currency that you can use in that relationship.

[00:07:21] Charlie: Absolutely. I'd like to narrow in a little bit on the internal side of communication. We kind of mentioned there, customer facing, internal. I don't think anyone would disagree. Internal comms is gonna be really important during an incident, but I wonder if there's any practical advice, tips, or experiences that.

You can share in terms of how do you achieve that culture of good internal

[00:07:43] Pete: comms during an incident? No Instant management company. Never thought about it. . No. Like, yeah, no, I guess so. There's a few things we could cover here. We also talk about some of the stuff in the Instant management guide. So like go, go read that.

I think it's genuinely a really good resources like that. You're gonna hear me and Lisa's opinions, but if you want the pulled collective knowledge of everyone at incident io and and friends, that's the place that I go. . I guess maybe we could just focus on stuff that we've found useful and, and leave some of the hypothetical stuff that maybe a much larger company might be really important to folks to read afterwards.

But maybe just a few things from me. So I think one is like comms should be a proactive thing. I think you can do really good comms reactively. So for example, when someone asks questions, have a really good response to respond quickly. That's great. That is objectively good comms. But I think the best kind of instant comms are the ones where you are very proactive, right where you are, you're kind of ahead of the game, you're getting answers out before people have asked the questions, and you know, sometimes you do that really well and sometimes you miss the ball and you are always gonna have an element of reactive stuff.

But that proactive element of comms can really head off a huge amount of stress for other people in the organization. Many of whom are often honestly sitting there being like, I really need to know something. But I'm really hesitant to kind of dive in and ask, cuz everything seems quite on fire right now.

And if you just take, if you make sure you've got someone who's just going out and saying, I'm here. I'm updating you. If this isn't what you need, tell me what you do need. That's really important. I think doing this on a regular cadence. Is really valuable as well. You know, if you give an update, like tell people when the next one, next one's coming, right?

If you, if they know that you said something half an hour ago and you promised an update in 30 minutes, they're much more likely to be like, Hey, it's been 40 minutes and you said you were gonna tell me what was happening. And I'm kind of sat here with customers yelling at me like they're much more likely to ask you.

So you make yourself more accessible, but also, , you kind of help people orient themselves in the incident much more easily, right? Like they land and they go, oh cool, I know that in like 15 minutes or so, I'm gonna get another update so I don't, I maybe hold my questions because they might get answered. So again, you're making your life easier.

You're also reducing stress and making everything simpler for everyone else. And then I guess the last thing, and maybe like Lisa, you could provide some detail on this cuz I think, I think you're good at it, but it's like, it's making sure that the structure's really solid. So it's kind of, if every update is totally different.

So imagine you're met with like a wall. Every half an hour and that wall of text will cover wildly different things each time. There are someone's ability to join an incident and pass. That is quite difficult. You have to sort of assemble your own model of what kind of updates you're getting. It's like, oh, okay.

So a minute ago they were updating us on like customer impact and now we're talking about like technical fixes and mitigation processes and like, you know, I dunno if you've got any thoughts on this one, Lisa, like are my own opinions. But there's often some key points that you want to hit. Your instant update structure and often you want to repeatedly hit that same structure, even if there's nothing so that people know that now there is nothing on that topic as opposed to you've covered one point on each update and then maybe you've like forgotten to talk about customers this time.

Right. It's, yeah. I dunno if you've got any thoughts on

[00:10:37] Lisa: that. One thing always to think about whenever you are communicating in any context is who is the audience and what context do they have and what context don't they have? And so classic thing with bad or like not ideal incident comms is when someone sends out some comms, it's got lots of jargon in it or it's got lots of stuff where if you were on that one project, it all makes sense.

And if you weren't on that project, you just like, this is, this is all just meaningless. And then I think it's about ordering an update. People start reading. Top if you're gonna do it in one paragraph. So make sure the first thing is the thing you really want 'em to know. And the second thing is the bit you'd quite like them to know.

And the third thing is then the like, if you are interested, you might want to read on. And so I think useful prompts are things like what is the current state? Which often is really easy to forget when you're in an update cuz you're doing something. But it's like, where are we now? Even if it's where we, as Pete said, right, it could be the same place that we were half an hour ago, or an hour ago, or a day ago, if you're really unlucky.

But it's like, what is the current situation? What are we doing and when do we think the situation might change? Those are fundamentally the things that people care. And then maybe a fourth one is, here is some more collateral that you might need. So there might be, oh, we've written out some comms that we can send to customers.

So if customers questions come in, this is what you should send them, or this is a third party outage, we are being affected by. Heroku's got a D N S problem. So this is the link to the status page. So that is like pushing out sort of extra resources for people if you like, but that should probably be at the end cause it's kind of only the most interested people care about that.

And so it's that kind of hierarchy of information and then it's the predictability. So making sure that. All of your updates have the same information at the same place, kind of like Pete was saying, right? So you can start to pause them. So the person who does only care about the first bit could just look at the first bit and then move on with their day.

Yeah.

[00:12:22] Pete: Strong agree. One of the interesting things about this is for tiny incidents, this is usually quite trivial, and you're often doing it for like internal benefit. Maybe it's like your team plus, I don't know, there's an EM who needs oversight so that they're comfortable, but like it's still a technical stakeholder.

And as you get further and further away from the team, , you're getting to kind of less and less domain expertise. You're also maybe getting to like folks that will interpret things wildly differently. So for example, you know, if you say, I think we're like quite unstable, that sounds very different in various contexts in technical sense to like the CEO being like, hang on, does this mean like my whole company's about to fall over?

Actually, as you get to much, much bigger incidents, this becomes a lot more common is actually there's now so many stakeholders that a single update doesn't quite cut it cuz you're trying to write one message for five different interpretations of the incident or levels of interpretation. So I think one thing that I've seen before is maybe you kick off like multiple comms loops.

So you have a kind of maybe an infrequent comms loop of like an hourly comms loop for the business as a whole, where you write a very superficial high level update and then you have. More low level technical update for all the teams that need to be aware that we've currently frozen deploys. In every 15 minutes you're telling them like, Nope, still frozen, but here's some details.

And it's kind of interesting. You can end up with multiple loops. The question often is like, is that the same person who's now managing like multiple comms loops or do you have. One person does the small comms loop and then you have another sort of envelope person that takes that and aggregates it and rolls it up.

And

[00:13:51] Lisa: I love it when I get to be the envelope person.

[00:13:53] Pete: Yeah, . Yeah, that's an interesting point. It's not actually something that our product does at the moment. Maybe it's something we should think about doing, but I've ended up, in instance, I think they're quite rare that they tend to be the horrendous ones where you're four days in.

And now you've got at least once a day you're updating the board, right? I've been in two or three incidents like that, and it's kind of, that is not Lisa person on incident rights, quick update every 30 minutes, right? That's like we summarize top level business impact and how likely are we to still be alive in three days as a company.

I didn't

[00:14:22] Charlie: realize we'd be using the pod for sort of future feature requests as well. It's coming full circle now.

[00:14:28] Pete: This isn't it. Yeah. Yeah. You know, well, you know, coming up with ideas all the time. That's product team instant. I, so I guess one thing that I've implicitly done there is assume that everyone has ability to see everything.

And actually maybe like where that takes us slightly as well. There's different audiences and maybe not everyone should see everything. Should we even be sharing details of incidents and when we do share or not share? I think. I'm very biased. I think you should generally be very open and transparent, but there are probably good reasons or good motivations, even if I don't quite agree with them for why people might not.

Do what we've just been talking about, which I, I think is very good, but clearly not everyone does that. Right. I dunno if you have any thoughts on that, Lisa. It's just like, people, when people haven't done this, not because they're not aware of it or because they're bad at their job, right. But they've deliberately made a decision not to.

Has that ever happened? I've seen a couple

[00:15:14] Lisa: of different versions of this. So one version of this that I've definitely seen is, There is a perception, and I think correctly that the more you communicate, sometimes the more inbound you get. You are sort of creating work for yourself. And so as soon as you start communicating with the entire company, suddenly you've got 20 questions coming back.

And then you're like, oh, well now I feel obliged to answer those questions as well, and that's taking time that maybe I don't have. And so you'll try to protect sort of the time of the incident team, I think. Fair. I think very serious. That's just like, cool. We need more people on your incident team. So that time is probably valuable, but there are also other ways of mitigating that, right?

By being very clear with the audience that this is a really time sensitive situation. There's not loads of people we are communicating as best we can, but we don't necessarily have the bandwidth right now to answer your questions kind of vibe, right? So there are, there are other ways of dealing with that, but I think that's one thing.

Can make people quite reluctant to share. And then the other, and this is maybe more too for external problems, which I guess I'm not quite sure whether we're in in internal, external, or both mode right now, but thinking more about external or large companies, where there is lower trust and lower familiarity is there is this perception where.

People might not take to a transparency in the spirit that it was intended and therefore they kind of use it as a stick to beat you with. So you say, oh, you know, you start to be really, really transparent about all your incidents. And I think Monzo found this a little bit where Monzo started to, they, they had this culture, it was like absolutely in their d n a.

And every now and then somebody would go, oh, well you can't use Monzo cause they've got a really flaky platform. And you'd be like, they don't, they're just honest about it. They have the same, probably a less flaky platform, frankly, than a lot of their competitors. But because they are being open and transparent, the perception is that they've got more problems.

That's a very, very legitimate reason to be quite careful about what you communicate. And that's really difficult because you are kind of stuck between a rock and a hard place where, for me anyway, every bone in my body is telling. We should be being really honest and open with our customers, and why wouldn't you do that?

What have you got to lose? And then someone's like, oh, that however many hundred thousand pound deal. And you're like, oh, maybe we do have something to lose that's a bit more difficult now, you know?

[00:17:17] Pete: Yeah, a hundred percent. Yeah. And I, I remember this at Monza and there were, there were several instances, some from before I joined and some while I was there where I was kind of like, In general, like Monzos, incredibly transparent.

So for anyone listening, like there's not loads of dark secrets that they've hidden in, it's actually the opposite, but it's out there being like, I am 90% sure that every single other bank is having exactly the same problems as this, but they're not talking about it. And then, you know, Monzo gets raked over the coals and everyone's like, oh, you can't even prop your bank up.

And we're like, Jim, any idea like how good a job the team is doing? How amazing this is in terms like we are giving you everything and there's barely anything , and then at other places it's just like they're just keeping quiet. Right? And it's, yeah, I dunno. I guess at that point I know what I care about, but when you've got 10 million customers or 5 million customers, in one's case, not everyone's gonna be able to interpret that.

Right. I think it's interesting cuz I think that principle applies internally a little bit as well. It's not quite the same, but I think there's a. You could argue, maybe let's assume like I'm a senior engineer in a team and a lot of my identity is that people look up to me and they trust my opinions and I'm like a technical authority.

Right? And then you're breaking stuff constantly. Probably like, you know, hopefully you're writing good tests and you're testing at work and you know, making sure that it's all doing what it's meant to do. But like, if you're not breaking things, you're probably not moving quick enough, honestly. And. You're gonna have problems.

The question is like when you have them, do you quietly fix them and hide them and go, God, I hope nobody noticed cuz that would like tarnish my social capital. Or do you tell everyone? Right. And I think whether that's an incident or whether that's just sort of fessing up and being open and transparent or not authentic, I think a lot of people really struggle with that.

I think, honestly, I think it's much easier when you are a later stage career. So like I have very few qualms about doing that, but I'm in a massive position of privilege. I founded this company. I have a lot of trust with my co-founders. There isn't really anyone who's gonna look at me and go like, oh, I think Pete's really dropping the ball cuz he's reported like 20 bucks outta recently.

That's, yeah, actually a lot of ever happens in the, the shadow channels, but everyone picks, picks apart my, my ancient code from the early days of incident. I, on a daily basis, but it's,

[00:19:16] Lisa: we pick apart your Twitter typos.

[00:19:18] Pete: Oh man. Yeah. I mean, there's a reason I'm not allowed to tweet from the company account anymore.

And yeah. Typos does does that tweet still exist, Lisa? All the tweets

[00:19:26] Lisa: permanently. Yeah, because they already had likes, so you can't get rid of them.

[00:19:30] Pete: Well, yeah. I

[00:19:30] Charlie: think there's actually a bit of a competition for listeners coming up here, which is if you can comment in whatever form one of Pete's typos, we will send the first person to do that some instant

[00:19:42] Pete: IO swag.

Yeah, a fun early instant IO days. Activity was essentially me trying to tweet something having not been on Twitter for ages, typo it like an absolute moron, and then Chris and Steven just rinsing me in an in our WhatsApp group and just going like, ha ha, ha. Immediately liking it and retweeting it so that everyone else likes and retweets it, and then I can't delete it without it being weird.

Anyway, my point was, as an earlier stage career person, I think this is a really hard thing to do. So what you're essentially trying to prove is that like you deserve that promotion. You deserve that authority, that responsibility that. At the same time you are going like, look at all my failures. It's kind of a similar vibe I think.

I dunno how you feel, Lisa. I felt a little bit like this and I had to really get used

[00:20:21] Lisa: to it. I think it's really acute, either early stage career or when you join a new company, right? So basically it's about how much social capital you have in the organization. And if the answer is a lot, like I'm in the same position here that you are and I, I dunno.

So we, we run incidents for any kind of, basically any exception that comes of our app and. I write quite a lot code and it turns out , quite a lot of those exceptions are in one way or another due to something that I have done. And that is a co, like that's a trade off I'm very comfortable with. And that means multiple times a week I will jump into a minor incident channel and be like, ah, yeah, this is me.

Thanks for sorting that hour and I appreciate it. You know what, whatever it is, right? And that is something that I'm, I'm very relaxed. And I'm very confident that the reaction to that is not the team thinking that I'm really bad at my job. If you join a company, you don't

[00:21:09] Pete: have that. How

[00:21:10] Charlie: much of that do you think is sort of the culture of declaring lots of incidents?

So like ultimately the bar for declaring an incident in our company is pretty low, right? We want lots of incidents and like, Good , good transparency within those incidents. But with that, does that mean that you are more comfortable with saying like, oh, like there were six incidents that were my fault this week almost.

Is this a way that companies that are struggling with this could move to a world where people are even more happy to declare like, this was my fault, like I screwed this up. This incident was because of me, or is there something

[00:21:43] Pete: else?

[00:21:44] Lisa: I think basically, I hope so. I think that's what's happening. There are a couple of different reasons that we talk a lot about sort of declaring lots of incidents and a lot of that is about getting a chance to practice your incident response process and.

All of that tooling being really, really smooth so that when the really bad thing does happen, you're only trying to do one new thing at time, right? You're trying to solve whatever specific thing has gone wrong. You're not trying to generally learn how all of your tools fit together and where the status page is and how does anyone have the login for this bit of monitoring, you know, , you're not having to do all of that at the same time.

I'm done

[00:22:13] Pete: at this point, .

[00:22:16] Lisa: We're just watching Pete's face as I describe incidents that we've both been in over. But anyway, so that's part of it. But another part of it is definitely this transparency point, which is like if you are really open and lots of people are really open, when they make mistakes, those mistakes become more common.

And then you're not burning as much as your social capital because everyone around you is doing the same thing. You know, everybody in the team has caused an incident in the last month. . I say that I think probably that's true probably in the last three weeks. So it clearly doesn't matter, right? We've not fired any of them because of it.

We're not going to . And I think that that is, if you as the people with the social capital, you then have a responsibility to do that because that's what normalizes it for other people in your organization and that's why people can then like copy that behavior. And be confident that they're not gonna get the same kind of blowback, I guess because they've seen that there are other people who can admit that they're wrong.

And one of the kind of worst things that you can ever do, I think as like a leader in one of these orgs is like the flip side of that, which is to very publicly come down on someone really hard when they do make an error. Even if privately you are going. What on earth happened here, like, I really didn't think we should have done that.

If you have to have that conversation, you can have that in a different way and do it privately without it having that impact on the team. Whereas if, you know, if you publicly like mock somebody, shame somebody, and humans live on their relationships with other people, right? They, they feel shame incredibly acutely.

And so if you use that as a tool to reprimand somebody, then you're gonna have massive, in fact, across the, like all of the other people who've seen that. Absolutely

[00:23:47] Charlie: and like this is half the reason why we do the podcast is because we can publicly mock Pete rather

[00:23:54] Pete: than for his incidents. Why my role exists, right, is to just be the public sort of destroyable face of engineering.

Instant. I . It's a role that I relish and do well. One thing on what Lisa said, I think that this gets a lot of blip service, so I don't think we need to talk about it, Liz, but I think there's a lot of good writing out there on kind of blameless instant culture. You'll hear like blameless post-mortems being something that gets talked about a lot or blameless debriefs, and I.

That sort of stuff's also really important. Like you, you've got, there are several places where you as a leader or a senior person in a team can either sort of take advantage of an incident for massive upside or massive sort of positive gain for your team, or you can. Knock the culture negatively and you'll kind of set people on a negative spiral as opposed to like a positive one.

So it's worth thinking hard about where those are for you. I mean, you do it all the time if you're a leader anyway, which is you're thinking about like, where can I encourage more of the behavior I want and discourage the behavior? I don't want an instance, so like, That's as hard as it gets, right?

Everything's on fire. Everyone's stressed. No one has like extra emotional buffer, right? Tensions are high. That's where you know, how you react in that moment makes a huge difference. And if you crush your team or you shit all over the work that they're doing or the approach that they're taking, You guarantee next time this happens, either they're gonna, I dunno, be too stressed.

Yeah. Like not do anything decision paralysis. Right. You're not gonna hear about it because they're not gonna be too scared to tell you. And then now your instant culture is like, you know, the opposite of all the good stuff that me and Lisa have been trying to say that you should be doing for the rest of this podcast.

So yeah, I just wanted to really reinforce that. I think maybe the, on the same principle, on a similar note, maybe just we could talk about external facing feedback. Cause I feel like we've touched on it like four or five times now and maybe we should just dive straight in and spend a few minutes. I'm sure that's probably what you're about to do.

Sorry, Charlie. I'm like, youp your,

[00:25:38] Charlie: you've asked your own question. Feel free to answer

[00:25:40] Pete: that now. . Yeah. Asked and answered. I'm not gonna answer. I'm gonna, I'm gonna probably throw, but I think we talked a little bit already about the pros and cons of. Externally managing your instance well or communicating well, maybe a couple of examples or like why this matters briefly and then we can maybe do some practical tips on the same side of just like instant I, we live and breathe this stuff and it feels like that would be a nice way to wrap up.

So yeah, I dunno if you wanna kind of give your thoughts and then we can get into sort of bullet point list of things that we think people can do and they can take that away for, for.

[00:26:11] Lisa: Sure. So my, my go-to example on this is Atlassian. So Atlassian had an outage a few months ago, and it was a weird shaped outage.

It's actually really unusual because it was, A complete blackout for a very small number of customers. It was like a few hundred customers, which for Atlassian is obviously a very, very small percentage of their

[00:26:30] Pete: customers. People didn't know that at the beginning. Right. It was a bit like some proportion of their customers have lost every, you know, it's

[00:26:35] Lisa: crazy.

Yeah. I thought that was really interesting because I spend maybe more time than I should. Reading engineering Twitter and engineering Twitter for a week was absolutely fuming. I have almost never seen anything like, The whole of engineering. Twitter was just like, this company is appalling. I can't believe they've done this.

Their engineering prices are so bad, they're not talking to us. Duh. And it was, it was really quite like vitriolic and kind of, you know, the nasty side of the internet. Quite frankly, I think during that time, Atlassian had been communicating with the affected companies directly or with some of them, but maybe not all of them.

And also critically what they had not done is to put anything really meaningful out into public. So their public status page was sort of, something is happening. And then every few hours sub. Yes, we're still working on it. What I thought was really interesting is that about a week into this, their C T O published a really, really long, really detailed article explaining exactly what had happened, what they were doing, what the expected timelines were, kind of all the stuff that we've been talking about.

And the tone of the conversation on Twitter completely changed from absolute vitriol. These people are the worst people in the world through to. Oh my God, I have so much empathy and for this poor person on the other end of this, and what they were trying to do, right, they basically accidentally deleted a bunch of data, and so they were having to restore all of the data from backups, which is like putting Humpty Dumpty back together again.

It's like your worst nightmare. I've had to do this once at my last company. It really is the stuff of nightmares, of sitting there with a database backup and you're desperately trying to work out which of the bits of data that you need to pull back in to build, rebuild this whole thing.

[00:28:08] Pete: That's, that's assuming your, your database backups are even like available and not on some, like tape in cold storage in like a mountains somewhere, right?

It's like depends what day you've lost and when you lost it.

[00:28:17] Lisa: Oh yeah. As soon as you give people that amount of information and insight into what's going on, suddenly they're much more empathetic, they're much more understanding, and they give you a lot more time. And the fact that it then took Atlassian another week or two to restore service across everyone's accounts wasn't such a problem anymore.

There were questions about exactly what had happened. There was basically, there was a really bad script. It shouldn't have been allowed to run. It was allowed to run. That had like a very, very serious impact for a large number of. , but that's what people were talking about, which is a much healthier conversation than the sort of like, who the hell are these people?

And they don't know what they're doing, which is Atlassian are a very large, successful company and they do absolutely know what they're doing, but that is not always how the internet perceives it. Yeah,

[00:29:00] Pete: agree. agreed. What can people do? How can they get this positive

[00:29:04] Lisa: vibes? Well, Pete , just as we practice, I.

It's providing lots and lots of context to, or enough the right amount of context, I guess actually is a, is a better answer to that. To help minimize speculation and explain to people enough of what's happened for them to have some empathy saying something has gone roll with the big black box. People don't have empathy for that.

It's kind of the same as if, if your colleague is like, I'm taking two weeks off by, even if you have all the best intentions in the world, it's really easy to feel like, oh, they're kind of abandoning us in this type of need. But if they explain what's happening, even at a very like light level of detail, all of a sudden you're like, oh my God, off you go.

Take all the time you need. We'll handle it. It's exactly the same when you're communicating, and so it's giving people confidence that you know what you're doing cuz you have actually understood what's happening. Giving them a timeline so that they can adapt their behavior. So if I know that I'm not gonna have any of my Jira data for a week or two weeks, I can make a good choice If I'm, like, my Jira data might turn up tomorrow or it might turn up never or anything in between, and I dunno when that's much more stressful for me.

I can't make good choices and I'm now gonna start shouting at you until you give me that answer. and so lots of context. Make sure people know that you're on it. Give them a clear expectation of when things are going to change and also what is actually wrong for them right now so that they don't continually try things that you know are broken, for

[00:30:26] Pete: example.

Yeah, a hundred percent. It's not a huge amount. I'd add to that, honestly. And a lot of it's the same as what you do internally, it's just, it's just contextualizing it to your customers. I think that on the context point, I. Avoiding too much detail is really important. It's frustrating because I think often if you provide lots of detail what happens?

Someone like slams you, so you sort of mention a particular technology and they're like, what do you mean you can't? You can't even use Kubernetes. And it's sort of, you end up having these like, you know, someone picks up on it as like, oh, I would be able to fix this really fast. You kind of got to be able to put that aside.

The real reason you don't do that is like for most of your customers, it doesn't mean anything. That's the reason you should hold back. And then I think the other thing you should being very. So like one of the things I've done before is like even if you are providing regular updates or you are providing quick updates when you change state, even if the answer is like we, nothing has really changed, but I'm confirming that nothing has really changed.

Cuz they can't, they don't have the option of just going like, oh, well I'll just join your incident channel. To collect the fire hose. If I'm not getting the regular updates that I want from a customer's perspective, they're blind, and actually the way that they'll do that is they will hammer your support team, who then have to, they'll be coming to you and asking for a holding message, so you might as well help 'em out and just put one up there.

Yeah. Strong agree with all of that. Don't have, don't have much more to add at anything. Great.

[00:31:36] Charlie: Well, thank you both Lisa and Pete for your time today. I'm gonna drop into the show notes, a link to a talk that Lisa gave a few months back now, Lisa. Yeah,

[00:31:47] Pete: we haven't plugged any people's talks. I always plugged.

We have not plugged, isn't it? Go watch Lisa's talk. Okay. It's really

[00:31:53] Charlie: good. That's how she got the gig on this pod .

[00:31:57] Lisa: They didn't trust me. I needed to prove my, prove myself, ,

[00:32:01] Charlie: absolutely not. And then Pete also mentioned the guides that we put together, which is an instant management guide. We'll drop a link to that in the show notes as well.

But thank you both again for joining me on the pod, and we will see you the audience again in hopefully a week. I think it's been two weeks since the last one. Apologies, we will get better at this, but thank

[00:32:21] Pete: you all. We got together as a founding team, I think, and we did a, we did a little one that we, we may or may not, may or may not release.

We need to, we need to make a decision on that.

[00:32:30] Lisa: This is what happens when Charlie goes on holiday. I know

[00:32:32] Pete: we just sort of messed around up with the podcast recording stuff, for an hour, but no, it's good. Also, I think like if people, I assume, or I'd hope that at least one or two people are listening to this and

[00:32:43] Lisa: I'm shouting into the void.

Yeah,

[00:32:45] Pete: either that we're just having lovely chat. If you are, and you have stuff that you'd love to hear us talk about, either like the engineers on the team or like the company as a whole, let us know, be really, really interested. Got a big list of topics and we'll just churn through them. But if anyone has stuff they'd love to hear more about, like, let's let us know and we can dive into that.

[00:33:05] Charlie: and on that bombshell. See you soon.

Picture of Charlie Kingston
Charlie Kingston
Product Manager

Operational excellence starts here