Managing incidents in a growing organisation

Podcast
Picture of Chris Evans
Chris Evans
YouTube Channel podcast player badge
Spotify podcast player badge
Apple Podcasts podcast player badge
Amazon Music podcast player badge
Google Podcasts podcast player badge
RSS Feed podcast player badge

Summary:

In this week's episode, we're joined by Matt Huxtable, CTO at Ziglu (an e-money issuer, offering a variety of digital finance services, particularly well known for its cryptocurrency services).

Matt talks about how engineering the team at Ziglu has evolved over time, building an agile culture and why "keep it boring" is his mantra.

Chris, Pete and Matt cover how to context switch between solving and communicating during an incident, their most creative incident fixes and why AI isn't ready to solve incidents for us just yet.

Notes:

Matt Huxtable is Chief Technology at Ziglu, a UK and EU regulated e-money issuer, offering a variety of digital finance services. It is particularly well known for its cryptocurrency services.

Chris Evans is co-founder and Chief Product Officer at incident.io. In practice, he covers everything from Customer Success to Sales to Product development. Chris has spent his entire career working in Technology. Starting out as a Software Engineer he later transitioned towards Platform work, most recently as Head of Platform and Reliability at Monzo, where he was also responsible for incident management and on-call.

Pete Hamilton is co-founder and Chief Technology Officer at incident.io. Most of Pete’s time is focussed on Engineering and Product (although he also covers lots of other areas, including Operations, Legal and Finance). He’s worked in Engineering for start-ups and scale-ups for the last decade, starting his career at GoCardless and later moving to Monzo.

Key topics/timestamps:

[01:20] Introducing Matt and Ziglu

[05:45] The evolution of the team at Ziglu

[09:45] Horizontal career growth in Engineering

[12:15] Finding an incident solution at Ziglu

[18:00] Balancing fixing with communicating during an incident

[20:26] Could AI help you solve incidents?

[27:00] Matt’s most creative incident fix

Full script

[00:00:00] Chris: Welcome to this week's incident fm. So this week is a special week. We are gonna be joined by Matt from Ziglu. So Ziglu will actually one of our very, very first customers, so, you know, hold a special place in our, in our heart. And Matt has. A fantastic advocate of the product and company throughout.

So Matt, genuinely, like really, really delighted to have you here.

[00:00:27] Matt: Thank you for having me. It's been a, well, it's a pleasure to join you, but it's also been a pleasure to be on the journey of, is it nearly two years now that we've been using instant io since the, I think so Private Beta.

[00:00:40] Chris: yeah, exactly.

A long old time. Lots of, lots of stuff's changed. Lots of stuff's happened. And in fact, maybe that's

[00:00:50] Chris: from, from three founders sort of hacking away on a product to being a proper company with proper teams, doing lots of things all over the place.

So yeah, definitely a fun journey for us. But I think, I think also a pretty fun journey for you on your side. So when you, when we first spoke to you, I think you were engineer at Ziglu, right? And now you are, you are not engineered. So maybe like we could kick off with like a bit of a background on like Eglu, the company and your journey through Ziglu, if that sounds.

[00:01:18] Matt: Yeah, absolutely. Yeah. So Ziglu is a a regulated e-money issuer a FinTech with a vision to be the cool home for your money. But we've had products largely in the crypto space actually to begin with. But I joined because I wanted to Great opportunities for people to have new ways of, of seeing their cash.

And we've, we've launched some innovative products over our time. Perhaps right now we're in the middle of a crypto winter, and that's obviously leading to a bit less appetite than than there would've been before for those products. But yeah, just to, I'm more on the tech side and more see the, the, the tech integrations and capabilities as being my bread and butter.

So I, I joined Ziglu in the height of Covid in June, 2020 when most companies were not hiring. But I was looking for a challenge and Ziglu came along. My background's really in kind of platform and infrastructure engineering, so I've held roles, s e s SRE team leads and everything that kind of goes along with that.

I think we're now seeing in the industry a, a split between the SRE hat and the platform and cloud. Style have, but I guess I, I probably bridge for both. And I joined Ziglu as a platform engineer. Leading well as, as a lead in that team. One of my first and. Actually proudest moments as an engineer was when I came in and on week two, day one someone said hi, we'd like you to integrate with MasterCard so we can build a card proposition.

Please go. You have two weeks and about four months later we do use a payment processor that sits in the middle, although this isn't a FinTech podcast, so I won't go into all the details. But suffice to say I wanted to come into work in the FinTech space so I could understand. Some of the fun and career building opportunities.

I'm picking my words. .

[00:03:03] Pete: Yeah. Being very selective about the words you're using,

[00:03:06] Matt: associated with you know, working with, let's say APIs that were designed in the eighties and haven't really changed and all of, basically all this legacy. But it keeps one of the most important things moving i e the economy.

And we all take it for granted that we go into a shop every day, we flash our card on the payment terminal. We, you know, we flash it on the tube barrier. You can do this all around the world as well, you know, and it is this completely interchangeable medium. But how, how does it work? You know, I've, I've always been that type of person the year dot.

In fact, my mum says, oh, you were bored at the age of one. You know, I've wanted to take things apart. I want to understand how they work. You know, I do believe in the engineer's motto that if it's not broken, take it apart and fix it and put it back together again. And I just wanted to see behind the scenes and this gave me the opportunity to, to do that with connections to the, the ful mood of.

of, you know payment networks and financial networks that exist within the uk. So yeah, long story short, I came to Ziglu as an engineer. We did a whole bunch of stuff building quite a novel card integration. And then over time, one thing led to another, I ended up heading up the cloud platforms teams.

So setting our cloud platform strategy. And actually we did. A whole ton of stuff to significantly improve our reliability and modularity, including January, 2021. Just before we adopted instant io, I remember we did a cloud to cloud migration from an old tech to a new tech with about five seconds of downtime, which was a very proud moment in my career.

Nice. Beautiful. And that was basically just a. Just to do the cutover essentially. You know, there's always gonna be a little gap there, but it, it was, it was minor and nobody noticed cause it was 2:00 AM . And then the CTO of the business who was here when I joined great guy called hus who I have a ton of respect for.

He decided to move on and business needed cto. And after a lot of searching, I stepped into the role. And stepped into the role at a really interesting time for the company, and it's been a, it's been a, a, a journey to build a team. It's not always been easy. And obviously in this role, it, it's very much a leadership role.

I've had to take a step back from doing engineering perhaps not as much as I should have done because I still have that desire to like be hands on and write code. And I realized I was trying to do that rather than my actual job sometimes. But, you know, that's, that's how I keep myself fresh and, and current.

But yeah, building a team, setting the technical direction, making some really difficult decisions. Sometimes, you know, my natural tendency is to be everyone's friend. But actually as a leader, you, you can't necessarily please everyone all the time. You have to do the right thing. But I'm, I'm very privileged and, and humbled to have had the opportunity to do this before even the age of 30 to lead a, you know, a business through this and through some of the, the, the change that we've gone through over this time.

So what is the,

[00:05:51] Chris: what is the evolution of the, the team look like alongside you, sort of stepping into CTO and, and the journey at Ziglu?

[00:05:58] Matt: Yeah, so we, we had really the change that we've made in the team, we've made probably three changes and I always say that I always say three and I start enumerating two and I can guarantee I'll get to the end of this and then I'm gonna say, I forgot what the third.

That's,

[00:06:11] Chris: that's always the way. Yeah, yeah, yeah. It's all good. We'll help you with the third, we'll make it up on

[00:06:14] Matt: five speed . When, when I stepped up to cto, we had a very kind of monolithic, very early. Culture the team was split into front end backend QA and cloud platforms. Mm-hmm. and there were kind of hard barriers between them.

You know, the old, not quite the old school, Devon Ops split because cloud was doing development work and Backend was doing operations work, so we didn't have that. Typical problem. But they all spoke different languages. Well they all wrote different languages, I should say, but maybe that means that the humans speak different languages.

They don't always understand as well. And there was very, yeah, there is very much, there was very much a handoff culture between the teams. And one of the big changes that I wanted to make was to move to the scale up model where you end. More integrated teams, more cross-functional teams. Mm-hmm.

I think that's the standard in our industry today. But it didn't make sense in the very early days of the company for it to be built like that because you put everyone together who speaks the same language, they all have a single mission. They just get on and do it. And the coordination overhead and process overhead perhaps isn't as great as in subsequent times when you then end up needing to think more.

I think more deliberately about what it is you're building you know, respond a lot more to the customer feedback that you're now receiving because you have customers and have the opportunity to be nimble and respond to emerging business demands. So really moving to that cross-functional squad structure combined with rolling out more effective agile.

And I think in this, in our industry, everyone says agile. In air quotes, everyone has a different interpretation of agile, and I've certainly worked everywhere from, you know, scaled enterprise Agile to, well, agile is just free for all animal farm, just get on with it style thing. But we, we've very deliberately wanted to empower engineers to have a say so that their, their view on how complex something was and whether it was the right solution actually was valuable because I, I firmly believe that engineer.

are not programmers. There is a distinction. You know, they're not here just to pick tickets up off of a board and move them through some swim lanes and ship the code, but they're actually a critical part of the definition of the problem. And perhaps even if someone else defines the problem, they're a critical part of defining the solution.

And I think we've been very successful in, in doing that internally. It wasn't the easiest transition. And it certainly can be difficult at times to sell people a, a story. Building an agile culture, when you start to talk about, well, it does, it means that not everything's gonna have a deadline, for example.

You can't necessarily guarantee that you're going to ship on a certain cadence and schedule, but we will ship as early as we possibly can when things are at the quality that they are and moving that, that responsibility and capability into the tech team has. Quite honestly, really brilliant. Combined with, it's reduced my workload significantly to the point where actually when I first took over the team, basically everything, it was very early startup, everything went through me or my head of engineering and we kind of, we were very much day-to-day working with who was doing what and, and being operational.

And now I quite genuinely just sit back sometimes and go, I. kind of an idea of what's going on, but I don't know the ins and outs. I don't know the technical detail. I get involved where I can add the most value or I can contribute to some major architectural decision, but I don't need to be across a color change in the UI to change the color of a button, right?

Like that can just happen in the team, and I trust my team to get on with it as well. That's. Another word we don't hear very often, but trust is trust is vital. And the third thing I've actually remembered the three things that I was gonna talk about, we brought in a result. Yeah. . We brought in a separation of engineering management from team leads.

So. . A belief that I have, and one that I think is shared but not universally, is that as you grow, it's vitally important that you give people the opportunity to grow in their career horizontally, right? I call it horizontally. You need people to be able to remain at the individual contributor track, but to be recognized for their.

increasingly senior and senior contributions in terms of technical leadership, technical influence the responsibilities and the accountability that they carry, and ultimately their ability to transform business goals into technical deliverables and technical solutions. And so many businesses. And I'm not sure if this is a UK thing, but it's the US thing because the.

companies definitely get this horizontal versus vertical promotion track right? In my view. Whereas I see it much less. Okay. But it's so important that people have that opportunity. And I see so many businesses saying, oh, you need to become a manager. You need to manage people. You need to take on all this responsibility.

That is al to actually what made someone a good engineer in the first place. And, and by making them a manager, you in many cases take away. very good at and the value they're contributing to the company. And I've certainly seen people leave businesses because their only opportunity for promotion was to move into people management, mm-hmm.

or they've given that and they've found it's a poison challenge. So I really wanted to separate those tracks and make sure that there was an opportunity for everyone to move appropriately within the business and where it went. So we've got an engineering management function that I'm incredibly proud of.

And actually. Driven so much change in the business in the six months or so since we've had it in place. And that has really vindicated the decision to to, to, to go down that track and create opportunities for everyone in the team.

[00:11:32] Pete: Yeah, that's super interesting, Matt, and I guess I couldn't, I couldn't echo more strongly some of the challenges that you've just outlined.

I think the way that. I guess built the early team here at instant IO is like, basically went through a very similar journey to the one you've just described at my, my previous company and sort of had a lot of burned fingers and scars and learnings from that and kind of naturally bought all of that instant io made, made, made a huge difference here.

So yeah, big plus one to the, to the shape of team you've just outlined. And I think particularly sort of splitting out that, you know, em, tech lead, not overloading people and kind of making sure that people can follow the journey. feels most natural to them as opposed to the kind of, you know, I want progression and therefore I must, it's sort of a bit of an anti pattern, I think in a lot of teams and sort of making sure people can.

do do the job that feels amazing to them without having to kind of carry additional baggage. Is, is, is absolutely the, well, I don't know opinion here, but it's like I, to me absolutely the right one. So yeah. That's, that's awesome. I guess o one thing I'd be interested in is kind of what, what prompted you to look at instant IO and was that part of that journey?

Because so for context for me, like. Challenges of starting to split teams is suddenly you introduce this like overhead of coordination and communication and suddenly it's not like you have one team that knows everything. Like, was that a factor at all or did you come looking to us for totally independent reasons to some of the team evolution stuff you just, you just mentioned.

[00:12:49] Matt: I think when opportunity knocks and someone builds a door, sometimes you walk through that door. Nice. You should

[00:12:54] Pete: come work for our, our marketing team at some point. ,

[00:12:57] Matt: I love, I'm not a marketer, , unless you want truth and honesty from marketing, which actually is probably the right way to market.

Yeah.

Probably

[00:13:04] Pete: what we should be aiming for. I think

[00:13:07] Matt: that all day. We were early, so we, I think I've done case studies in the past for the. Website where you did,

[00:13:15] Pete: you were one of our first case studies.

[00:13:17] Matt: We really appreciated it. Iactually, I remember this where we were doing the thing in Slack of saying, we have an incident, we need to get in Python.

Oh, what was the last number? 34. Oh no, that channel's gone. Oh, 35. Oh no, that one's oh, 36. There you go. And you create the channel and then some wise Alec pops up and says, yeah, but instant 36 had an underscore, not a hyphen in it. So you've actually just duplicated the instant number and Yeah, the whole pain of managing incidents.

Was not good. And I fundamentally believed that incidents are not a bad thing. I think I'm in good company by saying that, you know, incidents are in my view, either, you know, emergent from just complex systems and you can't predict them. Or they're the price you pay for moving fast. And in an agile organization, sometimes you're going to hit problems, but as long as you remain within your overall quality bounds and goals, then that's absolutely okay.

And I think with that thesis in mind, there needs to be an effective way to manage incidents. And we'd looked at and played with the early tool that I know Chris wrote at Monzo, that I knew from when I met Chris at S R E K in Dublin all those years ago before Covid was a thing. And we'd looked at our self-hosting that.

But in an early stage startup, you just don't have the time. Yep. Go and spin up special systems that are not part of your core purpose. So while we had experimented, it just hadn't gone anywhere and, and nobody really wanted to own running the thing that absolutely had to be mission critical and bulletproof.

But you only called on it one, an incident happened, which actually means the SLA on the thing needs to be far higher than your overall system because you just don't touch it very often. Yeah. Then instant IO came along with the. Early private beater, and I think it just played very much to me. And my lead cloud engineer, dean's desired for how we believe companies should own, respond, and manage incidents.

And, you know, we really wanted to support the mission that you were on to share that, that love for the, with the world, I guess, and, and build a product that actually can can achieve that effectively and sustainably and, and build a culture within organizations that's the one that we all collectively on this call would be proud to work within.

Yeah, I think there are many aspects to the product that really help us. There are aspects that perhaps we're also a little bit too small to benefit from just because our instant management practices are. Perhaps more nascent than you would have in a, you know, a significant organization with hundreds of engineers.

Mm-hmm. , when you have 20 engineers, you tend to get people wearing multiple hats necessarily particularly when you say the instance at 1:00 AM and you're sort of making that judgment of do I wake people up or do I respond with fewer people? So there's definit. Differences to the instant approach that I would adopt if I was in a 200 person organization for instance.

But then we also perhaps don't have as many. So, you know, the quantity versus quality of response also becomes a, a factor here. Mm-hmm. . The other thing that I realized was that internally the only signal that people had as to kind of quality of production and quality of response. Days since last incident.

I know that sounds like the sort of thing that you would see in, in, you know, the Simpsons or, you know, some other like, you know, a animated comedy, but you know, it's like, oh, you know, zero days and the ticker tick background and that seemed a bad thing. And largely that was because there was no. Mechanism for communicating severity of incident impact of incident communicating how the process was going through the incident because it wasn't easy for people to do that.

One of the skills that I have come to value is IBA is the, the quality of written work, particularly over covid. And I know that many engineers struggle to. , perhaps the detail and nuance when they're also in the middle of like, I need to resolve this problem with code. Oh. But now I need to go and write three paragraphs about what's happening.

And it's sort of, there's a lot of uncertainty and and, and difficulty to convey that to non-technical stakeholders and Right. Just adopting the tool that encourages you to do the right thing and makes doing the right thing, the lowest friction option. Yeah. And reminds you you need to post an. That's actually really valuable when you're in that high stress situation.

So for all those factors we adopted it, I guess, I guess it's run by some great people as well, but don't tell them that, oh wait, they're, they're on this call. ,

[00:17:27] Chris: the very best, the very, very best people. I hear, I hear Matt. , the, the thing thing you were mentioning about like days since last incident reminds me there was a, there was an engineering director at Monzo.

Middle of Covid. He used to dial into meetings and sort of in the background of his, of his video stream was this very scrappy, handwritten note, which was like days since last pandemic, zero . And just every time I dialed in I was like, oh, that's the little bit of humor that I need to get me through this, this situation.

That's fun. That's that's interesting that, that, that point you make around. , you know, folks who are in the middle of like fixing things and the, the tension that exists between like that and the need to communicate because it's something that like, so we use incident io at Incident io when we have our own incidents, which is a bit sort of inception, but.

I, I find still with, with a tool like ours where the prompts are there and stuff you know, all the right things are kind of happening from a product point of view. It's really, really hard for humans to switch between those, those two modes. It, it is just genuinely tricky. Is that something, how do you, I mean, how do you folks think about that?

Like, how do you, do you find that a problem at all? Is that like a struggle or is it, you know, are you at the point where it's, you know, happening.

[00:18:38] Matt: I think we have a split of people within the team. We have people who are naturally content going and looking in the database and figuring out why this line of code's broken.

And we have people who are content with communicating up. And you sort of get a natural division of labor almost within an instant where people are you know, people bias one way or the other. But I would say that one of the challenges that we. Is will always spin up an incident call for a major incident and sort of the incident call can move so quickly that it, it can often be difficult for someone to scribe and to take the notes that they really want to within the call.

And, and to communicate that back to the instant channel and it finding that balance between understanding. Why the instant is happening and as your system gets bigger we've, we've certainly had incidents, well, I guess there's, there's two things that have happened. I've noticed as the system has got bigger, we've had incidents that we've gone, we're not really sure why this is happening.

We perhaps know how to mitigate the impact of it, but we, it's not sustainable to have someone sat there running a script every five minutes to purge a queue or something. And, you know, you, you get very much that emergent distributed systems behavior that it can be difficult to get to the bottom of.

how do I communicate something that I don't know, that's tough. And then a, a personal reflection that I've had, and it was in that same incident when, when this first occurred, was that I actually can't get involved in the technical details always. I need to actually be able to just back off and trust in the process and the system and the people that are responding.

And I guess that means that. Put more emphasis then on receiving the communication and the updates and mm-hmm. , perhaps at the time we weren't as mature in making those updates, so I was having to chase them a little bit. But fortunately I'm an engineer as well, so when engineers speak to me in engineering language, I can, I can translate and I can translate that to a business stakeholders.

I guess there's a new opportunity here for you to integrate with chat G P T. You know, slash incident, right? Update that states this and yeah, yeah, yeah. Oh, even better. We

[00:20:32] Pete: could, we could go one step further, right? Which is just like, tell us what's wrong and we just proxy through the chat, G B T and say, how do you solve a problem where?

And then just insert, and then instant I magically tells you how to, you know, bring your database back online.

[00:20:45] Matt: It's a really interesting one though, because I remember a big turning point in my career. You know, I think. Let's not talk too much about Safety one, safety two here, unless we really wanna geek out about resilience engineering.

I dunno, that sounds right up. Chris'

[00:20:58] Pete: Street loves a bit of safety one. [00:21:00] Safety

Chris: two. We'll, we'll, we'll lose Pete. Pete will just be like, come on, come on guys. , .

[00:21:05] Matt: But there is always this tendency of going, what went wrong? Make it never happen again. Mm-hmm. . Yeah. And when I discovered Safety Tour as a concept, it sort of intuitively made sense.

But one of the phrases I remember from it was actually from an SRE con. I remember exactly the talk. So this isn't my phrase at all but it really captured my understanding of really what is an incident and why are we having them and what do we do about them Was when. I think it was someone from Netflix setting a talk that you, i e you working in your company, have the best people to build the thing that you build.

You know, there is no one better on the planet to build an instant IO, for instance, and to run it and to operate it and to understand how it works. No one else understands the data model. No one else understands the unique quirks of that customer that is reporting that problem. And that really just almost flips the logic to me in my head of actually my team is coming to work to do the best job that they possibly can do.

Sometimes that goes wrong. Sometimes you have an incident. But I think, you know, playing to ai, AI can solve the general problem, but it, it still doesn't have until it's very specialized and understands that in terms of a business, does it actually understand the nuances? Ultimately special snowflake system that we all have, that's all gonna be different.

Yeah. And will have its own design criteria and quirks and, and, and considerations that need to be taken into account. Yeah. I think

[00:22:24] Pete: in particular, it's the kind of, it's, it, it could, it could, it could probably do a pretty good job of, in general principle terms telling you how to solve the problem.

Right. But it's in the same way that I could tell you, like, first, what do you already know second? Like, you know, what do we not know that we need to find out, but is. The fact that every single system is so different and that your generally, your incidents do not happen because of really obvious and common.

Your instance happened because of the edge cases or the unexpecteds. Right. And that's where AI is literally is not specifically designed to necessarily handle that. Right. It's,

[00:22:54] Matt: Novel incidents are good in my view. Yes. Another mantra that I have with my team is keep it boring. I I oh hundred percent abbreviation.

But I don't want to be in a situation where we have to use our superior experience to kind of engineer our way out the solution or out of a problem. Yeah. , let's just keep it boring and reserve 50% of our mental capacity for when it does go wrong, and we actually have some, some space to understand why it's broken.

Yeah, a

[00:23:18] Pete: hundred percent. Yeah, I, I I think that applies to, yeah, I guess this is what you're saying, but to teams generally, right? It's kind of, it's, it is interesting when you're hiring as well. You, you get people who kind of, I, I'll talk to them and they'll be like, so tell me about all the like, really, really exciting underlying technology that you use at instant ir.

Honestly, I don't want exciting. I don't want my infrastructure to be exciting. I don't want, you know, I, I, I wanna use all of our kind of, you know the team's smarts and capabilities to build amazing stuff for customers. I do not want like the, the fun of running on some incredibly new novel Edge database and it's like, nah, it's like mostly like Postgres go like, you know, type scripts on the front end.

Obviously there's a lot of thought and a lot of smart stuff that goes into that, but it's like if you're, look, if you're coming here to use the latest and. That, you know, our job is reduce incidents and I can tell you exactly where that's gonna lead. So it's kind of, yeah, could not agree we all with you.

Then I

[00:24:08] Chris: can, I cannot tell you how happy I was when we, when we started this company and the prospect of there not being Kubernetes in the mix, , I was just like, I can, I can focus on the bits that I really, really care about and like, no, no shade generally, or Kubernetes like clearly. a wildly, wildly successful piece of technology and like does incredible things.

But like, I think, I think it's exactly it. It's like you have so many, like cognitive cycles mm-hmm. , and you want those going on. The bit that's most high leverage and like that applies for like normal work as much as incidents. You know, e either one.

[00:24:38] Matt: But yeah, I absolutely agree in startup culture. Be interesting conversation as to whether you think that you would need to adopt an architecture of that maybe.

own your own platform as you grow and get bigger. Mm-hmm. and have more stability over the stack. But I certainly have come to the conclusion that if I want to do fun, novel, exciting things, I'll do it in my home lab. I'll run personal projects of my own. Mm-hmm. and largely as I've stepped away from being a HandsOn engineer, that's also what I've done to keep my skills up.

I'll. Yeah. Yeah. I'm doing of code 2022 at the moment and I'm learning rust. Oh, nice. Oh, that's cool. I'm actually. Yeah,

[00:25:13] Pete: I I have, I have a lot of respect. I, I started learning rust and, and then sort of like many other evening side project things kind of got squashed by starting a business. But yeah, I've got another friend that carried on.

And occasionally I catch a chat with him and he just, he just raves about it and it's kinda, I, it says a lot of things that sound incredibly smart. And I go, Hmm, yeah, that does sound good. But also , I have no time ,

[00:25:33] Chris: so, no, sounds really, really interesting. I spent my day in a spreadsheet, so

[00:25:38] Pete: I mean, legit, it's like, you know, that's, that's, that's often the response.

Yeah.

[00:25:42] Matt: Yeah, no, it's, that's also really interesting about how much engineering work is not perhaps what we. Would've naively imagined before we came into this industry, if we go right back to of our careers. I remember being a very junior engineer in a company and, and thinking, what do all the senior people do?

They just seem to sit in meetings all day, like . I'm not sure if I thought exactly this way, but I'm, you know, I'm being a little bit flippant, but I can write code quicker than them. Why are they, yeah, what, what, what is this? And it's only over time that you suddenly realize that and, and you gain that humility.

Engineering is not about writing code. Engineering is about solving problems and Correct. In many cases, solving those problems may not involve writing a line of code. Right? Yeah. But let's understand the problems that humans have. Let's be net positive contributors to society you know, within the scope of the organizations.

that we work within and figure out how we sustainably build that culture moving forward. And I do fundamentally believe that that's, that's something I want from every engineer that I work with. You know, an understanding of how they can add value every day. And you know, how they can come to work and make the world, the company, the human condition a better.

Yeah. Oh,

[00:26:54] Chris: wow. The, we've, we found the little like, you know, sound part, the nugget for this, this podcast. . Matt preachers. Matt Preachers. But like, it's a genuine, really interesting really interesting point that around like solving problems. And that's, it's one of the reasons that we've ended up calling engineers here.

Not like backend engineers, frontend engineers, they're all product engineers because the, the thing to do is to build the product. It's like everything is product backwards, like, you know, engineer. It's a hundred percent clearly, hugely important, but it's like the implementation detail beneath the, sort of the value that our customers care about and that we care about as a company.

But yeah. I, I, so I think a lot of podcasts like of this nature, they often ask the question of like, you know, tell me about your. Your worst incident. And I, I love the stories, but I would like to go in a slightly different direction and I hope you have a, a good answers for this, Matt. Otherwise it will fall very flat.

But tweaking that question a little bit. Like is there an incident where you feel like you had an incredibly creative fix to get something like back online, for example?

[00:27:56] Matt: Yeah. Actually two incidents pop into my head and they're a very different circumstances. One of 'em is just an interesting story.

The other one, I like because it's a three byte change to the best kind. Three bites, bikes, fourth lines of code change to take a system that was essentially hard down back to operational. So I was working a company and we had a request response system with a, a worker in the backend. And unfortunately, an engineer had refactored the dispatch queue that dispatches tasks off to workers.

And I, I forget the exact details, but I think it was I think it was. The exact details escaped me. But essentially the engineer had gone through, they'd refactored this code. It was quite coupled code for reasons not of their making. And they were trying to improve that. You know, as we all know, we sometimes write solutions that we are not necessarily proud of, but they achieved the purpose and they get something done and they're proof of concept.

Mm-hmm. and go back and iterate. And unfortunately what manifested to us was a saturated system that. Just the throughput just went through the floor. And of course we could roll the thing back and we couldn't come to the bottom of like, why is this commit on name broken and this one isn't? And there was quite a big change in there.

And of course we could just roll the change back, but people are built on top of it. We at that point didn't have a really some green culture, so it kind of changes had accrued and it was gonna be very difficult to back it out. Anyway, the, the long and short of it was had dropped due to some nested function calls the word go.

and a space before a a function call. And for, for anyone who's not familiar with Go go routine is like a lightweight thread as it gets scheduled by the go run time. And not typing go means that you are going to run that function call on the dispatch thread, rather than spawn a new thread that will get spun up for the purposes.

Handling that unit of work. And it was one of those cases where it was so obvious when you found the problem, and it was that six stages of debugging that this is impossible and that doesn't happen on my machine, and so on. And then you just see this change that had been dropped and you go, oh, that was it.

It was funny, I was actually talking to my team because we had a, we were joking yesterday about. Value that you can add to the fewest lines of code change in pr, . And I asked them, I said, give me the most impactful incident you've sold, measured as you know, the, the fewer bites that you change, the higher the score, and then multiply it by the impact where zero is like nothing was wrong and one point not is system was hard down.

And I don't wanna, you know blow my own trumpet. But I think that one scores quite highly, that we essentially had a. Yeah. Like change the other fun incident that I had and it was actually in the same system but that. Where we had a whole bunch of really unfortunate scenarios happen that resulted in one user on the end of a dodgy internet connection due to some weird request response acknowledgements and TCP getting all involved in.

What it does and message replays and so on, where they managed to tear the system down for everyone. Essentially it was God similar to your recent blog post where we had a poison pill. Yeah, with an unfortunate deployment oversight that meant only one replica of a service to process that thing was running, coupled with a message broker.

Mute coupled with a message broker that if you have a protocol error on its control channel, it will tear down the whole connection, but it will give you three seconds multiplied by the number of open channels you have to this thing, which channels like a with a client in our case. So we just had this thing just sat there spinning for about 15 minutes before the process died.

And then we could rew and then of course the poison pilt went again, and then, denied itself. Yeah. And I remember my instant report, it's probably one of the proudest instant reports that I've written in my career. It was like I spent 24 hours writing this thing. But I went right to the line of code where I could say, oh, yeah, that's three time or three seconds times the number of active connections.

Look, this is why it hung for 15 minutes. And it was just again, a, I can explain to. In gory detail exactly what happened and what all of the contributing factors to this incident was and why the system responded the way it did. And that was, it's so satisfying. It should not be someone on the end of a dodgy 3g, you know, 3G connection on a bus in rural Devon that causes your whole system to go on available.

Those are the most fun incidents that you can talk about after the fact. You go, well, you know, I cut my engineering chops on resolving this. And that was, . I think that being, being

[00:32:26] Pete: able to track it down to the, the, the, like being able to, ah, it's, it's like finding that smoking gun is just so satisfying.

I was talking to one of our engineers who had a problem yesterday where they're just like, look, I mean, we fixed it and it's all fine, but I don't know why and so I can't sleep. And it's, it's when you, when you can go, it's definitely this. I remember like my, one of my personal, like, you know, I will never get over.

instance was like, there's a quite significant, or like a significant one internally at Monay. And I remember doing the debrief with you, Chris, where like, I'd gone so, so D off the deep end on like trying to figure this out and like I'd built like a. A traffic modeling simulation in like go playground.

And like we sat in the debrief and I was like, here's like literally a visual KY animation in go playground of the traffic scenario I think has happened, but I cannot prove it. And I was like, I can't, I can't prove this. I'm like nine out of 10. Sure this is what it is, but I will never be able to prove that that's what it was.

And it's just so frustrating. So when you do get that light, you know that light bulb moment. Oh my God. Oh, the numbers work. And that's why it was this, and that's why it was this. It's just like, yeah, I, I

[00:33:31] Matt: yeah, I'm very jealous. I think some people see it as as wasted effort. You've got the system back up and running, why are you doing this?

That's done right. Yeah. But of course you dunno when it's going to happen again. And quite often we get lucky in these things. They only impact us for a short time. But it's, it's that innate curiosity in an engineer and that knowledge and understanding of what normal looks like in that. To be able to go that isn't right.

And quite often you get photo incidents that don't manifest into incidents because someone looked at it and went, Hmm, that's odd. Why are we doing, yeah, why? Why are we, why are we doing this? Why is that traffic going over there? It shouldn't be able to. Or why is that logging that line? That's unusual. And again, it just comes back to people having.

Familiarity with the status quo and, and what normal, what, what good looks like. You know, I do believe there is the, you know, normal is not word. There is no normal, but I guess there's a steady state in our systems, and then there's there's, there's the nons steady state behavior or elastic and plastic behavior maybe to steal a physics term.

And it's really important to create the space and the culture for people to go and have those investigations because it probably save. Your bacon in the future, plus using open source. I'm a big advocate for that. The only reason I found that three second thing was because we were using a component that I could go and read the source code.

You know, quite often we just hide behind abstractions, but sometimes you actually find the weeds what pushing the clutch does and how the engine works internally. And even if you can't build one yourself. Understanding what it does can help you be a better engineer in the process. Yeah, absolutely. Is that,

[00:35:01] Pete: that why I think is the really important bit, which is kind of, you know, are you satisfied with I know what it did, or do you need, like, do you have that compulsive need to go?

But I don't know why it did it. And it's often what you find out is like, actually, you know, the outcome's still the same, but maybe the reason it happened was totally opposite to what, to, to what you thought. It's like, you know, you assumed the system behavior. And it's like, it kind of wouldn't have mattered cuz the incident's now over.

The problem is when it happens again, you go, oh yeah, I know why that happens. And it's like, no, you don't know why that happens. You know? It's like, yeah, that cue is backed up. It's like, oh, that's just because this, and it's, that's where kind of, you know, what can feel like a useful shorthand can suddenly mean that, you know, the classic is someone goes, oh, it's probably this.

and then half the incident spent looking in the wrong place. And when you've got someone that does that real like, oh no, I want to understand the system, like actually then spreading that knowledge of how the system works is the thing that mitigates the next incident, not knowing what happened and that it got better.

[00:35:53] Matt: Exactly. Yeah. I really like I realize a short in time, but I really like the concept of the gamma. So for anyone not familiar, essentially, I think we all, I mean company that won't see it like this, but we all have this tendency as humans in post analysis to go oh, this happened because this, and then that led to that.

And you, you tell a very linear chain of events. But actually life doesn't work that way. You know, humans are all distributed systems, we're all engineers. We are all sort of only partially ordered with each other. And there's this concept of the gamma knife I can't remember who came up with it.

Essentially see every action that someone does as a zap of radiation just firing off. And most of the time they don't align, right? They, you know, someone deployed that . Someone logged into the production database to run a query zap. Most of the time they run select star. users, semicolon and nothing goes wrong, but occasionally they do update thing and they screw the wear claws up and they update everything.

And, you know, all these are little zaps of radiation that are going on. A gamma knife is a concept for treating, for example, brain tumors, where if you focus all that radiation on one point, then it can have a much greater impact. And essentially, . I, I like to see incidences when all of that radiation concentration, it was all still there.

It was all still happening. Yeah. But most of the time your people are creating conditions for success and it's not aligning and it's only when all of those stars align. That's sort of very linear way of looking at it, but only when all of those zaps align that suddenly you go, oh, that's an incident.

And we need to go and look at it. And I think that changes the logic. And certainly how perhaps non-technical actors often look at it, that it's. How can we prevent this ever happening again by stopping people and assuming that it's human error that led to it, you still need people to be able to perform those actions in a day-to-day job, but it's about them having that understanding of how it could have.

Wider impact than just the, the objective that they're looking at. Kind of, you know, think outside the box. What's the problem that could happen here and mm-hmm. . And only through having understanding of the system and how it fits together and having a well designed system. Can you start to comprehend that?

I love that.

[00:37:54] Chris: Yeah. It fits really naturally with like my mental model of, of how these things merge, which is just like, you know, the best incidents are the ones where like sort of continuing that you've got these, these apps and you've got people who have independently in their sort of day job, been curious about what sort of underpins those, those little like independent events that are sort of emergent across your whole system.

And it's like the intuition leads to experience. Cause people are like, cool, I've seen how this works when things work. Right. And then that experience can sort of be extrapolated into intuition. So when something then goes wrong that's sort of slightly outside the immediacy of those things, you're like, Cool.

Well, I'm, I'm up to hit this level, like level nine with my sort of understanding. And so now I've just gotta make this little extra leap, which is often quite easy. And it's those, those things there, which if you weren't curious in the first place and you didn't really understand what was happening in these independent things, you are in an absolute world of pain.

Yes, and have definitely been there in the past. But, but then incidents of this fantastic way of like even. You've had that pain and you're like, okay, no one was curious or no one even knew that thing existed. There's such a good spotlight to go and like, dig into those things and go, cool, well we're gonna, that's a new vector that we can explore where we can get very good at this thing.

And then next time something in the, in the domain of whatever it might be is gonna be that little bit easiest deal with. Absolutely. But yes. Listen, we are, we are rapidly running out of time. And I feel like I could chat for, for hours, so maybe we Yeah. Maybe we get you back at some point in the future, Matt, and we can I love continue the convers.

Nice. Nice. Well listen, thanks Matt, Matt from Ziegler. Genuinely appreciate you taking the time. It's been

[00:39:22] Matt: a lot of fun and we'll chat soon. Yep. Thank you for having me and it's been a pleasure. Yeah, likewise. Thanks so much, Matt.

So Matt, genuinely, like really, really delighted to have you here.

[00:00:27] Matt: Thank you for having me. It's been a, well, it's a pleasure to join you, but it's also been a pleasure to be on the journey of, is it nearly two years now that we've been using incident.io since the, I think so Private Beta.

[00:00:40] Chris: yeah, exactly.

A long old time. Lots of, lots of stuff's changed. Lots of stuff's happened.

[00:00:50] Chris: from, from three founders sort of hacking away on a product to being a proper company with proper teams, doing lots of things all over the place.

So yeah, definitely a fun journey for us. But I think, I think also a pretty fun journey for you on your side. So when you, when we first spoke to you, I think you were engineer at Zulu, right? And now you are, you are not engineered. So maybe like we could kick off with like a bit of a background on like Eglu, the company and your journey through Eglu, if that sounds.

[00:01:18] Matt: Yeah, absolutely. Yeah. So Zigoo is a a regulated e-money issuer a FinTech with a vision to be the cool home for your money. But we've had products largely in the crypto space actually to begin with. But I joined because I wanted to Great opportunities for people to have new ways of, of seeing their cash.

And we've, we've launched some innovative products over our time. Perhaps right now we're in the middle of a crypto winter, and that's obviously leading to a bit less appetite than than there would've been before for those products. But yeah, just to, I'm more on the tech side and more see the, the, the tech integrations and capabilities as being my bread and butter.

So I, I joined Zigoo in the height of Covid in June, 2020 when most companies were not hiring. But I was looking for a challenge and Zigoo came along. My background's really in kind of platform and infrastructure engineering, so I've held roles, s e s SRE team leads and everything that kind of goes along with that.

I think we're now seeing in the industry a, a split between the SRE hat and the platform and cloud. Style have, but I guess I, I probably bridge for both. And I joined Zigler as a platform engineer. Leading well as, as a lead in that team. One of my first and. Actually proudest moments as an engineer was when I came in and on week two, day one someone said hi, we'd like you to integrate with MasterCard so we can build a card proposition.

Please go. You have two weeks and about four months later we do use a payment processor that sits in the middle, although this isn't a FinTech podcast, so I won't go into all the details. But suffice to say I wanted to come into work in the FinTech space so I could understand. Some of the fun and career building opportunities.

I'm picking my words. .

[00:03:03] Pete: Yeah. Being very selective about the words you're using,

[00:03:06] Matt: associated with you know, working with, let's say APIs that were designed in the eighties and haven't really changed and all of, basically all this legacy. But it keeps one of the most important things moving i e the economy.

And we all take it for granted that we go into a shop every day, we flash our card on the payment terminal. We, you know, we flash it on the tube barrier. You can do this all around the world as well, you know, and it is this completely interchangeable medium. But how, how does it work? You know, I've, I've always been that type of person the year dot.

In fact, my mum says, oh, you were bored at the age of one. You know, I've wanted to take things apart. , I, I want to understand how they work. You know, I do believe in the engineer's motto that if it's not broken, take it apart and fix it and put it back together again. And I just wanted to see behind the scenes and this gave me the opportunity to, to do that with connections to the, the ful mood of.

of, you know payment networks and financial networks that exist within the uk. So yeah, long story short, I came to Zigoo as an engineer. We did a whole bunch of stuff building quite a novel card integration. And then over time, one thing led to another, I ended up heading up the cloud platforms teams.

So setting our cloud platform strategy. And actually we did. A whole ton of stuff to significantly improve our reliability and modularity, including January, 2021. Just before we adopted instant io, I remember we did a cloud to cloud migration from an old tech to a new tech with about five seconds of downtime, which was a very proud moment in my career.

Nice. Beautiful. And that was basically just a. Just to do the cutover essentially. You know, there's always gonna be a little gap there, but it, it was, it was minor and nobody noticed cause it was 2:00 AM . And then the CTO of the business who was here when I joined great guy called hus who I have a ton of respect for.

He decided to move on and business needed cto. And after a lot of searching, I stepped into the role. And stepped into the role at a really interesting time for the company, and it's been a, it's been a, a, a journey to build a team. It's not always been easy. And obviously in this role, it, it's very much a leadership role.

I've had to take a step back from doing engineering perhaps. , perhaps not as much as I should have done because I still have that desire to like be hands on and write code. And I realized I was trying to do that rather than my actual job sometimes. But, you know, that's, that's how I keep myself fresh and, and current.

But yeah, building a team, setting the technical direction, making some really difficult decisions. Sometimes, you know, my natural tendency is to be everyone's friend. But actually as a leader, you, you can't necessarily please everyone all the time. You have to do the right thing. But I'm, I'm very privileged and, and humbled to have had the opportunity to do this before even the age of 30 to lead a, you know, a business through this and through some of the, the, the change that we've gone through over this time.

So what is the,

[00:05:51] Chris: what is the evolution of the, the team look like alongside you, sort of stepping into CTO and, and the journey at Sigly?

[00:05:58] Matt: Yeah, so we, we had really the change that we've made in the team, we've made probably three changes and I always say that I always say three and I start enumerating two and I can guarantee I'll get to the end of this and then I'm gonna say, I forgot what the third.

That's,

[00:06:11] Chris: that's always the way. Yeah, yeah, yeah. It's all good. We'll help you with the third, we'll make it up on

[00:06:14] Matt: five speed . When, when I stepped up to cto, we had a very kind of monolithic, very early. Culture the team was split into front end backend QA and cloud platforms. Mm-hmm. and there were kind of hard barriers between them.

You know, the old, not quite the old school, Devon Ops split because cloud was doing development work and Backend was doing operations work, so we didn't have that. Typical problem. But they all spoke different languages. Well they all wrote different languages, I should say, but maybe that means that the humans speak different languages.

They don't always understand as well. And there was very, yeah, there is very much, there was very much a handoff culture between the teams. And one of the big changes that I wanted to make was to move to the scale up model where you end. More integrated teams, more cross-functional teams. Mm-hmm.

I think that's the standard in our industry today. But it didn't make sense in the very early days of the company for it to be built like that because you put everyone together who speaks the same language, they all have a single mission. They just get on and do it. And the coordination overhead and process overhead perhaps isn't as great as in subsequent times when you then end up needing to think more.

I think more deliberately about what it is you're building you know, respond a lot more to the customer feedback that you're now receiving because you have customers and have the opportunity to be nimble and respond to emerging business demands. So really moving to that cross-functional squad structure combined with rolling out more effective agile.

And I think in this, in our industry, everyone says agile. In air quotes, everyone has a different interpretation of agile, and I've certainly worked everywhere from, you know, scaled enterprise Agile to, well, agile is just free for all animal farm, just get on with it style thing. But we, we've very deliberately wanted to empower engineers to have a say so that their, their view on how complex something was and whether it was the right solution actually was valuable because I, I firmly believe that engineer.

are not programmers. There is a distinction. You know, they're not here just to pick tickets up off of a board and move them through some swim lanes and ship the code, but they're actually a critical part of the definition of the problem. And perhaps even if someone else defines the problem, they're a critical part of defining the solution.

And I think we've been very successful in, in doing that internally. It wasn't the easiest transition. And it certainly can be difficult at times to sell people a, a story. Building an agile culture, when you start to talk about, well, it does, it means that not everything's gonna have a deadline, for example.

You can't necessarily guarantee that you're going to ship on a certain cadence and schedule, but we will ship as early as we possibly can when things are at the quality that they are and moving that, that responsibility and capability into the tech team has. Quite honestly, really brilliant. Combined with, it's reduced my workload significantly to the point where actually when I first took over the team, basically everything, it was very early startup, everything went through me or my head of engineering and we kind of, we were very much day-to-day working with who was doing what and, and being operational.

And now I quite genuinely just sit back sometimes and go, I. kind of an idea of what's going on, but I don't know the ins and outs. I don't know the technical detail. I get involved where I can add the most value or I can contribute to some major architectural decision, but I don't need to be across a color change in the UI to change the color of a button, right?

Like that can just happen in the team, and I trust my team to get on with it as well. That's. Another word we don't hear very often, but trust is trust is vital. And the third thing I've actually remembered the three things that I was gonna talk about, we brought in a result. Yeah. . We brought in a separation of engineering management from team leads.

So. . A belief that I have, and one that I think is shared but not universally, is that as you grow, it's vitally important that you give people the opportunity to grow in their career horizontally, right? I call it horizontally. You need people to be able to remain at the individual contributor track, but to be recognized for their.

increasingly senior and senior contributions in terms of technical leadership, technical influence the responsibilities and the accountability that they carry, and ultimately their ability to transform business goals into technical deliverables and technical solutions. And so many businesses. And I'm not sure if this is a UK thing, but it's the US thing because the.

companies definitely get this horizontal versus vertical promotion track right? In my view. Whereas I see it much less. Okay. But it's so important that people have that opportunity. And I see so many businesses saying, oh, you need to become a manager. You need to manage people. You need to take on all this responsibility.

That is al to actually what made someone a good engineer in the first place. And, and by making them a manager, you in many cases take away. very good at and the value they're contributing to the company. And I've certainly seen people leave businesses because their only opportunity for promotion was to move into people management, mm-hmm.

or they've given that and they've found it's a poison challenge. So I really wanted to separate those tracks and make sure that there was an opportunity for everyone to move appropriately within the business and where it went. So we've got an engineering management function that I'm incredibly proud of.

And actually. Driven so much change in the business in the six months or so since we've had it in place. And that has really vindicated the decision to to, to, to go down that track and create opportunities for everyone in the team.

[00:11:32] Pete: Yeah, that's super interesting, Matt, and I guess I couldn't, I couldn't echo more strongly some of the challenges that you've just outlined.

I think the way that. I guess built the early team here at instant IO is like, basically went through a very similar journey to the one you've just described at my, my previous company and sort of had a lot of burned fingers and scars and learnings from that and kind of naturally bought all of that instant io made, made, made a huge difference here.

So yeah, big plus one to the, to the shape of team you've just outlined. And I think particularly sort of splitting out that, you know, em, tech lead, not overloading people and kind of making sure that people can follow the journey. feels most natural to them as opposed to the kind of, you know, I want progression and therefore I must, it's sort of a bit of an anti pattern, I think in a lot of teams and sort of making sure people can.

do do the job that feels amazing to them without having to kind of carry additional baggage. Is, is, is absolutely the, well, I don't know opinion here, but it's like I, to me absolutely the right one. So yeah. That's, that's awesome. I guess o one thing I'd be interested in is kind of what, what prompted you to look at instant IO and was that part of that journey?

Because so for context for me, like. Challenges of starting to split teams is suddenly you introduce this like overhead of coordination and communication and suddenly it's not like you have one team that knows everything. Like, was that a factor at all or did you come looking to us for totally independent reasons to some of the team evolution stuff you just, you just mentioned.

[00:12:49] Matt: I think when opportunity knocks and someone builds a door, sometimes you walk through that door. Nice. You should

[00:12:54] Pete: come work for our, our marketing team at some point. ,

[00:12:57] Matt: I love, I'm not a marketer, , unless you want truth and honesty from marketing, which actually is probably the right way to market.

Yeah.

Probably

[00:13:04] Pete: what we should be aiming for. I think

[00:13:07] Matt: that all day. We were early, so we, I think I've done case studies in the past for the. Website where you did,

[00:13:15] Pete: you were one of our first case studies.

[00:13:17] Matt: We really appreciated it. Iactually, I remember this where we were doing the thing in Slack of saying, we have an incident, we need to get in Python.

Oh, what was the last number? 34. Oh no, that channel's gone. Oh, 35. Oh no, that one's oh, 36. There you go. And you create the channel and then some wise Alec pops up and says, yeah, but instant 36 had an underscore, not a hyphen in it. So you've actually just duplicated the instant number and Yeah, the whole pain of managing incidents.

Was not good. And I fundamentally believed that incidents are not a bad thing. I think I'm in good company by saying that, you know, incidents are in my view, either, you know, emergent from just complex systems and you can't predict them. Or they're the price you pay for moving fast. And in an agile organization, sometimes you're going to hit problems, but as long as you remain within your overall quality bounds and goals, then that's absolutely okay.

And I think with that thesis in mind, there needs to be an effective way to manage incidents. And we'd looked at and played with the early tool that I know Chris wrote at Monzo, that I knew from when I met Chris at S R E K in Dublin all those years ago before Covid was a thing. And we'd looked at our self-hosting that.

But in an early stage startup, you just don't have the time. Yep. Go and spin up special systems that are not part of your core purpose. So while we had experimented, it just hadn't gone anywhere and, and nobody really wanted to own running the thing that absolutely had to be mission critical and bulletproof.

But you only called on it one, an incident happened, which actually means the SLA on the thing needs to be far higher than your overall system because you just don't touch it very often. Yeah. Then instant IO came along with the. Early private beater, and I think it just played very much to me. And my lead cloud engineer, dean's desired for how we believe companies should own, respond, and manage incidents.

And, you know, we really wanted to support the mission that you were on to share that, that love for the, with the world, I guess, and, and build a product that actually can can achieve that effectively and sustainably and, and build a culture within organizations that's the one that we all collectively on this call would be proud to work within.

Yeah, I think there are many aspects to the product that really help us. There are aspects that perhaps we're also a little bit too small to benefit from just because our instant management practices are. Perhaps more nascent than you would have in a, you know, a significant organization with hundreds of engineers.

Mm-hmm. , when you have 20 engineers, you tend to get people wearing multiple hats necessarily particularly when you say the instance at 1:00 AM and you're sort of making that judgment of do I wake people up or do I respond with fewer people? So there's definit. Differences to the instant approach that I would adopt if I was in a 200 person organization for instance.

But then we also perhaps don't have as many. So, you know, the quantity versus quality of response also becomes a, a factor here. Mm-hmm. . The other thing that I realized was that internally the only signal that people had as to kind of quality of production and quality of response. Days since last incident.

I know that sounds like the sort of thing that you would see in, in, you know, the Simpsons or, you know, some other like, you know, a animated comedy, but you know, it's like, oh, you know, zero days and the ticker tick background and that seemed a bad thing. And largely that was because there was no. Mechanism for communicating severity of incident impact of incident communicating how the process was going through the incident because it wasn't easy for people to do that.

One of the skills that I have come to value is IBA is the, the quality of written work, particularly over covid. And I know that many engineers struggle to. , perhaps the detail and nuance when they're also in the middle of like, I need to resolve this problem with code. Oh. But now I need to go and write three paragraphs about what's happening.

And it's sort of, there's a lot of uncertainty and and, and difficulty to convey that to non-technical stakeholders and Right. Just adopting the tool that encourages you to do the right thing and makes doing the right thing, the lowest friction option. Yeah. And reminds you you need to post an. That's actually really valuable when you're in that high stress situation.

So for all those factors we adopted it, I guess, I guess it's run by some great people as well, but don't tell them that, oh wait, they're, they're on this call. ,

[00:17:27] Chris: the very best, the very, very best people. I hear, I hear Matt. , the, the thing thing you were mentioning about like days since last incident reminds me there was a, there was an engineering director at Monzo.

Middle of Covid. He used to dial into meetings and sort of in the background of his, of his video stream was this very scrappy, handwritten note, which was like days since last pandemic, zero . And just every time I dialed in I was like, oh, that's the little bit of humor that I need to get me through this, this situation.

That's fun. That's that's interesting that, that, that point you make around. , you know, folks who are in the middle of like fixing things and the, the tension that exists between like that and the need to communicate because it's something that like, so we use incident io at Incident io when we have our own incidents, which is a bit sort of inception, but.

I, I find still with, with a tool like ours where the prompts are there and stuff you know, all the right things are kind of happening from a product point of view. It's really, really hard for humans to switch between those, those two modes. It, it is just genuinely tricky. Is that something, how do you, I mean, how do you folks think about that?

Like, how do you, do you find that a problem at all? Is that like a struggle or is it, you know, are you at the point where it's, you know, happening.

[00:18:38] Matt: I think we have a split of people within the team. We have people who are naturally content going and looking in the database and figuring out why this line of code's broken.

And we have people who are content with communicating up. And you sort of get a natural division of labor almost within an instant where people are you know, people bias one way or the other. But I would say that one of the challenges that we. Is will always spin up an incident call for a major incident and sort of the incident call can move so quickly that it, it can often be difficult for someone to scribe and to take the notes that they really want to within the call.

And, and to communicate that back to the instant channel and it finding that balance between understanding. Why the instant is happening and as your system gets bigger we've, we've certainly had incidents, well, I guess there's, there's two things that have happened. I've noticed as the system has got bigger, we've had incidents that we've gone, we're not really sure why this is happening.

We perhaps know how to mitigate the impact of it, but we, it's not sustainable to have someone sat there running a script every five minutes to purge a queue or something. And, you know, you, you get very much that emergent distributed systems behavior that it can be difficult to get to the bottom of.

how do I communicate something that I don't know, that's tough. And then a, a personal reflection that I've had, and it was in that same incident when, when this first occurred, was that I actually can't get involved in the technical details always. I need to actually be able to just back off and trust in the process and the system and the people that are responding.

And I guess that means that. Put more emphasis then on receiving the communication and the updates and mm-hmm. , perhaps at the time we weren't as mature in making those updates, so I was having to chase them a little bit. But fortunately I'm an engineer as well, so when engineers speak to me in engineering language, I can, I can translate and I can translate that to a business stakeholders.

I guess there's a new opportunity here for you to integrate with chat G P T. You know, slash incident, right? Update that states this and yeah, yeah, yeah. Oh, even better. We

[00:20:32] Pete: could, we could go one step further, right? Which is just like, tell us what's wrong and we just proxy through the chat, G B T and say, how do you solve a problem where?

And then just insert, and then instant I magically tells you how to, you know, bring your database back online.

[00:20:45] Matt: It's a really interesting one though, because I remember a big turning point in my career. You know, I think. Let's not talk too much about Safety one, safety two here, unless we really wanna geek out about resilience engineering.

I dunno, that sounds right up. Chris'

[00:20:58] Pete: Street loves a bit of safety one. [00:21:00] Safety

Chris: two. We'll, we'll, we'll lose Pete. Pete will just be like, come on, come on guys. , .

[00:21:05] Matt: But there is always this tendency of going, what went wrong? Make it never happen again. Mm-hmm. . Yeah. And when I discovered Safety Tour as a concept, it sort of intuitively made sense.

But one of the phrases I remember from it was actually from an SRE con. I remember exactly the talk. So this isn't my phrase at all but it really captured my understanding of really what is an incident and why are we having them and what do we do about them Was when. I think it was someone from Netflix setting a talk that you, i e you working in your company, have the best people to build the thing that you build.

You know, there is no one better on the planet to build an instant IO, for instance, and to run it and to operate it and to understand how it works. No one else understands the data model. No one else understands the unique quirks of that customer that is reporting that problem. And that really just almost flips the logic to me in my head of actually my team is coming to work to do the best job that they possibly can do.

Sometimes that goes wrong. Sometimes you have an incident. But I think, you know, playing to ai, AI can solve the general problem, but it, it still doesn't have until it's very specialized and understands that in terms of a business, does it actually understand the nuances? Ultimately special snowflake system that we all have, that's all gonna be different.

Yeah. And will have its own design criteria and quirks and, and, and considerations that need to be taken into account. Yeah. I think

[00:22:24] Pete: in particular, it's the kind of, it's, it, it could, it could, it could probably do a pretty good job of, in general principle terms telling you how to solve the problem.

Right. But it's in the same way that I could tell you, like, first, what do you already know second? Like, you know, what do we not know that we need to find out, but is. The fact that every single system is so different and that your generally, your incidents do not happen because of really obvious and common.

Your instance happened because of the edge cases or the unexpecteds. Right. And that's where AI is literally is not specifically designed to necessarily handle that. Right. It's,

[00:22:54] Matt: Novel incidents are good in my view. Yes. Another mantra that I have with my team is keep it boring. I I oh hundred percent abbreviation.

But I don't want to be in a situation where we have to use our superior experience to kind of engineer our way out the solution or out of a problem. Yeah. , let's just keep it boring and reserve 50% of our mental capacity for when it does go wrong, and we actually have some, some space to understand why it's broken.

Yeah, a

[00:23:18] Pete: hundred percent. Yeah, I, I I think that applies to, yeah, I guess this is what you're saying, but to teams generally, right? It's kind of, it's, it is interesting when you're hiring as well. You, you get people who kind of, I, I'll talk to them and they'll be like, so tell me about all the like, really, really exciting underlying technology that you use at instant ir.

Honestly, I don't want exciting. I don't want my infrastructure to be exciting. I don't want, you know, I, I, I wanna use all of our kind of, you know the team's smarts and capabilities to build amazing stuff for customers. I do not want like the, the fun of running on some incredibly new novel Edge database and it's like, nah, it's like mostly like Postgres go like, you know, type scripts on the front end.

Obviously there's a lot of thought and a lot of smart stuff that goes into that, but it's like if you're, look, if you're coming here to use the latest and. That, you know, our job is reduce incidents and I can tell you exactly where that's gonna lead. So it's kind of, yeah, could not agree we all with you.

Then I

[00:24:08] Chris: can, I cannot tell you how happy I was when we, when we started this company and the prospect of there not being Kubernetes in the mix, , I was just like, I can, I can focus on the bits that I really, really care about and like, no, no shade generally, or Kubernetes like clearly. a wildly, wildly successful piece of technology and like does incredible things.

But like, I think, I think it's exactly it. It's like you have so many, like cognitive cycles mm-hmm. , and you want those going on. The bit that's most high leverage and like that applies for like normal work as much as incidents. You know, e either one.

[00:24:38] Matt: But yeah, I absolutely agree in startup culture. Be interesting conversation as to whether you think that you would need to adopt an architecture of that maybe.

own your own platform as you grow and get bigger. Mm-hmm. and have more stability over the stack. But I certainly have come to the conclusion that if I want to do fun, novel, exciting things, I'll do it in my home lab. I'll run personal projects of my own. Mm-hmm. and largely as I've stepped away from being a HandsOn engineer, that's also what I've done to keep my skills up.

I'll. Yeah. Yeah. I'm doing of code 2022 at the moment and I'm learning rust. Oh, nice. Oh, that's cool. I'm actually. Yeah,

[00:25:13] Pete: I I have, I have a lot of respect. I, I started learning rust and, and then sort of like many other evening side project things kind of got squashed by starting a business. But yeah, I've got another friend that carried on.

And occasionally I catch a chat with him and he just, he just raves about it and it's kinda, I, it says a lot of things that sound incredibly smart. And I go, Hmm, yeah, that does sound good. But also , I have no time ,

[00:25:33] Chris: so, no, sounds really, really interesting. I spent my day in a spreadsheet, so

[00:25:38] Pete: I mean, legit, it's like, you know, that's, that's, that's often the response.

Yeah.

[00:25:42] Matt: Yeah, no, it's, that's also really interesting about how much engineering work is not perhaps what we. Would've naively imagined before we came into this industry, if we go right back to of our careers. I remember being a very junior engineer in a company and, and thinking, what do all the senior people do?

They just seem to sit in meetings all day, like . I'm not sure if I thought exactly this way, but I'm, you know, I'm being a little bit flippant, but I can write code quicker than them. Why are they, yeah, what, what, what is this? And it's only over time that you suddenly realize that and, and you gain that humility.

Engineering is not about writing code. Engineering is about solving problems and Correct. In many cases, solving those problems may not involve writing a line of code. Right? Yeah. But let's understand the problems that humans have. Let's be net positive contributors to society you know, within the scope of the organizations.

that we work within and figure out how we sustainably build that culture moving forward. And I do fundamentally believe that that's, that's something I want from every engineer that I work with. You know, an understanding of how they can add value every day. And you know, how they can come to work and make the world, the company, the human condition a better.

Yeah. Oh,

[00:26:54] Chris: wow. The, we've, we found the little like, you know, sound part, the nugget for this, this podcast. . Matt preachers. Matt Preachers. But like, it's a genuine, really interesting really interesting point that around like solving problems. And that's, it's one of the reasons that we've ended up calling engineers here.

Not like backend engineers, frontend engineers, they're all product engineers because the, the thing to do is to build the product. It's like everything is product backwards, like, you know, engineer. It's a hundred percent clearly, hugely important, but it's like the implementation detail beneath the, sort of the value that our customers care about and that we care about as a company.

But yeah. I, I, so I think a lot of podcasts like of this nature, they often ask the question of like, you know, tell me about your. Your worst incident. And I, I love the stories, but I would like to go in a slightly different direction and I hope you have a, a good answers for this, Matt. Otherwise it will fall very flat.

But tweaking that question a little bit. Like is there an incident where you feel like you had an incredibly creative fix to get something like back online, for example?

[00:27:56] Matt: Yeah. Actually two incidents pop into my head and they're a very different circumstances. One of 'em is just an interesting story.

The other one, I like because it's a three byte change to the best kind. Three bites, bikes, fourth lines of code change to take a system that was essentially hard down back to operational. So I was working a company and we had a request response system with a, a worker in the backend. And unfortunately, an engineer had refactored the dispatch queue that dispatches tasks off to workers.

And I, I forget the exact details, but I think it was I think it was. The exact details escaped me. But essentially the engineer had gone through, they'd refactored this code. It was quite coupled code for reasons not of their making. And they were trying to improve that. You know, as we all know, we sometimes write solutions that we are not necessarily proud of, but they achieved the purpose and they get something done and they're proof of concept.

Mm-hmm. and go back and iterate. And unfortunately what manifested to us was a saturated system that. Just the throughput just went through the floor. And of course we could roll the thing back and we couldn't come to the bottom of like, why is this commit on name broken and this one isn't? And there was quite a big change in there.

And of course we could just roll the change back, but people are built on top of it. We at that point didn't have a really some green culture, so it kind of changes had accrued and it was gonna be very difficult to back it out. Anyway, the, the long and short of it was had dropped due to some nested function calls the word go.

and a space before a a function call. And for, for anyone who's not familiar with Go go routine is like a lightweight thread as it gets scheduled by the go run time. And not typing go means that you are going to run that function call on the dispatch thread, rather than spawn a new thread that will get spun up for the purposes.

Handling that unit of work. And it was one of those cases where it was so obvious when you found the problem, and it was that six stages of debugging that this is impossible and that doesn't happen on my machine, and so on. And then you just see this change that had been dropped and you go, oh, that was it.

It was funny, I was actually talking to my team because we had a, we were joking yesterday about. Value that you can add to the fewest lines of code change in pr, . And I asked them, I said, give me the most impactful incident you've sold, measured as you know, the, the fewer bites that you change, the higher the score, and then multiply it by the impact where zero is like nothing was wrong and one point not is system was hard down.

And I don't wanna, you know blow my own trumpet. But I think that one scores quite highly, that we essentially had a. Yeah. Like change the other fun incident that I had and it was actually in the same system but that. Where we had a whole bunch of really unfortunate scenarios happen that resulted in one user on the end of a dodgy internet connection due to some weird request response acknowledgements and TCP getting all involved in.

What it does and message replays and so on, where they managed to tear the system down for everyone. Essentially it was God similar to your recent blog post where we had a poison pill. Yeah, with an unfortunate deployment oversight that meant only one replica of a service to process that thing was running, coupled with a message broker.

Mute coupled with a message broker that if you have a protocol error on its control channel, it will tear down the whole connection, but it will give you three seconds multiplied by the number of open channels you have to this thing, which channels like a with a client in our case. So we just had this thing just sat there spinning for about 15 minutes before the process died.

And then we could rew and then of course the poison pilt went again, and then, denied itself. Yeah. And I remember my instant report, it's probably one of the proudest instant reports that I've written in my career. It was like I spent 24 hours writing this thing. But I went right to the line of code where I could say, oh, yeah, that's three time or three seconds times the number of active connections.

Look, this is why it hung for 15 minutes. And it was just again, a, I can explain to. In gory detail exactly what happened and what all of the contributing factors to this incident was and why the system responded the way it did. And that was, it's so satisfying. It should not be someone on the end of a dodgy 3g, you know, 3G connection on a bus in rural Devon that causes your whole system to go on available.

Those are the most fun incidents that you can talk about after the fact. You go, well, you know, I cut my engineering chops on resolving this. And that was, . I think that being, being

[00:32:26] Pete: able to track it down to the, the, the, like being able to, ah, it's, it's like finding that smoking gun is just so satisfying.

I was talking to one of our engineers who had a problem yesterday where they're just like, look, I mean, we fixed it and it's all fine, but I don't know why and so I can't sleep. And it's, it's when you, when you can go, it's definitely this. I remember like my, one of my personal, like, you know, I will never get over.

instance was like, there's a quite significant, or like a significant one internally at Monay. And I remember doing the debrief with you, Chris, where like, I'd gone so, so D off the deep end on like trying to figure this out and like I'd built like a. A traffic modeling simulation in like go playground.

And like we sat in the debrief and I was like, here's like literally a visual KY animation in go playground of the traffic scenario I think has happened, but I cannot prove it. And I was like, I can't, I can't prove this. I'm like nine out of 10. Sure this is what it is, but I will never be able to prove that that's what it was.

And it's just so frustrating. So when you do get that light, you know that light bulb moment. Oh my God. Oh, the numbers work. And that's why it was this, and that's why it was this. It's just like, yeah, I, I

[00:33:31] Matt: yeah, I'm very jealous. I think some people see it as as wasted effort. You've got the system back up and running, why are you doing this?

That's done right. Yeah. But of course you dunno when it's going to happen again. And quite often we get lucky in these things. They only impact us for a short time. But it's, it's that innate curiosity in an engineer and that knowledge and understanding of what normal looks like in that. To be able to go that isn't right.

And quite often you get photo incidents that don't manifest into incidents because someone looked at it and went, Hmm, that's odd. Why are we doing, yeah, why? Why are we, why are we doing this? Why is that traffic going over there? It shouldn't be able to. Or why is that logging that line? That's unusual. And again, it just comes back to people having.

Familiarity with the status quo and, and what normal, what, what good looks like. You know, I do believe there is the, you know, normal is not word. There is no normal, but I guess there's a steady state in our systems, and then there's there's, there's the nons steady state behavior or elastic and plastic behavior maybe to steal a physics term.

And it's really important to create the space and the culture for people to go and have those investigations because it probably save. Your bacon in the future, plus using open source. I'm a big advocate for that. The only reason I found that three second thing was because we were using a component that I could go and read the source code.

You know, quite often we just hide behind abstractions, but sometimes you actually find the weeds what pushing the clutch does and how the engine works internally. And even if you can't build one yourself. Understanding what it does can help you be a better engineer in the process. Yeah, absolutely. Is that,

[00:35:01] Pete: that why I think is the really important bit, which is kind of, you know, are you satisfied with I know what it did, or do you need, like, do you have that compulsive need to go?

But I don't know why it did it. And it's often what you find out is like, actually, you know, the outcome's still the same, but maybe the reason it happened was totally opposite to what, to, to what you thought. It's like, you know, you assumed the system behavior. And it's like, it kind of wouldn't have mattered cuz the incident's now over.

The problem is when it happens again, you go, oh yeah, I know why that happens. And it's like, no, you don't know why that happens. You know? It's like, yeah, that cue is backed up. It's like, oh, that's just because this, and it's, that's where kind of, you know, what can feel like a useful shorthand can suddenly mean that, you know, the classic is someone goes, oh, it's probably this.

and then half the incident spent looking in the wrong place. And when you've got someone that does that real like, oh no, I want to understand the system, like actually then spreading that knowledge of how the system works is the thing that mitigates the next incident, not knowing what happened and that it got better.

[00:35:53] Matt: Exactly. Yeah. I really like I realize a short in time, but I really like the concept of the gamma. So for anyone not familiar, essentially, I think we all, I mean company that won't see it like this, but we all have this tendency as humans in post analysis to go oh, this happened because this, and then that led to that.

And you, you tell a very linear chain of events. But actually life doesn't work that way. You know, humans are all distributed systems, we're all engineers. We are all sort of only partially ordered with each other. And there's this concept of the gamma knife I can't remember who came up with it.

Essentially see every action that someone does as a zap of radiation just firing off. And most of the time they don't align, right? They, you know, someone deployed that . Someone logged into the production database to run a query zap. Most of the time they run select star. users, semicolon and nothing goes wrong, but occasionally they do update thing and they screw the wear claws up and they update everything.

And, you know, all these are little zaps of radiation that are going on. A gamma knife is a concept for treating, for example, brain tumors, where if you focus all that radiation on one point, then it can have a much greater impact. And essentially, . I, I like to see incidences when all of that radiation concentration, it was all still there.

It was all still happening. Yeah. But most of the time your people are creating conditions for success and it's not aligning and it's only when all of those stars align. That's sort of very linear way of looking at it, but only when all of those zaps align that suddenly you go, oh, that's an incident.

And we need to go and look at it. And I think that changes the logic. And certainly how perhaps non-technical actors often look at it, that it's. How can we prevent this ever happening again by stopping people and assuming that it's human error that led to it, you still need people to be able to perform those actions in a day-to-day job, but it's about them having that understanding of how it could have.

Wider impact than just the, the objective that they're looking at. Kind of, you know, think outside the box. What's the problem that could happen here and mm-hmm. . And only through having understanding of the system and how it fits together and having a well designed system. Can you start to comprehend that?

I love that.

[00:37:54] Chris: Yeah. It fits really naturally with like my mental model of, of how these things merge, which is just like, you know, the best incidents are the ones where like sort of continuing that you've got these, these apps and you've got people who have independently in their sort of day job, been curious about what sort of underpins those, those little like independent events that are sort of emergent across your whole system.

And it's like the intuition leads to experience. Cause people are like, cool, I've seen how this works when things work. Right. And then that experience can sort of be extrapolated into intuition. So when something then goes wrong that's sort of slightly outside the immediacy of those things, you're like, Cool.

Well, I'm, I'm up to hit this level, like level nine with my sort of understanding. And so now I've just gotta make this little extra leap, which is often quite easy. And it's those, those things there, which if you weren't curious in the first place and you didn't really understand what was happening in these independent things, you are in an absolute world of pain.

Yes, and have definitely been there in the past. But, but then incidents of this fantastic way of like even. You've had that pain and you're like, okay, no one was curious or no one even knew that thing existed. There's such a good spotlight to go and like, dig into those things and go, cool, well we're gonna, that's a new vector that we can explore where we can get very good at this thing.

And then next time something in the, in the domain of whatever it might be is gonna be that little bit easiest deal with. Absolutely. But yes. Listen, we are, we are rapidly running out of time. And I feel like I could chat for, for hours, so maybe we Yeah. Maybe we get you back at some point in the future, Matt, and we can I love continue the convers.

Nice. Nice. Well listen, thanks Matt, Matt from Ziegler. Genuinely appreciate you taking the time. It's been

[00:39:22] Matt: a lot of fun and we'll chat soon. Yep. Thank you for having me and it's been a pleasure. Yeah, likewise. Thanks so much, Matt.

See related articles

Podcast

How communication can make or break your incidents

Charlie Kingston
Picture of Charlie Kingston

Charlie Kingston

42 min read
Podcast

The founder's story: a trip down memory lane

Chris Evans
Picture of Chris Evans

Chris Evans

61 min read
Podcast

Building an incident management process

Charlie Kingston
Picture of Charlie Kingston

Charlie Kingston

44 min read