Recently, we sat down with Kerim Satirl, Senior Developer Advocate at HashiCorp, to get some practical advice about how teams can make incidents less painful.
If you want to listen to the full conversation, we've embedded it here.
But if you're looking for a few quick highlights from that episode, we've shared some below.
How many incidents have you managed?
"I think a ratio of once every two weeks checks out. New security incidents, vulnerabilities that need to be taken care of. And of course, stuff that was committed that wasn't ready to be rolled out and had to be reworded."
Can you talk through one incident that highlighted the value of having a good incident management process?
"So I'll go back to...this is probably like 2012 or 2013. I worked at an ad tech company. We had an issue with a database migration. We decided to migrate some data from one MongoDB cluster to another. And the process should have worked out, but of course it didn't. So we paused it.
I was scheduled to go on vacation the week after—I was rock climbing back then. And we decided to pause everything because, you know, we were almost there, almost certain it was going to work out. But we didn't want to rush it. It's kind of like, don't deploy on Fridays. Don't make database changes when half of your team is going to be gone for the next week. Monday morning, my colleague comes in and—I want to be very clear, this is not about blaming anyone.
This is absolutely about lacking good tools and a good process.
My colleague decides that, you know what, I'm just going to finish this. It's going to be a nice surprise when he comes back and finds out everything is finished. And the intention there was good, right?
We love people that pick up our slack. As it happens, I'm climbing and I feel my backpack starting to vibrate, which is usually not a good thing because when I was climbing, everyone knew I was climbing and. So as I'm doing the second pitch of my route, I take a quick break.
I grab my phone and I see my colleague is calling me. 20 missed calls, something like that. First thought is, something happened that is so critical that one, he's calling me while I'm climbing. Two, he's calling me on vacation, which it was a startup, but we had a strong rule: If you're on vacation, you don't call people or you don't call them.
Here I am hanging off the mountain, probably 50 meters up in the air and grab my phone, call him and ask, 'What is the problem?' He ran a change mods command and basically locked himself out of one of the servers and just wasn't able to SSH into it.
If teams know how important a good incident response process is, why aren't they prioritizing it?
"I've mainly worked with startups and startup-like organizations inside enterprises, where the larger company wants their team to respond to incidents as quickly as possible. And then it's always high speed, low drag.
Not a lot of processes. Not a lot of documentation. Hardly any testing.
These are all the things that you can 'save money on,' but you can save money on them until stuff goes wrong. At which point, all the money you saved goes into an all-hands-on deck situation where you're trying to solve that incident because, usually, there's not just a single thing that went wrong.
There's usually a litany of things. And of course, because you haven't properly documented everything, nobody has an idea how to solve it all."
Given how stressful incident response can be, why would someone want to be involved in them? What's the value?
"For many engineers, I think there's this hero complex of wanting to be the one that saves the team. But you could have prevented having that work in the first place by doing it properly. But ultimately, if you're the one building it, you're uniquely qualified to fix it. When I have an incident, I don't want to have to first fix my incident tooling.
I just want to be taking notes. I want my comms team to be able to go in there, see what's happening. I want them to be able to understand the severity of it. I want the PMs that are involved to be able to just dig in and find the information they need. I think that's to me is the most important thing because if you can see what's going on, what's broken, you don't have to bother me, which means I can probably focus on fixing the incident.
We can always chat in more detail later, but while the incident is ongoing, engineering focus should be on solving the issue. So take part in that process and learn from it."
About the interviewee: Kerim Satirl is a Senior Developer Advocate at HashiCorp, an infrastructure and orchestration platform, best known for Terraform and Nomad. Before HashiCorp, Kerim worked on various operational teams, helping organizations embrace DevOps mentalities and automation. Ultimately, he's focused on making sure his teams build stuff that doesn't break.