All customer stories

How incident.io’s pace of development helped Etsy turn incident response into a superpower

Since adopting incident.io, e-commerce titan Etsy has made tangible improvements to the way it manages site outages and similar incidents—ultimately transforming its culture of learning.

Key Benefits

  • Insights that enable meaningful and impactful learnings from incidents
  • Organization-wide adoption spanning Engineering, Security, Customer Support, and C-suite
  • Automated Workflows to reduce manual overhead on incident responders
  • Partnership with a company delivering value at pace
When we did our evaluation process, we gave incident.io a list of features we needed. In the time that it had taken us to get one vendor to respond to our product feedback, incident.io had shipped four features we requested. Internally, we had a meeting, and I think we said something like, 'Wow, they are super hungry.'
Jeremy Tinley
Jeremy Tinley
Principal Systems Architect

Etsy is a global marketplace for unique and creative goods. It's home to a universe of special, extraordinary items, from unique handcrafted pieces to vintage treasures.

As a company, they strive to lead with their guiding principles and to help spread ideas of sustainability and responsibility whose impact can reach far beyond their business.

They have a hybrid business model, with offices spread across the globe. Currently, they have over 2000 employees.

Life before incident.io

As a 20-year veteran of on-call, Jeremy Tinley has had his fair share of incident response.

"I've carried a pager for almost my entire tech career, and it's exhausting. I remember getting paged in a movie theater once, and the people two rows back started throwing popcorn at me. I once got paged five times while trying to pick up a bottle of wine, and I debated whether I should get more bottles. I've been on multi-hour conference call bridges where I was required to stay until everyone finally agreed it wasn't my team's problem. Most of my cynicism comes from not having a good experience with on-call."

Nearly a decade ago, Jeremy started at Etsy, where he continued his on-call stretch, starting off as an operations engineer. He likened that role to being an SRE today. "It was more of a generalist role: you could get paged for web or email servers, but I was brought in specifically to respond to database issues."

As Jeremy's experience and tenure grew, he wanted to take on bigger challenges. This led him to move into the role he holds today—Principal Systems Architect. Now, he spends much of his time analyzing different problems and determining the most efficient way to resolve them. Some of his focus areas as a Systems Architect revolve around cloud resilience and helping Etsy with its annual holiday season preparedness. One particular area of interest for Jeremy is incidents, specifically situations such as a site outage, or an internal service having an issue.

Etsy has long since been known for its blameless culture; in fact, its Debriefing Facilitation Guide is an inspiration for incident.io co-founder Chris Evans.

"Coming to Etsy was a major shift in how I thought about incident response and postmortems. For the first time, I had people telling me it's okay to remove alerts that aren't useful, and even more radically, that it was okay to make mistakes," says Jeremy.

"To be candid, Etsy was transformative in how I thought about psychological safety. I've always been a perfectionist, and being here really taught me that it's okay to make mistakes—and even more so, it's okay to share those mistakes with others."

...the idea that we were starting to overlap time periods of incidents and having them run for longer meant that we were getting to a size where a single incident channel was no longer tenable.

Inspired by how Etsy had impacted him, Jeremy decided he wanted to leave his own mark on Etsy's culture by further leveling up how they handled incident debriefs. As he began digging deeper, he noticed a few things that made it really hard to continue.

Lack of visibility and insights into incidents

The first place that Jeremy was able to dig into was on trying to find the true time spent resolving an incident or similar service issue.

"In the past, we were measuring the time between when someone got paged, and when the page was resolved. I realized you really haven't accounted for any of the follow-up conversation that happens after. Even though the incident is resolved, there's still time spent discussing it because that incident happened."

Another thing Jeremy identified was that there was a much larger surface area to cover, specifically smaller, internal-facing incidents.

"We were really great at tracking high severity incidents, because when we had them, we had a postmortem about it. The gap we found was that if an incident occurred that was internally facing, we had very little data from it. Sometimes they were completely managed in a private channel. This meant we were grazing the surface of what incidents we could use as part of a revised learning-from-incidents program. That's when I realized that we needed more data."

When we did our evaluation process, we gave incident.io a list of features we needed. In the time that it had taken us to get one vendor to respond to our product feedback, incident.io had shipped four features we requested.

Put together, Etsy was in need of a solution that made it easier for teams to track the smaller incidents and that increased the transparency of them.

Communication overlaps

In addition to needing a way to collect more data about lower severity incidents, Etsy also found that they needed something to streamline communication. The big hint was that they started to find themselves having simultaneous incidents.

“One day, we actually had two overlapping internally facing incidents in the same channel, and it became a little murky. Was this person responding to the first incident? To the second incident? It didn't take very long before we realized we really needed to separate these two. But the idea that we were starting to overlap time periods of incidents and having them run for longer meant that we were getting to a size where a single incident channel was no longer tenable.”

With what turned out to be impeccable timing, Jeremy was at an SRE meetup in Seattle when the attendees started to talk about managing incidents. During that meetup, someone mentioned using incident.io as their solution of choice for managing incidents.

"That meetup was the first time I heard of incident.io. I went ahead and jotted down the name and just stuck it on a note for later. Eventually, I got around to checking out their website, and I thought I had struck gold. It was at this point that I pitched the idea that we should look at bringing a vendor in to streamline our incident response process. We got approval and pulled together a few people from across the org to evaluate whether a vendor could meet our needs."

Why Etsy chose incident.io

Shortly after beginning their search for an incident management solution, Jeremy and his team narrowed down their list of choices to a handful of options and started their extensive evaluation process.

They were looking to trial-run each option on their list and evaluate them against one another using a matrix they created. This matrix included must-have features like a direct Slack integration, a way to aggregate incident response data for deeper learning, a simple UI so anyone could come in and start using the platform with little overhead, and solid customer support.

"When we did our evaluation process, we gave incident.io a list of features we needed. In the time that it had taken us to get one vendor to respond to our product feedback, incident.io had shipped four features we requested. Internally, we had a meeting, and I think we said something like, 'Wow, they are super hungry.'" Jeremy said.

"Obviously, delivering many feature enhancements before we even signed as a customer was a sign that we would be well treated, but also, the enthusiasm and direct developer access were a huge plus. For me, this really set incident.io apart from their competitors."

The data that I pull out of insights is shown to VPs, directors, and the CTO on a monthly basis, and they see the value we're getting out of this.

For Etsy, this breakneck pace, coupled with the incident.io engineering team following through on feature requests, were ultimately the deciding factors.

“We knew what we wanted and how we could best use a product. We just needed to fill in the gaps. incident.io was the one that said, ‘We're gonna help you fill in those gaps’ and then they did.”

The impact of incident.io on Etsy

Fast-forward to today, Jeremy and the team have transformed how they manage incidents, thanks to a feature set and user experience that makes incident response more efficient and easy to adopt across the entire organization.

The actionable incident data they get through Insights has also been a game-changer. They went from not being able to answer how many lower severity incidents they had in a given month to being able to say how many, who was involved, how long people spent, and even track the future remediation work that came from that incident.

On top of all of that, incident.io's engineering and customer success teams have also played a pivotal role in helping Etsy set itself up for success when responding to incidents.

Actionable insights that empower change

With Insights, Jeremy was able to unlock the incident response data he needed to pursue his ultimate goal of a better post-incident process.

“The data that I pull out of insights is shown to VPs, directors, and the CTO on a monthly basis, and they see the value we're getting out of this. A great example of this would be the Workload report. With this report, I can see the time spent on an incident. But not just the time fixing it but also the time we spent talking about it afterward,” says Jeremy.

“We might talk more deeply about why an incident happened, share code snippets or more graphs, and even talk about steps we want to take now that it's mitigated. This gets us much closer to seeing the true time spent on an incident versus just measuring the impacted time.”

But to collect the most comprehensive data, the adoption of incident.io needs to be widespread–a challenge Jeremy has been able to navigate thanks to how intuitive incident.io is.

Organization-wide adoption

“I was successful in getting our VP of Infrastructure to have all the teams underneath him start tracking incidents using incident.io. I've even gotten teams like payments, member support, trust and safety, and security interested, too,” says Jeremy.

But since data is best in aggregate, Jeremy has spent a lot of time evangelizing incident.io across Etsy. And so far, it’s worked.

"As we continue using incident.io and more people start to adopt it, we see more and more data. I wouldn't specifically say ‘we have more incidents,’ it's that we have more visibility into incidents which were already happening."

We’ve seen the continued effort into building the things we ask for. I get a message—often from the incident.io developers themselves—saying, ‘Hey, that feature you asked for done,’ and that level of personal support is incredible.

All of this is meeting Jeremy's biggest need: getting more data.

"I know there's more data we can get, and I know incident.io will continue to get better at surfacing information. This information will, ultimately, help make somebody's life easier, whether that's someone who is on-call or someone who needs to make decisions about on-call."

Workflows that power automated response processes

Workflows are at the heart of incident.io. They're a seamless and highly customizable way for teams to automate parts of their incident response. For example, automatically looping certain people into an incident channel based on severity level.

Etsy has used Workflows for everything from orchestration of the highest severity incident to providing contextually relevant runbooks at the right time.

"We have a set of incident leads. To help them acclimate to being in that role, we have a document that explains all of their responsibilities. These were things like creating a ticket, posting in an engineering channel, creating a document to track the incident, sending out a summary, and asking everyone if they have what they need,” says Jeremy.

We also found it really heartening that when we crossed a significant adoption milestone, our Customer Success Manager, Lucy Jennings, told us that incident.io would donate to a firefighter organization based in Brooklyn on our behalf.

“If you don't respond to a lot of incidents, you're not going to have any muscle memory for this. This led to incident leads reading a document while trying to manage an incident. To improve that, we turned 95% of the document into automation with Workflows. This means that the incident lead can actually focus on the conversation instead of the procedure. An absolute game-changer for us.”

Support that goes the extra mile

The level of support Jeremy and his team received during the evaluation period has continued today and has played a critical role in their confidence in incident.io.

“We’ve seen the continued effort into building the things we ask for. I get a message—often from the incident.io developers themselves—saying, ‘Hey, that feature you asked for done,’ and that level of personal support is incredible. We've also had an opportunity to be involved in the development process. We've had design document previews shared with us from different engineers and product managers to try to make sure that what incident.io is building is aligned with our needs. I can't think of a single door that's been closed to us.”

But it's not all about responding to queries and delivering features–the extra bit of thoughtfulness stands out above the rest.

"We also found it really heartening that when we crossed a significant adoption milestone, our Customer Success Manager, Lucy Jennings, told us that incident.io would donate to a firefighter organization based in Brooklyn on our behalf. We thought it was very appropriate from the incident response angle and something that says that incident.io is always looking for ways to give back."

etsy
About the interviewee

With over 20 years experience in being on-call, Jeremy Tinley understands the challenges of incident response all to well. In his role as a Principal Systems Architect, he spearheaded a team in Etsy's adoption of incident.io. He is a co-author on "High Performance MySQL 4e", has spoken at several conferences, and thinks that computers are the worst.

Jeremy Tinley

Jeremy Tinley

Principal Systems Architect

Industry
e-commerce
Customer since
2022
Company size
2000+
Office model
Hybrid

You may also be interested in

Operational excellence starts here