All customer stories

How Clari built a global incident management program, powered by incident.io

With incident.io, Clari has rolled out a consistent and unified way to manage incidents across the globe.

Key Benefits

  • A single, unified way to seamlessly respond to incidents across teams and regions
  • A partnership with a development team that prioritizes pace and impact
  • An intuitive platform that's flexible enough to use for product releases, maintenance events, and more
The biggest advantage we have right now is that there’s only one way to declare an incident. There’s only one way to track an incident and one way to track an incident evaluation. That in itself is a big win for us.
Balaji Narayanan
Balaji Narayanan
Senior Director of Engineering

As the Senior Director of Engineering at Clari, Balaji Narayanan’s responsibilities break down into two buckets.

First, he’s focused on ensuring that Clari users have the best experience possible. This means staying on top of system availability, reliability and addressing any performance degradations—an absolute must, given how essential their revenue platform is. And second, he works to enable his engineering teams to focus on building and shipping products quickly. With a tried-and-true incident response process, this latter responsibility becomes easier.

Unfortunately, Clari’s incident response workflow, which they outgrew due to their increasing size and criticality, was highly manual and inconsistent. These challenges created significant barriers to efficient incident response and made learning from them a considerable challenge.

“When you’re in an incident, your focus should be on recovery. After you have recovered, you should focus on learning. We were struggling to run the process correctly. So we couldn’t focus on the right things,” says Balaji.

So when incident.io came into the fold, they were able to meaningfully improve the way they ran their incidents from start to finish. Not only that, but with the platform rolled out, they'd be able to achieve one of their biggest goals: to roll out a global incident management program to ensure consistency across regions and teams.

Life before incident.io

Before adopting incident.io, Clari dealt with one specific issue that affected its incident response—inconsistency.

“We had a very manual and bespoke process. We did have certain guidelines on how to start incidents, how to run them, and how to do follow-ups, but we were basically following instructions from Confluence or Google Docs,” says Balaji.

But the reason I say it was bespoke is that, for example, the channel naming was not consistent. You couldn't easily figure out where the incident happened. Somebody would set the Slack channel as private, so you needed to be allowed to join.

Beyond declaration and channel naming, this inconsistency also impacted other parts of their response flow, too.

Ad hoc response processes

Software Architect Kurt Andersen has worked alongside Balaji Narayanan on the Infrastructure team for over a year. During his tenure, he noticed firsthand how challenging it was to respond to incidents at Clari.

One thing he noticed, particularly, was how the incident response process was ultimately tied to who was leading it.

“We had an incident command rotation with a primary, secondary, and third-level fallback. But it was, I'd say, very tribal knowledge and isolated as to what people knew and how they did things,” says Kurt.

“It was not just bespoke, but it was very much at the whim of the individual incident commander. They weren't necessarily being pulled in effectively or quickly.”

Challenges with context sharing

One of the most critical aspects of incident response is context sharing. When someone joins an incident channel midway through, they should be able to get up to speed quickly without disrupting the flow of response. Unfortunately, this was exactly one of the issues both Kurt and Balaji were dealing with.

“There would be lots of ad hoc Zoom calls and very little documentation in the Slack channel,” says Kurt.

Some responders shared updates regularly, while others didn’t. So, if someone joined the channel halfway through, someone would need to take responsibility for sharing as much information as possible.

In the end, this meant that context sharing wasn’t a reliable process or baked into their workflows—it was highly unpredictable.

It’s time to make the switch

Put together, Balaji knew it was only a matter of when, not if, they’d be on the search for a dedicated incident response platform. “When I joined in March of last year, I was happy that Balaji was already convinced of the importance of this,” says Kurt.

So began their evaluation process for a tool to address all the pain points they were dealing with.

“We had specific requirements. The UX needed to be relatively intuitive. The more intuitive it was, the better it scored for the product. And we wanted support getting communications out to all the relevant parties that needed to be kept in the loop on what was going on with incidents,” says Kurt.

Another requirement was the ability to automate as much of the repetitive turn of the crank—all of the things you need to do for every incident.

Here were some of their other requirements, based on an evaluation document Balaji and Kurt shared with us:

  • Intuitiveness of incident declaration and coordination: It is easy to "do the right thing" and let the tool run automations so we can focus on the problem
  • Incident visibility: As a manager or executive, it's easy to self-serve information, search live incidents, and request updates when necessary
  • Data visualization and insights: We can capture relevant data points to learn from our incidents and improve/adapt over time

Better visibility, communication, an intuitive platform, and a partnership: life with incident.io

Now that a dedicated incident response platform is in the fold, the Clari team has been able to run better incidents from start to finish. From helping roll out their global incident management program to building a partnership with our product team, here’s how adopting incident.io has improved incident response at Clari.

Running a global incident management program

One of the most significant benefits that incident.io has enabled is giving Clari the tools and confidence it needed to launch a global incident management platform—something that would have been difficult previously with ad hoc workflows.

Thanks to the intuitiveness of the incident.io platform, they were able to effectively scale incident management responsibilities across several time zones.

In the middle of last year, when we adopted incident.io, we went live across the entire engineering organization. We were spread across North America and India. We have subsequently opened an office in Poland.

“We wanted to roll out the same process to every team. Ultimately, you want to ensure a unified view if you have different products under your portfolio like we do. So I believe having a standardized tool and practice and removing some of the manual cognitive load helped.”

A single, unified way to declare incidents

Before adopting incident.io, incident response workflows varied from team to team, leading to process gaps. But since switching to the platform, they’ve been able to lean on a single process to roll out to incident responders across the globe.

“The biggest advantage we have right now is that there’s only one way to declare an incident. There’s only one way to track an incident and one way to track an incident evaluation. That in itself is a big win for us,” says Balaji

It’s easy to declare an incident and get the right people involved quickly.

Another benefit of this simplicity? Newcomers can seamlessly run incidents without added context.

“We recently had a product manager fire up an incident for the first time. He had never done one before but knew enough to kick it off. And then I could come in and guide him and say, ‘Okay, great. You’re off to a good start,’” says Kurt.

Not just customers, but partners

Many companies have experienced vendors who practically disappear after onboarding. At incident.io, we focus on building partnerships with our customers well beyond the rollout phase, something the team at Clari has witnessed firsthand.

“The support team has been excellent. Herbert and George—when you were rolling out the AI assistant—were accommodating. I'm still impressed at the speed with which fixes are turned around,” says Balaji.

It's been a wonderful experience working with you all.

When Clari raises support requests for issues they’re experiencing, they’ve noticed how quickly the team jumps to resolve them—not in weeks, but days, if not hours.

“I think the biggest thing I see is when I comment in our Slack channel, it gets added to your product board. And then someone comes back and says, ‘Hey, by the way, this is fixed.’ And it’s not weeks or months for turnaround and fixes. It's days, which is super, super impressive.”

incident.io is for more than just incidents

Finally, beyond the ability to roll the platform out globally for core incident response, Clari has also been able to use incident.io in more non-traditional ways that require coordination across teams—namely, product releases and maintenance events.

“We started as a global incident management program for customer-facing incidents. But we've also incorporated security incidents into it. I think the critical thing is that we have a blueprint to replicate for different problems,” says Balaji.

“There's one way you do your process, and then you can do the same for any other. So that's what I see as the critical value that incident.io brings.”

clari
About the interviewee

Kurt Andersen is a Software Architect working in the Infrastructure organization at Clari. Previously he worked as the head of strategy for Blameless.com and was one of the leads for the Product-SRE organization at LinkedIn. Across the full spectrum of IT-influence, he is strongly committed to developing the best engineers and teams, and enabling them with the right ideas, tools, and connections at the right time to facilitate personal and organizational learning and resilience.

Kurt Andersen

Kurt Andersen

Software Architect

Industry
FinTech
Customer since
June 2023
Company size
1000+
Office model
Hybrid

You may also be interested in

Operational excellence starts here