How Thrive Learning went from text files to enterprise-ready incident response

When Leigh Darlow, Head of Live Service at Thrive, reached out to incident.io, he wasn't shopping for a replacement. He was curious about a company making noise in the market. Two weeks into the proof of concept, he'd already messaged the SRE lead saying it was a no-brainer. Within a month, PagerDuty was gone.

Manual timelines, missed signals, no visibility

One person, one document

Before incident.io, Thrive's incident process ran on two things: PagerDuty and a text file document template.

When something went wrong, someone would open the template and start typing: the date, the summary, a timeline updated manually in real time, all while managing the incident itself. The post-mortem came later, written up by hand. Follow-up actions lived in the same document and were hard to track.

"It was all down to one person," says Leigh. "They'd be typing in every action at whatever time, all while trying to help manage the incident itself. There was quite a lot of room for human error: records missed, and so on."

Follow-ups were equally challenging. Someone had to go back into the document after the fact, check what had been committed, and manually track tickets across separate systems.

As Thrive grew, the limits of running incidents across disconnected tools became harder to ignore. PagerDuty handled alerting, but it lived outside of where the team actually worked.

We were using it in the most basic sense and didn't really see the value for money. It did the basics, but it didn't integrate into how our team actually operated. For us it was really about that consolidation piece.

The process held together well enough. But as the company grew, attracting larger enterprises, expanding into MENA, and adding headcount rapidly, the gaps got harder to ignore.

Half a day to migrate, years of friction gone

A one-month decision

Leigh didn't come looking for a change. It was Thrive's CPTO, who suggested taking a look at incident.io. That one conversation turned into a proof of concept almost immediately.

Within a week, he'd already messaged the SRE lead ready to make the call.

Everything just worked out of the box. It was intuitive, the licensing was straightforward, and when we compared the costs, it was a no-brainer. We got a whole incident management platform with the paging piece already built in.

Replicating their configuration on PagerDuty took about half a day. Everyone internally was impressed with the turnaround.

What changed

The most visible shift was what happened to the text document. Incidents now run entirely inside Slack, where the team is already working. Timelines build themselves from channel conversations, Jira tickets raised mid-incident get automatically pulled into the incident record, and reminders go out when action items are overdue. Post-mortems are drafted from Scribe transcripts rather than reconstructed from memory.

It's very easy to lose track of follow-up actions if you've got a manual document. Now incident.io can send out reminders, update the channel when a ticket status changes. We've got a lot more visibility of those action items.

Engineering shared it. Support used it. Nobody asked them to.

Adoption spread without anyone forcing it. The SRE team started first, then support joined, then InfoSec started using it for security events. Account directors started getting notified when something was affecting a customer. No mandates, no formal rollout. People just asked if they could use it too.

Middle of a family holiday, data center down, zero chaos

There's one incident Leigh comes back to when he thinks about what incident.io is actually for.

In early 2026, on a Sunday evening, one of Thrive's data centers was impacted by recent events in the Middle East. Leigh was traveling, twelve hours in transit from a family vacation. His team assembled and worked through the night to handle a full failover.

Incident was at the heart of that. We managed to fail over super quickly into a new region, and the customer was genuinely happy with how we responded.

After the incident, the team fed everything from the channel into a trusted AI model and used it to generate recommendations for updating their disaster recovery plan. The data gathered throughout the day was invaluable. Dozens of DR improvements came directly from it.

What sticks with Leigh isn't just that it went well. It's who ran it. Someone who had been with the company for six months was asked to lead, and did, without issues.

From incident response to strategic analysis

The culture shift has been as meaningful as any feature. Thrive now sees more proactive incidents than before: smaller things caught early, dealt with quickly, before customers ever notice.

People used to see incidents as a bad thing. Now they use them to mobilize teams and improve. That whole attitude and culture around it has changed as a result of this partnership.

Executives can jump directly into an incident channel, ask @incident for a summary, and get an instant update (including a transcript of the previous hour of calls) without interrupting anyone doing the work.

Using the MCP server turned incident data into strategic insights

Leigh has recently started using incident.io's MCP server alongside Claude to do higher-level analysis across incident data, looking for trends, identifying weaknesses, and feeding the output into a platform health forum.

I can use the MCP server and say, where do you see our weaknesses? Historically that would have been really hard with text documents. Now that it's all in incident.io, I can do strategic analysis in a way that just wasn't possible before.

The data has also become a tool for enterprise procurement conversations. As Thrive lands larger and larger customers, the scrutiny on how they manage incidents has only increased. Large customers ask detailed questions about incident response processes, response times, and how learnings get captured. Thrive can now answer all of it with real data.

"There's more scrutiny now with the calibre of customer we're attracting, the scale we're operating at, and the speed we're moving. These non-functional aspects to service get more and more important as we look to service more and more enterprises. Knowing that we're in safe hands with incident.io really helps give confidence in that area."

With eight SRE engineers on the team, AI SRE is next on Leigh's radar: the ability to go from incident to cause as quickly as possible and drive down mean time to recovery.