All customer stories

How Trainline got their incident management back on track with incident.io

With incident.io, Trainline unified their fragmented incident management tooling, improved communication clarity, and simplified their overall process. With a more streamlined process they've been able to lower the threshold for incidents too, allowing smaller issues to be handled as effectively as large scale ones, bringing more visibility and insights across the organization.

Key Benefits

  • Consolidated tooling and processes for simpler incidents and reduced complexity
  • Automated external communications, saving responders valuable time
  • Simplified process for consistent management of both large-scale and smaller incidents

For Reliability and Operations Manager, Dan Cook, there’s one misstep he often sees organizations make with incident management: they make things far too complicated.

“The role of an incident manager is relatively simple: it's to coordinate people to resolve an incident and ensure that you have stakeholder management,” says Dan.

“If you're asking your incident managers to do too much or you're throwing loads of tools at them, they're not actually going to spend time dealing with incidents, which is really where you want them to be.”

So, when it came time for Dan’s team at Trainline to take a fresh look at how they manage their incidents, simplicity was a top priority. Clear communication also scored highly.

Until then, they dealt with a highly manual and complex response and communication process that risked significantly hindering the team. During particularly severe incidents, the last thing responders and stakeholders needed was complexity and uncertainty.

“We wanted clear and consistent internal and external stakeholder communication,” says Dan.

In incident.io, they not only found the answer to these pain points but also did so without making any sacrifices or concessions.

“Within running a couple of incidents through the platform, we could see the benefits and how much easier it would make our lives.”

Aiming for a streamlined incident management process

For Dan, ensuring the incident response process wasn’t too complicated or inefficient was a hurdle in his ultimate goal as a Reliability and Operations Manager: ensuring best-in-class reliability.

“I look after our Platform Incident Management team and I also look after our Site Reliability Engineering team. Our remit is the full incident cycle: starting out from incident detection to resolution, down to continuous improvement to avoid repeat incidents,” says Dan.

“We also look after observability best practices and provide tooling to the rest of our engineering organization.”

However, as they continued to scale with their existing incident response process, it became clearer that it wasn’t meeting the needs of their growing team.

Three incident Slack channels—all incidents at once

The first issue was that all incidents were relegated to one of three channels: a triage channel, a P1 channel for high-severity incidents, or a P2 channel for less critical issues.

“We used Slack-native options and had three incident channels that we used. One was our main platform operations channel, where we did triage. And then, if we had a serious enough incident, it would go to a P1 channel. If it was a lesser but still major incident, it’d go to a P2 channel,” says Dan.

“This all came with its own set of problems.”

While the benefits of having a handful of dedicated response channels were felt in the short term—engineers were always ready to jump in at a moment's notice—as Trainline matured, the issues with this approach became more acute.

“If you had two incidents, then we often had to branch out into another channel and then start adding people. This slowed down the response to the second incident and caused significant confusion,” says Dan.

“If you were an engineer and, let's say, something happened while you were on lunch, and you came back and saw 500 messages in the P1 Slack channel. You’d have to read through all of them to get the updates. Eventually, you’d just say, ‘I don't even know what this incident's about.’”

Complex comms during the most critical times

Internal and external communications are critical during incidents to keep everyone in the loop—particularly during severe incidents where the stakes are high and executives need to be kept privy to what’s happening.

Unfortunately for Dan and his team at Trainline, stakeholder communications were complicated by two realities: first, their existing status page tool existed outside of Slack. Second, non-technical stakeholders found it nearly impossible to parse the context they needed within existing channels.

“We had an internal status page that we used, which was a completely separate tool that we'd have to access. And then we still relied on manually sending comms to engineers,“ says Dan.

“If you have your senior executives looking at that Slack channel, and some of them aren't technical, you then start getting questions from them. They're not able to follow what the incident is or what's broken. All they know is that we're losing revenue, but have no idea what the customer impact is. That then causes extra trouble for the incident manager.”

Working towards a better way to manage incidents

After the Trainline team identified their problem areas, they started looking for platforms that could address these pain points head-on—and then some.

Clarity and simplicity stood above the rest.

“We wanted it to be clear what an incident was, who's looking after it, and what services or customers were impacted—but we wanted a consistent way of doing that,” says Dan.

“Our goal was to allow the incident commander to sit down and deal with the incident instead of having to do all these extra communication tasks.”

Streamlined incidents, clearer communications and life-changing automations: life with incident.io

With incident.io, Dan and his team at Trainline have addressed all of their pain points.

They've done this while also gaining access to additional functions that have made life a lot easier for responders and executive stakeholders during critical moments.

A brand new type of incident

The first change? The intuitiveness of the incident.io platform has allowed Trainline to create a whole new incident type: P3s.

“Since adopting incident.io, we’ve gone on to create P3 incidents as well. We realised there was a gap where we had incidents within our production platform that didn't have significant revenue or customer impact,” says Dan.

“By having a central place where we could run these and announce them to the right audiences, we now have a much better response. And it was only through having incident.io that we felt we were comfortable enough taking that step.”

Better visibility across the organization

When it comes to incidents, particularly of higher severity, the more visibility you have, the better. With incident.io, Dan has noted how much simpler it is to keep everyone in the loop, without the manual overhead.

“Any time we raise an incident, we'll have the announcement go into our dedicated Incident channel. So people will know what the incident is, who's running the incident from my team, and what services are impacted,” says Dan.

“We encourage everyone in the company to join the incident channel: from the business to the tech side. So everyone can keep up-to-date with what's going on.”

Time-saving Workflows

For a travel company like Trainline, certain incidents affect third parties that they need to keep updated every step of the way. But with their previous workflows, sharing these was highly manual and got in the way of responders' main goal: resolving the incident at hand.

But with the incident.io Workflows feature, Dan and his team have been able to automate much of the overhead that came with these external communications—and inspire confidence across the board.

“As part of the contracts that we have with some of our third parties, we have to send instant notifications when we have a severe service disruption. Previously, we'd raise an incident in Status Page. Then, an email would go to the incident managers. They would then have to forward that email, amend the formatting, go to an Excel sheet, copy and paste the right details such as email addresses, and send that off,” says Dan.

With Workflows, the toil of this process has been completely eliminated, saving responders time and removing that cognitive overload.

“What Workflows allowed us to do is automate all of those external comms. Anytime an incident is updated, or you move through the incident lifecycle, new comms are sent out to our third parties. And it's all been customizable via the Custom Fields we choose. As an incident manager, you don't need to think about it. We probably save around 20 minutes for each incident.”

Meeting the needs of today and tomorrow—across teams

While Trainline isn’t done growing, Dan is confident that incident.io will continue to meet the needs of today and tomorrow. This confidence has driven him to push for wider use cases beyond platform incidents.

Thankfully, teams have been receptive.

“One thing incident.io has done is shown how well incident management can be run. We’re actually looking to expand the remit of incident.io within Trainline. Initially, we used it for platform incidents. We've since expanded it to look after our security incidents, especially once the feature around private channels was released,” says Dan.

“But there are so many different use cases for incident.io. I don't envision us moving away anytime soon, if ever. I think it's here for the long haul.”

trainline
About the interviewee

With over 7 years experience in being on-call, Dan Cook understands the challenges of incident response all too well, having dealt with a range of incidents across various technology stacks. In his role as a Reliability and Operations Manager, he spearheaded Trainline's adoption of incident.io and drives Observability best practice across the engineering organization.

Dan Cook

Dan Cook

Reliability and Operations Manager

Industry
Travel
Customer since
June 2023
Company size
1000+
Office model
Hybrid

You may also be interested in

Operational excellence starts here