Trainline

How Trainline grew with incident.io for five years, and why they never looked back

With incident.io, Trainline unified their fragmented incident management tooling, improved communication clarity, and simplified their overall process. With a more streamlined process they've been able to lower the threshold for incidents too, allowing smaller issues to be handled as effectively as large scale ones, bringing more visibility and insights across the organization.


  • Full PagerDuty migration
  • 20+ minutes saved per incident on external communications
  • 3-4 hours saved per P1/P2 post-mortem
  • 1,200+ services in the incident.io catalog, synced daily from Backstage

Three channels, two tools, no single source of truth

When Trainline first came to incident.io, their incident process ran across three separate Slack channels: triage, P1, and P2. When multiple incidents ran in parallel, the channels became impossible to follow and engineers coming back from a break faced hundreds of messages with no way to quickly understand where things stood. Stakeholders had no visibility. External notifications required manual coordination across tools and spreadsheets, a process that ate 20+ minutes per incident. Their status page was a separate tool, updated manually in the middle of a live incident.

The role of an incident manager is relatively simple. It's to coordinate people and ensure stakeholder management. If you throw loads of tools at them, they're not actually going to spend time resolving incidents.

The fix was consolidation. Trainline adopted incident.io for incident response, added a P3 tier to capture lower-severity issues, and used Workflows to automate the external communications that had been consuming their team. Status pages came with it. No more manual updates mid-incident.

The results were immediate. "Within running a couple of incidents through the platform, we could see the benefits," Dan says.

That was five years ago. A lot has changed since.


The case for one platform

PagerDuty wasn't broken. But it was over-engineered, in the feature list, in the pricing, and in the complexity of keeping it running. Trainline was using it for one thing: paging. The rest sat untouched, accruing cost and configuration debt with every passing quarter.

Meanwhile, everything that actually mattered was already happening in incident.io: the response, the coordination, the post-mortems. The gap wasn't in the tooling. It was in the logic of paying for a legacy platform to do a single job that a modern one was already doing better.

We weren't using most of the stuff in PagerDuty, we didn't need it. We were already doing everything we needed in incident.io. Consolidating just made sense.

The move brought cost savings. But the bigger draw was alignment.

We just saw closer alignment with the direction of travel and what we want to do at Trainline with how incident.io was working.

How the migration set Trainline up to scale

When the time came to move, Trainline had a choice: lift-and-shift, or start clean.

PagerDuty had accumulated years of configuration, services, teams, and applications that no longer reflected how Trainline actually worked. A straight migration would have just moved the mess somewhere new.

"We actually said to ourselves: no. Let's not do that because all we're going to be doing is copying and pasting a whole load of tech debt across."

So instead of a quick move, they spent time planning, tidying up their service catalog, building a custom sync job between Backstage and incident.io so ownership changes would propagate automatically, and aligning everything to their current org structure.

The planning paid off. Now when Trainline reorganizes (which they do regularly), the alert routing updates itself within 24 hours.

"Going forwards, a lot of it should just look after itself. So ultimately it should save us a lot of time in the long run."


Five years later: what incident management looks like now

From P1/P2 to P4, and beyond

When Trainline started, they tracked two incident severities. Today they're adding a fourth tier, P4, and the goal is to capture every incident that touches a customer, down to a single user affected in a single region.

We always knew there were things happening under the radar. And without the data capture there, it's just hard to really understand what's going on.

The expansion isn't just about more categories. It's about more teams. The security team now runs their own incidents self-serve, escalating to the central team only when severity warrants it. The data team has their own incident type with custom fields. New engineering teams are being onboarded to run incidents independently. If it affects Trainline, it's an incident.

We're really democratizing the platform. The goal is for incidents to just be part of how every team at Trainline operates, not something that sits with one central function.

The scale of that ambition would have broken the old setup. It needed infrastructure behind it. "Running all of those lower-level incidents through our old setup would have been impossible. It would have been a mess." Now incident.io handles the routing, the context, the automation, and the central team acts as an escalation layer, not a first responder to everything.

AI post-mortems

Trainline produces a full Major Incident Report for every P1 and P2, a document that goes to the whole company in multiple formats. Historically, that took three to four hours per incident.

The improvement in that post-mortem space has really been great. That could save, per incident, anywhere between three to four hours just writing reports. The time to initial review will probably reduce by four hours.

For a small central team, those hours matter. Trainline is reinvesting them into automating data capture and reporting, and onboarding the next wave of engineering managers to run incidents independently.

The Catalog integration with Backstage

One of the features that clinched the on-call migration was the Catalog, specifically its integration with Backstage, where Trainline keeps its service registry.

When teams reorganize and service ownership changes, it now propagates automatically. Alert routing updates itself, without the need for manual cleanup or spreadsheets.

At Trainline we realign regularly with how the market is performing, which means we often have changes internally. By having the Catalog and incident.io aligned with Backstage, service ownership updates automatically and always stays current.


The partnership

Trainline were customers when incident.io was still a startup. Dan used to sit in sales meetings with the CTO.

"At first it was quite cool to work with a company with that kind of real start-up mentality. You could speak to the CTO and he was just there. And actually when I go to the conferences now, you can still speak to the CTO, he just comes over."

Trainline has grown significantly over five years: more teams, more incident types, more complexity. What's kept the relationship strong isn't just that incident.io has kept pace, it's that the ethos hasn't changed.

Even as they've matured, they still have that same ethic around shipping fast and putting customers first.

That confidence has driven Dan to keep expanding the remit, from platform incidents to security, data, and beyond. When it came to renewing, they didn't just renew. They asked to extend the term.

"There are so many different use cases for incident.io. I don't envision us moving away anytime soon, if ever. I think it's here for the long haul."



You may also be interested in


So good, you’ll break things on purpose

Ready for modern incident management? Book a call with one of our experts today.

Signup image

We’d love to talk to you about

  • All-in-one incident management
  • Our unmatched speed of deployment
  • Why we’re loved by users and easily adopted
  • How we work for the whole organization