On August 28th, 2023—right in the middle of a UK public holiday—an issue with the UK’s air traffic control systems caused chaos across the country. The culprit? An entirely valid flight plan that hit an edge case in the processing software, partly because it contained a pair of duplicate airport codes.
As a safety measure, the system and its backup shut themselves down, forcing controllers to process flight plans manually while technical teams tried to identify and fix the problem.
The result was six hours of downtime, major delays, stranded passengers, and a whole lot of stress. While safety was never at risk, this incident provides a glimpse into the intricate and occasionally fragile world of air traffic systems.
For anyone curious about how these systems work (and why they sometimes don’t), this story is a fascinating case study. Beyond the tech, it’s also a great reminder of how messy incidents can be and why the hunt for a neat “root cause” entirely breaks in complex environments. Let’s dig in!
As you might expect, air traffic control is a pretty complex business. Before we look at how the incident unfolded, here’s a crash course in the key players and systems behind UK and European airspace management.
Together, NATS and EUROCONTROL ensure safe and efficient air traffic management across overlapping jurisdictions.
Every flight needs a plan, which is basically a blueprint containing essential information like the aircraft type, route, speeds, and callsign.
Airlines operating in Europe submit these plans to EUROCONTROL, which in turn distributes them to relevant air traffic control services. For flights involving the UK, NATS is the recipient of these plans so it has visibility of flights in their airspace.
Dropping a level deeper, because it’s critically important to this incident, it’s important to understand that there are different flight plan formats used by EUROCONTROL and NATS. I’m deliberately sidestepping the reason for this, but knowing it’s a thing is important!
✉️ What does a flight plan message look like?
In case you’re curious, here’s the (nearly 100 page) specification of the ADEXP message format. There's a lot of ground to cover here.
Between the IFPS and NAS systems there's a sub-system responsible for converting ADEXP messages into NAS messages. It’s a very catchily titled system called the “Flight Plan Reception Suite Automated – Replacement” or FPRSA-R.
ℹ️A brief history of FPRSA-R
FPRSA-R is an automated system for processing European format flight plans into UK format flight plans. As you might guess, the acronym has grown over the years. Here’s the brief history lesson:
Since a picture says a thousand words, here’s roughly how these systems all interact.
The 28th of August, 2023 was a public holiday in England and Wales, and one of the busiest days of the year for the UK aviation system. Everything was running smoothly with air traffic control (ATC).
As is normal practice for them on such important days, NATS had implemented a change freeze to ensure no changes were introduced that might impact operations.
At 08:32, an ADEXP flight plan was received by the FPRSA-R system from EUROCONTROL's IFPS system.
Whilst the flight plan was valid and within specification, the FPRSA-R system was unable to process it and hit an exception in the code. One of the reasons for this was the inclusion of two duplicate airport codes: DVL for Deauville, France, and DVL for Devil's Lake, North Dakota, USA. But whilst duplicate codes might seem like the obvious culprit, the system was actually designed to handle such cases.
The actual issue arose from a combination of six specific factors, all of which had to align perfectly to trigger the bug. Perfect storm, anyone? 😅
When the FPRSA-R system failed to convert the flight plan, it shut itself down. The NATS report referred to this process going into “maintenance mode”, where it ceases processing any further inbound messages.
It might seem like an odd choice for such a system, but FPRSA-R was designed this way as a safety mechanism, intended to prevent sending potentially inaccurate data to air traffic controllers.
"the system cannot include logic that would delete or ignore complex problematic data because, for safety reasons, all flight plan data must be processed and understood to ensure that controllers have accurate real time information on aircraft they are tasked to control at their workstations."
As you might expect, the FPRSA-R system was designed to be fault tolerant, with an isolated backup system running as a hot standby, ready in the event of a chronic failure of the primary.
But because the backup system runs “hot”, it too had processed the bad “DVL” message, and as a result also shut itself down into maintenance mode.
ℹ️Could the bug have been reasonably predicted and avoided?
For those of your asking “why didn’t they think about duplicate waypoints when writing the software?”, it turns out there was a lot more to the bug.
Specifically, the following six factors all had be present in the flight plan segment of the ADEXP message, in order to trigger the issue in the flight plan processing software:
I think this statement from the report is pretty accurate: “[the testing] was reasonable based on the premise that it is impractical to test every single scenario within complex systems”.
In under 20 seconds from receipt of the bad message, both FPRSA-R systems were offline, preventing any automated processing of flight plans.
Fortunately, the system had an additional resilience measure in place: 4 hours' worth of previously filed flight plans—thousands in total—are stored in the NAS system, allowing operations to continue temporarily during such disruptions.
But since flight plans often change (sometimes while planes are mid-flight) the stored plans become outdated quickly. And when it comes to safety, having up-to-date plans is pretty important.
To address this, NATS has a mechanism for manual entry of flight plan data. While effective in emergencies, manually inputting plans is both labor-intensive and requires skilled staff to manage, which drastically reduces the number of plans they can manage.
The Level 1 engineers of the Service management command centre (SMCC) received alerts immediately after the automated flight plan processing stopped.
ℹ️Support Levels
Levels range from 1 to 4, with 1 typically being lower skilled engineers responsible for executing well trodden run books to tackle common issues across many systems.
Level 4 are the most technical and specialist subject matter experts (SMEs), responsible for individual systems.
The team began their investigation into the failure, and 27 minutes later they decided to follow protocol and reboot the FPRSA-R systems.
At 09:06, after failing to bring the system online, they escalated the issue to a Flight Data Processing (FPD) Level 2 engineer; someone with more detailed expertise on the specifics of the FPRSA-R system. As it was a public holiday, this engineer wasn’t on site, and consequently had to be dialed in as a “remote hands”. For the uninitiated, this essentially means they’re working entirely through the on-site L1 who’s pushing all the buttons.
At 09:23, 51 minutes into the incident the L1 team notified the on-watch air traffic controller in Swanwick, and shortly after, an SMS message was sent by a duty service manager to the major incident managers group, warning them of the potential for the escalation of a major incident.
Interestingly, from the perspective of the L1 team, this was still not yet deemed to be a major incident.
The on-site Level 1 engineering team continued to work on the issue, assisted by the remote Level 2 engineer.
By 09:35, members of the NATS executive team—including the Technical Services Director, Operations Director, and CEO—were briefed on the situation. They weren’t yet running the incident, but they would be required to decide if and when to escalate to the highest level of command.
At 10:00, a meeting in the NATS operations room led to a key decision being made: the introduction of “regulations” to limit the number of planes allowed in UK airspace, or network capacity as it's known. After discussing the situation, they settled on reducing capacity to 75% of the normal demand which is usually around 800 flight plans per hour.
By 10:12, all initial attempts to resolve the technical issue had been exhausted, leaving the only option as a full hardware reboot of the FPRSA-R system. However, this posed a logistical challenge: on-site Level 1 engineers aren't authorized to perform the restart, and the Level 2 engineer couldn’t do it through a remote hands procedure. The only option was for the Level 2 engineer to travel to the site immediately.
ℹ️Incident response command and control structure at NATS
As part of their incident response, NATS runs a Bronze, Silver and Gold command and control structure.
At 10:38, the "Bronze" response team was convened to discuss the incident, and make decisions on whether they should escalate to Level 3 specialists for the FPRSA-R system.
At 11:06, the "Silver" team also met to step up coordination efforts, and by 11:30, they called in the support of ATICCC, who are the Air Traffic Incident Communication and Coordination Cell. ATICC is essentially a specialist group who manage communications across email, video calls, text messages, and the NATS website. They're responsible for keeping airlines, airports, government agencies, regulators like the Civil Aviation Authority, and EUROCONTROL informed of ongoing developments.
Several key events unfolded in parallel during this period. The Level 2 engineer was en-route to the site to perform a full system restart, but with it being a public holiday, they were heavily delayed by traffic on the roads. Meanwhile, at 11:47, the ATICCC held its first customer call, keeping stakeholders updated.
At 11:53—three hours and 21 minutes into the incident—the on-site engineers had escalated to the Level 3 SME. By this time, the Level 2 engineer had arrived on-site and spent 35 minutes attempting to reboot the FPRSA-R system. Unfortunately, none of these attempts was successful.
At 12:20, NATS activated their "Gold" team to oversee the response. Simultaneously, UK-wide air traffic regulations were tightened further, reducing airspace capacity to just 60 flights per hour—around 7% of normal levels—causing major delays and cancellations for passengers.
The technical investigation continued, with the "Bronze" team and the Level 3 engineer diving into detailed system log analysis.
Remember that 4 hour buffer of stored flight plan data when the FPRSA-R system first went down? That was fully exhausted at 12:32, meaning the safe processing of UK airspace flight plans was entirely reliant on manual input.
The Level 3 engineer found a log message linked to the failed message processing. But they'd never seen this error before, and since they hadn’t been involved in the software’s development (reminder that the system was built by an outsourced vendor) they decided it was necessary to escalate further. Five hours into the incident at this point, and only just at the point where an engineer from Comsoft who worked on the system was being escalated to.
Within this hour, Comsoft’s Level 4 engineer identified that the problem stemmed from a "bad" flight plan message, though the exact cause of the failure remained unclear.
What they did know, however, was how to mitigate it. So between 12:51 and 12:58, they worked together to isolate the offending plan and unblock automated processing.
At 13:00, with manual data entry still the only option, air traffic regulations were tightened further, reducing capacity to just 30 flights per hour—about 4% of normal operations. By 13:26, after resolving a separate issue involving system databases during the startup process (it never rains but it pours!), the FPRSA-R system successfully processed a batch of test flight plans, marking a small but crucial step forward.
With a few test flight plans successfully processed, the support teams started assurance and data integrity tests to ensure the issue had been genuinely resolved and the system was indeed stable.
By 14:27, the FPRSA-R system was back to full operation, automatically processing flight plans again. Whilst this marked the end of the technical component of the incident, the focus shifted to the operational recovery and unwinding the backlog of disrupted flights.
Seven hours after the incident first started, final calls happened between Silver and Bronze response teams, which were then stood down. With these folks predominantly aligned to technical recovery, this made sense.
The Gold team continued to manage comms, working with ATICCC, whilst air traffic regulations were eased and then removed. By 18:03, all regulations were removed and UK Air Traffic resumed normal operation.
For airlines and their passengers, the disruption obviously continued for some time longer, though the report doesn’t highlight this in detail.
It’s always nice to see positive outcomes discussed in incident reports too. Here’s a few of the things they concluded in the wake of this incident:
Incidents like this remind us how elusive the idea of a singular “root cause” can be. Was it the bad flight plan that set things in motion? The bug in the software? A lack of comprehensive testing? Or perhaps the way the system handled—or didn’t handle—exceptions like this?
Pinning down a single cause is actively unhelpful when it comes to exploring and learning across the breadth of an incident. Instead, it’s far more useful to recognize that incidents typically result from multiple contributing factors, and taking the time to explore them thoroughly.
When we focus too narrowly on one “root cause,” we risk losing valuable opportunities to learn from the broader picture.
Clear, objective criteria for escalation can make all the difference during an incident. Whilst it might seem heavyweight and unnecessary, but this can take the burden of decision-making off responders, speeding up your overall response.
Equally important is the culture around escalation. Encouraging teams to escalate early—and de-escalate if needed—creates a safety-first mindset, ensuring that hesitation or caution doesn’t make a bad situation worse.
In the context of this incident it took more than 5 hours to escalate to an engineer close to the system, who was then able to diagnose and restore the issue within 2 hours.
If in doubt, escalate.
Incidents with many moving parts often benefit from having a single-threaded lead—someone responsible for the overarching response and key decisions.
While teams, hierarchies and individuals with specialist knowledge will exist beneath them, having one person who maintains an overarching bird’s-eye view ensures consistency, alignment across domains, and faster decision making.
Without this, decisions risk being fragmented or delayed, especially in complex systems.
Nobody wants to scramble for the right words—or the right people to communicate with—during an incident. Pre-approved communication plans, key contact lists, and templated messages can save everyone time and stress.
Not only do they make life easier for responders, but they also improve the experience for customers, keeping them informed clearly and quickly.
This incident is a textbook example of why resilience is about more than just systems. Technically, FPRSA-R showed resilience by falling back to maintenance mode and retaining manual input capabilities.
Socially, this was a story of people stepping up—working outside normal hours, flexing established procedures, and finding creative ways to keep things moving.
Resilience isn’t just about anticipating failures; it’s about being prepared to adapt when they happen.
Game days aren’t just for operational teams. Anyone who might find themselves involved in a real incident—be it engineers, communications specialists, or execs—needs to be part of the practice.
Nobody should walk into their first incident without prior experience of the process, and game days or table-top exercises are a great way to exercise those muscles.
Curious about how to do this well? Check out this talk from Slack at SEV0 or explore our table-top exercise guide.
Having a single, up-to-date source of truth for your systems and suppliers is invaluable during an incident. The original NATS reports cites the myriad of documentation, missing contact details, and overall lack of system understanding as a major contributor to the severity of this incident.
Digging through outdated documents, or hunting for answers in Slack or Teams, wastes precious time. Under pressure, quick access to accurate information can be the difference between a smooth response and a prolonged crisis.
One of the most powerful tools for learning improvement is sharing what went wrong and how you fixed it.
Kudos to NATS for publishing such a thorough report. It’s a great asset for the broader industry (and nerds like me), offering lessons we can all apply.
Normalizing the sharing incident reports is a net positive for everyone. A rising tide lifts all ships, as they say.
If you've made it this far, congratulations. Hope you enjoyed the read, and feel free to share it around!
I'm one of the co-founders, and the Chief Product Officer here at incident.io.
Ready for modern incident management? Book a call with one our of our experts today.