Data teams are adopting more processes and tools that align with software engineering, and from talks at the dbt Coalesce conference in 2023, there’s clearly a big push towards adopting software engineering practices at enterprise scale companies.
At the moment, there are a lot of tools in the data space for identifying errors in data pipelines, but no tools for responding to these errors, such as coordinating fixes.
This is exactly where an incident management platform makes sense to implement. With one, data teams can put better processes behind how they respond to data errors and be more efficient as a result.
If you're on a data team, it's likely that you've fielded this question many times. Bonus points if it’s on a Friday afternoon.
Breakages in data pipelines are inevitable: continually changing data models, business logic, dashboards, and upstream data make it a reality that it’s a case of “when” rather than “if” it happens.
But just because it’s inevitable, doesn’t mean that you shouldn’t take steps to make the process of identifying and fixing these breakages as smooth as possible.
Let’s consider how this flow typically looks:
This, unsurprisingly, is flawed for a few reasons:
Automated tests implemented with tools like dbt helped catch problems with your pipeline, earlier. Observability tools, like synq, went one step further: identifying repeatedly failing tests, routing failures to the right owners, spotting anomalies (like your table row count suddenly halving), and even sending messages to tools like Slack for debugging.
This flow is a lot more effective. It’s shorter, automated, and the more common consequence is dashboards becoming stale, not broken.
But the concepts of automated testing, observability, and routing failures aren’t new: it’s part of a larger shift amongst data teams towards adopting the types of tools and processes used by software engineers.
After all, why try to reinvent the wheel?
So we now have a better way of identifying when something is broken, but what happens next?
Let's imagine this: you’ve got a stakeholder who has spotted a problem with your dashboard, or you’ve been notified of a problem with a critical data model that will cause stale or incorrect data downstream.
What do you do next?
There are observability tools which will allow you to resolve a data incident outside of Slack/Teams, but ultimately communication to stakeholders and collaboration with people who you’re fixing the issue with will always trickle back to these messaging tools.
The bulk of the messaging will happen in a channel or a thread, like below.
This causes a number of problems:
The last point is particularly painful. Being a data team that’s downstream of consistent errors can be quite demotivating, especially when the only evidence is in random messages and threads.
Data teams have adopted a lot of the tools and processes used by software engineers to help identify issues, so why not extend this to resolving these issues?
We use our own product at incident.io when things go wrong in our data pipeline, and having fixed data pipelines both with & without an incident management tool in place, I thought I’d share some top tips for responding to data incidents.
This is what I’d consider as “table stakes.” It’s tempting to debug everything by replying to the thread where the issue was raised, but this results in a noisy mess that gets worse the more complex the issue is to fix.
Each data incident should have its own dedicated channel.
Once in a channel, it’s a lot easier to keep things tidier—such as having a single thread for debugging a specific issue, or posting standalone updates that you pin to the channel.
Channels are a lot better for people responding to incidents, but they can still be a bit overwhelming for a non-technical stakeholder who’s just looking for updates.
You should provide a way for people to stay in the loop outside of the channel you’re using to respond to the issue.
Updates created in the channel are automatically threaded against the original Slack message where the channel was created, and sync with our web app
Every incident should always have a lead. This doesn’t mean that person is responsible for fixing everything, but they are responsible for coordinating the response and keeping people updated on progress.
A similar approach applies to actions—it should be clear who is owning what fixes, this is especially important with incidents where multiple people are collaborating on fixes.
If a data team is downstream of a lot of data quality issues then a lot of their time will be spent on identifying & resolving issues they didn’t create, and not on value add activities. In short bursts, it can be manageable, but over time it can sap a team’s capacity and be quite demotivating.
A great tool to help combat this is, of course, data! Even tracking the number of data incidents, and some form of useful tagging for the root cause (e.g. upstream system) can go a long way to getting budget allocated to fixing the issues upstream.
We calculate time spent on incidents using time between items in an incident channel (messages, updates, actions) and infer whether it’s continuous time spent.
As the tools that data teams use to identify pipeline errors (earlier) grow in maturity, it’s important not to forget how you respond to these issues and get the right processes in place to do so.
We only need to glance over at how software engineers do incident management to know that Data teams don’t need to reinvent the wheel, a small amount of structure goes a long way!
If you want to learn about more ways that data teams, and organizations as a whole, can integrate incident management into their processes, then check out our guide: Incidents are for everyone.
We created a dedicated page for Anthropic to showcase our incident management platform, complete with a custom game called PagerTron, which we built using Claude Code. This project showcases how AI tools like Claude are revolutionizing marketing by enabling teams to focus on creative ways to reach potential customers.
We examine both companies' comparison pages and find some significant discrepancies between PagerDuty's claims and reality. Learn how our different origins shape our approaches to incident management.
The EU AI Act introduces new incident reporting rules for high-risk AI systems. This post breaks down what Article 73 actually mandates, why it's not as scary as it sounds, and how good incident management makes compliance a breeze.
Ready for modern incident management? Book a call with one our of our experts today.