# How Groq built incident management that moves as fast as its models 

* **20 mins → seconds** per status page update with pre-filled templates
* **3-week phased migration** from PagerDuty with zero customer impact
* **150+ engineers** + data center technicians supported across software & physical infrastructure

---

## Three tools that couldn't keep pace with AI

**PagerDuty notified people. It didn't help them respond.**

PagerDuty could wake someone up. But once they were up, they were on their own. The on-call engineer would get paged, sign into their laptop, connect to a VPN, open Slack, and then try to find the right channel and the right people. There was no automation to kick off an incident response process from that initial alert. Everything between "I got paged" and "we're all in the same room" was manual.

> PagerDuty was kind of on its own and it didn't connect well to Slack. It didn't help us assemble the right audience. The one responder who was on call would get notified, but it was a much more manual process for that responder to then go back to Slack and notify the rest of the team.

The time to assemble varied wildly, anywhere from 15 to 30 minutes depending on who was on call, what time it was, and whether they happened to know the right Slack channel. Every incident followed a slightly different path because every responder had a different impression of the proper process.

> There were almost infinite permutations as to how we would get to that same Slack channel. And I think that was part of what made it frustrating. There wasn't this consistency in response.

**PagerDuty wasn't built for the speed of AI.**

Beyond the response workflow, PagerDuty created a persistent administrative burden. Reconfiguring schedules, attaching escalation policies to new services, training engineers on shift management: all of it took longer than it should. Something as simple as cover requests were particularly painful.

> PagerDuty always felt like a bit of a time sink. Anytime we had to reconfigure schedules or attach an escalation policy to a new service, it took more time than we'd expect. And if someone needed to swap shifts, it always felt like at least a 30-minute meeting just to explain how to navigate it. When you're sick and not trying to work, that's probably one of the last things you want: hopping on a Zoom call just to get your shift covered.

For a company releasing new AI models almost every week, a tool that required a meeting just to swap a shift wasn't keeping pace.

## New models every week, the catalog had to keep up

Groq's model catalogue changes constantly. A new model goes live, and suddenly there's a new service that could page someone, a new endpoint that could have an incident, a new entry that needs to appear in a dropdown when an engineer is triaging at 3am. Keeping the on-call service catalog in sync with what was actually in production was a manual task that nobody had time for, and it meant responders couldn't always attribute an incident to the right model.

**Status page updates were a liability, not a process.**

Groq used Atlassian Status Page for customer communications. The team had a long document full of template language: copy it, paste it into the status page fields, replace the placeholder text with the right service values, and publish. In theory, it worked. In practice, during a high-severity incident, people would miss the placeholders.

> In the middle of a high-severity incident, sometimes we would miss those placeholders and we'd end up posting stuff to the status page that didn't look fully baked. When you're dealing with an intense incident, you don't want to be trying to wordsmith the perfect language.

The manual flow, from a shared doc to the status page website and back to Slack to check the actual impact, added 10 to 20 minutes of overhead per update. Time that could have been spent fixing the problem. And with leadership copied on every update, the pressure to accurately represent the company at a vulnerable moment made an already stressful situation harder.

> Knowing that you're putting together language that is going to be read by hundreds of customers, if not thousands, feeling responsible for the voice of the business at a stressful time when your customers are already frustrated. I won't lie, I was sweating profusely a few times.

**Post-mortems were ad hoc and inconsistent.**

Before [incident.io](http://incident.io), retrospectives lived in a shared document template. Someone had to manually pull context from PagerDuty, link the alert, capture timestamps, and try to reconstruct the timeline. The quality depended entirely on who was writing it and how thorough they felt like being after a long incident. Timestamps from PagerDuty didn't always align with what happened in Slack. There was no single source of truth for what had actually occurred.

---

## Migrating production on-call without the anxiety

Groq evaluated Rootly, FireHydrant, and incident.io. The decision came down to coverage. Groq needed one platform that replaced three tools without leaving gaps. incident.io was the only option that covered the full surface area.

> incident.io just had the most comprehensive offering as far as what we were trying to do across on-call notifications, incident response, and customer communications via the status page. It had the best surface area to cover all of those needs.

The business case came together fast. Consolidating three tools into one meant spending less money while getting more out of every incident and the choice was clear.

> Things crystallized pretty quickly once we started realizing what we could do with a consolidated platform. And then it accelerated once we started doing the cost calculations. You're talking about not only efficiency improvements from a time and personnel perspective, but also monetary savings. That made it a lot easier to get the business aligned around this decision.

But choosing a tool is one thing. Migrating production on-call is another. Dylan led the migration personally and took a deliberately conservative approach.

**Week one: dual-paging.** The incident commander team put themselves on call in both PagerDuty and incident.io simultaneously. Every PagerDuty page triggered a corresponding incident.io notification.

For one to two weeks, they lived in both systems, validating that everything was wired up correctly. It added five seconds per alert to acknowledge both pages, but it meant Dylan's team understood exactly what they were asking of engineering teams before rolling it out to anyone else.

**Weeks two and three: phased rollout by complexity.** They started with the simplest schedules: lowest alert frequency, smallest blast radius. Each batch gave them time to discover rough edges, update documentation, and draft FAQ sections before moving to the next tier. By the time they reached the most critical, highest-frequency schedules, the process was polished.

> We sort of proceeded every one to two weeks. I think we had three batches in total. We would move up the complexity stack. And by the end of it, we had everybody moved over and we all kind of collectively sighed a great sigh of relief.

Once all schedules were migrated, Groq disconnected from PagerDuty entirely. 

> I wouldn't be too intimidated by the migration. incident.io's team will support you throughout. You don't have to feel like you're taking it on single-handedly. Ground yourself in what the outcome could be: not just cost savings, but quality of life for your on-call engineers, better learnings from these stressful situations, and ultimately a more reliable product for your customers.

The reaction from engineering was overwhelmingly positive.

> There were a few folks who tried to skip the line. They were asking us if we could move them over sooner. We had to try and tell them, you'll be moved over soon, but hang tight.

##### Terraform keeps the catalog in sync with production

With models shipping weekly, Groq needed the on-call service catalog to reflect what was actually live in production, without anyone having to update it manually. incident.io's Terraform integration solved this cleanly.

An automated Terraform sync runs roughly every hour. When a new model goes into production, it appears in the incident.io service catalog. When an engineer triages an alert at 3am, the dropdown in Slack shows every model currently serving customer traffic. Every model in production shows up automatically, with no stale entries, no missing options, and no admin overhead.

> Now, with incident.io, we run a Terraform sync roughly every hour, keeping the models in production in sync with our incident configuration. Responders can always associate the right incident with the right model. It's hard to overstate how helpful that is from an admin perspective.

For a company where the product catalog changes faster than most teams update their runbooks, this is not a nice-to-have. It is the difference between accurate incident attribution and guesswork.

##### Status page templates removed the guesswork under pressure

The old workflow of finding the template doc, copy-pasting, replacing placeholders, and hoping not to miss one is gone. incident.io's status page templates come with impacted services pre-filled. The language is pre-drafted and consistent. Engineers run the same status page update command and get the same templates every time.

> Being able to draw from the same set of templates consistently made a really big difference. It helped us feel really confident that our customers were having a much more consistent experience when subscribed to our status page.

This also transformed onboarding. New incident commanders no longer needed to hunt down the right template in a shared doc, figure out which version was current, or improvise an update for an incident that didn't match any of them. The process was the same every time, for everyone.

## Mobile-first incident response

Modern incident tools meet engineers where they are, and for Groq that means Slack. This means engineers can run an incident from their phone: acknowledge an alert, draft a channel update, escalate to another team, pull in an incident commander, all without opening a laptop.

> If you have to walk the dog and you get an alert, maybe you don't have to sprint home quite as quickly. You can maybe just pick up your pace a little bit and briskly walk back to your desk, knowing that you're still able to do a lot without your laptop.

For engineers 150 engineers on call around the clock, not needing a laptop within arm's reach made a big difference.

---

## **Data the whole business could see**

The most transformative change wasn't faster response times or smoother migrations. It was visibility. With everything flowing through a single platform (alerts, incidents, post-mortems, action items), Groq could finally see the full picture.

> Having a holistic platform where incident response is captured end to end gave us a wealth of data.

Dylan can now look at data to answer questions like, how are our alerts translating into incidents? How are our incidents translating into post-mortems? How are our post-mortems translating into action items? And how are those action items actually getting converted into codified changes in our codebase?

Before incident.io, assembling this picture would have meant connecting timestamps across PagerDuty, Slack, and Jira, and trusting that they aligned. Dylan described it as effectively impossible without a level of manual effort nobody had time for.

### After-hours paging insights nobody could have built manually

One specific insight changed how Groq prioritizes reliability work. Because incident.io is aware of each responder's local time zone, it can show which on-call alerts are waking people up at night, not just when alerts fire, but when they fire relative to the person who receives them.

> It can help you understand on a schedule-by-schedule or service-by-service basis, how many of these on-call alerts are waking people up in their local times. I'm not sure how I would have achieved that manually without tremendous investment in time zone Excel formula wizardry. I'm honestly not sure it would have been feasible.

This data feeds directly into prioritization decisions. Teams going through turbulent periods, getting woken up frequently, get targeted support and reliability investment. It is a human-first use of incident data that has a direct impact on engineer wellbeing and retention.

##### Executive confidence in reliability data

At the end of 2025, Groq used incident.io data to put together a year-end reliability review for the executive team. Alerts, incidents, post-mortems, action items, and code changes, all traced through a single incident ID, mapped to the appropriate team and service, and kept current through Terraform.

> When they asked where this data was coming from, being able to give them a really solid answer: here are the numbers, here's where they're derived, and if you'd like, you can look at this dashboard yourself. It felt like a really nice way to avoid people doubting or losing confidence in the story we were trying to tell.

##### AI-assisted updates save time during live incidents

Dylan flagged incident.io's AI capabilities as a significant time saver. Automatically drafted status updates in Slack, summarizing complex issues into a few sentences and getting everyone aligned on impact, root cause, and next steps, which reduced the cognitive burden during active incidents.

> Being able to draft those status updates automatically in Slack has been really powerful. It allows you to do a lot more in a lot less time.

##### Post-mortems went from inconsistent documents to structured learning

With incident.io, the retrospective process pulls through context automatically: timestamps translated into local time zones, alert data linked directly, and a consistent structure for what went well, what went poorly, and what to do differently. The document is largely assembled before anyone sits down to write it.

> To have that accurate information flow into it automatically was a really nice two-fold win: consistency of the document structure, and a reduction of burden on the folks who had to do the document assembly.

---

## **What incident response looks like now**

When asked what losing incident.io would mean for Groq, Dylan did not hesitate.

> If you took away incident.io tomorrow, I'd be pretty distraught. Going back to the old way of working would feel like a regression, both for our team and for our customers.