How Favor reduced MTTR by 37%

Favor is a food delivery service, powering the majority of H-E-B's home deliveries across Texas. As a subsidiary of the state's most beloved grocery chain, Favor operates a complex logistics network that goes beyond traditional food delivery. Customers can order groceries from H-E-B for delivery within as little as 45 minutes, restaurant meals, or convenience items from thousands of merchants.

This real-time orchestration of drivers, merchants, and customers means that when Favor experiences an incident, the impact cascades across the Texas delivery ecosystem.

When human coordination hit its breaking point

During a holiday weekend last year, Favor's assignment algorithm experienced scaling issues. Orders weren't getting matched to drivers, creating cascading failures across their delivery network. As Director of Engineering, Ross McKelvie found himself at the center of the incident.

"The COO was DMing me directly asking what was going on," McKelvie recalls.

I had two different Zoom calls running and two laptops open—I'd mute one and talk on the other, then switch to relay information.

This had become standard operating procedure for major incidents, with McKelvie serving as the human router for incident communications while simultaneously trying to resolve the underlying technical issues.

Manual setup burned 30 minutes before troubleshooting could begin

Coordinating incident response required a complex series of manual tasks: creating a Slack channel with the correct naming convention, pulling in the right engineers, setting up a Zoom bridge, updating the Atlassian Status Page, and providing regular updates to leadership.

Spinning up an incident was taking 20-30 minutes. By the time we got organized, we'd already lost orders.

Meanwhile, Favor's tooling had evolved into a frustrating patchwork. OpsGenie handled alerting but Status Page required manual updates through a separate login. Engineers without licenses couldn't post updates at all.

"These are both Atlassian products, and they didn't play well together," notes Abbas Bandali, Principal SRE. "There was no way for engineers to post an update to Status Page without logging in and having a Status Page license."

"It was uncanny" how incident.io matched their needs

After the holiday weekend incident, McKelvie was tasked with finding a better process. What he discovered at incident.io caught him off guard.

It was uncanny how incident.io built a product that matched what we were trying to do by hand. It was automating things we didn't even think to automate.

The theoretical became urgent when an incident struck during their proof of concept evaluation of incident.io. Watching themselves struggle through the manual process while knowing a better solution existed created immediate buy-in: "the incident convinced the other engineering directors they needed incident.io right away.”

Once the team saw incident.io handling their real-world scenarios, the decision became clear. “You guys were doing stuff that we didn't even think about", added McKelvie.

Implementation expanded across three departments

What began as an engineering initiative quickly revealed broader value. Engineering uses it for technical incidents. Operations adopted it for POS system outages affecting restaurant partners. Trust & Safety manages insurance claims and security issues.

"The customizable aspects let us build workflows and notifications for each team without overloading the engineering channel," McKelvie explains.

Meeting engineers where they already work

The team also integrated incident.io with the service catalog. "We dynamically look up the owner of the alert," Panahi explains. "Teams maintain service ownership themselves in the service catalog, and alerts automatically go to the right team." This eliminated the DevOps team's burden of manually updating configuration maps whenever teams reorganized.

The adoption challenge for any operational tool is significant, particularly during high-stress incidents. Kevin Panahi, a recently joined Senior SRE, immediately noticed the difference." At previous places, everyone's asking 'Where's the group chat? Can you add me? Is there a call going on?' With incident.io, everything's just there in Slack. No hunting around."

Abbas Bandali puts it simply: "There's something about that Slack workflow. The average engineer spends most of their time in Slack. incident.io understood that."

Cultural shifts required evangelization

The technical implementation was just part of the transformation. As Bandali explains, "There were two radical shifts that we had to make in our mindset."

First was decoupling alerts from escalations. "Before incident.io, it was very much coupled. It took us a long time to complete the mind shift to understand the decoupling, and then we had to evangelize that to the rest of engineering."

Second was treating every alert as a potential incident, a shift still in progress. "We haven't really swallowed the blue pill on that yet," Bandali admits. "We have come a long way in terms of alert hygiene. Engineers are very grateful now that alerts are actionable and that if it's tied to an incident, then it gets the relevant people involved."

The results: a 37% reduction in MTTR transformed operations

Mean time to resolution dropped almost 40%. But the more telling metric is that incident detection increased by 214%.

This wasn't because Favor had more problems. They were catching small issues before they became major outages. While the number of trivial and minor incidents increased, the number of major and critical incidents dramatically fell.

Most significantly, they flipped the script on incident detection.

Prior to incident.io, the majority of our incidents were reported by users, now we are 2-to-1 for automatically creating an incident versus a user reporting it.

Organizational visibility transformed stakeholder relationships

At Favor's semi-annual all-company meeting, leadership announced incident.io's rollout as a major transparency initiative. Customer-facing teams no longer anxiously ping engineers for updates—they monitor progress directly in Slack. Leadership sees real-time status without creating additional communication overhead for engineers solving problems.

This visibility has fundamentally changed how different teams interact during incidents, transforming what was once a source of organizational stress into a predictable, efficient process.

Looking forward

Favor's transformation happened remarkably quickly. Within months, they had fundamentally transformed their operational capabilities.

The team is now working toward a future where every actionable alert creates an incident that self-resolves when appropriate. They're exploring how incident.io's AI SRE capabilities might replace another set of manual tasks like checking recent deployments, which cause about half their incidents.

For a company delivering time-sensitive services across Texas, where every minute of downtime translates directly into lost orders and damaged customer relationships, that transformation has proven invaluable. The platform that once depended on a single engineering director managing incidents via dual laptops and competing Zoom calls now operates with the operational maturity their business demands.

incident.io didn't just solve our incident management problem, it transformed how we think about reliability, coordination, and continuous improvement.

Ross McKelvie, Director of Engineering