# 5 on-call workarounds your PagerDuty team has normalized (and what they actually cost)

*May 29, 2026*

5 on-call workarounds your PagerDuty team has normalized (and what they actually cost)

_Updated May 18, 2026_

> **TL;DR:** PagerDuty's alerting is reliable, but the coordination layer around it forces most SRE teams to build invisible infrastructure: custom Slack bots, pay tracking spreadsheets, DM-based shift swaps, and hand-rolled escalation scripts. Each workaround carries real engineering hours and operational risk that never appears on your invoice. We quantify the five most common workarounds, calculate what your custom on-call stack actually costs, and show how incident.io handles all five natively, eliminating the maintenance burden entirely.

Most SRE teams obsess over Mean Time To Resolution (MTTR) while ignoring the invisible infrastructure they've built just to make their alerting tool functional. The Slack bot that creates context-rich incident channels. The spreadsheet that calculates on-call pay. The DM thread where engineers negotiate shift swaps. The custom escalation script that handles edge-case conditional routing that Event Orchestration configuration alone doesn't fully cover.

Each of these started as a quick fix. None show up on your PagerDuty invoice. Someone maintains all of them, often the engineer who built them years ago and may have since moved teams or left the company entirely.

You bought an alerting tool but ended up building an incident management platform around it. Here is what that custom stack actually costs, and which incident.io features eliminate each workaround natively.

## PagerDuty's missing features and hidden costs

PagerDuty's alerting is reliable. Page delivery, notification rules, and escalation layers have been battle-tested for a decade, and that track record is real. The problem is not the page itself. It is everything that happens around it: assembling the team, coordinating the response in Slack, and reconciling who was actually on call when the pay period ends, workflows PagerDuty does not natively cover, and most teams solve with a combination of DMs, spreadsheets, and custom scripts.

Process debt works like technical debt. A quick fix ships, solves the immediate problem, and gets forgotten. Six months later, it is part of the critical path. When the original builder leaves, that workaround becomes a liability with a bus factor of one, and those costs never appear on your PagerDuty invoice. The [on-call tool selection framework](https://incident.io/blog/on-call-tool-selection-framework-2026) for 2026 highlights this pattern across teams at a similar scale. The [incident management pricing comparison](https://incident.io/blog/incident-management-pricing-comparison-2026) makes the total cost of ownership visible in a way vendor invoices do not.

Here is how the native capabilities compare before diving into each workaround:

| Feature | PagerDuty | incident.io |
| --- | --- | --- |
| On-call pay tracking | Requires API export + third-party tools or custom scripts | Built-in pay report, configurable rules, CSV export |
| Cover requests | Override UI + manual notification, Shift Agent for assisted requests | /inc cover me in Slack, auto-notifies schedule, auto-creates override |
| Incident channel customization | Creates channels for P1/P2, but message format is mostly fixed:custom context requires a custom bot | Automatic on alert fire, with service context, timeline start, and role assignment |
| Escalation conditional logic | Linear chains only, no if/then routing | Service Catalog-driven routing with team ownership built in |
| Schedule audit trail | Overrides visible but attribution limited natively | Full audit log with actor, target, and timestamp on Enterprise plan |

## Workaround #1: Untracked on-call handoffs via chat

The shift swap is a fundamental need for any on-call team. Engineers get sick, take vacations, and need coverage. What should be a formalized, logged, and acknowledged process has become an informal Slack DM exchange, followed by a manual override in the PagerDuty UI, followed by hoping the person who agreed to cover actually remembered to complete it.

### Why teams use DMs for shift swaps

PagerDuty does offer an override mechanism. My On-Call Shifts lets engineers view upcoming responsibilities and create overrides. PagerDuty's Shift Agent feature can automatically find a replacement by sending a message based on configured notification rules. But the friction of navigating to the override modal, confirming dates, times, and recipients, and formalizing the handover pushes most teams back to Slack DMs for the actual negotiation, with the formal override as an afterthought, if it happens at all.

### Coverage gaps and accountability risk

When a shift swap happens over DMs and no override is formally created, there is no audit trail. Nobody can prove who was supposed to be on call. If an incident fires during that window and escalation fails, the post-mortem must reconstruct coverage from memory and Slack scroll-back, which is exactly the archaeology that makes post-mortems inaccurate and blame-heavy.

Consider the accountability gap when neither party creates the formal override: PagerDuty's schedule history still shows the original on-call engineer, not who actually held the pager. The coverage agreement exists only in a DM thread. For compliance in regulated industries, that is not auditable evidence. For post-mortems trying to understand whether a P1 was escalated correctly, it is a dead end. The [incident.io cover request changelog](https://incident.io/changelog/flexible-cover-requests) describes exactly this friction: "Instead of coordinating through group chats and manually adding overrides in dashboards, you can simply create a cover request."

## Workaround #2: Spreadsheet-based on-call pay tracking

On-call compensation is a real and legally significant obligation for most engineering teams. Accurately tracking it requires knowing exactly when each engineer was on call, when they responded to alerts, and what multipliers apply to which time windows. PagerDuty does not provide this out of the box.

### Reconciling PagerDuty for on-call pay

Open-source tools exist to fill this gap. The [pager-hours tool](https://github.com/discordianfish/pager-hours) helps calculate time on call during weekday off-hours, weekends, and holidays. These are community-built, not supported by PagerDuty, and require someone to run them, validate the output, and translate results into payroll-ready numbers. The typical monthly reconciliation workflow looks like this:

1. Pull schedule data from PagerDuty's API or export schedule CSVs, the most common starting point for teams building their own reconciliation process.
2. Cross-reference alert history to identify actual response events.
3. Map timestamps to pay rule categories (business hours, off-hours, weekend, holiday).
4. Apply multipliers from the compensation policy (e.g., 1.5x for weekday off-hours, 2x for weekends).
5. Sum totals per engineer and prepare a finance-ready CSV.
6. Check for informal shift changes that may not have resulted in a formal override. DM-based swaps are a known payroll risk, as verbal or chat-based agreements rarely make it into the official schedule at the right time, leaving payroll to process pay based on the old schedule.

### Monthly toil and what it costs

The manual reconciliation process consumes significant time each month. [Glassdoor data](https://www.glassdoor.com/Salaries/san-francisco-senior-site-reliability-engineer-salary-SRCH_IL.0,13_IM759_KO14,46.htm) shows a $177,000 average base salary for a senior SRE in San Francisco. Divide that by 1,560 productive hours (2,000 working hours at 75% utilization) and you get a base hourly rate of approximately $113/hour. Apply a [1.25x to 1.4x fully loaded cost multiplier](https://www.glencoyne.com/guides/fully-loaded-cost-us-employee) for benefits, payroll taxes, and tools, and the fully loaded rate sits between $142 and $159 per hour, or approximately $153/hour at the midpoint. This document uses $150/hour as a conservative estimate to account for teams outside major metros or at lower seniority bands. Even a few hours monthly on spreadsheet reconciliation at that rate represents substantial senior engineering time doing what should be finance work.

That represents thousands of dollars annually, spent by an engineer who should be building observability tooling or reducing toil, not cross-referencing API exports against a Google Sheet. The [incident.io on-call pay page](https://incident.io/on-call-pay) captures the real-world impact of automated pay tracking.

## Workaround #3: Custom Slack bots for incident channel context

PagerDuty's Slack integration does create channels for major incidents and supports bidirectional command access. The problem is customization: the message format is mostly fixed, and you cannot control which fields appear or inject additional context without building on top of PagerDuty webhooks. Teams that want dedicated incident channels with structured context (service owner, runbook links, alert details, current on-call responder, live timeline) end up building that themselves.

### The customization gap that drives bot development

PagerDuty's community forums document this friction directly: "The default PagerDuty Slack integration doesn't let you hide specific fields like Type, Service, or Urgency from incident notifications - the message format is mostly fixed, and there aren't settings to customize which fields show up. If you need total control, you could build a custom integration using PagerDuty webhooks and a Slack bot, but that does require some engineering work."

You can see [why teams build on top](https://incident.io/blog/incident-io-vs-pagerduty-comparison) of PagerDuty rather than replace it in the comparison for 2026. As one engineering team described: "The team always has things that they want to change sometimes when they really want to change it, they just build something themselves. And we have so many bots running around our slack workspace."

### Who maintains the bot when the builder leaves

Your Slack bot authenticates against PagerDuty's API and Slack's API simultaneously. When Slack updates their API, the bot may break. When PagerDuty changes webhook payload formats, the bot may return malformed data silently. The [guide on updating Slack workflow bots](https://youtube.com/watch?v=OBHN4c3ZuWI) illustrates the ongoing engineering investment these integrations require, even for teams with modern tooling available.

The engineer who built your bot understands its dependencies, error handling, and deployment configuration. When they leave, that knowledge leaves with them. Whoever inherits the bot inherits undocumented infrastructure that breaks at the worst possible time, typically during a P1 at 3 AM. That is engineering time subtracted from proactive reliability work and observability improvements that would reduce incidents in the first place.

## Workaround #4: Custom escalation scripts

PagerDuty's escalation policies handle linear chains well. If Person A doesn't acknowledge within five minutes, page Person B. But production environments are rarely that simple. Teams need conditional routing: route to the database team if the alert involves connection pool errors, route to the platform team if it involves Kubernetes node pressure, but follow a different escalation chain if that same engineer is already primary on-call for another service.

### The conditional logic gap

PagerDuty does support dynamic routing through Event Orchestration, which lets you set rules based on incoming event fields to dynamically assign escalation policies. However, teams with complex conditional needs (like "if person X is paged in primary, then page this specific person from secondary; but if persons A, B, C are paged as primary, then page different people") often find Event Orchestration requires significant configuration or still resort to custom scripting for edge cases. Users have documented these limitations directly in community forums.

This is a well-documented limitation. The [PagerDuty vs incident.io comparison](https://incident.io/blog/incident-io-vs-pagerduty-comparison) covers how the two platforms approach escalation architecture differently.

### Why custom escalation scripts fail mid-incident

Custom escalation scripts introduce failure modes that are well-recognized in software engineering, and they tend to surface at the worst possible time:

* **Authentication expiration:** API tokens and OAuth credentials have expiry windows. Scripts that don't handle credential rotation gracefully can fail silently, no escalation fires, no one gets paged, and no error surfaces until someone notices the page didn't arrive.
* **Unhandled edge cases:** Alert payload formats can change when monitoring tools are updated or reconfigured. A script built against one payload schema may throw an unhandled exception when the format shifts, dropping the escalation entirely.
* **Silent regression:** A script last touched months ago by an engineer who has since moved on can break when an upstream dependency changes. Without active test coverage, the regression goes undetected until it surfaces during a live incident.

When an SRE has to stop and debug custom code during a live incident, that time comes directly out of troubleshooting the underlying issue. Teams frequently report that the hardest part of a post-mortem is separating time lost to the actual failure from time lost to the coordination layer around it. [incident.io's Escalation Paths](https://docs.incident.io/api-reference/escalation-paths-v2) handle this natively without custom scripts or maintenance burden.

## Workaround #5: The cost of unrecorded schedule changes

On-call schedules are not static. Engineers join and leave teams, take parental leave, change time zones, and swap primary and secondary rotations. Managing these changes in PagerDuty requires navigating the schedule editor and creating layer overrides. [On-call scheduling strategies](https://incident.io/blog/on-call-scheduling-rotation-models) that work well require clear visibility into schedule changes and overrides, PagerDuty does offer scheduling visibility features, but attribution on informal changes and a consolidated audit trail of who changed what and when is limited natively.

### Why teams maintain a secondary source of truth

Teams with complex overlapping schedules, business-hours-only layers, secondary rotations, time zone overrides, sometimes maintain a supplementary reference outside PagerDuty. The pattern is common enough that PagerDuty recently launched Shift-Based Schedules explicitly to "eliminate the complexity of overlapping layers," acknowledging that managing multiple layers in a single view introduces confusion.

When the secondary source diverges from PagerDuty's actual schedule (which happens when informal DM-based swaps don't result in formal overrides), the team has two conflicting versions of who holds the pager with no systematic way to reconcile them.

### The impact on post-mortems and compliance

When escalation fails during an active incident, one of the first questions engineers ask in the post-mortem is who was actually supposed to hold the pager, and if schedule changes happened informally, that answer may not exist in any auditable system. If schedule changes happened informally, the post-mortem can identify the gap but cannot attribute it to a specific schedule change because there is no record of when or why the schedule was modified.

This matters for compliance in regulated industries, where demonstrating that a properly qualified engineer was on call is part of audit evidence. It also matters for identifying patterns: if the same engineer was on call during three consecutive P1 incidents, the data needed to justify workload redistribution may not exist if schedule data was managed informally. [incident.io's audit log](https://help.incident.io/articles/9416833228-audit-logs) addresses this with a full record of configuration changes, including who made each change and when. The [on-call onboarding checklist](https://incident.io/blog/on-call-onboarding-30-day-checklist) for new engineers illustrates how many informal processes accumulate around schedule management when formal tooling does not cover the workflow.

## Your custom on-call stack: what it truly costs

Alerting tools are not incident management platforms. Teams using an alerting tool as their primary incident coordination system will build the missing layer themselves, and that custom infrastructure carries real costs that compound over years while concentrating risk in the people who maintain it.

### The fully loaded cost formula

The framework for calculating this uses the [fully burdened rate formula](https://www.fiscallion.io/blog/5-steps-to-stop-undercharging-for-your-teams-time): base salary plus benefits, payroll taxes, tools, and overhead, divided by billable hours at a 65-80% utilization rate.

For a senior SRE at a mid-market company in San Francisco, [Glassdoor data](https://www.glassdoor.com/Salaries/san-francisco-senior-site-reliability-engineer-salary-SRCH_IL.0,13_IM759_KO14,46.htm) shows an average base salary of $177,000 per year. Applying a 1.35x [fully loaded cost multiplier](https://www.glencoyne.com/guides/fully-loaded-cost-us-employee) for benefits, payroll taxes, and overhead yields approximately $239,000. At 2,000 working hours and 75% utilization (1,560 productive hours), that is approximately $153 per productive hour for a senior San Francisco SRE, or roughly $150 per hour as a conservative estimate for teams outside major metros or at lower seniority bands.

### A 50-person team's annual workaround spend

Using a conservative $150 per productive hour, here is an illustrative annual cost breakdown for a 50-person engineering organization with a typical 3-to-5-person on-call rotation, consistent with a common benchmark of 1:10 ([Google SRE Workbook](https://sre.google/workbook/table-of-contents/)).

Even modest monthly maintenance adds up quickly. Four hours per month spent reconciling on-call pay equals 48 hours annually, or roughly $7,200 per year in senior SRE time. A custom Slack bot requiring two hours of maintenance per month adds another $3,600 annually (2 hours × 12 months × $150/hour). Add escalation script upkeep, schedule audits, and time spent resolving informal shift swaps, and the coordination layer around PagerDuty can quietly consume tens of thousands of dollars per year in engineering time alone, before accounting for the operational risk of failed escalations or undocumented schedule changes.

### The human cost of on-call burnout

The spreadsheet does not capture everything. [On-call engineer onboarding research](https://incident.io/blog/on-call-engineer-onboarding-playbook) shows that chaotic, unstructured incident response drives attrition. Replacing a senior SRE costs 50-200% of annual salary in recruiting and lost productivity, [according to SHRM](https://www.shrm.org/executive-network/insights/myth-replaceability-preparing-loss-key-employees), which puts the replacement cost for a $177,000/year engineer between $88,500 and $354,000 per departure. If your on-call process drives burnout, the workaround stack's real cost is not $15,000 to $22,000 per year but potentially hundreds of thousands in turnover.

## Built-in incident automation and tracking

The alternative to maintaining a workaround stack is consolidating into a platform that handles these workflows natively. incident.io is Slack-native, building the entire incident lifecycle (on-call scheduling, pay tracking, cover requests, escalation paths, and audit logging) into the same interface where your team already coordinates. Watch [Intercom's migration from PagerDuty](https://youtube.com/watch?v=IirqpfXF2xE) to see what eliminating a workaround stack looks like in practice.

### Formalized on-call handover

Instead of negotiating shift swaps over DMs and hoping the formal override gets created, [incident.io's cover request system](https://incident.io/changelog/flexible-cover-requests) moves the entire workflow into Slack. Type `/inc cover me` in any channel where the incident.io Slackbot is present, add the shift details, and the system sends push notifications to everyone on the schedule. If someone accepts, the override is automatically added, with no separate step and no missed audit trail.

The [cover request help documentation](https://help.incident.io/articles/2815264840-cover-me,-overrides-and-schedules) shows how you can offer to cover part of a shift, add notes for the requester, and have the system handle the formal override creation automatically. This eliminates the DM-based swap that creates accountability gaps.

### On-call pay export without reconciliation

incident.io's [on-call pay report](https://docs.incident.io/on-call/pay-report) replaces the monthly spreadsheet process with a structured export. Configure your pay rules (no pay during business hours, 1.5x for weekday off-hours, 2x on weekends, specific rules for holidays), then export a summary CSV with total pay per person, ready to hand off to finance or import directly into payroll.

Configure your pay rules once, and the report handles the reconciliation that previously required a monthly API export, hours of mapping, and a finance-ready spreadsheet built by hand. No API export, no manual calculation, no separate spreadsheet required.

### Automate incident war room setup

When a Datadog alert fires, incident.io [automatically creates a dedicated Slack channel](https://slack.com/marketplace/A01DEGPUHHC-incidentio), pages the on-call engineer based on the service's escalation path, pulls in service owner context from the Service Catalog, and starts capturing a live timeline. No custom bot, no webhook configuration to maintain, and no maintenance cycle when Slack updates their API.

Your team starts troubleshooting immediately rather than spending the first several minutes assembling the war room. [Pleo's experience with incident.io workflows](https://youtube.com/watch?v=MMP3PBfELg4) shows how quickly this replaces custom tooling in practice.

> "Incident Workflows - The tool significantly reduces the time it takes to kick off an incident. The workflows enable our teams to focus on resolving issues while getting gentle nudges from the tool to provide updates and assign actions, roles, and responsibilities." - [Carmen G. on G2](https://g2.com/products/incident-io/reviews/incident-io-review-8240756)

### Error-proof escalation paths

incident.io's [Escalation Paths](https://docs.incident.io/api-reference/escalation-paths-v2) use your Service Catalog to route alerts based on service ownership rather than requiring you to build conditional logic yourself. The escalation configuration integrates with team and service metadata, so when an alert fires against a specific service, the routing already knows which team owns it and which escalation chain to follow.

The key distinction from PagerDuty's approach: incident.io's defaults are opinionated. You get escalation paths that work out of the box for most teams. If you need deep alerting customization with complex layer-based routing, PagerDuty remains more configurable. But for most teams, the opinionated defaults eliminate the need for custom scripts, and time-to-value is measured in days rather than weeks.

> "That's where incident.io really shines: it allows to seamlessly nudge or suggest actions. You can implement your incident management framework easily." - [Alexandre R. on G2](https://g2.com/products/incident-io/reviews/incident-io-review-8830447)

### On-call schedule history and audit

incident.io's [audit log](https://api-docs.incident.io/tag/Audit-logs/) captures every configuration change, including schedule modifications, with an actor (the person or system that made the change) and targets (the things that were modified). Available on the Business plan and above, every schedule change, override creation, and escalation policy modification has a timestamped, attributable record.

For post-mortems that need to answer who was on call at a specific moment, the audit log provides a definitive answer. For compliance audits requiring demonstration of proper coverage, the log is the evidence. This replaces the informal pattern of trusting that overrides were created and that the informal DM swap is somewhere in Slack history.

## End the toil: automate away custom fixes

The [incident.io vs PagerDuty comparison](https://incident.io/blog/incident-io-vs-pagerduty-comparison) breaks down where these tools diverge by design. incident.io's [PagerDuty Rescue Program](https://incident.io/blog/incident-io-launches-pager-duty-rescue-program) offers AI-powered, dedicated migration support, including schedule imports, escalation policy mapping, and runbook setup, that maps your existing schedules and escalation policies to their incident.io equivalents, making it clear which workarounds disappear immediately.

### Run a migration scan of your PagerDuty setup

incident.io provides [PagerDuty migration tools](https://docs.incident.io/getting-started/migrate-from-pagerduty) that map your existing schedules, escalation policies, and alert routing rules to their equivalents in incident.io. This gives you a concrete view of which workarounds transfer (escalation logic, schedule structures) and which disappear entirely (pay spreadsheets, channel context bots, escalation scripts).

The migration uses a parallel-run: PagerDuty keeps running while incident.io handles coordination, then you cut over on-call scheduling when confident. [Easier migrations](https://incident.io/changelog/easier-migrations) are built into onboarding because replacing incident management tooling while incidents keep happening requires careful sequencing.

### On-call pay and cover requests

Both pay tracking and cover requests are included in the Pro plan at $45 per user monthly, compared to the workaround stack costs plus PagerDuty seat expenses. Watch [Bud Financial's experience](https://youtube.com/watch?v=oFHSRlospRo) to see how this shift lets engineers focus on resolving incidents rather than managing tooling.

[Schedule a demo](https://incident.io/demo) to see the on-call pay export and cover request features in a live walkthrough and learn how incident.io can eliminate your workaround stack.

## Key terms glossary

**Mean Time To Resolution (MTTR):** The average time from when an alert fires or an incident is declared to when it is fully resolved, including diagnosis, remediation, and verification. MTTR is the primary metric engineering teams use to measure incident response performance.

**Site Reliability Engineer (SRE):** An engineer who applies software engineering principles to infrastructure and operations problems. SREs own reliability targets, on-call rotations, incident response, and post-mortems at most mid-to-large engineering organizations. The role was pioneered at Google and is now standard across cloud-native engineering teams.

**Bus factor:** The number of people who need to leave or become unavailable before a system or codebase fails. A bus factor of one means one person leaving breaks everything.

**Override (PagerDuty):** A manual one-time adjustment to an on-call schedule that temporarily changes who is responsible for a specific time window, visible beneath schedule layers in the Final Schedule view.

**Escalation policy:** The configured sequence of notifications and time delays that determines who gets paged if the first responder does not acknowledge an alert within a defined window.

**On-call pay report:** A system-generated summary of each engineer's on-call hours with compensation calculated against configured pay rules, exported as a payroll-ready file.

**Slack-native:** A tool architecture where the primary interface and workflow happens inside Slack using slash commands and channel interactions, not a web UI that sends notifications to Slack as a secondary output.