Updated May 4, 2026
TL;DR: Engineering teams face tool sprawl during critical moments, with new SREs often navigating five different tools (PagerDuty, Slack, Datadog, Jira, Confluence) while learning to troubleshoot systems. By reducing that tool-switching overhead with a structured shadowing program and Slack-native automation, you can ramp new engineers more efficiently. This playbook covers the exact steps: runbook creation, live incident shadowing, first-incident drills, and how incident.io's AI SRE automation reduces cognitive load so new hires focus entirely on diagnosing problems.
Reading documentation doesn't prepare an engineer for a 3 AM outage. Shadowing real incidents and running simulated drills does. Yet most teams hand a new hire a Confluence link, a PagerDuty login, and a "you'll figure it out" before their first on-call shift, and the result is fumbled escalations, missed status page updates, and a senior SRE dragged back in to clean it up.
We put this blueprint together to fix that, using structured workflows that get engineers incident-ready fast without adding risk to production.
When you onboard a new developer, setting up their local environment and merging their first pull request takes a few days. When you onboard a new on-call engineer, teaching them to respond to a cascading P1 at 2 AM, communicate clearly to stakeholders, and navigate five different browser tabs simultaneously takes much longer, but mostly because teams never write down the process itself. On-call culture means owning the pager, owning the communication, and owning the post-incident learning, and none of that fits into a two-hour HR session.
When the answer to "what do I do during a database incident?" is "ask Sarah," you don't have a process, you have a person. When Sarah leaves, that knowledge leaves with her. The Google SRE book on accelerating on-call discusses this challenge: new SREs often need to rely on the development team for every question because they don't have enough context to react appropriately.
Unclear escalation paths stall incidents. When a P2 fires at 11 PM, a new engineer shouldn't spend 20 minutes figuring out who owns the database team's on-call rotation by scrolling through a Google Sheet. The Google SRE Workbook on on-call addresses this problem: the absence of documented ownership means incidents stall on logistics while customers are already impacted.
Here's the typical on-call workflow before any consolidation: PagerDuty fires an alert, you manually create a Slack channel, open Datadog, start a Google Doc for notes, create a Jira ticket, and remember to update Statuspage. This kind of tool sprawl creates a "coordination tax" that can cost 15 minutes per incident before actual troubleshooting starts. For a new engineer, that tax is even higher because every tool requires a separate mental model.
Under cognitive overload, human judgment slows instead of acting as a defense mechanism, compromising incident response at the worst possible moment. New engineers freeze not because they can't debug a system, but because they're context-switching between five tools while trying to debug a system.
The most common new-hire on-call failures are predictable: paging the wrong team because the escalation path wasn't documented, forgetting to update the status page because focus was on Datadog, or trying a fix the team had already ruled out 20 minutes earlier because no one captured that decision. These aren't failures of competence, they're failures of process design.
We built this blueprint around a simple principle: remove the tool-learning overhead entirely and an engineer can be incident-ready faster. Extended ramp times often exist because new hires are learning five tools and one set of systems simultaneously. Flip that ratio and the timeline compresses.
Day 1 removes friction before any incident happens. The goal is full system access and a working mental model of the services the new hire will own.
/inc command: Have them declare a test incident in Slack to confirm the workflow feels natural before any pressure is applied.For a full walkthrough of how incident.io's on-call setup works end to end, Chris Evans from the incident.io team covers the full configuration in a product demo.
One effective way to build a new engineer's mental map is regular exposure to real outages, including the trigger conditions and the mitigation steps. Day 2 is structured shadowing: the new hire is in the incident channel, but the senior SRE is the primary responder. Their job is to observe, not to act:
If you don't have a live incident on Day 2, replay a past one. Pull up the incident channel history, the post-mortem, and the alert timeline. Walk through the incident chronologically, pausing at each decision point to ask the new hire what they would have done. incident.io's Insights dashboard helps you analyze past incidents by service and severity, so you can pull exactly the incident type most relevant to their service area.
The Google SRE approach to training emphasizes using real historical scenarios because familiar failure modes build pattern recognition. Run your Day 3 drill using a safe, non-customer-facing scenario from your own incident archive:
A runbook that hasn't been updated since the last incident is a liability, not an asset. Every runbook needs core sections to be useful during a crisis, adapted from on-call runbook best practices:
| section | what to include |
|---|---|
| Trigger conditions | Alert name, error message, observable symptoms |
| Service overview | Purpose, service owner, Slack channel, criticality tier |
| Architecture | Dependency map and failure impact (typical components) |
| Quick diagnostic commands | Health checks, log queries, expected outputs |
| Remediation steps | Numbered steps with expected outcomes per step |
| Escalation path | Named contacts, on-call schedule link, backup contacts |
| Verification criteria | How you confirm the issue is resolved |
Every runbook needs metadata at the top: title, version, last-tested date, owner, severity level, estimated resolution time, and risk rating. This runbook metadata lets a new engineer confirm immediately that they're reading the right document for their situation, rather than discovering the wrong runbook three steps in.
After every incident, add a lessons-learned block to the relevant runbook promptly. Capture the exact commands that worked, the ones that failed, and the escalation decision point. Set a regular review cadence: one senior SRE runs through each critical runbook as if it's their first time reading it, timing how long it takes and noting every question they have. Those questions become the gaps you fix before the next new hire joins the rotation. A well-maintained runbook should evolve directly from real incident data.
Define the rules before the first incident happens, not during it.
What the shadow does:
What the shadow doesn't do:
This keeps the shadow in learning mode and keeps the primary responder's cognitive load low.
Before a new engineer transitions from shadow to primary responder, confirm they've completed:
The senior SRE running the drill should follow three rules:
An engineer is ready for their first solo shift when they demonstrate these capabilities in a supervised drill:
If any criterion fails, schedule additional live shadowing sessions before re-assessment. The goal isn't to rush engineers through a checklist, it's to ensure the process feels automatic before the pager goes live.
Major incidents P1s and P2s should produce four things:
Post-mortems buried in Confluence folders nobody reads aren't a knowledge base, they're an archive. A usable knowledge base is indexed by service name, incident type, and root cause category so a new engineer can quickly find relevant incidents. When a new hire joins the on-call rotation, being able to search "database connection pool" and immediately find the three most relevant historical incidents is a faster orientation than any documentation session.
Review every runbook tied to a service that has had a recent incident. Assign ownership clearly: the team lead for each service is responsible for the runbook's accuracy. Add the last-tested date to the runbook header and make it visible, because an untested runbook is an unreliable one.
We built incident.io to make the 3-day blueprint above work without any tool-learning overhead. New engineers don't have to learn a new interface alongside a new system because the entire incident lifecycle happens inside Slack, the tool your team already uses. There's no web UI to learn during a 3 AM incident, just /inc commands that feel exactly like Slack messages.
When a Datadog alert fires through your alerting platform (PagerDuty or Opsgenie), incident.io can automatically create a dedicated channel (for example, #inc-2847-api-latency), page the on-call engineer, and pull in context from the Service Catalog. The new engineer joins the channel and sees the triggering alert, the service dependencies, and a live timeline already recording.
They type /inc assign @sarah to set the incident commander and /inc escalate @database-team to bring in the specialist. These Slack shortcuts require zero training because they're structured like normal Slack messages, not commands from a 47-step runbook.
Scribe, incident.io's AI note-taker that replaces the need for a dedicated scribe during incident calls, joins the incident call on Google Meet or Zoom and captures every decision, every diagnostic step, and every key moment in real time. When someone says "rolled back deploy abc123," Scribe logs it to the timeline automatically, with no dedicated note-taker and no engineer pulled away from troubleshooting to type updates into a Google Doc. Scribe eliminates the manual update work a senior engineer would otherwise spend during an incident, freeing them to focus entirely on the fix.
For new engineers, this is a genuine safety net. They can focus entirely on diagnosing the problem because they know the context is being captured.
When the engineer types /inc resolve, incident.io drafts the post-mortem automatically using the captured timeline, Scribe's transcription, and the key decisions logged during response. The draft is 80% complete before anyone opens a blank document, so the new engineer spends 10 minutes refining rather than 90 minutes reconstructing from memory.
Our AI SRE assistant automates up to 80% of incident response handling triage, root cause investigation, and fix recommendations so your team spends less time on coordination and more time reducing MTTR. For a junior engineer handling their first real P2, having an AI that surfaces similar past incidents and suggests probable root causes in Slack acts as a senior SRE looking over their shoulder, without the burnout risk.
The Service Catalog in incident.io surfaces the right runbook, service owner, and dependency map directly into the incident channel when an alert fires. A new engineer doesn't need to remember where the runbook lives or which team owns the affected service because the platform pulls that context into the channel where they're already working.
The team routing feature means alerts automatically escalate to the correct team based on the service that triggered them, with no spreadsheet lookups, no manual pings, and no "who owns the payment service?" in the middle of a P1.
Intercom's engineering team resolved incidents faster and reduced MTTR after adopting incident.io. The key drivers: automated summaries, real-time highlights, and auto-created channels removing the tool-switching overhead that slows new engineers down most.
"Incident.io helps promote a blameless incident culture by promoting clearly defined roles and helping show that dealing with an incident is a collective responsibility. We have also started using it to conduct game days, so that we can better prepare for a catastrophic scenario." - Saurav C. on G2
Before handing over the pager, confirm every item below. (Severity levels for reference: P1 = critical outage, all customers affected; P2 = major impact, partial functionality remains; P3 = some customers affected or workaround available; P4 = minor issue, minimal customer impact.)
Access and credentials:
Practical readiness:
/inc commands executed successfully in a test incident| Day | Focus | Deliverables |
|---|---|---|
| Day 1 | Tool access, architecture walkthrough, runbook review | System access confirmed, first /inc command executed |
| Day 2 | Live or replayed incident shadowing with debrief | Shadowing notes documented, debrief completed |
| Day 3 | Simulated P2/P3 incident drill under supervision | Readiness criteria demonstrated, drill completed |
Before the engineer takes their first solo shift, review these three questions directly:
If they answer all three without hesitation, they're ready. If they pause on any of them, schedule another live shadowing session before their solo shift starts.
If you want to see how the AI SRE handles up to 80% of incident response for a junior engineer on their first rotation, schedule a demo with us.
MTTR (Mean Time To Resolution): The average time from when an incident is detected to when it is fully resolved. MTTR is the primary metric for measuring incident response efficiency. Reducing MTTR by even 10 minutes per incident compounds across a full month of incidents.
On-call rotation: A scheduled cycle that assigns engineers primary pager responsibility for a defined window. A healthy rotation distributes burden evenly across the team and pairs new engineers with a senior SRE shadow during their first shifts.
Runbook: A documented, step-by-step guide for diagnosing and resolving a specific class of incident. A good runbook includes trigger conditions, diagnostic commands, remediation steps, escalation contacts, and a last-tested date.
Post-mortem: A structured document produced after a significant incident (typically P1 or P2) that captures the timeline, root cause, contributing factors, and follow-up action items. Post-mortems are blameless by design and serve as the primary learning artifact for new on-call engineers.
Incident commander: The engineer who owns communication, coordination, and decision-making during an active incident. The incident commander delegates diagnostic work to specialists and ensures stakeholders receive timely updates they manage the response, not the fix.
Coordination tax: The time lost to logistics before actual troubleshooting begins manually creating a Slack channel, paging the right team, finding the runbook, and opening monitoring tools. Coordination tax typically costs 10–15 minutes per incident and falls disproportionately on new engineers still learning the toolchain.


incident.io just launched the PagerDuty Rescue Program, making it easier than ever for engineering teams to ditch their decade-old on-call tooling. The program includes a contract buyout (up to a year free), AI-powered white glove migration, a 99.99% uptime SLA, and AI-first on-call that investigates alerts autonomously the moment they fire.
Tom Wentworth
Hitting 99.99% isn't a faster version of what you already do. It's a different problem to be solved: autonomous recovery, dependency ceilings, redundancies, and the discipline to build systems that buy you 15-30 minutes before you're needed at all.
Norberto Lopes
A look at how on-call schedules work, and how we made rendering them 2,500× faster — through profiling, smarter algorithms, and some Claude.
Rory BainReady for modern incident management? Book a call with one of our experts today.
