Site Reliability Engineering (SRE) is an engineering discipline that applies software engineering practices to infrastructure and operations problems. Originally developed at Google in 2003, SRE teams are responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of production systems.
SRE differs from traditional operations in several key ways:
In practice: An SRE team might spend 50% of their time on operational work (incident response, on-call, manual tasks) and 50% on engineering projects that improve reliability, such as building better monitoring, automating toil, or improving deployment pipelines.
Why it matters: According to the 2025 State of DevOps report, organizations with mature SRE practices achieve 4x faster deployment frequency and 3x lower change failure rates compared to those without dedicated reliability engineering.
AI SRE refers to artificial intelligence systems designed to investigate incidents, identify root causes, and suggest fixes alongside human engineers. Unlike basic automation that follows predefined rules, AI SRE uses machine learning to analyze patterns across logs, metrics, code changes, and historical incidents to accelerate investigation and resolution.
Modern AI SRE systems typically provide five core capabilities:
Real-world performance: Teams using AI-powered incident investigation report 60-80% reductions in context-gathering time. Instead of spending the first 15-30 minutes of an incident manually searching logs and asking "what changed recently?" in Slack, AI SRE surfaces relevant information within 30 seconds.
Key distinction: AI SRE acts as a knowledgeable teammate that investigates alongside you, not a replacement for human judgment. The goal is automating the first 80% of incident response - the repetitive context-gathering and correlation work - so engineers can focus on the complex decision-making that requires human expertise.
On-call is a scheduling system that ensures someone is always available to respond to incidents affecting production systems. Engineers on-call are responsible for acknowledging alerts, triaging issues, and either resolving incidents themselves or escalating to appropriate team members.
A well-designed on-call system includes:
On-call compensation considerations:
Industry benchmark: According to analysis of 10,000+ engineering teams, healthy on-call rotations see 2-5 pages per week per engineer. Teams experiencing more than 10 pages per week typically suffer from alert fatigue and burnout.
SLA, SLO, and SLI form a hierarchy for measuring and committing to system reliability. SLIs measure actual performance, SLOs set internal targets, and SLAs are external promises with consequences.
An SLI is a quantitative measurement of a specific aspect of service performance. It's the actual data point you measure.
Common SLIs include:
An SLO is an internal target for an SLI that your team commits to maintaining. It defines "good enough" reliability for your service.
Example: "Our API should have 99.9% availability measured over a 30-day rolling window."
Error budgets: The gap between your SLO and 100% becomes your error budget. If your SLO is 99.9% availability, you have a 0.1% error budget - approximately 43 minutes of downtime per month. Teams use error budgets to balance reliability work against feature development.
An SLA is a contract with customers that specifies minimum performance levels and consequences for missing them. SLAs should always be less aggressive than your SLOs to provide buffer.
Example: "We guarantee 99.5% API availability. If we fail to meet this, affected customers receive 10% service credits."
The relationship in practice:
SLI: We measured 99.94% availability this month
SLO: Our target is 99.9% availability (we're meeting it)
SLA: We promise customers 99.5% availability (comfortable margin)
Incident response is the process of detecting, investigating, communicating about, and resolving service disruptions. Effective incident response minimizes customer impact, reduces recovery time, and captures learnings to prevent recurrence.
A mature incident response process includes these phases:
Most organizations use 4-5 severity levels:
| Severity | Description | Response Expectation |
|---|---|---|
| SEV1/Critical | Complete service outage, major data loss | All-hands response, executive notification |
| SEV2/High | Significant degradation, major feature unavailable | Immediate response, broad team engagement |
| SEV3/Medium | Minor impact, workarounds available | Response within 1-2 hours during business hours |
| SEV4/Low | Minimal impact, single user affected | Next business day response |
Best practices for incident response:
MTTR (Mean Time to Resolution) is the average time from when an incident is detected to when it's fully resolved. It's the primary metric for measuring incident response effectiveness and directly correlates with customer impact.
MTTR is one of four key incident metrics:
MTTR = Total resolution time for all incidents / Number of incidents
Example:
- Incident 1: 45 minutes
- Incident 2: 120 minutes
- Incident 3: 30 minutes
- Incident 4: 90 minutes
MTTR = (45 + 120 + 30 + 90) / 4 = 71.25 minutes
Industry benchmarks for MTTR:
| Team Maturity | Typical MTTR (SEV1) | Typical MTTR (SEV2) |
|---|---|---|
| Early-stage | 2-4 hours | 4-8 hours |
| Maturing | 30-60 minutes | 1-2 hours |
| Advanced | Under 15 minutes | Under 30 minutes |
Strategies to reduce MTTR:
Critical insight: Teams that invest in reducing MTTD and MTTA often see the biggest MTTR improvements. Many incidents could be resolved in minutes if detected and acknowledged faster.
A runbook is a documented set of step-by-step procedures for handling specific incidents, operational tasks, or maintenance activities. Good runbooks transform tribal knowledge into repeatable processes that any trained team member can execute.
# Database Connection Pool Exhaustion
## Symptoms
- Spike in "connection timeout" errors
- Application logs showing "unable to acquire connection"
- Database connections at max_pool_size limit
## Resolution Steps
1. Verify the issue
- Check Grafana dashboard: [link]
- Expected: connection_pool_active > 95%
2. Identify the cause
- Check for long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle';
- Check for connection leaks in recent deployments
3. Immediate mitigation
- Kill long-running queries if safe: SELECT pg_terminate_backend(pid);
- Restart affected application pods: kubectl rollout restart deployment/api
4. Verify resolution
- Confirm connection pool utilization drops below 80%
- Confirm error rate returns to baseline
## Escalation
If issue persists after step 3, escalate to database team via #db-oncall
Runbook maintenance tips:
A post-mortem is a structured analysis conducted after an incident to understand what happened, why it happened, and how to prevent similar incidents in the future. Also called incident reviews, retrospectives, or learning reviews, effective post-mortems focus on system improvements rather than individual blame.
# Incident Post-Mortem: [Incident Title]
## Summary
[2-3 sentence description of what happened and impact]
## Timeline
- 14:32 - Alert fired for elevated error rates
- 14:35 - On-call engineer acknowledged
- 14:42 - Root cause identified (bad config deployed)
- 14:48 - Config rolled back
- 14:52 - Service recovered
## Root Cause Analysis
[Detailed explanation of what went wrong and why]
## Impact
- Duration: 20 minutes
- Users affected: ~5,000
- Revenue impact: Estimated $2,000 in failed transactions
## What Went Well
- Alert fired within 3 minutes of issue start
- Clear runbook for config rollback
## What Could Be Improved
- Config change lacked automated testing
- No canary deployment for config changes
## Action Items
1. [Owner: Alice] Add automated config validation - Due: Jan 15
2. [Owner: Bob] Implement config canary deployments - Due: Jan 30
3. [Owner: Carol] Update runbook with new validation steps - Due: Jan 10
Post-mortem best practices:
An escalation policy defines who gets notified about an incident, in what order, and through what channels. Well-designed escalation policies ensure the right people are engaged quickly without overwhelming teams with unnecessary alerts.
Alert triggers →
Layer 1 (0 min): Primary on-call
- Push notification + SMS
- If no acknowledgment in 5 minutes...
Layer 2 (5 min): Secondary on-call
- Push notification + SMS + Phone call
- If no acknowledgment in 10 minutes...
Layer 3 (15 min): Engineering manager + Primary on-call (again)
- Phone call to manager
- Escalate to #engineering-urgent Slack channel
Layer 4 (30 min): VP Engineering + Incident commander on-call
- Phone call
- Page entire team
Common mistake: Making escalation too slow. If your primary on-call doesn't respond to a SEV1 incident, waiting 15 minutes to escalate can mean significant additional customer impact.
Alert fatigue is the desensitization that occurs when on-call engineers receive too many alerts, leading to slower response times, missed critical issues, and eventual burnout. It's one of the most common causes of incident response failures.
| Cause | Example | Solution |
|---|---|---|
| False positives | Alert fires but no actual issue | Tune thresholds, add correlation |
| Duplicate alerts | Same issue triggers 10 alerts | Implement alert deduplication |
| Low-urgency alerts | Non-actionable notifications during on-call | Route to async channels |
| Missing context | Alert provides no investigation path | Include runbook links, context |
| Alert sprawl | Alerts never cleaned up | Regular alert hygiene reviews |
Industry benchmark: Elite teams maintain less than 5 pages per on-call engineer per week, with over 80% of alerts being actionable.
Monitoring tracks known failure modes through predefined metrics and thresholds, while observability provides the ability to understand any system state - including novel failures - by examining outputs like logs, metrics, and traces.
Monitoring answers the question: "Is the system working?"
Limitations: Monitoring only catches issues you've anticipated. If a new failure mode occurs that doesn't trip existing alerts, you won't know until customers report problems.
Observability answers the question: "Why is the system behaving this way?"
Built on three pillars:
Key capability: With proper observability, you can investigate issues you've never seen before by asking arbitrary questions of your telemetry data.
| Aspect | Monitoring | Observability |
|---|---|---|
| Approach | Predefined checks | Exploratory investigation |
| Question | "Is it broken?" | "Why is it broken?" |
| Coverage | Known failure modes | Unknown failure modes |
| Data | Aggregated metrics | High-cardinality data |
| Tools | Dashboards, alerts | Log analysis, distributed tracing |
Practical insight: You need both. Monitoring tells you something is wrong; observability helps you figure out what. Teams often start with monitoring and add observability capabilities as their systems grow more complex.
An error budget is the maximum amount of unreliability your service can have while still meeting its Service Level Objective (SLO). It's calculated as 100% minus your SLO target, and it creates a shared framework for balancing reliability work against feature development.
If your SLO is 99.9% availability:
Error budget = 100% - 99.9% = 0.1%
In a 30-day month:
0.1% of 30 days = 43.2 minutes of allowed downtime
When error budget is healthy (plenty remaining):
When error budget is depleted:
| Error Budget Status | Development Velocity | Risk Tolerance |
|---|---|---|
| >50% remaining | Full speed | Normal deployment process |
| 25-50% remaining | Moderate caution | Additional review for risky changes |
| 10-25% remaining | Slow down | Only low-risk changes allowed |
| <10% remaining | Freeze | Reliability work only |
Why error budgets matter: They eliminate the traditional tension between "ship faster" and "be more reliable" by creating objective criteria. Instead of arguing about whether to delay a feature for reliability work, the error budget provides a clear answer.
Toil is manual, repetitive, automatable work that scales linearly with system size and provides no enduring value. In SRE, eliminating toil is a primary objective because it frees engineers to focus on work that improves systems rather than just maintaining them.
Work qualifies as toil if it has these attributes:
| Toil | Not Toil |
|---|---|
| Manually restarting crashed services | Building auto-restart capability |
| Provisioning accounts by hand | Creating self-service provisioning |
| Running manual deployments | Implementing CI/CD pipelines |
| Responding to capacity alerts by adding resources | Building auto-scaling |
| Manually rotating credentials | Automating credential rotation |
Google SRE recommends that SRE teams spend no more than 50% of their time on toil. The other 50% should go toward engineering projects that reduce future toil or improve reliability.
Tracking toil: Many teams categorize their work weekly to ensure toil doesn't exceed healthy levels. If toil consistently exceeds 50%, it signals a need for more automation investment.
Incident severity is a classification system that indicates the urgency and impact of an incident, determining response speed, communication requirements, and resource allocation. Consistent severity classification ensures appropriate response effort and helps prioritize when multiple incidents occur simultaneously.
| Level | Name | Definition | Typical Response |
|---|---|---|---|
| SEV1 | Critical | Complete outage or data loss affecting all users | Immediate all-hands response, executive notification, status page update |
| SEV2 | High | Major feature unavailable or significant performance degradation | Immediate response, team-wide notification |
| SEV3 | Medium | Minor impact with workarounds available | Response within 1-2 hours, standard on-call handling |
| SEV4 | Low | Minimal impact, single user or cosmetic issue | Next business day, may be handled as bug ticket |
Escalate severity when:
De-escalate severity when:
Best practice: When uncertain, err toward higher severity. It's easier to de-escalate a well-managed incident than to recover from an under-resourced response.
A service catalog is a structured inventory of all services in your organization, including their owners, dependencies, runbooks, and metadata. It serves as the single source of truth for understanding what you're running and who's responsible for it.
A comprehensive service catalog entry includes:
| Benefit | Description |
|---|---|
| Faster incident response | Instantly find the right team to contact |
| Dependency awareness | Understand blast radius of failures |
| Onboarding acceleration | New team members quickly understand the system |
| Audit compliance | Track ownership and changes for compliance |
| Automation foundation | Power workflows with accurate service data |
Integration with incident response: When an incident occurs, the service catalog can automatically identify the owning team, surface relevant runbooks, and map potential impact to dependent services.
Chaos engineering is the practice of intentionally introducing failures into production systems to test resilience and uncover weaknesses before they cause real incidents. Rather than waiting for systems to fail unexpectedly, chaos engineering proactively validates that systems can handle adverse conditions.
| Experiment | Tests For | Example |
|---|---|---|
| Server termination | Auto-scaling, failover | Randomly kill EC2 instances |
| Network latency | Timeout handling | Add 500ms delay to service calls |
| Dependency failure | Graceful degradation | Block traffic to database |
| Resource exhaustion | Capacity limits | Fill disk, exhaust memory |
| Zone outage | Multi-AZ resilience | Disable entire availability zone |
1. Define steady state (normal system behavior)
2. Hypothesize that steady state will continue during experiment
3. Introduce failure (server crash, network partition, etc.)
4. Observe system behavior
5. Either confirm hypothesis or discover weakness
6. Fix weaknesses and repeat
Key insight: Chaos engineering isn't about breaking things for fun - it's about building confidence in your system's ability to handle real-world failures. The goal is finding weaknesses in controlled conditions rather than during actual incidents.
An incident commander (IC) is the designated leader responsible for coordinating an incident response, making decisions, and ensuring effective communication. The IC doesn't need to be the most technical person - their role is coordination and decision-making, not debugging.
Do:
Don't:
Rotating the IC role: During long incidents, IC responsibilities should transfer between engineers to prevent fatigue. Clear handoffs with status summaries ensure continuity.
Mean Time to Detect (MTTD) is the average time between when a failure occurs and when it's detected by monitoring systems or users. MTTD is often the largest hidden contributor to overall incident duration - you can't fix what you don't know is broken.
MTTD = Time of detection - Time of failure
Example:
- Failure occurred: 14:00:00
- Alert fired: 14:03:30
- MTTD: 3 minutes 30 seconds
| Factor | Impact on MTTD |
|---|---|
| Alert sensitivity | Tighter thresholds detect faster but increase false positives |
| Monitoring coverage | Blind spots create detection gaps |
| Alert routing | Misconfigured routing delays notification |
| Health check frequency | Longer intervals mean longer detection time |
| Synthetic monitoring | User-journey tests catch issues before real users |
Industry benchmark: Elite teams achieve MTTD under 5 minutes for critical issues. Most organizations operate in the 10-30 minute range.
A game day is a planned exercise where teams practice incident response by responding to simulated or real failures in a controlled environment. Game days build muscle memory, identify process gaps, and ensure teams can respond effectively under pressure.
| Type | Description | Best For |
|---|---|---|
| Tabletop exercise | Discussion-based scenario walkthrough | Process validation, new team training |
| Simulation | Realistic scenario with mocked failures | Testing runbooks and coordination |
| Live fire | Real failures in production | Validating actual system resilience |
1. Preparation (1-2 weeks before)
- Define scenario and objectives
- Identify participants
- Prepare monitoring and communication channels
2. Execution (2-4 hours)
- Brief participants on rules and objectives
- Inject failure scenario
- Observe response without intervention
- Call exercise when objectives are met
3. Debrief (1 hour)
- Review timeline and decisions
- Identify gaps and improvements
- Document findings
- Assign follow-up actions
Frequency: Most mature organizations run game days quarterly, with tabletop exercises monthly.
SRE teams can be organized in several models depending on company size, technical complexity, and organizational culture. There's no single correct structure - the best approach depends on your specific context.
1. Centralized SRE
2. Embedded SRE
3. Platform SRE
4. Hybrid Model
| Company Stage | Typical Ratio | Context |
|---|---|---|
| Early startup | 0 SREs | Developers handle operations |
| Growth stage | 1:10-15 | Building SRE foundation |
| Scale-up | 1:8-12 | Mature practices, automation focus |
| Enterprise | 1:6-10 | Complex systems, compliance requirements |
A blameless culture is an organizational approach where the focus after incidents is on understanding system failures and preventing recurrence, rather than punishing individuals who made mistakes. This approach recognizes that human error is inevitable and that blame-focused responses discourage transparency and learning.
Language matters:
Process changes:
| Misconception | Reality |
|---|---|
| "Blameless means no accountability" | Accountability for learning exists; punishment for honest mistakes doesn't |
| "Everyone is equally responsible" | Clear ownership still matters; blame assignment doesn't |
| "We can't address repeat issues" | Patterns can be addressed through systems, not blame |
Key insight: Blameless culture isn't about being soft on mistakes - it's about being smart about how you prevent future mistakes. Research consistently shows that blameless environments surface more issues and improve faster than punitive ones.
Starting with SRE doesn't require hiring a dedicated team or implementing every practice at once. Begin with the fundamentals that provide the most value for your current stage, and add sophistication as you grow.
Phase 1: Foundation (Week 1-4)
Phase 2: Process (Month 2-3)
Phase 3: Improvement (Month 4-6)
Phase 4: Maturity (Month 6+)
Site Reliability Engineering isn't a destination you arrive at - it's a continuous process of improving how your team builds, operates, and learns from production systems. The terminology in this guide represents decades of hard-won lessons from engineering teams at organizations of all sizes.
Key takeaways:
Whether you're building your first on-call rotation or implementing chaos engineering at scale, the goal remains the same: reliable systems that let you move fast when things inevitably break.
Last updated: January 2026
For teams looking to implement these practices:
Start with a free trial at incident.io to see how these concepts work in practice.

Ready for modern incident management? Book a call with one of our experts today.
