Key statistics and quick answers
Quick definitions:
- SLI (Service Level Indicator): The actual measurement (e.g., "99.95% uptime last month")
- SLO (Service Level Objective): The internal target (e.g., "99.9% uptime goal")
- SLA (Service Level Agreement): The contractual promise with consequences (e.g., "99.9% uptime or 10% credit")
Industry benchmarks (2024 data):
Service Type |
Typical SLA |
Industry Leader SLA |
Cost of Downtime/Hour |
E-commerce Platform |
99.9% |
99.99% (Amazon) |
$500,000 – $1,100,000 |
SaaS B2B |
99.5% |
99.95% (Salesforce) |
$100,000 – $540,000 |
Financial Services |
99.95% |
99.999% (Visa) |
$1,000,000 – $5,000,000 |
Social Media |
99.9% |
99.95% (LinkedIn) |
$200,000 – $450,000 |
Key facts:
- 73% of organizations experienced an outage costing over $100,000 in the last year
- Average SLA breach penalty: 5-25% service credit
- Most common SLO target: 99.9% uptime (allows 43.8 minutes downtime/month)
- Error budget calculation: (100% - SLO%) × time period
- ROI of SRE practices: 3-5x reduction in incident costs within 12 months
What is an SLI (Service Level Indicator)?
Definition and purpose
An SLI answers the question: "How are we actually performing right now?"
Service Level Indicators are the raw, quantitative measurements that tell you exactly how your service is performing. Think of them as the speedometer in your car - they show the actual speed, not the speed limit or your desired speed.
The 5 golden SLI categories
- Availability SLIs
- Measurement: Percentage of successful requests / total requests
- Example: "99.95% of health check requests succeeded in the last 30 days"
- Real-world case: Netflix measures availability as successful stream starts divided by total attempts
- Latency SLIs
- Measurement: Response time at specific percentiles (p50, p95, p99)
- Example: "95% of API requests completed in under 200ms"
- Real-world case: Google Search aims for p99 latency under 1000ms globally
- Throughput SLIs
- Measurement: Requests processed per second/minute
- Example: "Payment system processed 10,000 transactions per minute"
- Real-world case: Stripe processes 250+ million API requests daily (2,900/second average)
- Error rate SLIs
- Measurement: Failed requests / total requests
- Example: "0.01% error rate on database queries"
- Real-world case: Amazon DynamoDB maintains <0.001% error rates on standard operations
- Durability SLIs
- Measurement: Percentage of data retained without corruption
- Example: "99.999999999% (11 nines) data durability"
- Real-world case: Amazon S3 guarantees 11 nines of durability for objects
How to calculate SLIs: Step-by-step
Step 1: Choose your measurement window
- Real-time: Last 5 minutes
- Short-term: Last 24 hours
- Monthly: Rolling 30 days
- Quarterly: Rolling 90 days
Step 2: Collect raw data
{
"total_requests": 1000000,
"successful_requests": 999500,
"failed_requests": 500,
"measurement_period": "30 days"
}
Step 3: Apply the formula
- Availability SLI = (successful_requests / total_requests) × 100
- Example: (999,500 / 1,000,000) × 100 = 99.95%
Step 4: Track trends
- Daily averages
- Weekly patterns
- Monthly comparisons
- Quarterly reports
What is an SLO (Service Level Objective)?
Definition and strategic importance
An SLO answers the question: "What level of service do we want to provide?"
Service Level Objectives are your internal targets that balance reliability with innovation velocity. They define the minimum acceptable performance before your team needs to take action.
The 7-step SLO setting process
- Understand user expectations
- Survey data: "Users expect <2 second page loads"
- Industry benchmarks: "Competitors offer 99.9% uptime"
- Historical performance: "We achieved 99.95% last quarter"
- Define critical user journeys (CUJs)
- Login and authentication
- Core feature usage
- Payment processing
- Data export/import
- Set realistic targets
Availability Target |
Monthly Downtime |
Annual Downtime |
Use Case |
99% |
7.2 hours |
3.65 days |
Development environments |
99.9% |
43.8 minutes |
8.76 hours |
Standard web applications |
99.95% |
21.9 minutes |
4.38 hours |
Business-critical SaaS |
99.99% |
4.38 minutes |
52.56 minutes |
High-availability services |
99.999% |
26.3 seconds |
5.26 minutes |
Mission-critical systems |
- Calculate error budgets
- Formula: Error Budget = (100% - SLO%) × Time Period
- Example: 99.9% SLO = 0.1% error budget = 43.8 minutes/month
- Create alerting thresholds
- 50% budget consumed: Engineering awareness
- 75% budget consumed: Incident response team engaged
- 90% budget consumed: Feature freeze consideration
- 100% budget consumed: All-hands response
- Document decision framework
IF error_budget_remaining < 25% THEN
- Freeze non-critical deployments
- Focus on reliability improvements
- Increase monitoring coverage
- Review and adjust quarterly
- Too easy to meet? Increase target
- Consistently missed? Lower target or invest in reliability
Real-world SLO examples from tech leaders
1. Google Cloud Platform
- Service: Cloud Storage
- SLO: 99.95% availability for multi-region
- Error budget: 21.9 minutes/month
- Consequence: If breached, 10-50% credit depending on severity
2. Spotify
- Service: Music streaming
- SLO: 99.9% successful stream starts
- Error budget: 43.8 minutes/month
- Innovation trade-off: Allows rapid feature deployment
3. Uber
- Service: Ride matching
- SLO: 99.99% availability for ride requests
- Error budget: 4.38 minutes/month
- Critical because: Direct revenue impact per minute of downtime
4. GitHub
- Service: Git operations
- SLO: 99.95% success rate for git push/pull
- Error budget: 21.9 minutes/month
- Measurement: Excluding planned maintenance windows
What is an SLA (Service Level Agreement)?
Definition and legal framework
An SLA answers the question: "What did we promise our customers, and what happens if we fail?"
Service Level Agreements are legally binding contracts that define the minimum service levels a provider must deliver and the remedies or penalties if they fail to meet them.
The 10 essential components of an SLA
- Service description
- Exact services covered
- Service boundaries
- Included vs. excluded features
- Availability commitments
Monthly Uptime Percentage: 99.9%
Calculation: (Total Minutes - Downtime Minutes) / Total Minutes × 100
Measurement Period: Calendar month
- Performance metrics
- Response time: "95% of requests under 500ms"
- Throughput: "Support 10,000 concurrent users"
- Error rate: "Less than 0.1% failed requests"
- Support commitments
Severity |
Initial Response |
Resolution Target |
Escalation |
Critical (P1) |
15 minutes |
4 hours |
VP Engineering |
High (P2) |
1 hour |
24 hours |
Engineering Manager |
Medium (P3) |
4 hours |
72 hours |
Team Lead |
Low (P4) |
24 hours |
Best effort |
Support Team |
- Credit structure
| Monthly Uptime | Service Credit |
| --- | --- |
| 99.9% - 100% | 0% |
| 99.0% - 99.9% | 10% |
| 95.0% - 99.0% | 25% |
| Below 95.0% | 50% |
- Exclusions
- Planned maintenance windows
- Force majeure events
- Customer-caused issues
- Third-party service failures
- Reporting requirements
- Monthly availability reports
- Incident post-mortems for breaches
- Quarterly business reviews
- Measurement methodology
- Monitoring locations
- Sampling frequency
- Calculation formulas
- Dispute resolution process
- Notification procedures
- Incident communication SLA
- Status page updates
- Email notifications
- Executive escalation paths
- Termination clauses
- Chronic breach conditions
- Refund policies
- Data portability guarantees
- Transition assistance
Industry-specific SLA examples
1. AWS (Amazon Web Services)
- Service: EC2
- SLA: 99.99% for instances in multiple AZs
- Credits: 10% for <99.99%, 30% for <99.0%
- Unique clause: Excludes individual instance failures
2. Microsoft Azure
- Service: Virtual Machines
- SLA: 99.95% for single VM, 99.99% for availability sets
- Credits: 10% for <99.95%, 25% for <99%, 100% for <95%
- Measurement: 5-minute intervals
3. Salesforce
- Service: Sales Cloud
- SLA: 99.9% availability
- Credits: Days of service added to subscription
- Reporting: Monthly uptime reports published publicly
4. Slack
- Service: Business messaging
- SLA: 99.99% for Enterprise Grid
- Credits: 10× the duration of downtime
- Innovation: Separate SLAs for different service tiers
Quick reference: SLI vs SLO vs SLA comparison table
What's the difference between SLI, SLO, and SLA?
Aspect |
SLI (Service Level Indicator) |
Definition |
The actual, real-time measurement of your service performance. It's the raw data that tells you exactly how your system is performing right now. |
Metrics examples |
- Availability: 99.97% uptime measured over last 30 days
- Latency: P95 response time = 187ms
- Error rate: 0.03% failed requests
- Throughput: 52,341 requests/second
- Durability: 99.999999999% data integrity
|
Use cases |
- Real-time monitoring: Dashboard showing current system health
- Incident detection: Alert when availability drops below 99.95%
- Performance tracking: Weekly performance reports
- Capacity planning: Identify when to scale infrastructure
- Root cause analysis: Correlate metrics during incidents
|
Audience |
Engineering teams, SRE, DevOps |
Frequency of change |
Continuous (real-time data) |
Consequences of missing |
Alerts fired, incidents declared |