How much availability is "good enough"? What latency is "acceptable"? Without clear definitions, these questions become subjective debates. SLI, SLO, and SLA are frameworks for transforming vague expectations into measurable commitments.
This article explains each concept, how they relate, and how to implement them in practice.
Without clear goals, everything is urgent and nothing is a priority.
The Three Concepts
SLI (Service Level Indicator)
What it is: a metric that quantifies some aspect of the service.
Examples:
- Proportion of requests with latency < 200ms
- Proportion of successful requests (2xx status)
- Proportion of time the service is available
Typical formula:
SLI = (good events) / (total events) × 100%
Latency SLI = (requests < 200ms) / (total requests) × 100%
SLO (Service Level Objective)
What it is: the target you set for an SLI.
Examples:
- 99% of requests should have latency < 200ms
- 99.9% of requests should succeed
- The service should be available 99.95% of the time
Format:
SLO = SLI [operator] [threshold] during [period]
"99% of requests < 200ms in a 28-day window"
SLA (Service Level Agreement)
What it is: a formal contract with consequences if the SLO is not met.
Examples:
- "We guarantee 99.9% availability. If we don't meet it, you receive 10% credit."
- "Maximum latency of 500ms. Violations result in contractual penalty."
Relationship:
SLA is a contract based on an SLO
SLO is a target based on an SLI
SLI is a measurement of the system
Why SLOs Matter
1. They define "good enough"
Without SLO:
- "The system is slow"
- "We need to improve"
- "It's not good enough"
With SLO:
- "We're at 98.5% vs target of 99%"
- "We're 0.5% short of the objective"
- "We need to reduce errors by 50%"
2. They enable conscious trade-offs
100% availability is impossible. 99.999% is extremely expensive. What's the right level?
| Availability | Downtime/month | Relative cost |
|---|---|---|
| 99% | 7.2 hours | $ |
| 99.9% | 43 minutes | $$ |
| 99.99% | 4.3 minutes | $$$$ |
| 99.999% | 26 seconds | $$$$$$ |
3. They balance speed and reliability
If you're always meeting your SLOs, maybe you're being too conservative. If you're always violating them, you're being irresponsible.
The sweet spot is occasionally using your error budget on new features.
Choosing Good SLIs
Characteristics of a good SLI
- Reflects user experience: measures what the user sees
- Is measurable: can be collected reliably
- Is actionable: you can do something when it's bad
- Is simple: easy to understand and explain
Recommended SLIs by service type
Request/response services:
- Availability: % of successful requests
- Latency: % of requests below threshold
- Quality: % of correct responses
Data processing services:
- Freshness: % of time data is up to date
- Correctness: % of data processed correctly
- Throughput: % of jobs completed on time
Storage services:
- Durability: % of data preserved
- Availability: % of time accessible
- Latency: % of operations below threshold
Defining SLOs
The process
- Choose SLIs relevant to the service
- Analyze historical data: what's the current behavior?
- Consult stakeholders: what level is acceptable?
- Define targets based on data and expectations
- Iterate: SLOs are not static
Practical example
Service: Checkout API
Step 1 - SLIs:
- Latency: % requests < 500ms
- Availability: % successful requests
- Errors: % requests without 5xx error
Step 2 - Historical data:
- Latency: 95% of requests < 500ms currently
- Availability: 99.7% currently
- Errors: 0.3% of 5xx errors
Step 3 - Stakeholders:
- Product: "Checkout needs to be fast, it's critical for conversion"
- Finance: "Every checkout failure is a lost sale"
- Engineering: "We can improve, but 100% is impossible"
Step 4 - Defined SLOs:
- Latency: 99% < 500ms (better than current, challenging but achievable)
- Availability: 99.9% (very critical, we need to invest)
- Errors: < 0.1% of 5xx
Step 5 - Window:
- 28-day rolling window
Implementing SLOs
Monitoring
# Prometheus alerting rule
- alert: CheckoutLatencySLOBreach
expr: |
(
sum(rate(checkout_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(checkout_request_duration_seconds_count[5m]))
) < 0.99
for: 5m
labels:
severity: warning
annotations:
summary: "Checkout latency SLO at risk"
Dashboard
┌────────────────────────────────────────┐
│ SLO Status: Checkout API │
├────────────────────────────────────────┤
│ Latency (<500ms) │
│ ████████████████████░░░ 98.5% / 99% │
│ Error budget: 30% remaining │
├────────────────────────────────────────┤
│ Availability │
│ █████████████████████░░ 99.85% / 99.9% │
│ Error budget: 50% remaining │
└────────────────────────────────────────┘
Layered alerts
- High burn rate: SLO will be violated soon → page on-call
- Medium burn rate: SLO at risk → Slack notification
- Negative trend: could become a problem → ticket for review
Common Mistakes
1. SLOs too aggressive
Target: 99.999% availability
Reality: 99.9%
Result: Always violating, demotivated team, SLO ignored
2. SLOs too loose
Target: 95% availability
Reality: 99.5%
Result: Never worries, doesn't drive improvement
3. Too many SLOs
15 different SLOs
Team doesn't know which to prioritize
None receives adequate attention
Recommendation: 3-5 SLOs per service.
4. SLOs disconnected from user
SLO: CPU < 80%
User: doesn't care about CPU, cares about latency
Result: Optimizes wrong metric
Conclusion
SLI, SLO, and SLA form a hierarchy of commitments:
| Level | Question it answers |
|---|---|
| SLI | How do we measure quality? |
| SLO | What level of quality do we seek? |
| SLA | What's the consequence of not meeting it? |
To implement successfully:
- Start simple: few well-defined SLOs
- Base on data: use history to define realistic targets
- Involve stakeholders: SLOs are agreements, not impositions
- Monitor and alert: SLOs without monitoring are just wishful thinking
- Iterate: review and adjust as you learn
SLOs aren't about perfection. They're about defining what's "good enough" and ensuring you deliver it.