SLO, SLA, and SLI: defining performance expectations

How much availability is "good enough"? What latency is "acceptable"? Without clear definitions, these questions become subjective debates. SLI, SLO, and SLA are frameworks for transforming vague expectations into measurable commitments.

This article explains each concept, how they relate, and how to implement them in practice.

Without clear goals, everything is urgent and nothing is a priority.

The Three Concepts

SLI (Service Level Indicator)

What it is: a metric that quantifies some aspect of the service.

Examples:

Proportion of requests with latency < 200ms
Proportion of successful requests (2xx status)
Proportion of time the service is available

Typical formula:

SLI = (good events) / (total events) × 100%

Latency SLI = (requests < 200ms) / (total requests) × 100%

SLO (Service Level Objective)

What it is: the target you set for an SLI.

Examples:

99% of requests should have latency < 200ms
99.9% of requests should succeed
The service should be available 99.95% of the time

Format:

SLO = SLI [operator] [threshold] during [period]

"99% of requests < 200ms in a 28-day window"

SLA (Service Level Agreement)

What it is: a formal contract with consequences if the SLO is not met.

Examples:

"We guarantee 99.9% availability. If we don't meet it, you receive 10% credit."
"Maximum latency of 500ms. Violations result in contractual penalty."

Relationship:

SLA is a contract based on an SLO
SLO is a target based on an SLI
SLI is a measurement of the system

Why SLOs Matter

1. They define "good enough"

Without SLO:
- "The system is slow"
- "We need to improve"
- "It's not good enough"

With SLO:
- "We're at 98.5% vs target of 99%"
- "We're 0.5% short of the objective"
- "We need to reduce errors by 50%"

2. They enable conscious trade-offs

100% availability is impossible. 99.999% is extremely expensive. What's the right level?

Availability	Downtime/month	Relative cost
99%	7.2 hours	$
99.9%	43 minutes	$$
99.99%	4.3 minutes	$$$$
99.999%	26 seconds	$$$$$$

3. They balance speed and reliability

If you're always meeting your SLOs, maybe you're being too conservative. If you're always violating them, you're being irresponsible.

The sweet spot is occasionally using your error budget on new features.

Choosing Good SLIs

Characteristics of a good SLI

Reflects user experience: measures what the user sees
Is measurable: can be collected reliably
Is actionable: you can do something when it's bad
Is simple: easy to understand and explain

Recommended SLIs by service type

Request/response services:

Availability: % of successful requests
Latency: % of requests below threshold
Quality: % of correct responses

Data processing services:

Freshness: % of time data is up to date
Correctness: % of data processed correctly
Throughput: % of jobs completed on time

Storage services:

Durability: % of data preserved
Availability: % of time accessible
Latency: % of operations below threshold

Defining SLOs

The process

Choose SLIs relevant to the service
Analyze historical data: what's the current behavior?
Consult stakeholders: what level is acceptable?
Define targets based on data and expectations
Iterate: SLOs are not static

Practical example

Service: Checkout API

Step 1 - SLIs:

Latency: % requests < 500ms
Availability: % successful requests
Errors: % requests without 5xx error

Step 2 - Historical data:

Latency: 95% of requests < 500ms currently
Availability: 99.7% currently
Errors: 0.3% of 5xx errors

Step 3 - Stakeholders:

Product: "Checkout needs to be fast, it's critical for conversion"
Finance: "Every checkout failure is a lost sale"
Engineering: "We can improve, but 100% is impossible"

Step 4 - Defined SLOs:

Latency: 99% < 500ms (better than current, challenging but achievable)
Availability: 99.9% (very critical, we need to invest)
Errors: < 0.1% of 5xx

Step 5 - Window:

28-day rolling window

Implementing SLOs

Monitoring

# Prometheus alerting rule
- alert: CheckoutLatencySLOBreach
  expr: |
    (
      sum(rate(checkout_request_duration_seconds_bucket{le="0.5"}[5m]))
      /
      sum(rate(checkout_request_duration_seconds_count[5m]))
    ) < 0.99
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Checkout latency SLO at risk"

Dashboard

┌────────────────────────────────────────┐
│ SLO Status: Checkout API               │
├────────────────────────────────────────┤
│ Latency (<500ms)                       │
│ ████████████████████░░░ 98.5% / 99%    │
│ Error budget: 30% remaining            │
├────────────────────────────────────────┤
│ Availability                           │
│ █████████████████████░░ 99.85% / 99.9% │
│ Error budget: 50% remaining            │
└────────────────────────────────────────┘

Layered alerts

High burn rate: SLO will be violated soon → page on-call
Medium burn rate: SLO at risk → Slack notification
Negative trend: could become a problem → ticket for review

Common Mistakes

1. SLOs too aggressive

Target: 99.999% availability
Reality: 99.9%

Result: Always violating, demotivated team, SLO ignored

2. SLOs too loose

Target: 95% availability
Reality: 99.5%

Result: Never worries, doesn't drive improvement

3. Too many SLOs

15 different SLOs
Team doesn't know which to prioritize
None receives adequate attention

Recommendation: 3-5 SLOs per service.

4. SLOs disconnected from user

SLO: CPU < 80%
User: doesn't care about CPU, cares about latency

Result: Optimizes wrong metric

Conclusion

SLI, SLO, and SLA form a hierarchy of commitments:

Level	Question it answers
SLI	How do we measure quality?
SLO	What level of quality do we seek?
SLA	What's the consequence of not meeting it?

To implement successfully:

Start simple: few well-defined SLOs
Base on data: use history to define realistic targets
Involve stakeholders: SLOs are agreements, not impositions
Monitor and alert: SLOs without monitoring are just wishful thinking
Iterate: review and adjust as you learn

SLOs aren't about perfection. They're about defining what's "good enough" and ensuring you deliver it.

The Three Concepts

SLI (Service Level Indicator)

SLO (Service Level Objective)

SLA (Service Level Agreement)

Why SLOs Matter

1. They define "good enough"

2. They enable conscious trade-offs

3. They balance speed and reliability

Choosing Good SLIs

Characteristics of a good SLI

Recommended SLIs by service type

Defining SLOs

The process

Practical example

Implementing SLOs

Monitoring

Dashboard

Layered alerts

Common Mistakes

1. SLOs too aggressive

2. SLOs too loose

3. Too many SLOs

4. SLOs disconnected from user

Conclusion

Want to understand your platform's limits?