Methodology8 min

SLO, SLA, and SLI: defining performance expectations

Understand the difference between SLI, SLO, and SLA, and how to use them to manage performance expectations.

How much availability is "good enough"? What latency is "acceptable"? Without clear definitions, these questions become subjective debates. SLI, SLO, and SLA are frameworks for transforming vague expectations into measurable commitments.

This article explains each concept, how they relate, and how to implement them in practice.

Without clear goals, everything is urgent and nothing is a priority.

The Three Concepts

SLI (Service Level Indicator)

What it is: a metric that quantifies some aspect of the service.

Examples:

  • Proportion of requests with latency < 200ms
  • Proportion of successful requests (2xx status)
  • Proportion of time the service is available

Typical formula:

SLI = (good events) / (total events) × 100%

Latency SLI = (requests < 200ms) / (total requests) × 100%

SLO (Service Level Objective)

What it is: the target you set for an SLI.

Examples:

  • 99% of requests should have latency < 200ms
  • 99.9% of requests should succeed
  • The service should be available 99.95% of the time

Format:

SLO = SLI [operator] [threshold] during [period]

"99% of requests < 200ms in a 28-day window"

SLA (Service Level Agreement)

What it is: a formal contract with consequences if the SLO is not met.

Examples:

  • "We guarantee 99.9% availability. If we don't meet it, you receive 10% credit."
  • "Maximum latency of 500ms. Violations result in contractual penalty."

Relationship:

SLA is a contract based on an SLO
SLO is a target based on an SLI
SLI is a measurement of the system

Why SLOs Matter

1. They define "good enough"

Without SLO:
- "The system is slow"
- "We need to improve"
- "It's not good enough"

With SLO:
- "We're at 98.5% vs target of 99%"
- "We're 0.5% short of the objective"
- "We need to reduce errors by 50%"

2. They enable conscious trade-offs

100% availability is impossible. 99.999% is extremely expensive. What's the right level?

Availability Downtime/month Relative cost
99% 7.2 hours $
99.9% 43 minutes $$
99.99% 4.3 minutes $$$$
99.999% 26 seconds $$$$$$

3. They balance speed and reliability

If you're always meeting your SLOs, maybe you're being too conservative. If you're always violating them, you're being irresponsible.

The sweet spot is occasionally using your error budget on new features.

Choosing Good SLIs

Characteristics of a good SLI

  1. Reflects user experience: measures what the user sees
  2. Is measurable: can be collected reliably
  3. Is actionable: you can do something when it's bad
  4. Is simple: easy to understand and explain

Recommended SLIs by service type

Request/response services:

  • Availability: % of successful requests
  • Latency: % of requests below threshold
  • Quality: % of correct responses

Data processing services:

  • Freshness: % of time data is up to date
  • Correctness: % of data processed correctly
  • Throughput: % of jobs completed on time

Storage services:

  • Durability: % of data preserved
  • Availability: % of time accessible
  • Latency: % of operations below threshold

Defining SLOs

The process

  1. Choose SLIs relevant to the service
  2. Analyze historical data: what's the current behavior?
  3. Consult stakeholders: what level is acceptable?
  4. Define targets based on data and expectations
  5. Iterate: SLOs are not static

Practical example

Service: Checkout API

Step 1 - SLIs:

  • Latency: % requests < 500ms
  • Availability: % successful requests
  • Errors: % requests without 5xx error

Step 2 - Historical data:

  • Latency: 95% of requests < 500ms currently
  • Availability: 99.7% currently
  • Errors: 0.3% of 5xx errors

Step 3 - Stakeholders:

  • Product: "Checkout needs to be fast, it's critical for conversion"
  • Finance: "Every checkout failure is a lost sale"
  • Engineering: "We can improve, but 100% is impossible"

Step 4 - Defined SLOs:

  • Latency: 99% < 500ms (better than current, challenging but achievable)
  • Availability: 99.9% (very critical, we need to invest)
  • Errors: < 0.1% of 5xx

Step 5 - Window:

  • 28-day rolling window

Implementing SLOs

Monitoring

# Prometheus alerting rule
- alert: CheckoutLatencySLOBreach
  expr: |
    (
      sum(rate(checkout_request_duration_seconds_bucket{le="0.5"}[5m]))
      /
      sum(rate(checkout_request_duration_seconds_count[5m]))
    ) < 0.99
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Checkout latency SLO at risk"

Dashboard

┌────────────────────────────────────────┐
│ SLO Status: Checkout API               │
├────────────────────────────────────────┤
│ Latency (<500ms)                       │
│ ████████████████████░░░ 98.5% / 99%    │
│ Error budget: 30% remaining            │
├────────────────────────────────────────┤
│ Availability                           │
│ █████████████████████░░ 99.85% / 99.9% │
│ Error budget: 50% remaining            │
└────────────────────────────────────────┘

Layered alerts

  1. High burn rate: SLO will be violated soon → page on-call
  2. Medium burn rate: SLO at risk → Slack notification
  3. Negative trend: could become a problem → ticket for review

Common Mistakes

1. SLOs too aggressive

Target: 99.999% availability
Reality: 99.9%

Result: Always violating, demotivated team, SLO ignored

2. SLOs too loose

Target: 95% availability
Reality: 99.5%

Result: Never worries, doesn't drive improvement

3. Too many SLOs

15 different SLOs
Team doesn't know which to prioritize
None receives adequate attention

Recommendation: 3-5 SLOs per service.

4. SLOs disconnected from user

SLO: CPU < 80%
User: doesn't care about CPU, cares about latency

Result: Optimizes wrong metric

Conclusion

SLI, SLO, and SLA form a hierarchy of commitments:

Level Question it answers
SLI How do we measure quality?
SLO What level of quality do we seek?
SLA What's the consequence of not meeting it?

To implement successfully:

  1. Start simple: few well-defined SLOs
  2. Base on data: use history to define realistic targets
  3. Involve stakeholders: SLOs are agreements, not impositions
  4. Monitor and alert: SLOs without monitoring are just wishful thinking
  5. Iterate: review and adjust as you learn

SLOs aren't about perfection. They're about defining what's "good enough" and ensuring you deliver it.

SLOSLASLIreliabilitySRE

Want to understand your platform's limits?

Contact us for a performance assessment.

Contact Us