Error Budget: balancing speed and reliability

Engineering teams live in constant conflict: deliver features fast vs keep the system stable. Error budget is a concept that transforms this conflict into collaboration, giving a clear answer to the question: "How much risk can we take?"

This article explores what error budget is, how to calculate it, and how to use it to make better decisions.

Error budget isn't permission to fail. It's clarity about how much you can experiment.

What is Error Budget

Error budget is the amount of "allowed failure" before violating an SLO.

Calculation:

Error Budget = 100% - SLO

If SLO = 99.9% availability
Error Budget = 0.1% of allowed unavailability

In time:

0.1% of 30 days = 43.2 minutes of allowed downtime per month

Why Error Budget Matters

1. Transforms abstract into concrete

Vague: "We need to be more reliable"
Concrete: "We have 20 minutes of error budget remaining this month"

2. Creates common language

Product, Engineering, and Operations can agree on how much risk to take:

Budget remaining → we can launch risky features
Budget running out → focus on stability

3. Eliminates subjective debates

Without error budget:
"Should we launch?" → "Maybe... seems risky... I don't know..."

With error budget:
"Should we launch?" → "We have 80% budget, estimated risk is 5%. Yes."

Calculating Error Budget

Basic formula

Total Error Budget = (1 - SLO) × time window

For SLO of 99.9% in 30 days:
Budget = 0.001 × 30 × 24 × 60 = 43.2 minutes

Budget consumed

Consumed Budget = total time of SLO violation

If there were 15 minutes of unavailability:
Consumed = 15 minutes
Remaining = 43.2 - 15 = 28.2 minutes

Multiple SLIs

When you have multiple SLOs, each has its own budget:

Availability: 0.1% budget = 43 min/month
Latency: 1% budget = 7.2 hours/month of slow requests
Errors: 0.5% budget = 3.6 hours/month of requests with errors

Using Error Budget in Practice

Error Budget Policy

Define clear rules based on budget state:

Budget > 50%: 🟢
- Normal releases
- Experiments allowed
- Focus on features

Budget 20-50%: 🟡
- Releases with more caution
- Quick rollback ready
- Extra monitoring attention

Budget < 20%: 🔴
- Feature freeze
- Only critical fixes
- Team focused on stability

Budget = 0%: ⛔
- No releases
- Mandatory postmortem
- Exclusive work on reliability

Release decisions

New feature estimated to consume 5% of error budget.
Current budget: 60%

Decision:
60% - 5% = 55% remaining > 50% threshold
→ ✅ Can launch

Current budget: 25%
25% - 5% = 20% remaining = threshold
→ ⚠️ Can launch, but with extra caution

Current budget: 10%
10% - 5% = 5% remaining < 20% threshold
→ ❌ Don't launch, focus on stability

Team time allocation

Budget > 75%:
- 80% features, 20% reliability

Budget 25-75%:
- 50% features, 50% reliability

Budget < 25%:
- 20% features, 80% reliability

Burn Rate

Burn rate measures how fast you're consuming error budget.

Calculation

Burn Rate = (Budget consumed / Time elapsed) / (Total budget / Total window)

Example:
- Total budget: 43 minutes/month
- Time elapsed: 10 days
- Budget consumed: 20 minutes

Expected rate in 10 days: 43 × (10/30) = 14.3 min
Actual rate: 20 min
Burn rate = 20 / 14.3 = 1.4x

→ Consuming budget 40% faster than sustainable

Burn rate based alerts

Burn Rate	Interpretation	Action
< 1x	Sustainable	Continue normal
1-2x	Attention	Monitor closely
2-5x	Alert	Investigate cause
> 5x	Critical	Immediate action

# Prometheus alert
- alert: HighErrorBudgetBurnRate
  expr: |
    (
      1 - (sum(rate(http_requests_total{status=~"2.."}[1h]))
           / sum(rate(http_requests_total[1h])))
    ) / (1 - 0.999) > 2
  for: 5m
  annotations:
    summary: "Error budget burning 2x faster than sustainable"

Error Budget and Culture

Mindset shift

Before:

Reliability is ops' responsibility
Features are priority
Stability is cost

After:

Reliability is shared responsibility
Error budget balances features and stability
Stability is investment

Postmortems

When error budget is consumed, postmortems should answer:

What caused the budget consumption?
Was it predictable?
What can we do to reduce future consumption?
Is the trade-off worth it (feature vs stability)?

Healthy negotiation

Product: "We need to launch this feature"
Engineering: "We only have 15% error budget"

Options:
1. Launch to small % of users (less risk)
2. Delay until next budget cycle
3. Invest in hardening before launching
4. Accept risk and launch (conscious decision)

Common Pitfalls

1. Treating budget as a goal to spend

❌ "We have 40% budget, let's use it all on risky experiments"
✅ "We have 40% budget, we can take calculated risks"

2. Ignoring budget when convenient

❌ "This feature is too important, launch even without budget"
✅ "Without budget, we need to choose: delay feature or accept SLO degradation"

3. SLOs disconnected from reality

❌ SLO of 99.99%, budget of 4 min/month, impossible to meet
✅ SLO based on real capacity, usable budget

Conclusion

Error budget is a powerful tool for:

Quantifying risk objectively
Aligning teams around common objectives
Making decisions based on data
Balancing speed and reliability

To implement:

Define SLOs that are clear and realistic
Calculate error budget for each SLO
Establish policies based on budget state
Monitor burn rate continuously
Respect the system — budget is not a suggestion

Error budget is democracy applied to engineering: everyone has a voice, but the rules are clear and respected.