Engineering teams live in constant conflict: deliver features fast vs keep the system stable. Error budget is a concept that transforms this conflict into collaboration, giving a clear answer to the question: "How much risk can we take?"
This article explores what error budget is, how to calculate it, and how to use it to make better decisions.
Error budget isn't permission to fail. It's clarity about how much you can experiment.
What is Error Budget
Error budget is the amount of "allowed failure" before violating an SLO.
Calculation:
Error Budget = 100% - SLO
If SLO = 99.9% availability
Error Budget = 0.1% of allowed unavailability
In time:
0.1% of 30 days = 43.2 minutes of allowed downtime per month
Why Error Budget Matters
1. Transforms abstract into concrete
Vague: "We need to be more reliable"
Concrete: "We have 20 minutes of error budget remaining this month"
2. Creates common language
Product, Engineering, and Operations can agree on how much risk to take:
- Budget remaining → we can launch risky features
- Budget running out → focus on stability
3. Eliminates subjective debates
Without error budget:
"Should we launch?" → "Maybe... seems risky... I don't know..."
With error budget:
"Should we launch?" → "We have 80% budget, estimated risk is 5%. Yes."
Calculating Error Budget
Basic formula
Total Error Budget = (1 - SLO) × time window
For SLO of 99.9% in 30 days:
Budget = 0.001 × 30 × 24 × 60 = 43.2 minutes
Budget consumed
Consumed Budget = total time of SLO violation
If there were 15 minutes of unavailability:
Consumed = 15 minutes
Remaining = 43.2 - 15 = 28.2 minutes
Multiple SLIs
When you have multiple SLOs, each has its own budget:
Availability: 0.1% budget = 43 min/month
Latency: 1% budget = 7.2 hours/month of slow requests
Errors: 0.5% budget = 3.6 hours/month of requests with errors
Using Error Budget in Practice
Error Budget Policy
Define clear rules based on budget state:
Budget > 50%: 🟢
- Normal releases
- Experiments allowed
- Focus on features
Budget 20-50%: 🟡
- Releases with more caution
- Quick rollback ready
- Extra monitoring attention
Budget < 20%: 🔴
- Feature freeze
- Only critical fixes
- Team focused on stability
Budget = 0%: ⛔
- No releases
- Mandatory postmortem
- Exclusive work on reliability
Release decisions
New feature estimated to consume 5% of error budget.
Current budget: 60%
Decision:
60% - 5% = 55% remaining > 50% threshold
→ ✅ Can launch
Current budget: 25%
25% - 5% = 20% remaining = threshold
→ ⚠️ Can launch, but with extra caution
Current budget: 10%
10% - 5% = 5% remaining < 20% threshold
→ ❌ Don't launch, focus on stability
Team time allocation
Budget > 75%:
- 80% features, 20% reliability
Budget 25-75%:
- 50% features, 50% reliability
Budget < 25%:
- 20% features, 80% reliability
Burn Rate
Burn rate measures how fast you're consuming error budget.
Calculation
Burn Rate = (Budget consumed / Time elapsed) / (Total budget / Total window)
Example:
- Total budget: 43 minutes/month
- Time elapsed: 10 days
- Budget consumed: 20 minutes
Expected rate in 10 days: 43 × (10/30) = 14.3 min
Actual rate: 20 min
Burn rate = 20 / 14.3 = 1.4x
→ Consuming budget 40% faster than sustainable
Burn rate based alerts
| Burn Rate | Interpretation | Action |
|---|---|---|
| < 1x | Sustainable | Continue normal |
| 1-2x | Attention | Monitor closely |
| 2-5x | Alert | Investigate cause |
| > 5x | Critical | Immediate action |
# Prometheus alert
- alert: HighErrorBudgetBurnRate
expr: |
(
1 - (sum(rate(http_requests_total{status=~"2.."}[1h]))
/ sum(rate(http_requests_total[1h])))
) / (1 - 0.999) > 2
for: 5m
annotations:
summary: "Error budget burning 2x faster than sustainable"
Error Budget and Culture
Mindset shift
Before:
- Reliability is ops' responsibility
- Features are priority
- Stability is cost
After:
- Reliability is shared responsibility
- Error budget balances features and stability
- Stability is investment
Postmortems
When error budget is consumed, postmortems should answer:
- What caused the budget consumption?
- Was it predictable?
- What can we do to reduce future consumption?
- Is the trade-off worth it (feature vs stability)?
Healthy negotiation
Product: "We need to launch this feature"
Engineering: "We only have 15% error budget"
Options:
1. Launch to small % of users (less risk)
2. Delay until next budget cycle
3. Invest in hardening before launching
4. Accept risk and launch (conscious decision)
Common Pitfalls
1. Treating budget as a goal to spend
❌ "We have 40% budget, let's use it all on risky experiments"
✅ "We have 40% budget, we can take calculated risks"
2. Ignoring budget when convenient
❌ "This feature is too important, launch even without budget"
✅ "Without budget, we need to choose: delay feature or accept SLO degradation"
3. SLOs disconnected from reality
❌ SLO of 99.99%, budget of 4 min/month, impossible to meet
✅ SLO based on real capacity, usable budget
Conclusion
Error budget is a powerful tool for:
- Quantifying risk objectively
- Aligning teams around common objectives
- Making decisions based on data
- Balancing speed and reliability
To implement:
- Define SLOs that are clear and realistic
- Calculate error budget for each SLO
- Establish policies based on budget state
- Monitor burn rate continuously
- Respect the system — budget is not a suggestion
Error budget is democracy applied to engineering: everyone has a voice, but the rules are clear and respected.