Capacity planning is the process of determining how much resource you need to meet future demand. Underestimating causes outages. Overestimating wastes money. The art is getting it right enough.
Capacity planning isn't predicting the future. It's being prepared for it.
Why Capacity Planning
Without planning
Month 1: System OK
Month 2: System OK
Month 3: 20% growth, slowdown
Month 4: Emergency! Rush to buy infra
Month 5: New infra misconfigured, problems persist
With planning
Month 1: Measurement and baseline
Month 2: Growth projection
Month 3: Advance provisioning
Month 4: System absorbs growth
Month 5: Review and adjust
Capacity Planning Steps
1. Measure current state
Baseline metrics:
Current throughput: 5,000 req/s
p95 latency: 150ms
Average CPU: 45%
Average memory: 60%
DB connections: 80/100
Current resources:
Servers: 4x c5.xlarge
DB: db.r5.2xlarge
Cache: cache.r5.large
2. Understand demand
# Analyze historical patterns
daily_pattern = analyze_time_series(last_90_days)
# Identify trends
growth_rate = calculate_monthly_growth()
# Ex: 15% month over month
# Known events
upcoming_events = [
Event("Black Friday", multiplier=5),
Event("TV Campaign", multiplier=2),
Event("Feature Launch", multiplier=1.5)
]
3. Project growth
def project_demand(current: float, months: int, growth_rate: float) -> float:
"""Compound growth"""
return current * ((1 + growth_rate) ** months)
# Projection for 12 months
current_rps = 5000
monthly_growth = 0.15
month_6 = project_demand(5000, 6, 0.15) # 11,568 req/s
month_12 = project_demand(5000, 12, 0.15) # 26,764 req/s
4. Calculate required resources
def calculate_resources(target_rps: float, baseline: dict) -> dict:
"""
Calculates resources based on current ratio
"""
current_rps = baseline['throughput']
scale_factor = target_rps / current_rps
return {
'servers': ceil(baseline['servers'] * scale_factor),
'db_size': next_size(baseline['db_size'], scale_factor),
'cache_size': next_size(baseline['cache_size'], scale_factor)
}
# For 6 months
resources_6m = calculate_resources(11568, baseline)
# {'servers': 10, 'db_size': 'db.r5.4xlarge', 'cache_size': 'cache.r5.xlarge'}
5. Add margins
def add_margins(resources: dict) -> dict:
"""
Adds margin for:
- Spikes (20%)
- Special events (50%)
- Estimation error (20%)
"""
margin = 1.2 * 1.5 * 1.2 # ~2.16x
return {
'servers': ceil(resources['servers'] * 1.5),
'db_size': upsize(resources['db_size']),
'cache_size': upsize(resources['cache_size'])
}
Projection Models
Linear
Constant absolute growth
Month 0: 5,000 req/s
Month 1: 5,500 req/s (+500)
Month 2: 6,000 req/s (+500)
When to use: Mature markets, stable growth
Exponential
Constant percentage growth
Month 0: 5,000 req/s
Month 1: 5,750 req/s (+15%)
Month 2: 6,612 req/s (+15%)
When to use: Growing startups, new products
Event-based
Baseline + predicted spikes
Normal: 5,000 req/s
Black Friday: 25,000 req/s (5x)
Campaign: 10,000 req/s (2x)
When to use: Seasonal businesses, intensive marketing
Provisioning Strategies
1. Just-in-time
Provision when utilization > 70%
Deprovision when < 30%
Pros: Optimized cost Cons: Risk of not scaling in time
2. Fixed headroom
Always maintain 50% margin
If need 10 servers, maintain 15
Pros: Safety Cons: Higher cost
3. Stepped
Q1: Provision for H1
Q3: Provision for H2
Quarterly review
Pros: Predictability Cons: May under or over provision
4. Hybrid (recommended)
Base: stepped provisioning
+ Autoscaling: for daily variations
+ Reserve: for known events
Capacity Planning by Component
Application Servers
Key metrics:
- CPU utilization
- Memory usage
- Request throughput
- Active connections
Sizing:
rps_per_server: 500
target_rps: 10000
servers_needed: 20
with_margin: 30
Database
Key metrics:
- Queries per second
- Connections
- Storage growth
- IOPS
Sizing:
# Usually vertical scaling
current_size: db.r5.2xlarge
projected_load: 2x
target_size: db.r5.4xlarge
# Or read replicas
write_primary: 1
read_replicas: 3
Cache
Key metrics:
- Hit rate
- Memory usage
- Evictions
- Connections
Sizing:
current_keys: 1M
key_growth: 10%/month
projected_6m: 1.77M
memory_per_key: 1KB
memory_needed: 1.77GB
with_margin: 4GB
Message Queues
Key metrics:
- Messages per second
- Queue depth
- Consumer lag
- Storage
Sizing:
current_mps: 10000
retention_hours: 24
message_size: 1KB
storage_needed: 10000 * 3600 * 24 * 1KB = 864GB
with_margin: 1.5TB
Capacity Plan Document
# Capacity Plan - Q2 2025
## Executive Summary
We project 15% monthly growth. We need to expand
infrastructure before end of Q1 to avoid degradation.
## Current State
| Resource | Current | Utilization | Limit |
|----------|---------|-------------|-------|
| App servers | 4 | 45% | 80% |
| Database | r5.2xl | 60% | 80% |
| Cache | r5.lg | 70% | 90% |
## Demand Projection
| Metric | Current | +3 months | +6 months |
|--------|---------|-----------|-----------|
| RPS | 5000 | 7600 | 11500 |
| Users | 50K | 76K | 115K |
| Storage | 500GB | 650GB | 850GB |
## Required Resources
| Resource | +3 months | +6 months | Cost Delta |
|----------|-----------|-----------|------------|
| App servers | 6 (+2) | 10 (+6) | +$2,400/mo |
| Database | r5.2xl | r5.4xl | +$1,200/mo |
| Cache | r5.xl | r5.xl | +$400/mo |
## Timeline
- Week 1-2: Provision app servers
- Week 3: Migrate database
- Week 4: Validate and adjust
## Risks
- Black Friday (not modeled): may need 3x
- Viral new feature: unpredictable demand
## Recommendations
1. Approve immediate expansion
2. Configure autoscaling for spikes
3. Monthly metrics review
Common Mistakes
1. Using average instead of peak
❌ Average CPU: 40%, we have margin
✅ Peak CPU: 85%, we need to scale
2. Ignoring seasonality
❌ Last month = next month
✅ December has 3x more traffic
3. Single projection
❌ We'll grow 15%
✅ Scenarios: optimistic (25%), expected (15%), pessimistic (5%)
4. Forgetting dependencies
❌ App servers OK, done
✅ App servers OK, but DB will saturate in 2 months
Conclusion
Effective capacity planning requires:
- Solid data: accurate historical metrics
- Multiple scenarios: optimistic, expected, pessimistic
- Adequate margins: events and estimation errors
- Continuous review: plans change with reality
- Alignment: business, product, and engineering
Recommended process:
Quarterly: complete capacity plan review
Monthly: metrics vs projection verification
Weekly: trend monitoring
Continuous: threshold alerts
The best capacity plan is the one you don't need to use in an emergency.