Capacity Planning: sizing for the future

Capacity planning is the process of determining how much resource you need to meet future demand. Underestimating causes outages. Overestimating wastes money. The art is getting it right enough.

Capacity planning isn't predicting the future. It's being prepared for it.

Why Capacity Planning

Without planning

Month 1: System OK
Month 2: System OK
Month 3: 20% growth, slowdown
Month 4: Emergency! Rush to buy infra
Month 5: New infra misconfigured, problems persist

With planning

Month 1: Measurement and baseline
Month 2: Growth projection
Month 3: Advance provisioning
Month 4: System absorbs growth
Month 5: Review and adjust

Capacity Planning Steps

1. Measure current state

Baseline metrics:
  Current throughput: 5,000 req/s
  p95 latency: 150ms
  Average CPU: 45%
  Average memory: 60%
  DB connections: 80/100

Current resources:
  Servers: 4x c5.xlarge
  DB: db.r5.2xlarge
  Cache: cache.r5.large

2. Understand demand

# Analyze historical patterns
daily_pattern = analyze_time_series(last_90_days)

# Identify trends
growth_rate = calculate_monthly_growth()
# Ex: 15% month over month

# Known events
upcoming_events = [
    Event("Black Friday", multiplier=5),
    Event("TV Campaign", multiplier=2),
    Event("Feature Launch", multiplier=1.5)
]

3. Project growth

def project_demand(current: float, months: int, growth_rate: float) -> float:
    """Compound growth"""
    return current * ((1 + growth_rate) ** months)

# Projection for 12 months
current_rps = 5000
monthly_growth = 0.15

month_6 = project_demand(5000, 6, 0.15)   # 11,568 req/s
month_12 = project_demand(5000, 12, 0.15) # 26,764 req/s

4. Calculate required resources

def calculate_resources(target_rps: float, baseline: dict) -> dict:
    """
    Calculates resources based on current ratio
    """
    current_rps = baseline['throughput']
    scale_factor = target_rps / current_rps

    return {
        'servers': ceil(baseline['servers'] * scale_factor),
        'db_size': next_size(baseline['db_size'], scale_factor),
        'cache_size': next_size(baseline['cache_size'], scale_factor)
    }

# For 6 months
resources_6m = calculate_resources(11568, baseline)
# {'servers': 10, 'db_size': 'db.r5.4xlarge', 'cache_size': 'cache.r5.xlarge'}

5. Add margins

def add_margins(resources: dict) -> dict:
    """
    Adds margin for:
    - Spikes (20%)
    - Special events (50%)
    - Estimation error (20%)
    """
    margin = 1.2 * 1.5 * 1.2  # ~2.16x

    return {
        'servers': ceil(resources['servers'] * 1.5),
        'db_size': upsize(resources['db_size']),
        'cache_size': upsize(resources['cache_size'])
    }

Projection Models

Linear

Constant absolute growth
Month 0: 5,000 req/s
Month 1: 5,500 req/s (+500)
Month 2: 6,000 req/s (+500)

When to use: Mature markets, stable growth

Exponential

Constant percentage growth
Month 0: 5,000 req/s
Month 1: 5,750 req/s (+15%)
Month 2: 6,612 req/s (+15%)

When to use: Growing startups, new products

Event-based

Baseline + predicted spikes
Normal: 5,000 req/s
Black Friday: 25,000 req/s (5x)
Campaign: 10,000 req/s (2x)

When to use: Seasonal businesses, intensive marketing

Provisioning Strategies

1. Just-in-time

Provision when utilization > 70%
Deprovision when < 30%

Pros: Optimized cost Cons: Risk of not scaling in time

2. Fixed headroom

Always maintain 50% margin
If need 10 servers, maintain 15

Pros: Safety Cons: Higher cost

3. Stepped

Q1: Provision for H1
Q3: Provision for H2
Quarterly review

Pros: Predictability Cons: May under or over provision

4. Hybrid (recommended)

Base: stepped provisioning
+ Autoscaling: for daily variations
+ Reserve: for known events

Capacity Planning by Component

Application Servers

Key metrics:
  - CPU utilization
  - Memory usage
  - Request throughput
  - Active connections

Sizing:
  rps_per_server: 500
  target_rps: 10000
  servers_needed: 20
  with_margin: 30

Database

Key metrics:
  - Queries per second
  - Connections
  - Storage growth
  - IOPS

Sizing:
  # Usually vertical scaling
  current_size: db.r5.2xlarge
  projected_load: 2x
  target_size: db.r5.4xlarge

  # Or read replicas
  write_primary: 1
  read_replicas: 3

Cache

Key metrics:
  - Hit rate
  - Memory usage
  - Evictions
  - Connections

Sizing:
  current_keys: 1M
  key_growth: 10%/month
  projected_6m: 1.77M
  memory_per_key: 1KB
  memory_needed: 1.77GB
  with_margin: 4GB

Message Queues

Key metrics:
  - Messages per second
  - Queue depth
  - Consumer lag
  - Storage

Sizing:
  current_mps: 10000
  retention_hours: 24
  message_size: 1KB
  storage_needed: 10000 * 3600 * 24 * 1KB = 864GB
  with_margin: 1.5TB

Capacity Plan Document

# Capacity Plan - Q2 2025

## Executive Summary
We project 15% monthly growth. We need to expand
infrastructure before end of Q1 to avoid degradation.

## Current State
| Resource | Current | Utilization | Limit |
|----------|---------|-------------|-------|
| App servers | 4 | 45% | 80% |
| Database | r5.2xl | 60% | 80% |
| Cache | r5.lg | 70% | 90% |

## Demand Projection
| Metric | Current | +3 months | +6 months |
|--------|---------|-----------|-----------|
| RPS | 5000 | 7600 | 11500 |
| Users | 50K | 76K | 115K |
| Storage | 500GB | 650GB | 850GB |

## Required Resources
| Resource | +3 months | +6 months | Cost Delta |
|----------|-----------|-----------|------------|
| App servers | 6 (+2) | 10 (+6) | +$2,400/mo |
| Database | r5.2xl | r5.4xl | +$1,200/mo |
| Cache | r5.xl | r5.xl | +$400/mo |

## Timeline
- Week 1-2: Provision app servers
- Week 3: Migrate database
- Week 4: Validate and adjust

## Risks
- Black Friday (not modeled): may need 3x
- Viral new feature: unpredictable demand

## Recommendations
1. Approve immediate expansion
2. Configure autoscaling for spikes
3. Monthly metrics review

Common Mistakes

1. Using average instead of peak

❌ Average CPU: 40%, we have margin
✅ Peak CPU: 85%, we need to scale

2. Ignoring seasonality

❌ Last month = next month
✅ December has 3x more traffic

3. Single projection

❌ We'll grow 15%
✅ Scenarios: optimistic (25%), expected (15%), pessimistic (5%)

4. Forgetting dependencies

❌ App servers OK, done
✅ App servers OK, but DB will saturate in 2 months

Conclusion

Effective capacity planning requires:

Solid data: accurate historical metrics
Multiple scenarios: optimistic, expected, pessimistic
Adequate margins: events and estimation errors
Continuous review: plans change with reality
Alignment: business, product, and engineering

Recommended process:

Quarterly: complete capacity plan review
Monthly: metrics vs projection verification
Weekly: trend monitoring
Continuous: threshold alerts

The best capacity plan is the one you don't need to use in an emergency.