Load testing validates if the system handles expected load. Stress testing discovers where it breaks. Different objectives, different techniques, different insights. This article explains when and how to stress your system in a controlled way.
Load testing asks "can it handle it?". Stress testing asks "how much can it handle?".
The Difference Between Load and Stress
Load Testing
Objective: Validate behavior under expected load
Load: Normal to predicted peak
Duration: Hours (steady state)
Result: Pass/Fail on SLOs
Example:
- Load: 1000 req/s (expected peak)
- Duration: 2 hours
- Criterion: p95 < 500ms, error rate < 1%
Stress Testing
Objective: Find limits and failure points
Load: Above expected, increasing
Duration: Until degradation or failure
Result: Maximum capacity, failure mode
Example:
- Load: 1000 → 2000 → 3000 → ... req/s
- Duration: Until first bottleneck
- Result: "System handles 2500 req/s,
fails at connection pool at 2800"
Why Do Stress Testing
1. Know the real capacity
Without stress test:
"We think it handles 1000 users"
With stress test:
"We validated it handles 3200 users,
bottleneck is DB connections,
overload behavior: graceful degradation"
2. Understand the failure mode
Questions stress test answers:
- What fails first?
- Fails gradually or catastrophically?
- Recovers automatically?
- How long to recover?
3. Validate protection mechanisms
Tests if they work:
- Rate limiting
- Circuit breakers
- Autoscaling
- Graceful degradation
- Queue backpressure
4. Prepare for the unexpected
Events that exceed prediction:
- Unexpected viral campaign
- TV/media mention
- Bot flood (accidental or intentional)
- Outage recovery (thundering herd)
Types of Stress Test
1. Step-Up Stress
Profile:
┌─────────────────────────────────┐
│ Load │
│ ▲ ┌───┐ │
│ │ ┌─────┘ │ │
│ │ ┌─────┘ │ │
│ │ ┌─────┘ │ │
│ └─┴─────────────────────┴──▶ │
│ Time │
└─────────────────────────────────┘
Implementation (k6):
stages: [
{ duration: '10m', target: 1000 }, // Step 1
{ duration: '10m', target: 2000 }, // Step 2
{ duration: '10m', target: 3000 }, // Step 3
{ duration: '10m', target: 4000 }, // Step 4
]
Use: Find degradation point gradually
2. Spike Test
Profile:
┌─────────────────────────────────┐
│ Load │
│ ▲ ┌───┐ │
│ │ │ │ │
│ │ │ │ │
│ │ ───────┘ └─────────── │
│ └────────────────────────▶ │
│ Time │
└─────────────────────────────────┘
Implementation (k6):
stages: [
{ duration: '5m', target: 500 }, // Normal
{ duration: '1m', target: 5000 }, // Spike
{ duration: '5m', target: 5000 }, // Sustain
{ duration: '1m', target: 500 }, // Drop
{ duration: '10m', target: 500 }, // Recovery
]
Use: Validate response to sudden spike and recovery
3. Sustained Overload
Profile:
┌─────────────────────────────────┐
│ Load │
│ ▲ │
│ │ ┌─────────────────────┐ │
│ │ │ │ │
│ │ │ Above capacity │ │
│ └─┴─────────────────────┴──▶ │
│ Time │
└─────────────────────────────────┘
Implementation:
stages: [
{ duration: '5m', target: 3000 }, // Ramp
{ duration: '60m', target: 3000 }, // Sustain overload
]
Use: Observe prolonged degradation, memory leaks
4. Breaking Point (Destructive)
Profile:
┌─────────────────────────────────┐
│ Load │
│ ▲ ╱ │
│ │ ╱ │
│ │ ╱ │
│ │ ╱ │
│ └──────────────╱───────────▶ │
│ Time │
└─────────────────────────────────┘
Implementation:
stages: [
{ duration: '60m', target: 10000 }, // Continuous ramp
]
Use: Find absolute limit (until OOM, timeout, crash)
Metrics During Stress Test
What to observe
Performance:
- Latency (p50, p95, p99)
- Effective throughput
- Error rate
- Timeout rate
Resources:
- CPU (all pods/instances)
- Memory (and GC if applicable)
- Connections (DB, cache, external)
- I/O (disk, network)
Application:
- Queue depths
- Thread pool usage
- Connection pool usage
- Active requests
System:
- Pod/instance health
- Autoscaling events
- Circuit breaker states
Identifying the bottleneck
Symptoms by bottleneck type:
CPU-bound:
- CPU at 100%
- Latency rises linearly
- Throughput plateaus
Memory-bound:
- Memory growing
- Frequent/long GC
- Eventual OOM
Connection-bound:
- Pool at 100%
- Timeouts increasing
- Requests queued
I/O-bound:
- Low CPU
- High disk IOPS
- Network saturated
Executing Stress Test
Preparation
Checklist:
- [ ] Isolated environment (doesn't affect production)
- [ ] Complete monitoring active
- [ ] Production alerts silenced
- [ ] Team aware (SRE, infra)
- [ ] Rollback plan (if in shared staging)
- [ ] Stop criteria defined
During the test
Monitor in real-time:
- Metrics dashboard
- Error logs
- Pod status
- Autoscaling
Document:
- Event timestamps
- First sign of degradation
- Behavior under stress
- Observed errors
Stop criteria
Stop when:
- Error rate > 50%
- Latency > 30s
- OOM detected
- Critical component down
- Data corrupted
Don't stop just for:
- Gradual degradation
- Moderate error rate
- High latency but responsive
Interpreting Results
Saturation curve
┌─────────────────────────────────┐
│ │
Latency │ ╱╱╱╱╱ │
or │ ╱╱ │
Error % │ ╱ │
│ ╱ │
│ ────╱ "Knee" (inflection │
│ point) │
└─────────────────────────────────┘
Throughput →
Green zone: Stable performance
Knee: Saturation start
Red zone: Rapid degradation
Documenting results
## Stress Test Report - 2024-01-20
### Configuration
- Environment: Staging (3x prod)
- Baseline: 1000 req/s
- Test: Step-up until failure
### Results
| Load | p95 | Error % | Observation |
|------|-----|---------|-------------|
| 1000 | 120ms | 0.1% | Normal |
| 1500 | 150ms | 0.2% | OK |
| 2000 | 200ms | 0.5% | OK |
| 2500 | 350ms | 1.2% | Degradation start |
| 3000 | 800ms | 5% | Visible degradation |
| 3500 | 2s | 15% | Severe |
| 4000 | Timeout | 40% | Failure |
### Analysis
- Maximum sustainable capacity: 2000 req/s
- Knee point: 2500 req/s
- Primary bottleneck: DB connection pool
- Failure mode: Graceful degradation until 3500,
then timeout cascade
### Recommendations
1. Increase connection pool from 50 to 100
2. Implement circuit breaker for DB
3. Consider read replica for read queries
4. Retest after adjustments
Stress Testing in CI/CD
When to include
Not on every PR:
- Too slow
- Expensive resources
Include on:
- Production releases
- Infra changes
- New critical endpoints
- Connection handling changes
Nightly:
- Complete stress test
- Report for morning review
Automation
# Pipeline example
stress_test:
stage: test
only:
- main
- /^release-.*/
script:
- k6 run stress-test.js
- python analyze_results.py
artifacts:
paths:
- stress-report.html
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
Conclusion
Stress testing is essential to:
- Know limits - don't guess, know
- Understand failures - how, where, when
- Validate protections - do circuit breakers work?
- Prepare for incidents - know what to expect
The difference between a resilient system and a fragile one is knowing where the limits are before finding them in production.
Every system breaks under enough pressure. The question is: do you know where and how?
This article is part of the series on the OCTOPUS Performance Engineering methodology.