"We improved performance by 30%". Compared to what? If you can't answer that question with data, you don't know if you really improved. The baseline is the reference point that makes any measurement meaningful.
Performance without baseline is opinion. Performance with baseline is engineering.
What Is a Baseline
Definition
Baseline = Known and documented state of the system
under normal operating conditions
What a baseline is NOT
❌ "The system is fast"
❌ "Works fine on my computer"
❌ "We never had complaints"
❌ "Last test showed 200ms"
✅ "In production, p95 is 180ms at 500 req/s
between 10am-12pm, measured over 7 days"
Components of a baseline
Metrics:
- Latency (p50, p95, p99)
- Throughput (req/s, tx/s)
- Error rate
- Resource utilization
Context:
- Measurement period
- Load conditions
- System version
- Environment configuration
Variability:
- Standard deviation
- Observed min/max
- Temporal patterns
Why Baseline Is Essential
1. Measure real progress
Without baseline:
"We optimized the system" → By how much? We don't know.
With baseline:
Before: p95 = 450ms @ 1000 req/s
After: p95 = 180ms @ 1000 req/s
Improvement: 60% latency reduction
2. Detect regressions
Feature X deploy:
Baseline: p95 = 100ms
After deploy: p95 = 250ms
Alert: 150% regression detected!
3. Validate changes
Hypothesis: "Adding cache will improve performance"
Baseline (no cache):
p95 = 200ms
DB queries/s = 5000
After cache:
p95 = 50ms
DB queries/s = 500
Validation: Cache reduced latency by 75%
and queries by 90%
4. Communicate with stakeholders
❌ "The system is faster"
✅ "We reduced checkout time from 3.2s to 0.8s,
impacting 50K transactions/day"
How to Establish a Baseline
Step 1: Define relevant metrics
For an e-commerce:
Critical:
- Checkout latency (p50, p95, p99)
- Payment error rate
- Orders/hour throughput
Secondary:
- Product search time
- Login latency
- Cache hit rate
Resources:
- Order service CPU
- DB connections
- Redis memory
Step 2: Collect data for sufficient time
Minimum period: 1 week
- Captures daily variation
- Includes peaks and valleys
- Representative of real usage
Ideal period: 1 month
- Weekly variation
- Special events
- Greater statistical confidence
Step 3: Document conditions
## Baseline - Order System
**Period:** 2024-01-01 to 2024-01-31
**Version:** v2.3.4
**Environment:** Production (3x c5.xlarge)
### Metrics
| Metric | p50 | p95 | p99 |
|--------|-----|-----|-----|
| Checkout | 450ms | 980ms | 2.1s |
| Search | 80ms | 150ms | 300ms |
| Login | 120ms | 250ms | 500ms |
### Throughput
- Average: 850 req/s
- Peak: 2100 req/s (Black Friday)
- Valley: 150 req/s (overnight)
### Errors
- Average rate: 0.3%
- By type: 60% timeout, 30% 5xx, 10% 4xx
### Resources
- Average CPU: 45%
- Memory: 6.2GB / 8GB
- DB connections: 80 / 100
Step 4: Identify patterns
Daily pattern:
┌─────────────────────────────────────┐
│ req/s │
│ 2K ┤ ╭──╮ ╭───╮ │
│ 1.5K┤ ╭╯ ╰╮ ╭╯ ╰╮ │
│ 1K ┤ ╭───╯ ╰───╯ ╰── │
│ 500 ┤──╯ │
│ 0 ├──┬──┬──┬──┬──┬──┬──┬──┬──┬── │
│ 0 3 6 9 12 15 18 21 24 │
└─────────────────────────────────────┘
Peaks: 10am-12pm and 7pm-10pm
Valley: 2am-6am
Types of Baseline
1. Production baseline
Source: Real production metrics
Advantage: Reflects real usage
Disadvantage: Variable, hard to isolate
When to use:
- Continuous monitoring
- Regression detection
- Capacity planning
2. Test baseline
Source: Controlled environment with simulated load
Advantage: Reproducible, isolated
Disadvantage: May not reflect production
When to use:
- Compare versions
- Validate optimizations
- CI/CD pipeline
3. Theoretical baseline
Source: SLOs, requirements, industry benchmarks
Advantage: Clear target
Disadvantage: May be unrealistic
When to use:
- Define objectives
- Gap analysis
- Stakeholder negotiation
Keeping Baseline Updated
When to update
Mandatory:
- Significant architecture change
- Major system version
- Infrastructure change
- User growth > 50%
Optional:
- Quarterly
- After major optimizations
- Usage pattern change
Versioning baselines
## Baseline History
### v3 - 2024-Q1 (current)
- Checkout p95: 450ms
- Throughput: 850 req/s
- Context: New microservices architecture
### v2 - 2023-Q3
- Checkout p95: 1.2s
- Throughput: 400 req/s
- Context: Optimized monolith
### v1 - 2023-Q1
- Checkout p95: 2.5s
- Throughput: 200 req/s
- Context: Original monolith
Common Mistakes
1. Single-moment baseline
❌ "I ran a test just now and got 150ms"
Problem: Doesn't capture variability
2. Ignoring context
❌ "p95 is 200ms"
Missing: At what load? At what time?
What version? What environment?
3. Using averages instead of percentiles
❌ "Average latency: 100ms"
Problem: Hides outliers
Reality: p99 might be 5s
4. Never updating
❌ "Our baseline is from 2 years ago"
Problem: System changed completely
Comparison makes no sense
Baseline in Practice
Documentation template
# Performance Baseline
## Identification
- **System:** [Name]
- **Version:** [x.y.z]
- **Environment:** [Production/Staging]
- **Period:** [Start date] to [End date]
- **Owner:** [Name]
## Latency Metrics
| Endpoint | p50 | p95 | p99 | Max |
|----------|-----|-----|-----|-----|
| /api/x | Xms | Xms | Xms | Xms |
## Throughput Metrics
- Average: X req/s
- Peak: X req/s (when)
- Valley: X req/s (when)
## Error Metrics
- Overall rate: X%
- By type: [breakdown]
## Resources
| Resource | Average | Peak | Limit |
|----------|---------|------|-------|
| CPU | X% | X% | X% |
| Memory | XGB | XGB | XGB |
## Observed Patterns
[Describe daily, weekly variations, etc.]
## Known Anomalies
[List events that affected metrics]
## Next Review
[Scheduled date]
Automating collection
# Example job to collect baseline
baseline_collection:
schedule: "0 0 * * 0" # Weekly
queries:
latency_p95:
promql: histogram_quantile(0.95, rate(http_duration_bucket[7d]))
throughput_avg:
promql: avg_over_time(rate(http_requests_total[5m])[7d:1h])
error_rate:
promql: sum(rate(http_requests_total{status=~"5.."}[7d]))
/ sum(rate(http_requests_total[7d]))
output:
- grafana_annotation
- confluence_page
- slack_notification
Conclusion
The baseline is the foundation of any serious performance work:
- Without baseline, there's no measurable improvement
- Document the context, not just the numbers
- Use percentiles, not averages
- Update regularly as the system evolves
- Automate collection for consistency
The difference between "we think it improved" and "we proved it improved" is a well-documented baseline.
This article is part of the series on the OCTOPUS Performance Engineering methodology.