Performance Baseline: the starting point for any improvement

"We improved performance by 30%". Compared to what? If you can't answer that question with data, you don't know if you really improved. The baseline is the reference point that makes any measurement meaningful.

Performance without baseline is opinion. Performance with baseline is engineering.

What Is a Baseline

Definition

Baseline = Known and documented state of the system
           under normal operating conditions

What a baseline is NOT

❌ "The system is fast"
❌ "Works fine on my computer"
❌ "We never had complaints"
❌ "Last test showed 200ms"

✅ "In production, p95 is 180ms at 500 req/s
    between 10am-12pm, measured over 7 days"

Components of a baseline

Metrics:
  - Latency (p50, p95, p99)
  - Throughput (req/s, tx/s)
  - Error rate
  - Resource utilization

Context:
  - Measurement period
  - Load conditions
  - System version
  - Environment configuration

Variability:
  - Standard deviation
  - Observed min/max
  - Temporal patterns

Why Baseline Is Essential

1. Measure real progress

Without baseline:
  "We optimized the system" → By how much? We don't know.

With baseline:
  Before: p95 = 450ms @ 1000 req/s
  After: p95 = 180ms @ 1000 req/s
  Improvement: 60% latency reduction

2. Detect regressions

Feature X deploy:
  Baseline: p95 = 100ms
  After deploy: p95 = 250ms

  Alert: 150% regression detected!

3. Validate changes

Hypothesis: "Adding cache will improve performance"

Baseline (no cache):
  p95 = 200ms
  DB queries/s = 5000

After cache:
  p95 = 50ms
  DB queries/s = 500

Validation: Cache reduced latency by 75%
            and queries by 90%

4. Communicate with stakeholders

❌ "The system is faster"

✅ "We reduced checkout time from 3.2s to 0.8s,
    impacting 50K transactions/day"

How to Establish a Baseline

Step 1: Define relevant metrics

For an e-commerce:
  Critical:
    - Checkout latency (p50, p95, p99)
    - Payment error rate
    - Orders/hour throughput

  Secondary:
    - Product search time
    - Login latency
    - Cache hit rate

  Resources:
    - Order service CPU
    - DB connections
    - Redis memory

Step 2: Collect data for sufficient time

Minimum period: 1 week
  - Captures daily variation
  - Includes peaks and valleys
  - Representative of real usage

Ideal period: 1 month
  - Weekly variation
  - Special events
  - Greater statistical confidence

Step 3: Document conditions

## Baseline - Order System
**Period:** 2024-01-01 to 2024-01-31
**Version:** v2.3.4
**Environment:** Production (3x c5.xlarge)

### Metrics
| Metric | p50 | p95 | p99 |
|--------|-----|-----|-----|
| Checkout | 450ms | 980ms | 2.1s |
| Search | 80ms | 150ms | 300ms |
| Login | 120ms | 250ms | 500ms |

### Throughput
- Average: 850 req/s
- Peak: 2100 req/s (Black Friday)
- Valley: 150 req/s (overnight)

### Errors
- Average rate: 0.3%
- By type: 60% timeout, 30% 5xx, 10% 4xx

### Resources
- Average CPU: 45%
- Memory: 6.2GB / 8GB
- DB connections: 80 / 100

Step 4: Identify patterns

Daily pattern:
┌─────────────────────────────────────┐
│  req/s                              │
│   2K ┤         ╭──╮     ╭───╮       │
│  1.5K┤        ╭╯  ╰╮   ╭╯   ╰╮      │
│   1K ┤   ╭───╯     ╰───╯     ╰──    │
│  500 ┤──╯                           │
│    0 ├──┬──┬──┬──┬──┬──┬──┬──┬──┬── │
│      0  3  6  9  12 15 18 21 24     │
└─────────────────────────────────────┘
Peaks: 10am-12pm and 7pm-10pm
Valley: 2am-6am

Types of Baseline

1. Production baseline

Source: Real production metrics
Advantage: Reflects real usage
Disadvantage: Variable, hard to isolate

When to use:
  - Continuous monitoring
  - Regression detection
  - Capacity planning

2. Test baseline

Source: Controlled environment with simulated load
Advantage: Reproducible, isolated
Disadvantage: May not reflect production

When to use:
  - Compare versions
  - Validate optimizations
  - CI/CD pipeline

3. Theoretical baseline

Source: SLOs, requirements, industry benchmarks
Advantage: Clear target
Disadvantage: May be unrealistic

When to use:
  - Define objectives
  - Gap analysis
  - Stakeholder negotiation

Keeping Baseline Updated

When to update

Mandatory:
  - Significant architecture change
  - Major system version
  - Infrastructure change
  - User growth > 50%

Optional:
  - Quarterly
  - After major optimizations
  - Usage pattern change

Versioning baselines

## Baseline History

### v3 - 2024-Q1 (current)
- Checkout p95: 450ms
- Throughput: 850 req/s
- Context: New microservices architecture

### v2 - 2023-Q3
- Checkout p95: 1.2s
- Throughput: 400 req/s
- Context: Optimized monolith

### v1 - 2023-Q1
- Checkout p95: 2.5s
- Throughput: 200 req/s
- Context: Original monolith

Common Mistakes

1. Single-moment baseline

❌ "I ran a test just now and got 150ms"

Problem: Doesn't capture variability

2. Ignoring context

❌ "p95 is 200ms"

Missing: At what load? At what time?
         What version? What environment?

3. Using averages instead of percentiles

❌ "Average latency: 100ms"

Problem: Hides outliers
Reality: p99 might be 5s

4. Never updating

❌ "Our baseline is from 2 years ago"

Problem: System changed completely
         Comparison makes no sense

Baseline in Practice

Documentation template

# Performance Baseline

## Identification
- **System:** [Name]
- **Version:** [x.y.z]
- **Environment:** [Production/Staging]
- **Period:** [Start date] to [End date]
- **Owner:** [Name]

## Latency Metrics
| Endpoint | p50 | p95 | p99 | Max |
|----------|-----|-----|-----|-----|
| /api/x   | Xms | Xms | Xms | Xms |

## Throughput Metrics
- Average: X req/s
- Peak: X req/s (when)
- Valley: X req/s (when)

## Error Metrics
- Overall rate: X%
- By type: [breakdown]

## Resources
| Resource | Average | Peak | Limit |
|----------|---------|------|-------|
| CPU      | X%      | X%   | X%    |
| Memory   | XGB     | XGB  | XGB   |

## Observed Patterns
[Describe daily, weekly variations, etc.]

## Known Anomalies
[List events that affected metrics]

## Next Review
[Scheduled date]

Automating collection

# Example job to collect baseline
baseline_collection:
  schedule: "0 0 * * 0"  # Weekly

  queries:
    latency_p95:
      promql: histogram_quantile(0.95, rate(http_duration_bucket[7d]))

    throughput_avg:
      promql: avg_over_time(rate(http_requests_total[5m])[7d:1h])

    error_rate:
      promql: sum(rate(http_requests_total{status=~"5.."}[7d]))
              / sum(rate(http_requests_total[7d]))

  output:
    - grafana_annotation
    - confluence_page
    - slack_notification

Conclusion

The baseline is the foundation of any serious performance work:

Without baseline, there's no measurable improvement
Document the context, not just the numbers
Use percentiles, not averages
Update regularly as the system evolves
Automate collection for consistency

The difference between "we think it improved" and "we proved it improved" is a well-documented baseline.

This article is part of the series on the OCTOPUS Performance Engineering methodology.