Interpreting Test Results: beyond the numbers

"The test showed 150ms latency." Ok, but what does that mean? Is it good? Bad? Compared to what? Performance test numbers without context and interpretation are just digits. This article teaches how to extract real meaning from results.

Data isn't insight. Interpretation transforms data into knowledge.

The Problem with Raw Numbers

Numbers without context

Test result:
  - Latency p95: 250ms
  - Throughput: 1500 req/s
  - Error rate: 0.5%

Questions numbers don't answer:
  - Does this meet requirements?
  - Compared to baseline, did it improve or worsen?
  - Which endpoints contributed?
  - Was it stable or did it vary during the test?
  - Was the test environment representative?

The risk of superficial interpretation

Scenario:
  "Average latency: 100ms. Test passed!"

Reality:
  - p50: 50ms
  - p95: 200ms
  - p99: 2000ms
  - Max: 30s

Real conclusion:
  Average hides that 1% of users have
  terrible experience (>2s)

Interpretation Framework

1. Context first

Before looking at numbers, ask:
  - What was the test objective?
  - What load was applied vs expected?
  - What environment was used?
  - How long did it run?
  - Was there realistic data?

2. Comparison with baseline

Isolated result:
  p95 = 200ms

With baseline:
  Baseline: p95 = 180ms
  Test: p95 = 200ms
  Δ: +11%

Interpretation:
  11% regression - investigate or accept?

3. Distribution, not average

Always use:
  - p50 (median) - typical experience
  - p95 - most users
  - p99 - common worst case
  - Max - outliers

Never trust only:
  - Average (hides variance)
  - Min (irrelevant)

4. Trend over time

Constant latency:
  ───────────────────
  Good: Stable system

Increasing latency:
  ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱
  Problem: Memory leak? Queue buildup?

Latency with spikes:
  ─╲─╱─╲─╱─╲─╱─╲─╱─
  Investigate: GC? Background jobs? External API?

Analyzing by Layer

Endpoint analysis

Aggregated result:
  p95 = 300ms

By endpoint:
  GET /api/products:     p95 = 100ms  ✓
  GET /api/product/:id:  p95 = 150ms  ✓
  POST /api/checkout:    p95 = 2s     ✗ ← Problem!

Insight:
  Checkout needs specific attention

Component analysis

End-to-end result:
  p95 = 500ms

Breakdown:
  API Gateway:    50ms (10%)
  Auth Service:   80ms (16%)
  Order Service: 120ms (24%)
  DB Query:      200ms (40%) ← Largest contributor
  Serialization:  50ms (10%)

Insight:
  DB optimization will have greatest impact

Resource analysis

Performance metrics OK, but:
  CPU: 95%
  Memory: 7.8GB / 8GB
  DB Connections: 95 / 100

Interpretation:
  System operating at its limit
  No headroom for growth
  Next spike may cause failure

Identifying Patterns

Important correlations

Latency rises when:
  - CPU > 80%? → CPU-bound
  - Memory > 90%? → GC stress
  - Connections > 80%? → Connection starvation
  - Throughput increases? → Saturation

Errors appear when:
  - Timeout? → Slow dependency
  - 5xx? → Application failure
  - Connection refused? → Pool exhausted

Red flags

Watch for:
  - High variance (p99/p50 > 10x)
  - Errors increasing over time
  - Latency that doesn't stabilize
  - Throughput dropping during test
  - Resources at 100% (any of them)

False positives

Be careful with:
  - First minute (warm-up)
  - Single spikes (may be outliers)
  - Non-representative environment
  - Cache too hot or cold
  - Unrealistic test data

Results Reporting

Recommended structure

# Test Report - [Name]

## 1. Executive Summary
- Status: PASSED / FAILED / WITH CAVEATS
- Validated capacity: X req/s
- Key findings: [bullets]
- Recommended actions: [bullets]

## 2. Test Context
- Date: [when]
- Environment: [where]
- Load: [how much]
- Duration: [time]
- Scenario: [description]

## 3. Results vs Criteria

| Metric | Criterion | Result | Status |
|--------|-----------|--------|--------|
| p95 latency | < 500ms | 380ms | ✓ |
| Error rate | < 1% | 0.3% | ✓ |
| Throughput | > 1000 req/s | 1250 | ✓ |

## 4. Detailed Analysis

### By Endpoint
[Table with breakdown]

### By Component
[Table with breakdown]

### Temporal Trend
[Graph and observations]

## 5. Observations and Risks
- [Observation 1]
- [Identified risk]

## 6. Recommendations
- [Action 1 - high priority]
- [Action 2 - medium priority]

## 7. Next Steps
- [What to do with these results]

Essential visualizations

Graphs that help:

1. Latency over time:
   - Shows stability
   - Reveals trends

2. Latency distribution (histogram):
   - Shows dispersion
   - Identifies multimodality

3. Throughput vs Latency:
   - Shows saturation point
   - Identifies correlation

4. Resources vs Time:
   - CPU, Memory, Connections
   - Correlates with performance

Common Interpretations

"The test passed"

Follow-up question:
  - With what margin?
  - What's the headroom?
  - Close to the limit?

Example:
  Criterion: p95 < 500ms
  Result: p95 = 480ms

  Technically passed, but:
  - 4% margin
  - Any degradation = failure
  - Recommendation: optimize before production

"The test failed"

Don't stop at failure:
  - Where did it fail first?
  - How much to pass?
  - Was it consistent or intermittent?

Example:
  Criterion: p95 < 500ms
  Result: p95 = 650ms

  Analysis:
  - Failed by 30%
  - Bottleneck: DB (contributes 400ms)
  - Optimizing DB may solve it

"Inconsistent results"

High variance indicates:
  - Unstable environment
  - Non-deterministic GC
  - External dependencies
  - Non-uniform load

Action:
  - Run more times
  - Increase duration
  - Isolate variables
  - Investigate spikes

Interpretation Anti-Patterns

1. Cherry-picking

❌ "The best result was 100ms"
✅ "Median was 150ms, best was 100ms, worst was 2s"

2. Ignoring warm-up

❌ Including first 2 minutes in analysis
✅ Discard warm-up, analyze steady-state

3. Comparing incomparables

❌ "Prod has 200ms, test has 300ms, 50% regression"
   (If test environment is different)

✅ "Comparing same environment, before/after the change"

4. Average as truth

❌ "100ms average, excellent!"
   (When p99 is 5s)

✅ "p50=80ms, p95=200ms, p99=5s - outliers are a problem"

Conclusion

Interpreting results correctly requires:

Context - compare with baseline and requirements
Distribution - percentiles, not averages
Trend - behavior over time
Breakdown - by endpoint and component
Correlation - performance vs resources

Numbers are the beginning, not the end. The value is in the insight you extract.

The test generates data. You generate knowledge.

This article is part of the series on the OCTOPUS Performance Engineering methodology.