Observe Before Testing: the first step of the OCTOPUS methodology

Most performance initiatives start wrong: with tests. Teams configure tools, create scripts, generate load — and discover they don't know what they're measuring or why. The OCTOPUS methodology starts differently: with observation.

You can't improve what you don't understand. And you can't understand what you haven't observed.

Why Observe First

The common mistake

Typical approach:
1. "Let's do load testing"
2. Configure k6/Gatling/JMeter
3. Create generic scenario
4. Run test
5. "Average latency: 150ms"
6. "So what? Is that good or bad?"
7. "..."

The problem

Without prior observation:

Don't know what's "normal"
Don't know which endpoints are critical
Don't know what load to expect
Don't know what users actually do
Don't know which metrics matter

The OCTOPUS approach

1. Observe system in production
2. Understand real patterns
3. Identify what to measure
4. Establish baseline
5. Then, and only then, test

What Observing Means

It's not just monitoring

Monitor: Passively watch dashboards
Observe: Actively seek understanding

Monitor: "CPU is at 45%"
Observe: "CPU is at 45% because the report
         job runs at 2pm and competes
         with the user peak"

Dimensions of observation

Technical:
  - Infrastructure metrics
  - Application logs
  - Request traces
  - Code profiles

Business:
  - User journeys
  - Critical features
  - Peak hours
  - Important events

Context:
  - Current architecture
  - Dependencies
  - Known limitations
  - Incident history

Phase O: Observe in Practice

Step 1: Map the system

## System Map

### Components
- API Gateway (Kong)
- 3 microservices (Node.js)
- PostgreSQL (RDS)
- Redis (ElastiCache)
- S3 for assets

### Critical flows
1. Login → Auth Service → DB
2. Listing → Catalog Service → DB + Cache
3. Checkout → Order Service → DB + Payment Gateway

### External dependencies
- Payment Gateway (Stripe)
- Email Service (SendGrid)
- Analytics (Segment)

Step 2: Collect existing metrics

-- Understand traffic patterns
SELECT
  date_trunc('hour', created_at) as hour,
  count(*) as requests
FROM access_logs
WHERE created_at > now() - interval '7 days'
GROUP BY 1
ORDER BY 1;

# Current latency distribution
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[24h]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[24h]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[24h]))

Step 3: Understand the business

## Questions for stakeholders

1. Which features are critical for the business?
   → "Checkout cannot fail, that's where we generate revenue"

2. What's the peak usage time?
   → "10am to 12pm and 7pm to 10pm"

3. Are there planned special events?
   → "Black Friday in 3 months, we expect 10x traffic"

4. What's the tolerance for slowness?
   → "Users complain if page takes more than 3s"

5. Were there recent incidents?
   → "Last month we had a 2h outage on checkout"

Step 4: Document findings

## Observations - System XYZ

### Traffic patterns
- Peak: 10am-12pm (3x baseline)
- Valley: 2am-6am (0.1x baseline)
- Weekend: 40% of weekday

### Current latency
- p50: 45ms
- p95: 180ms
- p99: 450ms

### Observed bottlenecks
1. DB connection pool saturates at 11am
2. Redis hit rate drops to 60% after deploys
3. Payment gateway has spikes of 2s

### Critical endpoints
1. POST /api/checkout (revenue)
2. GET /api/products (experience)
3. POST /api/auth/login (access)

### Identified risks
- Single point of failure on DB
- No circuit breaker for payment
- Unstructured logs make debugging hard

Tools for Observation

Metrics

Infrastructure:
  - Prometheus + Grafana
  - CloudWatch
  - Datadog

Application:
  - APM (New Relic, Datadog, Dynatrace)
  - Custom metrics

Business:
  - Google Analytics
  - Mixpanel
  - Amplitude

Logs

Aggregation:
  - ELK Stack
  - Loki + Grafana
  - CloudWatch Logs

Analysis:
  - Structured queries
  - Correlation with traces

Traces

Distributed Tracing:
  - Jaeger
  - Zipkin
  - Tempo
  - X-Ray

Observation Checklist

## Before testing, do you know:

### Traffic
- [ ] Current requests per second?
- [ ] Distribution by endpoint?
- [ ] Daily/weekly pattern?
- [ ] Historical peaks?

### Performance
- [ ] Current latency (p50, p95, p99)?
- [ ] Current error rate?
- [ ] Maximum throughput ever reached?

### Resources
- [ ] CPU/memory utilization?
- [ ] DB connections?
- [ ] Cache hit rate?

### Business
- [ ] Critical endpoints?
- [ ] Existing SLOs?
- [ ] Degradation tolerance?

### Context
- [ ] Architecture documented?
- [ ] Dependencies mapped?
- [ ] Incident history?

If you answered "no" to more than 3 items,
you're not ready to test.

How Long to Observe

General rule

Minimum: 1 week of data
Ideal: 1 month of data
For seasonal events: 1 year of data

Why?

1 day: See daily pattern, miss weekly variation
1 week: See weekly pattern, miss monthly
1 month: See most normal patterns
1 year: Capture seasonality (Black Friday, holidays, etc.)

Common Mistakes

1. Skipping observation due to rush

❌ "We don't have time, let's go straight to testing"
   → Test wrong, repeat work

✅ "Let's invest 2 days observing"
   → Test right, save weeks

2. Only observing infrastructure

❌ "CPU and memory are fine"
   → Misses code bottleneck

✅ "CPU ok, but p99 latency is high"
   → Investigates beyond infra

3. Not involving business

❌ "The system handles 10K req/s"
   → But business needs 50K

✅ "Business expects 50K, we do 10K today"
   → Test has clear objective

Conclusion

The Observe phase of the OCTOPUS methodology establishes the foundation:

Understand the system before stressing it
Know the baseline before measuring changes
Align with business before setting goals
Document everything for future reference

Time invested in observation multiplies test quality.

Testing without observing is like driving without knowing where you're going. You'll get somewhere, but probably not where you need to be.

This article is part of the series on the OCTOPUS Performance Engineering methodology.