Most performance initiatives start wrong: with tests. Teams configure tools, create scripts, generate load — and discover they don't know what they're measuring or why. The OCTOPUS methodology starts differently: with observation.
You can't improve what you don't understand. And you can't understand what you haven't observed.
Why Observe First
The common mistake
Typical approach:
1. "Let's do load testing"
2. Configure k6/Gatling/JMeter
3. Create generic scenario
4. Run test
5. "Average latency: 150ms"
6. "So what? Is that good or bad?"
7. "..."
The problem
Without prior observation:
- Don't know what's "normal"
- Don't know which endpoints are critical
- Don't know what load to expect
- Don't know what users actually do
- Don't know which metrics matter
The OCTOPUS approach
1. Observe system in production
2. Understand real patterns
3. Identify what to measure
4. Establish baseline
5. Then, and only then, test
What Observing Means
It's not just monitoring
Monitor: Passively watch dashboards
Observe: Actively seek understanding
Monitor: "CPU is at 45%"
Observe: "CPU is at 45% because the report
job runs at 2pm and competes
with the user peak"
Dimensions of observation
Technical:
- Infrastructure metrics
- Application logs
- Request traces
- Code profiles
Business:
- User journeys
- Critical features
- Peak hours
- Important events
Context:
- Current architecture
- Dependencies
- Known limitations
- Incident history
Phase O: Observe in Practice
Step 1: Map the system
## System Map
### Components
- API Gateway (Kong)
- 3 microservices (Node.js)
- PostgreSQL (RDS)
- Redis (ElastiCache)
- S3 for assets
### Critical flows
1. Login → Auth Service → DB
2. Listing → Catalog Service → DB + Cache
3. Checkout → Order Service → DB + Payment Gateway
### External dependencies
- Payment Gateway (Stripe)
- Email Service (SendGrid)
- Analytics (Segment)
Step 2: Collect existing metrics
-- Understand traffic patterns
SELECT
date_trunc('hour', created_at) as hour,
count(*) as requests
FROM access_logs
WHERE created_at > now() - interval '7 days'
GROUP BY 1
ORDER BY 1;
# Current latency distribution
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[24h]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[24h]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[24h]))
Step 3: Understand the business
## Questions for stakeholders
1. Which features are critical for the business?
→ "Checkout cannot fail, that's where we generate revenue"
2. What's the peak usage time?
→ "10am to 12pm and 7pm to 10pm"
3. Are there planned special events?
→ "Black Friday in 3 months, we expect 10x traffic"
4. What's the tolerance for slowness?
→ "Users complain if page takes more than 3s"
5. Were there recent incidents?
→ "Last month we had a 2h outage on checkout"
Step 4: Document findings
## Observations - System XYZ
### Traffic patterns
- Peak: 10am-12pm (3x baseline)
- Valley: 2am-6am (0.1x baseline)
- Weekend: 40% of weekday
### Current latency
- p50: 45ms
- p95: 180ms
- p99: 450ms
### Observed bottlenecks
1. DB connection pool saturates at 11am
2. Redis hit rate drops to 60% after deploys
3. Payment gateway has spikes of 2s
### Critical endpoints
1. POST /api/checkout (revenue)
2. GET /api/products (experience)
3. POST /api/auth/login (access)
### Identified risks
- Single point of failure on DB
- No circuit breaker for payment
- Unstructured logs make debugging hard
Tools for Observation
Metrics
Infrastructure:
- Prometheus + Grafana
- CloudWatch
- Datadog
Application:
- APM (New Relic, Datadog, Dynatrace)
- Custom metrics
Business:
- Google Analytics
- Mixpanel
- Amplitude
Logs
Aggregation:
- ELK Stack
- Loki + Grafana
- CloudWatch Logs
Analysis:
- Structured queries
- Correlation with traces
Traces
Distributed Tracing:
- Jaeger
- Zipkin
- Tempo
- X-Ray
Observation Checklist
## Before testing, do you know:
### Traffic
- [ ] Current requests per second?
- [ ] Distribution by endpoint?
- [ ] Daily/weekly pattern?
- [ ] Historical peaks?
### Performance
- [ ] Current latency (p50, p95, p99)?
- [ ] Current error rate?
- [ ] Maximum throughput ever reached?
### Resources
- [ ] CPU/memory utilization?
- [ ] DB connections?
- [ ] Cache hit rate?
### Business
- [ ] Critical endpoints?
- [ ] Existing SLOs?
- [ ] Degradation tolerance?
### Context
- [ ] Architecture documented?
- [ ] Dependencies mapped?
- [ ] Incident history?
If you answered "no" to more than 3 items,
you're not ready to test.
How Long to Observe
General rule
Minimum: 1 week of data
Ideal: 1 month of data
For seasonal events: 1 year of data
Why?
1 day: See daily pattern, miss weekly variation
1 week: See weekly pattern, miss monthly
1 month: See most normal patterns
1 year: Capture seasonality (Black Friday, holidays, etc.)
Common Mistakes
1. Skipping observation due to rush
❌ "We don't have time, let's go straight to testing"
→ Test wrong, repeat work
✅ "Let's invest 2 days observing"
→ Test right, save weeks
2. Only observing infrastructure
❌ "CPU and memory are fine"
→ Misses code bottleneck
✅ "CPU ok, but p99 latency is high"
→ Investigates beyond infra
3. Not involving business
❌ "The system handles 10K req/s"
→ But business needs 50K
✅ "Business expects 50K, we do 10K today"
→ Test has clear objective
Conclusion
The Observe phase of the OCTOPUS methodology establishes the foundation:
- Understand the system before stressing it
- Know the baseline before measuring changes
- Align with business before setting goals
- Document everything for future reference
Time invested in observation multiplies test quality.
Testing without observing is like driving without knowing where you're going. You'll get somewhere, but probably not where you need to be.
This article is part of the series on the OCTOPUS Performance Engineering methodology.