What to Observe in a System: metrics that truly matter

Modern systems generate thousands of metrics. Dashboards crowded with graphs. Alerts firing constantly. The irony is that the more data, the harder it is to see what matters. This article teaches you to separate signal from noise.

The problem isn't lack of data. It's excess data without context.

The Three Levels of Observation

Level 1: Infrastructure

What to measure:
  CPU:
    - Average utilization
    - Usage peaks
    - Throttling (in containers)

  Memory:
    - Current vs available usage
    - Swap usage
    - OOM events

  Disk:
    - IOPS
    - I/O latency
    - Available space

  Network:
    - Bandwidth used
    - Lost packets
    - Active connections

Why observe: Identifies physical limits and resource problems.

Pitfall: Focusing only here. Healthy infrastructure doesn't mean healthy application.

Level 2: Application

What to measure:
  Latency:
    - p50, p95, p99
    - Per endpoint
    - Per operation type

  Throughput:
    - Requests per second
    - Transactions per second
    - Per functionality

  Errors:
    - Error rate
    - By type (4xx, 5xx)
    - By root cause

  Saturation:
    - Connection pools
    - Thread pools
    - Queue depths

Why observe: Reveals real user experience.

Pitfall: Aggregated metrics hide specific problems.

Level 3: Business

What to measure:
  Conversion:
    - Sales funnel
    - Abandonment rate
    - Time to conversion

  Engagement:
    - Sessions per user
    - Time on platform
    - Most used features

  Revenue:
    - Transactions/hour
    - Average value
    - Payment failures

Why observe: Connects technical performance to business impact.

Pitfall: Hard to correlate with technical metrics.

The Golden Signals

Google SRE defined 4 essential signals:

1. Latency

# Response time by percentile
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

What it reveals:

p50: Typical experience
p95: Most users' experience
p99: Common worst case

Why percentiles, not average:

Scenario: 99 requests of 10ms, 1 request of 10s
Average: 109ms (seems ok)
p99: 10s (reveals the problem)

2. Traffic

# Requests per second
rate(http_requests_total[5m])

# Per endpoint
sum by(endpoint) (rate(http_requests_total[5m]))

# Per status
sum by(status_code) (rate(http_requests_total[5m]))

What it reveals:

Current vs expected volume
Load distribution
Usage trends

3. Errors

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

# By type
sum by(status_code) (rate(http_requests_total{status=~"[45].."}[5m]))

What it reveals:

Overall system health
Specific problems (4xx vs 5xx)
Degradation trends

4. Saturation

# Connection pool
pg_stat_activity_count / pg_settings_max_connections

# Threads
jvm_threads_current / jvm_threads_max

# CPU
sum(rate(container_cpu_usage_seconds_total[5m]))
/ sum(container_spec_cpu_quota / container_spec_cpu_period)

What it reveals:

How much of the resource is in use
Proximity to limit
Saturation risk

RED Method (for services)

Rate:
  - Requests per second
  - How many operations the service processes

Errors:
  - Failure rate
  - How many operations fail

Duration:
  - Latency per request
  - How long each operation takes

# Complete RED dashboard
# Rate
sum(rate(http_requests_total[5m]))

# Errors
sum(rate(http_requests_total{status=~"5.."}[5m]))

# Duration
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

USE Method (for resources)

Utilization:
  - % of resource in use
  - CPU, memory, disk, network

Saturation:
  - Queued work
  - Requests waiting for resources

Errors:
  - Resource failures
  - OOM, disk full, network errors

# USE for CPU
# Utilization
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))

# Saturation
avg(node_load15) / count(node_cpu_seconds_total{mode="idle"})

# Errors
increase(node_cpu_core_throttles_total[5m])

Component-Specific Metrics

Database

Connections:
  - Active vs maximum
  - Idle vs waiting
  - Connection errors

Queries:
  - Queries per second
  - Average time per query
  - Slow queries

Resources:
  - Buffer cache hit rate
  - Disk reads vs cache reads
  - Lock wait time

Cache (Redis)

Performance:
  - Hit rate
  - Miss rate
  - Operation latency

Memory:
  - Current usage
  - Evictions
  - Fragmentation

Connections:
  - Connected clients
  - Blocked clients

Message Queue (Kafka)

Throughput:
  - Messages in/out per second
  - Bytes in/out

Consumer:
  - Consumer lag
  - Commit rate
  - Rebalances

Broker:
  - Under-replicated partitions
  - Request queue size

Creating an Effective Dashboard

Recommended structure

┌─────────────────────────────────────────────────┐
│ HEALTH OVERVIEW                                 │
│ [Status] [Error Rate] [p95 Latency] [Traffic]   │
├─────────────────────────────────────────────────┤
│ GOLDEN SIGNALS                                  │
│ [Latency p50/p95/p99] [Traffic] [Errors] [Sat]  │
├─────────────────────────────────────────────────┤
│ TOP ENDPOINTS                                   │
│ [By latency] [By traffic] [By errors]           │
├─────────────────────────────────────────────────┤
│ RESOURCES                                       │
│ [CPU] [Memory] [DB] [Cache]                     │
├─────────────────────────────────────────────────┤
│ DEPENDENCIES                                    │
│ [External APIs] [Payment] [Email]               │
└─────────────────────────────────────────────────┘

Design rules

1. Less is more:
   - 5-10 graphs per dashboard
   - Focus on what requires action

2. Clear hierarchy:
   - Overview at top
   - Details below
   - Drill-down available

3. Temporal context:
   - Compare with baseline
   - Show trends
   - Mark deploys/events

4. Actionable:
   - Each graph answers a question
   - If it doesn't require action, remove it

Correlating Metrics

Example: Investigating slowness

Symptom: p95 latency increased 2x

Check correlations:
1. Did traffic increase?
   → No, stable

2. Did errors increase?
   → Yes, 5xx up 3x

3. Resources saturated?
   → DB connections at 95%

4. Specific slow query?
   → Yes, report query without index

Root cause: Deploy included new query without index
            that competes for connections

Conclusion

Effective observation means:

Focus on Golden Signals as starting point
USE for resources, RED for services
Business metrics for context
Correlation to find root causes
Actionable dashboards, not decorative

Number of metrics doesn't matter. What matters is knowing which questions each metric answers.

If you don't know why you're measuring something, stop measuring it.

This article is part of the series on the OCTOPUS Performance Engineering methodology.

The Three Levels of Observation

Level 1: Infrastructure

Level 2: Application

Level 3: Business

The Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

RED Method (for services)

USE Method (for resources)

Component-Specific Metrics

Database

Cache (Redis)

Message Queue (Kafka)

Creating an Effective Dashboard

Recommended structure

Design rules

Correlating Metrics

Example: Investigating slowness

Conclusion

Want to understand your platform's limits?