Methodology9 min

Logs, Metrics, and Traces: the three pillars of observability

Understand the difference between logs, metrics, and traces, and how to use each one to diagnose performance problems.

"The system is slow." Where do you look first? Logs? Metrics? Traces? The right answer depends on the question you're asking. The three pillars of observability are complementary, not competing. This article teaches you when to use each one.

Observability isn't about having data. It's about being able to answer any question about your system.

The Three Pillars

Overview

Metrics:
  What: Aggregated numbers over time
  Answers: "What is happening?"
  Example: "p95 latency = 200ms"

Logs:
  What: Discrete events with context
  Answers: "What happened specifically?"
  Example: "Request X failed with error Y at time Z"

Traces:
  What: Journey of a request through the system
  Answers: "Where did it go and how long did it take?"
  Example: "Request went through A (50ms) → B (200ms) → C (30ms)"

Analogy

Metrics = Thermometer
  "You have a fever of 102°F"
  Know something is wrong, but not what

Logs = Medical history
  "Patient reported sore throat 3 days ago"
  Specific events that help diagnosis

Traces = Imaging scan
  "Infection located in right tonsil"
  Complete visualization of the problem

Metrics

Characteristics

Format: Numeric values with timestamp
Granularity: Aggregated (averages, percentiles, counts)
Volume: Low (compressed data points)
Retention: Long (months to years)
Cost: Low per stored data

Types of metrics

Counter:
  - Always increments
  - Ex: total requests, errors, bytes

Gauge:
  - Can go up or down
  - Ex: temperature, active connections, memory

Histogram:
  - Distribution of values
  - Ex: latency, payload size

Summary:
  - Pre-calculated percentiles
  - Ex: p50, p95, p99 of latency

Examples in Prometheus

# Counter: Requests per second rate
rate(http_requests_total[5m])

# Gauge: Active connections now
db_connections_active

# Histogram: 95th percentile latency
histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m]))

# Aggregation: Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

When to use metrics

✅ Use for:
  - Alerts (threshold violated)
  - Health dashboards
  - Trends over time
  - Period comparisons
  - Capacity planning

❌ Don't use for:
  - Investigating specific request
  - Understanding error context
  - Individual case debug

Logs

Characteristics

Format: Text or JSON with timestamp
Granularity: Individual event
Volume: High (each event is recorded)
Retention: Medium (days to weeks)
Cost: High per volume

Log levels

DEBUG:
  - Details for development
  - Never in production

INFO:
  - Important normal events
  - "User X logged in"

WARN:
  - Unusual but not critical situations
  - "Retry needed for service Y"

ERROR:
  - Failures that need attention
  - "Timeout connecting to database"

FATAL:
  - System cannot continue
  - "Invalid configuration, shutting down"

Structured logs

// ❌ Unstructured log
"2024-01-15 10:30:45 ERROR Failed to process order 12345 for user john@example.com"

// ✅ Structured log
{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "ERROR",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_email": "john@example.com",
  "error_type": "PaymentDeclined",
  "payment_provider": "stripe",
  "trace_id": "abc123",
  "duration_ms": 1523
}

When to use logs

✅ Use for:
  - Investigating specific errors
  - Audit and compliance
  - Business flow debugging
  - Failure context

❌ Don't use for:
  - Aggregated metrics (use metrics)
  - Visualizing request flow (use traces)
  - Threshold alerts

Logging best practices

1. Always structured:
   - JSON with standardized fields
   - Facilitates queries and analysis

2. Include context:
   - trace_id for correlation
   - user_id for investigation
   - request_id for tracking

3. Avoid unnecessary PII:
   - Don't log passwords, tokens
   - Mask sensitive data

4. Log at the right level:
   - Production: INFO and above
   - Debug only when necessary

Traces

Characteristics

Format: Spans connected by trace_id
Granularity: Per request, across services
Volume: Medium (sampled in high traffic)
Retention: Short (hours to days)
Cost: Medium to high

Anatomy of a trace

Trace ID: abc-123-xyz

├─ Span: API Gateway (0-50ms)
│  └─ Span: Auth Service (10-30ms)
│     └─ Span: Redis Cache (15-20ms)
│
├─ Span: Order Service (50-250ms)
│  ├─ Span: PostgreSQL Query (60-150ms)
│  └─ Span: Inventory Check (160-200ms)
│
└─ Span: Payment Service (250-400ms)
   └─ Span: Stripe API (280-390ms)

Total: 400ms
Bottleneck identified: PostgreSQL Query (90ms)

Implementing tracing

// Example with OpenTelemetry

const span = tracer.startSpan('process_order', {
  attributes: {
    'order.id': orderId,
    'customer.id': customerId,
  },
});

try {
  // Child span for DB operation
  const dbSpan = tracer.startSpan('db.query', { parent: span });
  const order = await db.getOrder(orderId);
  dbSpan.end();

  // Child span for payment
  const paymentSpan = tracer.startSpan('payment.process', { parent: span });
  await processPayment(order);
  paymentSpan.end();

  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
  throw error;
} finally {
  span.end();
}

When to use traces

✅ Use for:
  - Identifying bottlenecks in slow requests
  - Understanding flow between microservices
  - Distributed latency debugging
  - Visualizing dependencies

❌ Don't use for:
  - Alerts (too granular)
  - Long-term trends (use metrics)
  - Detailed auditing (use logs)

Integrating the Three Pillars

Investigation flow

1. ALERT (metric)
   "Error rate rose to 5%"

2. CONTEXT (metric)
   "Increase correlated with DB latency"

3. INVESTIGATION (trace)
   "Slow requests go through query X"

4. DETAIL (log)
   "Query X failing with timeout due to lock"

5. ROOT CAUSE
   "Deploy Y introduced lock contention"

Correlation by trace_id

The trace_id connects everything:

Metric:
  http_request_duration{trace_id="abc123"} = 2.5s

Trace:
  trace_id: abc123
  spans: [gateway, auth, order, payment]
  duration: 2500ms
  bottleneck: payment (2000ms)

Log:
  {
    "trace_id": "abc123",
    "service": "payment",
    "message": "Stripe timeout after 2000ms",
    "error_code": "TIMEOUT"
  }

Practical investigation example

## Scenario: "System slow at 10am"

### 1. Metrics (Grafana)
- p95 latency: 2s (normal: 200ms)
- Throughput: normal
- Error rate: 3% (normal: 0.1%)

→ Problem confirmed, not perception

### 2. Drill-down in metrics
- Latency by endpoint: /api/checkout 10x slower
- Latency by service: Payment service degraded

→ Problem located in payment

### 3. Traces (Jaeger)
- Trace of slow request
- 90% of time in span "stripe_api_call"

→ Bottleneck is external call to Stripe

### 4. Logs (Elasticsearch)
```json
{
  "timestamp": "2024-01-15T10:05:00Z",
  "service": "payment",
  "message": "Stripe API retry attempt 3",
  "trace_id": "xyz789",
  "response_time_ms": 1800,
  "stripe_error": "rate_limited"
}

→ Root cause: Stripe rate limiting

5. Resolution

  • Implement circuit breaker
  • Add validation cache
  • Negotiate rate limit with Stripe

## Tools by Pillar

### Open source stack

```yaml
Metrics:
  Collection: Prometheus, Victoria Metrics
  Visualization: Grafana
  Alerts: Alertmanager

Logs:
  Collection: Fluentd, Fluent Bit, Vector
  Storage: Elasticsearch, Loki
  Visualization: Kibana, Grafana

Traces:
  Collection: OpenTelemetry, Jaeger Agent
  Storage: Jaeger, Tempo, Zipkin
  Visualization: Jaeger UI, Grafana

Managed stack

All-in-one:
  - Datadog
  - New Relic
  - Dynatrace
  - Splunk

By pillar:
  Metrics: CloudWatch, Datadog
  Logs: CloudWatch Logs, Papertrail
  Traces: X-Ray, Honeycomb

Conclusion

The three pillars are complementary:

  1. Metrics: Detect that something is wrong (alerts)
  2. Traces: Show where the problem is (location)
  3. Logs: Explain why it happened (context)

Use all three together:

  • Correlate by trace_id
  • Start with metrics for overview
  • Use traces to locate
  • Use logs for detail

A truly observable system allows you to answer questions you didn't think to ask before having the problem.


This article is part of the series on the OCTOPUS Performance Engineering methodology.

OCTOPUSobservabilitylogsmetricstraces

Want to understand your platform's limits?

Contact us for a performance assessment.

Contact Us