Event-driven systems process work asynchronously through messages. This fundamentally changes how we measure and optimize performance. Traditional metrics like "response time" need to be rethought.
In synchronous systems, you measure latency. In asynchronous systems, you measure throughput and lag.
Characteristics of Event-Driven Systems
Synchronous vs asynchronous flow
Synchronous:
Request → Processing → Response
[100ms total]
Asynchronous:
Request → Queue → ACK (5ms)
↓
Consumer processes (100ms)
↓
Result available
Different metrics
| Synchronous | Asynchronous |
|---|---|
| End-to-end latency | End-to-end latency |
| Response time | Consumer lag |
| Requests/second | Messages/second |
| Timeout | TTL (Time to Live) |
Essential Metrics
1. Consumer Lag
Difference between last published message and last consumed.
Producer published: message #1000
Consumer processed: message #800
Consumer lag: 200 messages
In time:
Lag in messages: 200
Consumer throughput: 50 msg/s
Lag in time: 200/50 = 4 seconds
Prometheus + Kafka:
# Lag per consumer group
kafka_consumergroup_lag{topic="orders"}
2. Throughput
Producer throughput: 1000 msg/s
Consumer throughput: 800 msg/s
Gap: 200 msg/s
→ Lag will increase 200 msg/s continuously
→ System is not sustainable
Golden rule:
Consumer throughput ≥ Producer throughput
(with margin for peaks)
3. Processing Time
Time to process an individual message.
Message received: 14:00:00.000
Processing started: 14:00:00.005
Processing finished: 14:00:00.105
Processing time: 100ms
Wait time: 5ms
4. End-to-End Latency
Total time from production to complete processing.
Event created: 14:00:00.000
Published to broker: 14:00:00.010
Consumed: 14:00:00.050
Processed: 14:00:00.150
End-to-end: 150ms
Common Bottlenecks
1. Slow consumer
Producer: 1000 msg/s
Consumer: 100 msg/s
Lag after 1 hour: 3.24 million messages
Solutions:
1. More consumers (parallel partitions)
2. Batching in processing
3. Optimize consumer logic
4. Async I/O within consumer
2. Unbalanced partitions
Partition 0: 500 msg/s → Consumer A (overloaded)
Partition 1: 100 msg/s → Consumer B (idle)
Partition 2: 100 msg/s → Consumer C (idle)
Solutions:
1. Improve partition key
2. More partitions
3. Manual rebalancing
3. Saturated broker
Broker CPU: 95%
Disk: 90% IOPS
Network: saturated
→ Publishing latency increases
→ Consumers receive delayed
Solutions:
1. More brokers
2. Compression
3. Producer batching
4. Shorter retention
4. Infinite reprocessing
Message fails → Back to queue
Fails again → Back to queue
...infinitely
→ Resources consumed by messages that will never pass
Solutions:
1. Dead Letter Queue (DLQ)
2. Retry with limit
3. Exponential backoff
# Retry with DLQ
max_retries = 3
retry_count = message.headers.get('retry_count', 0)
if retry_count >= max_retries:
send_to_dlq(message)
else:
try:
process(message)
except Exception:
message.headers['retry_count'] = retry_count + 1
republish_with_delay(message)
Optimization Patterns
1. Batching
# ❌ Commit per message
for message in messages:
process(message)
consumer.commit() # I/O per message
# ✅ Batch commit
batch = []
for message in messages:
batch.append(process(message))
if len(batch) >= 100:
save_batch(batch)
consumer.commit()
batch = []
2. Prefetch
# Kafka consumer config
fetch.min.bytes = 1024 # Wait to accumulate data
fetch.max.wait.ms = 500 # Or up to 500ms
max.poll.records = 500 # Up to 500 records per poll
3. Parallelism
# Consumer with thread pool
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=10)
for message in consumer:
executor.submit(process, message)
Caution: Order not guaranteed with parallelism!
4. Compression
# Producer config
compression.type = 'lz4' # Fast
# or 'zstd' # Better compression
Without compression: 100MB/s bandwidth
With LZ4: 25MB/s bandwidth (4x reduction)
Testing Event-Driven Systems
Challenges
Synchronous:
- Request → Response
- Time measured directly
Asynchronous:
- Publish → ??? → Result somewhere
- How to measure end-to-end?
Techniques
1. Correlation IDs
# Producer
event = {
'correlation_id': uuid4(),
'timestamp': time.time(),
'data': {...}
}
publish(event)
store_correlation(event['correlation_id'], event['timestamp'])
# Consumer
def process(event):
result = do_work(event)
publish_result({
'correlation_id': event['correlation_id'],
'completed_at': time.time()
})
# Collector
def collect_result(result):
start = get_stored_timestamp(result['correlation_id'])
end_to_end = result['completed_at'] - start
record_metric('e2e_latency', end_to_end)
2. Synthetic events
# Inject test events periodically
@every(minute=1)
def inject_synthetic():
publish({
'type': 'synthetic_test',
'timestamp': time.time()
})
# Consumer detects and measures
def process(event):
if event['type'] == 'synthetic_test':
latency = time.time() - event['timestamp']
record_metric('synthetic_e2e', latency)
return # Don't process
# Normal processing
3. Async load testing
// k6 for asynchronous systems
import { check } from 'k6';
import kafka from 'k6/x/kafka';
export default function() {
// Publish
kafka.produce({
topic: 'orders',
messages: [{ value: JSON.stringify({order_id: __VU}) }]
});
// Verify result (eventual)
// Polling on another topic or database
}
Test metrics
1. Maximum sustainable throughput
- Increase load until lag starts growing
- Point before = maximum throughput
2. Latency under load
- Measure end-to-end at different load levels
- Identify inflection point
3. Recovery time
- Inject spike, stop
- Measure time until lag reaches zero
Production Monitoring
Essential dashboard
1. Consumer Lag (absolute and in time)
2. Throughput: produced vs consumed
3. Processing time (p50, p95, p99)
4. Error rate per consumer
5. Partition distribution
6. Broker health
Alerts
# Consumer lag increasing
- alert: ConsumerLagIncreasing
expr: |
delta(kafka_consumergroup_lag[5m]) > 1000
for: 5m
# High absolute lag
- alert: ConsumerLagHigh
expr: |
kafka_consumergroup_lag > 10000
for: 2m
# Stalled consumer
- alert: ConsumerStalled
expr: |
rate(kafka_consumer_records_consumed_total[5m]) == 0
for: 5m
Conclusion
Performance in event-driven systems requires a different mindset:
- Focus on throughput, not just latency
- Consumer lag is the most important metric
- Partition balancing is critical
- Retry and DLQ prevent failure cascades
- End-to-end tests require correlation IDs
The fundamental rule:
Consumer throughput > Producer throughput × safety margin
If this equation doesn't balance, lag will only grow.
A healthy event-driven system has lag close to zero. Growing lag is technical debt accumulating interest.