Performance in Event-Driven Systems

Event-driven systems process work asynchronously through messages. This fundamentally changes how we measure and optimize performance. Traditional metrics like "response time" need to be rethought.

In synchronous systems, you measure latency. In asynchronous systems, you measure throughput and lag.

Characteristics of Event-Driven Systems

Synchronous vs asynchronous flow

Synchronous:
Request → Processing → Response
         [100ms total]

Asynchronous:
Request → Queue → ACK (5ms)
                    ↓
          Consumer processes (100ms)
                    ↓
          Result available

Different metrics

Synchronous	Asynchronous
End-to-end latency	End-to-end latency
Response time	Consumer lag
Requests/second	Messages/second
Timeout	TTL (Time to Live)

Essential Metrics

1. Consumer Lag

Difference between last published message and last consumed.

Producer published: message #1000
Consumer processed: message #800
Consumer lag: 200 messages

In time:

Lag in messages: 200
Consumer throughput: 50 msg/s
Lag in time: 200/50 = 4 seconds

Prometheus + Kafka:

# Lag per consumer group
kafka_consumergroup_lag{topic="orders"}

2. Throughput

Producer throughput: 1000 msg/s
Consumer throughput: 800 msg/s
Gap: 200 msg/s

→ Lag will increase 200 msg/s continuously
→ System is not sustainable

Golden rule:

Consumer throughput ≥ Producer throughput
(with margin for peaks)

3. Processing Time

Time to process an individual message.

Message received: 14:00:00.000
Processing started: 14:00:00.005
Processing finished: 14:00:00.105

Processing time: 100ms
Wait time: 5ms

4. End-to-End Latency

Total time from production to complete processing.

Event created: 14:00:00.000
Published to broker: 14:00:00.010
Consumed: 14:00:00.050
Processed: 14:00:00.150

End-to-end: 150ms

Common Bottlenecks

1. Slow consumer

Producer: 1000 msg/s
Consumer: 100 msg/s

Lag after 1 hour: 3.24 million messages

Solutions:

1. More consumers (parallel partitions)
2. Batching in processing
3. Optimize consumer logic
4. Async I/O within consumer

2. Unbalanced partitions

Partition 0: 500 msg/s → Consumer A (overloaded)
Partition 1: 100 msg/s → Consumer B (idle)
Partition 2: 100 msg/s → Consumer C (idle)

Solutions:

1. Improve partition key
2. More partitions
3. Manual rebalancing

3. Saturated broker

Broker CPU: 95%
Disk: 90% IOPS
Network: saturated

→ Publishing latency increases
→ Consumers receive delayed

Solutions:

1. More brokers
2. Compression
3. Producer batching
4. Shorter retention

4. Infinite reprocessing

Message fails → Back to queue
Fails again → Back to queue
...infinitely

→ Resources consumed by messages that will never pass

Solutions:

1. Dead Letter Queue (DLQ)
2. Retry with limit
3. Exponential backoff

# Retry with DLQ
max_retries = 3
retry_count = message.headers.get('retry_count', 0)

if retry_count >= max_retries:
    send_to_dlq(message)
else:
    try:
        process(message)
    except Exception:
        message.headers['retry_count'] = retry_count + 1
        republish_with_delay(message)

Optimization Patterns

1. Batching

# ❌ Commit per message
for message in messages:
    process(message)
    consumer.commit()  # I/O per message

# ✅ Batch commit
batch = []
for message in messages:
    batch.append(process(message))
    if len(batch) >= 100:
        save_batch(batch)
        consumer.commit()
        batch = []

2. Prefetch

# Kafka consumer config
fetch.min.bytes = 1024        # Wait to accumulate data
fetch.max.wait.ms = 500       # Or up to 500ms
max.poll.records = 500        # Up to 500 records per poll

3. Parallelism

# Consumer with thread pool
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=10)

for message in consumer:
    executor.submit(process, message)

Caution: Order not guaranteed with parallelism!

4. Compression

# Producer config
compression.type = 'lz4'  # Fast
# or 'zstd'               # Better compression

Without compression: 100MB/s bandwidth
With LZ4: 25MB/s bandwidth (4x reduction)

Testing Event-Driven Systems

Challenges

Synchronous:
- Request → Response
- Time measured directly

Asynchronous:
- Publish → ??? → Result somewhere
- How to measure end-to-end?

Techniques

1. Correlation IDs

# Producer
event = {
    'correlation_id': uuid4(),
    'timestamp': time.time(),
    'data': {...}
}
publish(event)
store_correlation(event['correlation_id'], event['timestamp'])

# Consumer
def process(event):
    result = do_work(event)
    publish_result({
        'correlation_id': event['correlation_id'],
        'completed_at': time.time()
    })

# Collector
def collect_result(result):
    start = get_stored_timestamp(result['correlation_id'])
    end_to_end = result['completed_at'] - start
    record_metric('e2e_latency', end_to_end)

2. Synthetic events

# Inject test events periodically
@every(minute=1)
def inject_synthetic():
    publish({
        'type': 'synthetic_test',
        'timestamp': time.time()
    })

# Consumer detects and measures
def process(event):
    if event['type'] == 'synthetic_test':
        latency = time.time() - event['timestamp']
        record_metric('synthetic_e2e', latency)
        return  # Don't process

    # Normal processing

3. Async load testing

// k6 for asynchronous systems
import { check } from 'k6';
import kafka from 'k6/x/kafka';

export default function() {
    // Publish
    kafka.produce({
        topic: 'orders',
        messages: [{ value: JSON.stringify({order_id: __VU}) }]
    });

    // Verify result (eventual)
    // Polling on another topic or database
}

Test metrics

1. Maximum sustainable throughput
   - Increase load until lag starts growing
   - Point before = maximum throughput

2. Latency under load
   - Measure end-to-end at different load levels
   - Identify inflection point

3. Recovery time
   - Inject spike, stop
   - Measure time until lag reaches zero

Production Monitoring

Essential dashboard

1. Consumer Lag (absolute and in time)
2. Throughput: produced vs consumed
3. Processing time (p50, p95, p99)
4. Error rate per consumer
5. Partition distribution
6. Broker health

Alerts

# Consumer lag increasing
- alert: ConsumerLagIncreasing
  expr: |
    delta(kafka_consumergroup_lag[5m]) > 1000
  for: 5m

# High absolute lag
- alert: ConsumerLagHigh
  expr: |
    kafka_consumergroup_lag > 10000
  for: 2m

# Stalled consumer
- alert: ConsumerStalled
  expr: |
    rate(kafka_consumer_records_consumed_total[5m]) == 0
  for: 5m

Conclusion

Performance in event-driven systems requires a different mindset:

Focus on throughput, not just latency
Consumer lag is the most important metric
Partition balancing is critical
Retry and DLQ prevent failure cascades
End-to-end tests require correlation IDs

The fundamental rule:

Consumer throughput > Producer throughput × safety margin

If this equation doesn't balance, lag will only grow.

A healthy event-driven system has lag close to zero. Growing lag is technical debt accumulating interest.

Characteristics of Event-Driven Systems

Synchronous vs asynchronous flow

Different metrics

Essential Metrics

1. Consumer Lag

2. Throughput

3. Processing Time

4. End-to-End Latency

Common Bottlenecks

1. Slow consumer

2. Unbalanced partitions

3. Saturated broker

4. Infinite reprocessing

Optimization Patterns

1. Batching

2. Prefetch

3. Parallelism

4. Compression

Testing Event-Driven Systems

Challenges

Techniques

Test metrics

Production Monitoring

Essential dashboard

Alerts

Conclusion

Want to understand your platform's limits?