Performance in Critical Systems: when failure is not an option

Some systems simply cannot fail. Financial systems, healthcare, critical infrastructure — when performance degrades or the system goes down, the consequences go beyond unhappy users. This article explores performance practices for mission-critical systems.

In critical systems, "working" isn't enough. It needs to work well, always, under any condition.

Characteristics of Critical Systems

Typical requirements

Availability: 99.99% (52 min downtime/year)
p99 Latency: < 50ms
Error rate: < 0.01%
Recovery time: < 30 seconds
Data loss: Zero

Examples

Financial: Trading, payments, core banking
Healthcare: Patient monitors, prescriptions
Infrastructure: Energy, telecom, transportation
Government: Elections, emergency, defense

Fundamental Principles

1. Defense in Depth

Multiple layers of protection:

Layer 1: Input validation
Layer 2: Timeouts and circuit breakers
Layer 3: Service redundancy
Layer 4: Data replication
Layer 5: Disaster recovery

2. Fail-Safe Defaults

# ❌ Silent failure
def process_payment(payment):
    try:
        return gateway.charge(payment)
    except Exception:
        return None  # Ignores error

# ✅ Safe and explicit failure
def process_payment(payment):
    try:
        result = gateway.charge(payment)
        if not result.is_verified():
            raise PaymentVerificationFailed()
        return result
    except Exception as e:
        log_critical(e)
        alert_oncall()
        raise PaymentProcessingError(original=e)

3. Graceful Degradation

def get_user_recommendations(user_id):
    try:
        # Try full service
        return recommendation_service.get_personalized(user_id)
    except TimeoutError:
        # Fallback: local cache
        cached = local_cache.get(user_id)
        if cached:
            return cached

        # Fallback: generic popular items
        return get_popular_items()

Patterns for High Performance

1. Active-Active Redundancy

Primary Region: us-east-1
  Load: 50%
  Status: Active

Secondary Region: us-west-2
  Load: 50%
  Status: Active

Failover: Automatic via DNS
RTO: 30 seconds
RPO: 0 (sync replication)

2. Read-Write Separation

class Database:
    def __init__(self):
        self.primary = connect_primary()
        self.replicas = connect_replicas()

    def write(self, query):
        return self.primary.execute(query)

    def read(self, query):
        replica = self.load_balance(self.replicas)
        return replica.execute(query)

3. Request Hedging

async def critical_query(query):
    """
    Sends to multiple replicas, uses first response
    """
    tasks = [
        replica.execute(query)
        for replica in replicas[:3]
    ]

    # Returns first result
    done, pending = await asyncio.wait(
        tasks,
        return_when=FIRST_COMPLETED
    )

    # Cancel pending queries
    for task in pending:
        task.cancel()

    return done.pop().result()

4. Pre-computation

# Pre-compute critical results
@scheduled(every='5 minutes')
def precompute_dashboards():
    for user in premium_users:
        dashboard_data = compute_dashboard(user)
        cache.set(f'dashboard:{user.id}', dashboard_data)

# Request only fetches from cache
def get_dashboard(user_id):
    return cache.get(f'dashboard:{user_id}')

Testing Critical Systems

Chaos Engineering

# Chaos Monkey: kills random instances
def chaos_test():
    instance = random.choice(production_instances)
    instance.terminate()
    assert system_health() == 'healthy'
    assert latency_p99() < SLO

# Chaos Kong: simulates region failure
def region_failover_test():
    disable_region('us-east-1')
    assert all_requests_succeed()
    assert latency_increase() < 20%

Game Days

## Game Day Plan: DB Failover

### Objective
Validate that DB failover occurs < 30s without data loss

### Prerequisites
- [ ] Backup verified
- [ ] Runbook updated
- [ ] On-call alerted

### Execution
1. 10:00 - Start monitoring
2. 10:05 - Simulate primary failure
3. 10:06 - Verify automatic failover
4. 10:10 - Validate data integrity
5. 10:15 - Restore primary
6. 10:20 - Verify failback

### Metrics
- Failover time: ___s
- Lost requests: ___
- Lost data: ___

### Result
[ ] Passed [ ] Failed

Extreme Load Testing

# Test beyond expected
Scenarios:
  Normal: 1x load
  Peak: 5x load
  Extreme: 10x load
  Absurd: 20x load

For each scenario:
  - Latency still within SLO?
  - Errors still within SLO?
  - System recovers when load decreases?
  - How long for recovery?

Monitoring Critical Systems

Golden Signals with Strict SLOs

Latency:
  p50: < 20ms
  p95: < 50ms
  p99: < 100ms
  p99.9: < 500ms

Traffic:
  Expected: 10,000 req/s
  Alert: < 8,000 or > 15,000

Errors:
  Total: < 0.01%
  5xx: < 0.001%

Saturation:
  CPU: < 60%
  Memory: < 70%
  Connections: < 80%

Multi-level Alerts

# Tier 1: Immediate action (page)
- alert: CriticalLatency
  expr: latency_p99 > 100ms
  for: 1m
  severity: page

# Tier 2: Investigate soon (urgent ticket)
- alert: HighLatency
  expr: latency_p99 > 50ms
  for: 5m
  severity: ticket_urgent

# Tier 3: Monitor (normal ticket)
- alert: ElevatedLatency
  expr: latency_p99 > 30ms
  for: 15m
  severity: ticket

War Room Dashboards

┌─────────────────────────────────────────────────┐
│ SYSTEM STATUS: 🟢 HEALTHY                       │
├─────────────────────────────────────────────────┤
│ Latency p99: 32ms (SLO: 50ms)      [====    ]  │
│ Error Rate:  0.003% (SLO: 0.01%)   [=       ]  │
│ Throughput:  12,342 req/s          [======  ]  │
│ CPU:         45%                   [====    ]  │
├─────────────────────────────────────────────────┤
│ Active Incidents: 0                             │
│ Error Budget Remaining: 87%                     │
│ Last Incident: 14 days ago                      │
└─────────────────────────────────────────────────┘

Runbooks for Incidents

Structure

## Runbook: High Latency Alert

### Symptoms
- p99 latency > 50ms for > 1 minute
- Alert: CriticalLatency

### Impact
- Users experience slowness
- Possible client timeouts
- SLO at risk

### Diagnosis
1. Check dashboard: grafana.internal/critical
2. Identify slow component:

kubectl top pods -n production

3. Check dependencies:

curl -s http://health.internal/dependencies


### Mitigation
**If high CPU:**
- Scale horizontally: `kubectl scale deployment/api --replicas=10`

**If slow DB:**
- Check queries: `SELECT * FROM pg_stat_activity`
- Consider failover to replica

**If external dependency:**
- Activate circuit breaker: `curl -X POST http://api/circuit/external/open`

### Escalation
- 5 min without resolution: Call on-call lead
- 15 min without resolution: Call engineering manager
- 30 min without resolution: Activate war room

Organizational Practices

On-call Structure

Primary: Responds in 5 min
Secondary: Backup if primary doesn't respond in 10 min
Tertiary: Specialist for complex cases

Rotation: Weekly
Compensation: Day off per on-call week

Post-incident Review

## Incident Review: 2025-01-15 Latency Spike

### Summary
p99 latency reached 2s for 15 minutes due to N+1 query
introduced in deploy at 14:30.

### Timeline
- 14:30 - Deploy of version 2.3.1
- 14:45 - First latency alert
- 14:47 - On-call started investigation
- 15:00 - N+1 query identified
- 15:05 - Rollback initiated
- 15:08 - System normalized

### Impact
- 15 minutes of degradation
- 0.5% of requests with timeout
- 0 data loss

### Root Cause
N+1 query introduced in PR #1234, not detected in review
or performance tests.

### Actions
1. [ ] Add performance test for affected endpoint
2. [ ] Implement automatic N+1 detection in CI
3. [ ] Review code review checklist

### Lessons
- Changes to critical queries need load testing
- Alerts worked well
- Rollback was quick and effective

Conclusion

Performance in critical systems requires:

Redundancy at all layers
Fail-safe by default
Extreme tests (chaos, game days)
Obsessive monitoring
Detailed runbooks
Blameless post-mortem culture

The difference between critical and non-critical systems isn't just technical:

Non-critical: "If it fails, users complain"
Critical: "If it fails, consequences are irreversible"

Plan for the worst case. Test the unthinkable. Document everything.

In critical systems, paranoia is a feature, not a bug.