Some systems simply cannot fail. Financial systems, healthcare, critical infrastructure — when performance degrades or the system goes down, the consequences go beyond unhappy users. This article explores performance practices for mission-critical systems.
In critical systems, "working" isn't enough. It needs to work well, always, under any condition.
Characteristics of Critical Systems
Typical requirements
Availability: 99.99% (52 min downtime/year)
p99 Latency: < 50ms
Error rate: < 0.01%
Recovery time: < 30 seconds
Data loss: Zero
Examples
Financial: Trading, payments, core banking
Healthcare: Patient monitors, prescriptions
Infrastructure: Energy, telecom, transportation
Government: Elections, emergency, defense
Fundamental Principles
1. Defense in Depth
Multiple layers of protection:
Layer 1: Input validation
Layer 2: Timeouts and circuit breakers
Layer 3: Service redundancy
Layer 4: Data replication
Layer 5: Disaster recovery
2. Fail-Safe Defaults
# ❌ Silent failure
def process_payment(payment):
try:
return gateway.charge(payment)
except Exception:
return None # Ignores error
# ✅ Safe and explicit failure
def process_payment(payment):
try:
result = gateway.charge(payment)
if not result.is_verified():
raise PaymentVerificationFailed()
return result
except Exception as e:
log_critical(e)
alert_oncall()
raise PaymentProcessingError(original=e)
3. Graceful Degradation
def get_user_recommendations(user_id):
try:
# Try full service
return recommendation_service.get_personalized(user_id)
except TimeoutError:
# Fallback: local cache
cached = local_cache.get(user_id)
if cached:
return cached
# Fallback: generic popular items
return get_popular_items()
Patterns for High Performance
1. Active-Active Redundancy
Primary Region: us-east-1
Load: 50%
Status: Active
Secondary Region: us-west-2
Load: 50%
Status: Active
Failover: Automatic via DNS
RTO: 30 seconds
RPO: 0 (sync replication)
2. Read-Write Separation
class Database:
def __init__(self):
self.primary = connect_primary()
self.replicas = connect_replicas()
def write(self, query):
return self.primary.execute(query)
def read(self, query):
replica = self.load_balance(self.replicas)
return replica.execute(query)
3. Request Hedging
async def critical_query(query):
"""
Sends to multiple replicas, uses first response
"""
tasks = [
replica.execute(query)
for replica in replicas[:3]
]
# Returns first result
done, pending = await asyncio.wait(
tasks,
return_when=FIRST_COMPLETED
)
# Cancel pending queries
for task in pending:
task.cancel()
return done.pop().result()
4. Pre-computation
# Pre-compute critical results
@scheduled(every='5 minutes')
def precompute_dashboards():
for user in premium_users:
dashboard_data = compute_dashboard(user)
cache.set(f'dashboard:{user.id}', dashboard_data)
# Request only fetches from cache
def get_dashboard(user_id):
return cache.get(f'dashboard:{user_id}')
Testing Critical Systems
Chaos Engineering
# Chaos Monkey: kills random instances
def chaos_test():
instance = random.choice(production_instances)
instance.terminate()
assert system_health() == 'healthy'
assert latency_p99() < SLO
# Chaos Kong: simulates region failure
def region_failover_test():
disable_region('us-east-1')
assert all_requests_succeed()
assert latency_increase() < 20%
Game Days
## Game Day Plan: DB Failover
### Objective
Validate that DB failover occurs < 30s without data loss
### Prerequisites
- [ ] Backup verified
- [ ] Runbook updated
- [ ] On-call alerted
### Execution
1. 10:00 - Start monitoring
2. 10:05 - Simulate primary failure
3. 10:06 - Verify automatic failover
4. 10:10 - Validate data integrity
5. 10:15 - Restore primary
6. 10:20 - Verify failback
### Metrics
- Failover time: ___s
- Lost requests: ___
- Lost data: ___
### Result
[ ] Passed [ ] Failed
Extreme Load Testing
# Test beyond expected
Scenarios:
Normal: 1x load
Peak: 5x load
Extreme: 10x load
Absurd: 20x load
For each scenario:
- Latency still within SLO?
- Errors still within SLO?
- System recovers when load decreases?
- How long for recovery?
Monitoring Critical Systems
Golden Signals with Strict SLOs
Latency:
p50: < 20ms
p95: < 50ms
p99: < 100ms
p99.9: < 500ms
Traffic:
Expected: 10,000 req/s
Alert: < 8,000 or > 15,000
Errors:
Total: < 0.01%
5xx: < 0.001%
Saturation:
CPU: < 60%
Memory: < 70%
Connections: < 80%
Multi-level Alerts
# Tier 1: Immediate action (page)
- alert: CriticalLatency
expr: latency_p99 > 100ms
for: 1m
severity: page
# Tier 2: Investigate soon (urgent ticket)
- alert: HighLatency
expr: latency_p99 > 50ms
for: 5m
severity: ticket_urgent
# Tier 3: Monitor (normal ticket)
- alert: ElevatedLatency
expr: latency_p99 > 30ms
for: 15m
severity: ticket
War Room Dashboards
┌─────────────────────────────────────────────────┐
│ SYSTEM STATUS: 🟢 HEALTHY │
├─────────────────────────────────────────────────┤
│ Latency p99: 32ms (SLO: 50ms) [==== ] │
│ Error Rate: 0.003% (SLO: 0.01%) [= ] │
│ Throughput: 12,342 req/s [====== ] │
│ CPU: 45% [==== ] │
├─────────────────────────────────────────────────┤
│ Active Incidents: 0 │
│ Error Budget Remaining: 87% │
│ Last Incident: 14 days ago │
└─────────────────────────────────────────────────┘
Runbooks for Incidents
Structure
## Runbook: High Latency Alert
### Symptoms
- p99 latency > 50ms for > 1 minute
- Alert: CriticalLatency
### Impact
- Users experience slowness
- Possible client timeouts
- SLO at risk
### Diagnosis
1. Check dashboard: grafana.internal/critical
2. Identify slow component:
kubectl top pods -n production
3. Check dependencies:
curl -s http://health.internal/dependencies
### Mitigation
**If high CPU:**
- Scale horizontally: `kubectl scale deployment/api --replicas=10`
**If slow DB:**
- Check queries: `SELECT * FROM pg_stat_activity`
- Consider failover to replica
**If external dependency:**
- Activate circuit breaker: `curl -X POST http://api/circuit/external/open`
### Escalation
- 5 min without resolution: Call on-call lead
- 15 min without resolution: Call engineering manager
- 30 min without resolution: Activate war room
Organizational Practices
On-call Structure
Primary: Responds in 5 min
Secondary: Backup if primary doesn't respond in 10 min
Tertiary: Specialist for complex cases
Rotation: Weekly
Compensation: Day off per on-call week
Post-incident Review
## Incident Review: 2025-01-15 Latency Spike
### Summary
p99 latency reached 2s for 15 minutes due to N+1 query
introduced in deploy at 14:30.
### Timeline
- 14:30 - Deploy of version 2.3.1
- 14:45 - First latency alert
- 14:47 - On-call started investigation
- 15:00 - N+1 query identified
- 15:05 - Rollback initiated
- 15:08 - System normalized
### Impact
- 15 minutes of degradation
- 0.5% of requests with timeout
- 0 data loss
### Root Cause
N+1 query introduced in PR #1234, not detected in review
or performance tests.
### Actions
1. [ ] Add performance test for affected endpoint
2. [ ] Implement automatic N+1 detection in CI
3. [ ] Review code review checklist
### Lessons
- Changes to critical queries need load testing
- Alerts worked well
- Rollback was quick and effective
Conclusion
Performance in critical systems requires:
- Redundancy at all layers
- Fail-safe by default
- Extreme tests (chaos, game days)
- Obsessive monitoring
- Detailed runbooks
- Blameless post-mortem culture
The difference between critical and non-critical systems isn't just technical:
Non-critical: "If it fails, users complain"
Critical: "If it fails, consequences are irreversible"
Plan for the worst case. Test the unthinkable. Document everything.
In critical systems, paranoia is a feature, not a bug.