Performance and Growth: preparing the system for the future

"We solved the performance problem." Until the next deploy. Until the next feature. Until the next user growth. Performance isn't a project with a beginning and end — it's a continuous practice that needs to be integrated into engineering culture.

Performance isn't a destination. It's a journey that never ends.

The Performance Vicious Cycle

The common pattern

Phase 1: Development
  "Focus on features, performance later"

Phase 2: Launch
  "It's working, let's ship"

Phase 3: Growth
  "Why is it getting slow?"

Phase 4: Crisis
  "We need a performance project!"

Phase 5: Fix
  "Fixed!" (back to Phase 1)

Why it happens

Misaligned incentives:
  - Features are visible, performance isn't
  - Deadline pressure cuts "extras"
  - Performance seems to work until it breaks

Lack of process:
  - No defined SLOs
  - No performance tests in CI
  - No proactive monitoring

Technical debt:
  - "We'll optimize later"
  - "Works for current volume"
  - "When it's a problem, we'll solve it"

Integrating Performance in the Cycle

Shift-Left: Performance from design

In Design:
  - Consider expected scale
  - Choose adequate architecture
  - Define preliminary SLOs

In Development:
  - Local profiling
  - Unit performance tests
  - Code review with performance lens

In PR:
  - Automated load tests
  - Before/after benchmark
  - Regression verification

SLOs as contract

Define SLOs before developing:

Checkout API:
  - Latency p95: < 500ms
  - Availability: 99.9%
  - Error rate: < 0.1%

Tracking:
  - SLO compliance dashboard
  - Alerts when degrading
  - Weekly error budget review

Performance in CI/CD

Pipeline stages:

1. Unit tests (includes performance):
   - Benchmarks of critical functions
   - Comparison with baseline
   - Fail if regression > 10%

2. Integration tests:
   - Basic load test (smoke)
   - Verify still works under load

3. Pre-prod:
   - Full load test
   - Comparison with current production
   - Approval gate

4. Deploy:
   - Canary with metrics
   - Auto rollback if degradation

Performance Culture

Distributed ownership

Not one team's responsibility:
  ❌ "The performance team solves it"

Everyone's responsibility:
  ✅ "Each team owns their service's performance"

Model:
  - Team A: Service A SLO
  - Team B: Service B SLO
  - Platform: Tools and standards
  - SRE: Alerts and incident response

Visible metrics

Public dashboards:
  - SLO compliance per service
  - Latency trending
  - Remaining error budget

Regular reviews:
  - Weekly: Team metrics review
  - Monthly: Engineering-wide review
  - Quarterly: Executive report

Aligned incentives

OKRs that include performance:
  ❌ "Deliver feature X"
  ✅ "Deliver feature X maintaining p95 < 500ms"

Celebrate improvements:
  - Recognize optimizations
  - Share learnings
  - Post-mortems for improvements (not just incidents)

Continuous Practices

Monitoring as habit

Daily:
  - Review key dashboards
  - Check pending alerts
  - Note trends

Weekly:
  - SLO compliance review
  - Slow queries analysis
  - Error budget review

Monthly:
  - Capacity planning update
  - Trend analysis
  - Next bottlenecks

Performance reviews

Each major release:
  - Complete stress test
  - Comparison with previous release
  - Identify regressions

Each quarter:
  - Architecture review
  - Validate scales for next quarter
  - Identify performance debt

Chaos engineering

Regular:
  - Fail instances
  - Simulate dependency latency
  - Test circuit breakers

Gamedays:
  - Simulate Black Friday
  - Incident response
  - Validate runbooks

Planning for Growth

Continuous capacity planning

Data needed:
  - Historical load growth
  - Current capacity (from stress test)
  - Known future events

Monthly process:
  1. Update load projection
  2. Compare with capacity
  3. Identify when limit is reached
  4. Plan actions (scale, optimize)

Output:
  - Capacity timeline
  - Budget needed
  - Decisions for next quarter

Performance roadmap

Just like product roadmap:

Q1:
  - Implement caching layer
  - Optimize critical queries
  - Setup automated stress test

Q2:
  - Read replicas for DB
  - CDN for static assets
  - Autoscaling configuration

Q3:
  - Data sharding
  - Async processing for jobs
  - Edge caching

Q4:
  - Multi-region setup
  - Global load balancing
  - Disaster recovery

Avoiding Regressions

Quality gates

PR doesn't pass if:
  - Benchmark regresses > 5%
  - New endpoint without defined SLO
  - Query without documented EXPLAIN
  - Missing cache for static data

Deploy doesn't happen if:
  - Load test fails
  - Canary shows degradation
  - Error budget exhausted

Proactive alerts

Alert before it becomes a problem:

Trend alerts:
  - Latency growing 5%/day for 3 days
  - Memory growing consistently
  - Error rate increasing gradually

Capacity alerts:
  - CPU approaching 70%
  - Disk approaching 80%
  - Connection pool approaching limit

The Role of Leadership

Executives

Responsibilities:
  - Include performance in goals
  - Allocate budget for infrastructure
  - Celebrate improvements publicly

Questions to ask:
  - "What's our current capacity?"
  - "When do we hit the limit?"
  - "What's the cost of not investing?"

Engineering managers

Responsibilities:
  - Protect time for performance
  - Balance features with tech debt
  - Ensure SLO ownership

Questions to ask:
  - "What are the team's SLOs?"
  - "Are we within error budget?"
  - "What's the biggest performance risk?"

Engineers

Responsibilities:
  - Consider performance in design
  - Write code with performance in mind
  - Monitor services you developed

Questions to ask:
  - "Does this scale to 10x?"
  - "What's the complexity of this operation?"
  - "Does it need cache?"

Conclusion

Sustainable performance requires:

Integration in the cycle - not as afterthought
SLOs as contract - defined and measured
Distributed ownership - each team responsible
Continuous monitoring - not just when it breaks
Proactive planning - regular capacity planning
Performance culture - aligned incentives

The OCTOPUS methodology isn't a project — it's a continuous practice.

The best time to think about performance is before having a problem. The second best time is now.

This article concludes the series on the OCTOPUS Performance Engineering methodology.