"We solved the performance problem." Until the next deploy. Until the next feature. Until the next user growth. Performance isn't a project with a beginning and end — it's a continuous practice that needs to be integrated into engineering culture.
Performance isn't a destination. It's a journey that never ends.
The Performance Vicious Cycle
The common pattern
Phase 1: Development
"Focus on features, performance later"
Phase 2: Launch
"It's working, let's ship"
Phase 3: Growth
"Why is it getting slow?"
Phase 4: Crisis
"We need a performance project!"
Phase 5: Fix
"Fixed!" (back to Phase 1)
Why it happens
Misaligned incentives:
- Features are visible, performance isn't
- Deadline pressure cuts "extras"
- Performance seems to work until it breaks
Lack of process:
- No defined SLOs
- No performance tests in CI
- No proactive monitoring
Technical debt:
- "We'll optimize later"
- "Works for current volume"
- "When it's a problem, we'll solve it"
Integrating Performance in the Cycle
Shift-Left: Performance from design
In Design:
- Consider expected scale
- Choose adequate architecture
- Define preliminary SLOs
In Development:
- Local profiling
- Unit performance tests
- Code review with performance lens
In PR:
- Automated load tests
- Before/after benchmark
- Regression verification
SLOs as contract
Define SLOs before developing:
Checkout API:
- Latency p95: < 500ms
- Availability: 99.9%
- Error rate: < 0.1%
Tracking:
- SLO compliance dashboard
- Alerts when degrading
- Weekly error budget review
Performance in CI/CD
Pipeline stages:
1. Unit tests (includes performance):
- Benchmarks of critical functions
- Comparison with baseline
- Fail if regression > 10%
2. Integration tests:
- Basic load test (smoke)
- Verify still works under load
3. Pre-prod:
- Full load test
- Comparison with current production
- Approval gate
4. Deploy:
- Canary with metrics
- Auto rollback if degradation
Performance Culture
Distributed ownership
Not one team's responsibility:
❌ "The performance team solves it"
Everyone's responsibility:
✅ "Each team owns their service's performance"
Model:
- Team A: Service A SLO
- Team B: Service B SLO
- Platform: Tools and standards
- SRE: Alerts and incident response
Visible metrics
Public dashboards:
- SLO compliance per service
- Latency trending
- Remaining error budget
Regular reviews:
- Weekly: Team metrics review
- Monthly: Engineering-wide review
- Quarterly: Executive report
Aligned incentives
OKRs that include performance:
❌ "Deliver feature X"
✅ "Deliver feature X maintaining p95 < 500ms"
Celebrate improvements:
- Recognize optimizations
- Share learnings
- Post-mortems for improvements (not just incidents)
Continuous Practices
Monitoring as habit
Daily:
- Review key dashboards
- Check pending alerts
- Note trends
Weekly:
- SLO compliance review
- Slow queries analysis
- Error budget review
Monthly:
- Capacity planning update
- Trend analysis
- Next bottlenecks
Performance reviews
Each major release:
- Complete stress test
- Comparison with previous release
- Identify regressions
Each quarter:
- Architecture review
- Validate scales for next quarter
- Identify performance debt
Chaos engineering
Regular:
- Fail instances
- Simulate dependency latency
- Test circuit breakers
Gamedays:
- Simulate Black Friday
- Incident response
- Validate runbooks
Planning for Growth
Continuous capacity planning
Data needed:
- Historical load growth
- Current capacity (from stress test)
- Known future events
Monthly process:
1. Update load projection
2. Compare with capacity
3. Identify when limit is reached
4. Plan actions (scale, optimize)
Output:
- Capacity timeline
- Budget needed
- Decisions for next quarter
Performance roadmap
Just like product roadmap:
Q1:
- Implement caching layer
- Optimize critical queries
- Setup automated stress test
Q2:
- Read replicas for DB
- CDN for static assets
- Autoscaling configuration
Q3:
- Data sharding
- Async processing for jobs
- Edge caching
Q4:
- Multi-region setup
- Global load balancing
- Disaster recovery
Avoiding Regressions
Quality gates
PR doesn't pass if:
- Benchmark regresses > 5%
- New endpoint without defined SLO
- Query without documented EXPLAIN
- Missing cache for static data
Deploy doesn't happen if:
- Load test fails
- Canary shows degradation
- Error budget exhausted
Proactive alerts
Alert before it becomes a problem:
Trend alerts:
- Latency growing 5%/day for 3 days
- Memory growing consistently
- Error rate increasing gradually
Capacity alerts:
- CPU approaching 70%
- Disk approaching 80%
- Connection pool approaching limit
The Role of Leadership
Executives
Responsibilities:
- Include performance in goals
- Allocate budget for infrastructure
- Celebrate improvements publicly
Questions to ask:
- "What's our current capacity?"
- "When do we hit the limit?"
- "What's the cost of not investing?"
Engineering managers
Responsibilities:
- Protect time for performance
- Balance features with tech debt
- Ensure SLO ownership
Questions to ask:
- "What are the team's SLOs?"
- "Are we within error budget?"
- "What's the biggest performance risk?"
Engineers
Responsibilities:
- Consider performance in design
- Write code with performance in mind
- Monitor services you developed
Questions to ask:
- "Does this scale to 10x?"
- "What's the complexity of this operation?"
- "Does it need cache?"
Conclusion
Sustainable performance requires:
- Integration in the cycle - not as afterthought
- SLOs as contract - defined and measured
- Distributed ownership - each team responsible
- Continuous monitoring - not just when it breaks
- Proactive planning - regular capacity planning
- Performance culture - aligned incentives
The OCTOPUS methodology isn't a project — it's a continuous practice.
The best time to think about performance is before having a problem. The second best time is now.
This article concludes the series on the OCTOPUS Performance Engineering methodology.