Autoscaling is the promise of automatically scaling based on demand. In practice, inadequate configurations result in late scaling, constant oscillation, or uncontrolled costs. This article explores how to configure autoscaling predictably.
Autoscaling isn't "configure and forget". It's "configure, test, adjust, monitor".
How Autoscaling Works
The basic loop
1. Metrics collected
2. Compared with threshold
3. Decision: scale up, down, or nothing
4. Action executed
5. Cooldown
6. Repeat
Reaction time
Event (spike) → Detection → Decision → Provisioning → Ready
Typical timeline:
0s: Spike starts
15s: Metric reflects spike
30s: HPA decides to scale
60s: Pod scheduled
90s: Container pulled
120s: App ready
Total: ~2 minutes until new capacity
Problem: If your spike lasts 1 minute, autoscaling doesn't help.
Horizontal Pod Autoscaler (HPA)
Basic configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Common problems
1. Threshold too high:
# ❌ 90% utilization
# Only scales when already saturated
averageUtilization: 90
2. Threshold too low:
# ❌ 30% utilization
# Scales with any normal variation
averageUtilization: 30
3. Inadequate cooldown:
# ❌ Without configuring behavior
# Scale up/down too fast = thrashing
Optimized configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65 # Margin to react
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale immediately
policies:
- type: Percent
value: 100 # Can double
periodSeconds: 15
- type: Pods
value: 4 # Or +4 pods
periodSeconds: 15
selectPolicy: Max # Use the largest
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min stable
policies:
- type: Percent
value: 10 # Reduce 10% at a time
periodSeconds: 60
Custom metrics
CPU isn't always the best indicator:
metrics:
# Based on requests per second
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
# Based on queue
- type: External
external:
metric:
name: queue_messages_ready
selector:
matchLabels:
queue: orders
target:
type: Value
value: 100
Vertical Pod Autoscaler (VPA)
When to use
HPA: Adds more pods (horizontal)
VPA: Increases resources of existing pods (vertical)
HPA: Good for stateless
VPA: Good for databases, caches, stateful
Configuration
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Auto" # Recreate pods when needed
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
VPA + HPA
# ❌ VPA and HPA on the same resource (CPU)
# Conflict: both try to adjust
# ✅ VPA for memory, HPA for CPU
# VPA
resourcePolicy:
containerPolicies:
- containerName: app
controlledResources: ["memory"]
# HPA
metrics:
- type: Resource
resource:
name: cpu
Cluster Autoscaler
How it works
Pod pending (no available node)
→ Cluster Autoscaler detects
→ Provisions new node
→ Pod scheduled on new node
Configuration
# Node pool configuration (GKE example)
gcloud container node-pools create scaling-pool \
--cluster=my-cluster \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10 \
--machine-type=e2-standard-4
Provisioning time
AWS: 2-5 minutes
GCP: 1-3 minutes
Azure: 2-4 minutes
→ For quick spikes, Cluster Autoscaler is too slow
Predictability: The Key to Autoscaling
Problem 1: Thrashing
Load varies: 60% → 75% → 65% → 80% → 70%
With 70% threshold:
- Scale up (75%)
- Scale down (65%)
- Scale up (80%)
- Scale down (70%)
= Pods being created and destroyed constantly
Solution: Stabilization window
behavior:
scaleDown:
stabilizationWindowSeconds: 300
Problem 2: Late scaling
Spike at 9:00
HPA detects at 9:00:30
New pods ready at 9:02:00
= 2 minutes of degradation
Solution: Predictive scaling or pre-warming
# Scheduled scaling (AWS)
aws autoscaling put-scheduled-action \
--scheduled-action-name "morning-scale" \
--auto-scaling-group-name "app-asg" \
--recurrence "0 8 * * MON-FRI" \
--min-size 10
Problem 3: Cold start
New pod:
- Container start: 5s
- App initialization: 30s
- Warm-up: 60s
Total: ~95s until ideal performance
Solution: Adequate readiness probe + warm-up
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 60 # Time for warm-up
periodSeconds: 5
Problem 4: Uncontrolled costs
Autoscaling without limit:
- Anomalous spike or attack
- Scales to 100 pods
- Cost: $10,000/day
→ "Autoscaling worked perfectly... and bankrupted the company"
Solution: Limits + alerts
maxReplicas: 20 # Hard limit
# Alert when near limit
- alert: HPA_NearMaxReplicas
expr: |
kube_hpa_status_current_replicas / kube_hpa_spec_max_replicas > 0.8
for: 5m
Autoscaling Strategies
1. Pure reactive
Detects increase → Scales → Stabilizes
Pros: Simple, responds to any pattern Cons: Always late
2. Predictive
Analyzes history → Predicts spike → Scales before
# AWS Predictive Scaling
aws autoscaling put-scaling-policy \
--policy-name "predictive" \
--policy-type "PredictiveScaling" \
--predictive-scaling-configuration '{
"MetricSpecifications": [...],
"Mode": "ForecastAndScale"
}'
Pros: Scales before spike Cons: Only works with predictable patterns
3. Scheduled
Scales at fixed times based on known pattern
# Kubernetes CronJob to adjust HPA
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-morning
spec:
schedule: "0 8 * * MON-FRI"
jobTemplate:
spec:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl
command:
- kubectl
- patch
- hpa/app-hpa
- --patch
- '{"spec":{"minReplicas":10}}'
Pros: Predictable, controlled costs Cons: Doesn't respond to anomalies
4. Hybrid (recommended)
Base: scheduled scaling for known patterns
+
Reactive: HPA for unexpected variations
+
Limits: max replicas + alerts
Testing Autoscaling
Mandatory validation
# 1. Scale up test
# Generate gradual load, observe scaling
# 2. Scale down test
# Remove load, observe cooldown
# 3. Limit test
# Reach max replicas, observe behavior
# 4. Cold start test
# Measure time until pod is performant
Metrics to validate
1. Time to scale up (target < 2 min)
2. Thrashing frequency (target = 0)
3. % of requests during scaling (target > 99% success)
4. Cost during spikes (target within budget)
Conclusion
Predictable autoscaling requires:
- Adequate thresholds: not too high, not too low
- Behavior configured: stabilization windows to avoid thrashing
- Correct metrics: CPU isn't always the best indicator
- Defined limits: max replicas + cost alerts
- Validated tests: never trust untested autoscaling
Before trusting autoscaling:
- Measure scale up time
- Simulate realistic spikes
- Validate behavior at limits
- Actively monitor costs
Autoscaling is like autopilot: great when it works, disastrous when it fails. Always have plan B.