Methodology8 min

Autoscaling and Predictability: scaling without surprises

Autoscaling promises automatic scaling, but without proper configuration it can cause more problems. Learn how to configure predictable autoscaling.

Autoscaling is the promise of automatically scaling based on demand. In practice, inadequate configurations result in late scaling, constant oscillation, or uncontrolled costs. This article explores how to configure autoscaling predictably.

Autoscaling isn't "configure and forget". It's "configure, test, adjust, monitor".

How Autoscaling Works

The basic loop

1. Metrics collected
2. Compared with threshold
3. Decision: scale up, down, or nothing
4. Action executed
5. Cooldown
6. Repeat

Reaction time

Event (spike) → Detection → Decision → Provisioning → Ready

Typical timeline:
0s:    Spike starts
15s:   Metric reflects spike
30s:   HPA decides to scale
60s:   Pod scheduled
90s:   Container pulled
120s:  App ready

Total: ~2 minutes until new capacity

Problem: If your spike lasts 1 minute, autoscaling doesn't help.

Horizontal Pod Autoscaler (HPA)

Basic configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Common problems

1. Threshold too high:

# ❌ 90% utilization
# Only scales when already saturated
averageUtilization: 90

2. Threshold too low:

# ❌ 30% utilization
# Scales with any normal variation
averageUtilization: 30

3. Inadequate cooldown:

# ❌ Without configuring behavior
# Scale up/down too fast = thrashing

Optimized configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 65    # Margin to react
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale immediately
      policies:
      - type: Percent
        value: 100                      # Can double
        periodSeconds: 15
      - type: Pods
        value: 4                        # Or +4 pods
        periodSeconds: 15
      selectPolicy: Max                 # Use the largest
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min stable
      policies:
      - type: Percent
        value: 10                       # Reduce 10% at a time
        periodSeconds: 60

Custom metrics

CPU isn't always the best indicator:

metrics:
# Based on requests per second
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: 1000

# Based on queue
- type: External
  external:
    metric:
      name: queue_messages_ready
      selector:
        matchLabels:
          queue: orders
    target:
      type: Value
      value: 100

Vertical Pod Autoscaler (VPA)

When to use

HPA: Adds more pods (horizontal)
VPA: Increases resources of existing pods (vertical)

HPA: Good for stateless
VPA: Good for databases, caches, stateful

Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"    # Recreate pods when needed
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi

VPA + HPA

# ❌ VPA and HPA on the same resource (CPU)
# Conflict: both try to adjust

# ✅ VPA for memory, HPA for CPU
# VPA
resourcePolicy:
  containerPolicies:
  - containerName: app
    controlledResources: ["memory"]

# HPA
metrics:
- type: Resource
  resource:
    name: cpu

Cluster Autoscaler

How it works

Pod pending (no available node)
    → Cluster Autoscaler detects
    → Provisions new node
    → Pod scheduled on new node

Configuration

# Node pool configuration (GKE example)
gcloud container node-pools create scaling-pool \
  --cluster=my-cluster \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=10 \
  --machine-type=e2-standard-4

Provisioning time

AWS:   2-5 minutes
GCP:   1-3 minutes
Azure: 2-4 minutes

→ For quick spikes, Cluster Autoscaler is too slow

Predictability: The Key to Autoscaling

Problem 1: Thrashing

Load varies: 60% → 75% → 65% → 80% → 70%

With 70% threshold:
- Scale up (75%)
- Scale down (65%)
- Scale up (80%)
- Scale down (70%)

= Pods being created and destroyed constantly

Solution: Stabilization window

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300

Problem 2: Late scaling

Spike at 9:00
HPA detects at 9:00:30
New pods ready at 9:02:00

= 2 minutes of degradation

Solution: Predictive scaling or pre-warming

# Scheduled scaling (AWS)
aws autoscaling put-scheduled-action \
  --scheduled-action-name "morning-scale" \
  --auto-scaling-group-name "app-asg" \
  --recurrence "0 8 * * MON-FRI" \
  --min-size 10

Problem 3: Cold start

New pod:
- Container start: 5s
- App initialization: 30s
- Warm-up: 60s

Total: ~95s until ideal performance

Solution: Adequate readiness probe + warm-up

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 60    # Time for warm-up
  periodSeconds: 5

Problem 4: Uncontrolled costs

Autoscaling without limit:
- Anomalous spike or attack
- Scales to 100 pods
- Cost: $10,000/day

→ "Autoscaling worked perfectly... and bankrupted the company"

Solution: Limits + alerts

maxReplicas: 20    # Hard limit

# Alert when near limit
- alert: HPA_NearMaxReplicas
  expr: |
    kube_hpa_status_current_replicas / kube_hpa_spec_max_replicas > 0.8
  for: 5m

Autoscaling Strategies

1. Pure reactive

Detects increase → Scales → Stabilizes

Pros: Simple, responds to any pattern Cons: Always late

2. Predictive

Analyzes history → Predicts spike → Scales before
# AWS Predictive Scaling
aws autoscaling put-scaling-policy \
  --policy-name "predictive" \
  --policy-type "PredictiveScaling" \
  --predictive-scaling-configuration '{
    "MetricSpecifications": [...],
    "Mode": "ForecastAndScale"
  }'

Pros: Scales before spike Cons: Only works with predictable patterns

3. Scheduled

Scales at fixed times based on known pattern
# Kubernetes CronJob to adjust HPA
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-morning
spec:
  schedule: "0 8 * * MON-FRI"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - kubectl
            - patch
            - hpa/app-hpa
            - --patch
            - '{"spec":{"minReplicas":10}}'

Pros: Predictable, controlled costs Cons: Doesn't respond to anomalies

4. Hybrid (recommended)

Base: scheduled scaling for known patterns
+
Reactive: HPA for unexpected variations
+
Limits: max replicas + alerts

Testing Autoscaling

Mandatory validation

# 1. Scale up test
# Generate gradual load, observe scaling

# 2. Scale down test
# Remove load, observe cooldown

# 3. Limit test
# Reach max replicas, observe behavior

# 4. Cold start test
# Measure time until pod is performant

Metrics to validate

1. Time to scale up (target < 2 min)
2. Thrashing frequency (target = 0)
3. % of requests during scaling (target > 99% success)
4. Cost during spikes (target within budget)

Conclusion

Predictable autoscaling requires:

  1. Adequate thresholds: not too high, not too low
  2. Behavior configured: stabilization windows to avoid thrashing
  3. Correct metrics: CPU isn't always the best indicator
  4. Defined limits: max replicas + cost alerts
  5. Validated tests: never trust untested autoscaling

Before trusting autoscaling:

  1. Measure scale up time
  2. Simulate realistic spikes
  3. Validate behavior at limits
  4. Actively monitor costs

Autoscaling is like autopilot: great when it works, disastrous when it fails. Always have plan B.

autoscalingkubernetescloudelasticity

Want to understand your platform's limits?

Contact us for a performance assessment.

Contact Us