Auto-Scaling Patterns

Why It Exists

Statically provisioned capacity is always wrong. Over-provision and money burns on idle machines. Under-provision and users eat latency spikes during traffic surges. There is no correct static number.

Auto-scaling solves this by adjusting capacity to match actual demand. At large scale the savings are dramatic. Workloads that swing 3x between daily peak and trough can cut infrastructure costs by 40-60% compared to provisioning for peak.

How It Works

Scaling Dimensions

Horizontal scaling adds or removes instances (pods, VMs, containers). For stateless services, this is almost always the right choice. No single point of failure, no hard ceiling. Just add more of the same thing.

Vertical scaling resizes existing instances by giving them more CPU or memory. It has its place for stateful workloads that cannot be easily distributed, but it requires restarts and eventually hits hardware limits. If someone suggests vertical scaling as the primary strategy for a stateless API, push back.

Scaling Algorithms

Target tracking is the workhorse. Pick a target value (say, average CPU at 60%) and the autoscaler adjusts replica count to hold it there. The math is simple: desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue)). So if CPU sits at 90% with a 60% target and 4 replicas, the result is ceil(4 * 90/60) = 6 replicas. Easy to reason about, easy to explain.

Step scaling provides more control through discrete thresholds. At 60% CPU add 1 instance, at 80% add 3, at 95% add 5. The trade-off is that those breakpoints need manual tuning, and getting them wrong means either sluggish response or wasted capacity.

Predictive scaling looks at historical patterns and pre-provisions capacity before load arrives. AWS Predictive Scaling analyzes 14 days of CloudWatch data and schedules changes ahead of time. If traffic follows a clear daily pattern and reactive scaling is too slow to keep up, predictive scaling is the tool to reach for.

Feedback Loops and Oscillation

The single biggest operational headache is scaling oscillation. It goes like this: load rises, the system scales out, metrics drop, it scales in, metrics rise again, it scales out again. Around and around. I have watched teams debug this for days before realizing their cooldowns were too aggressive.

Prevent it with a few techniques:

Stabilization windows. HPA v2 defaults to a 5-minute window for scale-down. During that window, the autoscaler picks the highest recommended replica count, which prevents premature scale-in.
Cooldown periods. After a scaling action, the system ignores subsequent signals for a set duration. AWS defaults to 300 seconds for scale-in, and that is a reasonable starting point.
Asymmetric scaling. Scale out fast (seconds) but scale in slowly (minutes). Losing capacity hurts more than keeping a few extra replicas running briefly. This asymmetry is a feature, not a bug.

Karpenter vs Cluster Autoscaler

The Cluster Autoscaler watches for unschedulable pods and adds nodes from predefined node groups (ASGs). It works, but it is slow (2-5 minutes) and limited to instance types configured ahead of time. It is basically guessing what shapes will be needed.

Karpenter flips this around. It looks at what pending pods actually require, then provisions the cheapest instance that satisfies those constraints (CPU, memory, GPU, architecture, availability zone). It picks from the full menu of instance types, provisions in 60-90 seconds, and automatically consolidates underutilized nodes. In practice, switching from Cluster Autoscaler to Karpenter cut our node provisioning time in half and reduced compute costs by 20-30% through better bin-packing.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
  limits:
    cpu: "1000"
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized

Production Considerations

Pick better metrics. Use request latency (P99) or queue depth as scaling signals instead of CPU. A service can sit at 30% CPU while returning 2-second responses because it is blocked on I/O. CPU tells almost nothing about user experience.
KEDA for event-driven workloads. KEDA scales on external metrics (SQS queue length, Kafka consumer lag, Prometheus queries) and supports scale-to-zero, which HPA cannot do. For queue consumers, KEDA is the obvious choice.
Spot instances for cost savings. Combining autoscaling with spot instances can save 60-90%. Use Karpenter's consolidation to migrate off spot gracefully when AWS reclaims capacity.
Load test the scaling, not just the app. Run synthetic load and watch the autoscaler respond. Does scale-out speed match real traffic ramp rates? Does scale-in destabilize the system? Almost certainly something will need tuning.
Set cost guardrails. Always configure maximum replica counts and node pool CPU/memory limits. Without these, a traffic spike or a bad metric query can spin up hundreds of instances. I have seen a single misconfigured HPA target generate a five-figure bill overnight.

Failure Scenarios

Scenario 1: Metrics Pipeline Goes Down Prometheus OOMs on high cardinality, or the Kubernetes Metrics Server becomes unavailable. HPA cannot fetch metrics and enters FailedGetScale status. Existing replica counts freeze. The system cannot scale out under increasing load or scale in when load drops. If load climbs, latency degrades and requests fail. If load was already at peak, the result is paying for frozen over-provisioned replicas. Detection: horizontal_pod_autoscaler_status_condition{condition="ScalingActive",status="false"} fires, Metrics Server apiservice_available drops. Recovery: HPA has --horizontal-pod-autoscaler-tolerance (default 0.1), so during a metrics gap it holds steady. Restore the metrics pipeline. Prevention: Run Metrics Server as HA (2+ replicas), use Prometheus with Thanos sidecar for redundancy. Having multiple independent scaling signal sources avoids a single point of failure on any one metrics system.

Scenario 2: Thundering Herd Exhausts Node Capacity A flash sale or viral moment causes 10x traffic in 60 seconds. HPA requests 50 new pods, but Cluster Autoscaler needs 3-5 minutes to provision nodes. Pods sit in Pending. Users get 503s. Detection: kube_pod_status_phase{phase="Pending"} count rises, scheduler_pending_pods spikes, cluster_autoscaler_unschedulable_pods_count > 0. Recovery: Pre-provision buffer nodes using Karpenter's consolidationPolicy: WhenEmpty to keep 2-3 spare nodes warm. For known surges (product launch, marketing campaign), pre-scale 30 minutes ahead. Maintain at least 20-40% headroom for on-sale events. Target: new pods schedulable within 30 seconds.

Scenario 3: Scale-In Race During Deployment A rolling deployment temporarily drops CPU usage as old pods drain and new pods warm up. HPA reads the dip as reduced demand and scales in. Then the new pods finish initializing, full load arrives, but now there are fewer replicas than needed. P99 latency spikes right after every deploy. Detection: Deployment events correlate with scale-in events in audit logs. Latency spikes follow deployments on a consistent pattern. Recovery: Set HPA behavior.scaleDown.stabilizationWindowSeconds: 600 (10 minutes) to ride through deployments without reacting. Prevention: Pause HPA during deployments via annotations, or use a request-rate metric that does not dip during rollouts. CPU is especially misleading during deployments.

Capacity Planning

Metric	Planning Threshold	Real-World Reference
Target CPU utilization	60-70% (leaves headroom for bursts)	Google SRE: recommends 60% target for latency-sensitive services
Scale-out response time	HPA: 15-30s; Node: 60-300s; Karpenter: 60-90s	Airbnb: 90s end-to-end from metric spike to pod serving traffic
Min replicas	≥ 3 for HA (spread across AZs)	Stripe: minimum 6 replicas for payment-critical services
Max replicas	Set explicit ceiling (cost guardrail)	Cap at 10x normal traffic to bound blast radius and cost
Scale-in cooldown	300-600s (5-10 min)	AWS default: 300s; Spotify uses 600s for core services
Headroom buffer	20-40% above observed peak	Pinterest: 30% headroom for organic traffic growth + burst margin

Key formulas: Required replicas = ceil(peak_rps / rps_per_pod). rps_per_pod = 1000 / P99_latency_ms * concurrency_per_pod. For a service with P99 = 50ms and 10 concurrent connections per pod: rps_per_pod = 1000/50 * 10 = 200. At 10K peak RPS: ceil(10000/200) = 50 replicas. Add 30% headroom: 65 replicas max. Node planning: nodes = ceil(total_pod_cpu_requests / allocatable_cpu_per_node * 0.8). The 0.8 factor accounts for system pods (kube-proxy, DaemonSets, monitoring agents) eating roughly 20% of each node's capacity.

Architecture Decision Record

ADR: Scaling Strategy Selection

Context: Choosing between reactive, predictive, scheduled, and event-driven scaling for a workload.

Criteria (Weight)	Reactive (HPA)	Predictive (AWS)	Scheduled (CronHPA)	Event-Driven (KEDA)
Traffic pattern fit (25%)	Gradual ramps, organic growth	Repeating daily/weekly cycles	Known events (sales, broadcasts)	Queue-based, async processing
Response speed (20%)	15-30s (after metric delay)	Proactive (pre-scales)	Exact (pre-scheduled)	15-30s (scales on queue depth)
Scale-to-zero (15%)	Not supported (min replicas ≥ 1)	Not supported	Possible with CronHPA	Native support
Operational complexity (15%)	Low (built-in to K8s)	Low (AWS managed ML)	Low (cron expressions)	Medium (KEDA operator + scalers)
Custom metric support (15%)	Yes (via custom metrics API)	Limited (CPU, network, ALB)	N/A (time-based only)	Excellent (50+ event sources)
Cost efficiency (10%)	Good for steady workloads	Best for predictable patterns	Best for known events	Best for bursty async workloads

Decision guidance: Most production services should layer multiple strategies. Start with reactive HPA as the baseline for handling unexpected traffic. Add predictive or scheduled scaling for known patterns like daily peaks or marketing events. Use KEDA for async workloads like queue consumers and stream processors.

Specifically: use reactive HPA with custom metrics (RPS, latency) for synchronous APIs. Add predictive scaling if traffic shows strong daily or weekly periodicity with more than 2x variance. Use KEDA for anything consuming from SQS, Kafka, or RabbitMQ. Its scale-to-zero capability saves 80-90% for intermittent batch processors.

One thing people underestimate is team capacity. HPA alone is appropriate for teams without dedicated platform engineers. Adding KEDA and predictive scaling means someone needs to understand how multiple scaling controllers interact. If nobody on the team owns that, keep it simple.

Tool	Type	Best For	Scale
Kubernetes HPA	Open Source	Pod-level scaling, custom metrics API	Medium-Enterprise
Karpenter	Open Source	Fast node provisioning, instance type selection	Medium-Enterprise
AWS Auto Scaling	Managed	EC2/ECS scaling, target tracking policies	Small-Enterprise
KEDA	Open Source	Event-driven scaling, scale-to-zero	Medium-Enterprise

Why It Exists

How It Works

Scaling Dimensions

Scaling Algorithms

Feedback Loops and Oscillation

Prevent it with a few techniques:

Stabilization windows. HPA v2 defaults to a 5-minute window for scale-down. During that window, the autoscaler picks the highest recommended replica count, which prevents premature scale-in.
Cooldown periods. After a scaling action, the system ignores subsequent signals for a set duration. AWS defaults to 300 seconds for scale-in, and that is a reasonable starting point.
Asymmetric scaling. Scale out fast (seconds) but scale in slowly (minutes). Losing capacity hurts more than keeping a few extra replicas running briefly. This asymmetry is a feature, not a bug.

Karpenter vs Cluster Autoscaler

apiVersion: karpenter.sh/v1beta1
kind: NodePool
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
  limits:
    cpu: "1000"
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized

Production Considerations

Pick better metrics. Use request latency (P99) or queue depth as scaling signals instead of CPU. A service can sit at 30% CPU while returning 2-second responses because it is blocked on I/O. CPU tells almost nothing about user experience.
KEDA for event-driven workloads. KEDA scales on external metrics (SQS queue length, Kafka consumer lag, Prometheus queries) and supports scale-to-zero, which HPA cannot do. For queue consumers, KEDA is the obvious choice.
Spot instances for cost savings. Combining autoscaling with spot instances can save 60-90%. Use Karpenter's consolidation to migrate off spot gracefully when AWS reclaims capacity.
Load test the scaling, not just the app. Run synthetic load and watch the autoscaler respond. Does scale-out speed match real traffic ramp rates? Does scale-in destabilize the system? Almost certainly something will need tuning.
Set cost guardrails. Always configure maximum replica counts and node pool CPU/memory limits. Without these, a traffic spike or a bad metric query can spin up hundreds of instances. I have seen a single misconfigured HPA target generate a five-figure bill overnight.

Failure Scenarios

Capacity Planning

Metric	Planning Threshold	Real-World Reference
Target CPU utilization	60-70% (leaves headroom for bursts)	Google SRE: recommends 60% target for latency-sensitive services
Scale-out response time	HPA: 15-30s; Node: 60-300s; Karpenter: 60-90s	Airbnb: 90s end-to-end from metric spike to pod serving traffic
Min replicas	≥ 3 for HA (spread across AZs)	Stripe: minimum 6 replicas for payment-critical services
Max replicas	Set explicit ceiling (cost guardrail)	Cap at 10x normal traffic to bound blast radius and cost
Scale-in cooldown	300-600s (5-10 min)	AWS default: 300s; Spotify uses 600s for core services
Headroom buffer	20-40% above observed peak	Pinterest: 30% headroom for organic traffic growth + burst margin

Architecture Decision Record

ADR: Scaling Strategy Selection

Context: Choosing between reactive, predictive, scheduled, and event-driven scaling for a workload.

Criteria (Weight)	Reactive (HPA)	Predictive (AWS)	Scheduled (CronHPA)	Event-Driven (KEDA)
Traffic pattern fit (25%)	Gradual ramps, organic growth	Repeating daily/weekly cycles	Known events (sales, broadcasts)	Queue-based, async processing
Response speed (20%)	15-30s (after metric delay)	Proactive (pre-scales)	Exact (pre-scheduled)	15-30s (scales on queue depth)
Scale-to-zero (15%)	Not supported (min replicas ≥ 1)	Not supported	Possible with CronHPA	Native support
Operational complexity (15%)	Low (built-in to K8s)	Low (AWS managed ML)	Low (cron expressions)	Medium (KEDA operator + scalers)
Custom metric support (15%)	Yes (via custom metrics API)	Limited (CPU, network, ALB)	N/A (time-based only)	Excellent (50+ event sources)
Cost efficiency (10%)	Good for steady workloads	Best for predictable patterns	Best for known events	Best for bursty async workloads

Architecture Diagram

Why It Exists

How It Works

Scaling Dimensions

Scaling Algorithms

Feedback Loops and Oscillation

Karpenter vs Cluster Autoscaler

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Scaling Strategy Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics

Auto-Scaling Patterns

Architecture Diagram

Why It Exists

How It Works

Scaling Dimensions

Scaling Algorithms

Feedback Loops and Oscillation

Karpenter vs Cluster Autoscaler

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Scaling Strategy Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics