Auto-Scaling Patterns
Architecture Diagram
Why It Exists
Statically provisioned capacity is always wrong. Over-provision and money burns on idle machines. Under-provision and users eat latency spikes during traffic surges. There is no correct static number.
Auto-scaling solves this by adjusting capacity to match actual demand. At large scale the savings are dramatic. Workloads that swing 3x between daily peak and trough can cut infrastructure costs by 40-60% compared to provisioning for peak.
How It Works
Scaling Dimensions
Horizontal scaling adds or removes instances (pods, VMs, containers). For stateless services, this is almost always the right choice. No single point of failure, no hard ceiling. Just add more of the same thing.
Vertical scaling resizes existing instances by giving them more CPU or memory. It has its place for stateful workloads that cannot be easily distributed, but it requires restarts and eventually hits hardware limits. If someone suggests vertical scaling as the primary strategy for a stateless API, push back.
Scaling Algorithms
Target tracking is the workhorse. Pick a target value (say, average CPU at 60%) and the autoscaler adjusts replica count to hold it there. The math is simple: desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue)). So if CPU sits at 90% with a 60% target and 4 replicas, the result is ceil(4 * 90/60) = 6 replicas. Easy to reason about, easy to explain.
Step scaling provides more control through discrete thresholds. At 60% CPU add 1 instance, at 80% add 3, at 95% add 5. The trade-off is that those breakpoints need manual tuning, and getting them wrong means either sluggish response or wasted capacity.
Predictive scaling looks at historical patterns and pre-provisions capacity before load arrives. AWS Predictive Scaling analyzes 14 days of CloudWatch data and schedules changes ahead of time. If traffic follows a clear daily pattern and reactive scaling is too slow to keep up, predictive scaling is the tool to reach for.
Feedback Loops and Oscillation
The single biggest operational headache is scaling oscillation. It goes like this: load rises, the system scales out, metrics drop, it scales in, metrics rise again, it scales out again. Around and around. I have watched teams debug this for days before realizing their cooldowns were too aggressive.
Prevent it with a few techniques:
- Stabilization windows. HPA v2 defaults to a 5-minute window for scale-down. During that window, the autoscaler picks the highest recommended replica count, which prevents premature scale-in.
- Cooldown periods. After a scaling action, the system ignores subsequent signals for a set duration. AWS defaults to 300 seconds for scale-in, and that is a reasonable starting point.
- Asymmetric scaling. Scale out fast (seconds) but scale in slowly (minutes). Losing capacity hurts more than keeping a few extra replicas running briefly. This asymmetry is a feature, not a bug.
Karpenter vs Cluster Autoscaler
The Cluster Autoscaler watches for unschedulable pods and adds nodes from predefined node groups (ASGs). It works, but it is slow (2-5 minutes) and limited to instance types configured ahead of time. It is basically guessing what shapes will be needed.
Karpenter flips this around. It looks at what pending pods actually require, then provisions the cheapest instance that satisfies those constraints (CPU, memory, GPU, architecture, availability zone). It picks from the full menu of instance types, provisions in 60-90 seconds, and automatically consolidates underutilized nodes. In practice, switching from Cluster Autoscaler to Karpenter cut our node provisioning time in half and reduced compute costs by 20-30% through better bin-packing.
apiVersion: karpenter.sh/v1beta1
kind: NodePool
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
limits:
cpu: "1000"
memory: 1000Gi
disruption:
consolidationPolicy: WhenUnderutilized
Production Considerations
- Pick better metrics. Use request latency (P99) or queue depth as scaling signals instead of CPU. A service can sit at 30% CPU while returning 2-second responses because it is blocked on I/O. CPU tells almost nothing about user experience.
- KEDA for event-driven workloads. KEDA scales on external metrics (SQS queue length, Kafka consumer lag, Prometheus queries) and supports scale-to-zero, which HPA cannot do. For queue consumers, KEDA is the obvious choice.
- Spot instances for cost savings. Combining autoscaling with spot instances can save 60-90%. Use Karpenter's consolidation to migrate off spot gracefully when AWS reclaims capacity.
- Load test the scaling, not just the app. Run synthetic load and watch the autoscaler respond. Does scale-out speed match real traffic ramp rates? Does scale-in destabilize the system? Almost certainly something will need tuning.
- Set cost guardrails. Always configure maximum replica counts and node pool CPU/memory limits. Without these, a traffic spike or a bad metric query can spin up hundreds of instances. I have seen a single misconfigured HPA target generate a five-figure bill overnight.
Failure Scenarios
Scenario 1: Metrics Pipeline Goes Down Prometheus OOMs on high cardinality, or the Kubernetes Metrics Server becomes unavailable. HPA cannot fetch metrics and enters FailedGetScale status. Existing replica counts freeze. The system cannot scale out under increasing load or scale in when load drops. If load climbs, latency degrades and requests fail. If load was already at peak, the result is paying for frozen over-provisioned replicas. Detection: horizontal_pod_autoscaler_status_condition{condition="ScalingActive",status="false"} fires, Metrics Server apiservice_available drops. Recovery: HPA has --horizontal-pod-autoscaler-tolerance (default 0.1), so during a metrics gap it holds steady. Restore the metrics pipeline. Prevention: Run Metrics Server as HA (2+ replicas), use Prometheus with Thanos sidecar for redundancy. Having multiple independent scaling signal sources avoids a single point of failure on any one metrics system.
Scenario 2: Thundering Herd Exhausts Node Capacity A flash sale or viral moment causes 10x traffic in 60 seconds. HPA requests 50 new pods, but Cluster Autoscaler needs 3-5 minutes to provision nodes. Pods sit in Pending. Users get 503s. Detection: kube_pod_status_phase{phase="Pending"} count rises, scheduler_pending_pods spikes, cluster_autoscaler_unschedulable_pods_count > 0. Recovery: Pre-provision buffer nodes using Karpenter's consolidationPolicy: WhenEmpty to keep 2-3 spare nodes warm. For known surges (product launch, marketing campaign), pre-scale 30 minutes ahead. Maintain at least 20-40% headroom for on-sale events. Target: new pods schedulable within 30 seconds.
Scenario 3: Scale-In Race During Deployment A rolling deployment temporarily drops CPU usage as old pods drain and new pods warm up. HPA reads the dip as reduced demand and scales in. Then the new pods finish initializing, full load arrives, but now there are fewer replicas than needed. P99 latency spikes right after every deploy. Detection: Deployment events correlate with scale-in events in audit logs. Latency spikes follow deployments on a consistent pattern. Recovery: Set HPA behavior.scaleDown.stabilizationWindowSeconds: 600 (10 minutes) to ride through deployments without reacting. Prevention: Pause HPA during deployments via annotations, or use a request-rate metric that does not dip during rollouts. CPU is especially misleading during deployments.
Capacity Planning
| Metric | Planning Threshold | Real-World Reference |
|---|---|---|
| Target CPU utilization | 60-70% (leaves headroom for bursts) | Google SRE: recommends 60% target for latency-sensitive services |
| Scale-out response time | HPA: 15-30s; Node: 60-300s; Karpenter: 60-90s | Airbnb: 90s end-to-end from metric spike to pod serving traffic |
| Min replicas | ≥ 3 for HA (spread across AZs) | Stripe: minimum 6 replicas for payment-critical services |
| Max replicas | Set explicit ceiling (cost guardrail) | Cap at 10x normal traffic to bound blast radius and cost |
| Scale-in cooldown | 300-600s (5-10 min) | AWS default: 300s; Spotify uses 600s for core services |
| Headroom buffer | 20-40% above observed peak | Pinterest: 30% headroom for organic traffic growth + burst margin |
Key formulas: Required replicas = ceil(peak_rps / rps_per_pod). rps_per_pod = 1000 / P99_latency_ms * concurrency_per_pod. For a service with P99 = 50ms and 10 concurrent connections per pod: rps_per_pod = 1000/50 * 10 = 200. At 10K peak RPS: ceil(10000/200) = 50 replicas. Add 30% headroom: 65 replicas max. Node planning: nodes = ceil(total_pod_cpu_requests / allocatable_cpu_per_node * 0.8). The 0.8 factor accounts for system pods (kube-proxy, DaemonSets, monitoring agents) eating roughly 20% of each node's capacity.
Architecture Decision Record
ADR: Scaling Strategy Selection
Context: Choosing between reactive, predictive, scheduled, and event-driven scaling for a workload.
| Criteria (Weight) | Reactive (HPA) | Predictive (AWS) | Scheduled (CronHPA) | Event-Driven (KEDA) |
|---|---|---|---|---|
| Traffic pattern fit (25%) | Gradual ramps, organic growth | Repeating daily/weekly cycles | Known events (sales, broadcasts) | Queue-based, async processing |
| Response speed (20%) | 15-30s (after metric delay) | Proactive (pre-scales) | Exact (pre-scheduled) | 15-30s (scales on queue depth) |
| Scale-to-zero (15%) | Not supported (min replicas ≥ 1) | Not supported | Possible with CronHPA | Native support |
| Operational complexity (15%) | Low (built-in to K8s) | Low (AWS managed ML) | Low (cron expressions) | Medium (KEDA operator + scalers) |
| Custom metric support (15%) | Yes (via custom metrics API) | Limited (CPU, network, ALB) | N/A (time-based only) | Excellent (50+ event sources) |
| Cost efficiency (10%) | Good for steady workloads | Best for predictable patterns | Best for known events | Best for bursty async workloads |
Decision guidance: Most production services should layer multiple strategies. Start with reactive HPA as the baseline for handling unexpected traffic. Add predictive or scheduled scaling for known patterns like daily peaks or marketing events. Use KEDA for async workloads like queue consumers and stream processors.
Specifically: use reactive HPA with custom metrics (RPS, latency) for synchronous APIs. Add predictive scaling if traffic shows strong daily or weekly periodicity with more than 2x variance. Use KEDA for anything consuming from SQS, Kafka, or RabbitMQ. Its scale-to-zero capability saves 80-90% for intermittent batch processors.
One thing people underestimate is team capacity. HPA alone is appropriate for teams without dedicated platform engineers. Adding KEDA and predictive scaling means someone needs to understand how multiple scaling controllers interact. If nobody on the team owns that, keep it simple.
Key Points
- •Dynamically adjusts compute capacity based on demand. Scales out under load, scales in when idle.
- •Three main approaches: reactive scaling (metric thresholds), predictive scaling (ML-based forecasting), and scheduled scaling
- •Horizontal Pod Autoscaler (HPA) scales pods. Cluster Autoscaler scales nodes. Karpenter can replace both.
- •Scale-out should be fast. Scale-in needs to be conservative, or oscillation occurs.
- •Custom metrics like queue depth and business KPIs usually make better scaling signals than CPU or memory
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Kubernetes HPA | Open Source | Pod-level scaling, custom metrics API | Medium-Enterprise |
| Karpenter | Open Source | Fast node provisioning, instance type selection | Medium-Enterprise |
| AWS Auto Scaling | Managed | EC2/ECS scaling, target tracking policies | Small-Enterprise |
| KEDA | Open Source | Event-driven scaling, scale-to-zero | Medium-Enterprise |
Common Mistakes
- Scaling on CPU only. High CPU doesn't always mean the service needs more instances.
- Setting scale-in cooldown too short. This causes thrashing: scale out, scale in, scale out, repeat.
- Not accounting for pod startup time. New pods aren't ready for 30-60 seconds after creation.
- Ignoring cluster autoscaler lag. Node provisioning takes 2-5 minutes on most cloud providers.
- Skipping PodDisruptionBudgets during scale-in. Terminating too many pods at once kills availability.