Capacity Planning Failures
Why Capacity Planning Goes Wrong
Capacity planning fails because it requires predicting the future, and engineers are bad at that. You base your estimates on current traffic, add a comfortable margin, and forget about the marketing team's plans to run a Super Bowl ad. Or a competitor goes down and their users flood to your service. Or a TikTok video featuring your product gets 10 million views overnight.
The failure mode is always the same: traffic exceeds capacity, the system can't scale fast enough, users get errors. The difference between companies that handle this well and those that don't is preparation, not prediction.
Autoscaling Lag
Autoscaling is often presented as the solution to capacity planning. It's not. Autoscaling is a tool for handling gradual traffic changes. For sudden spikes, autoscaling is too slow.
Here's the timeline of a Kubernetes scale-up event: Horizontal Pod Autoscaler detects high CPU (30 seconds to detect, 15 second evaluation window). New pod scheduled. No node available. Cluster Autoscaler requests a new EC2 instance (API call: 2-5 seconds). Instance launches (60-90 seconds). Instance joins the cluster and becomes ready (30-45 seconds). Pod scheduled on new node. Container image pulled (30-120 seconds depending on image size). Application starts and passes health check (15-60 seconds).
Total: 3-7 minutes from traffic spike to additional capacity. If your traffic doubles in 30 seconds, you're serving errors for 3-7 minutes.
The Database Bottleneck
You can scale your application pods to 100 replicas. You cannot scale your PostgreSQL primary to 100 replicas. Every new application pod needs database connections. If you have 50 pods with 10 connections each, that's 500 connections. PostgreSQL handles 500 connections reasonably. 2,000 connections will bring it to its knees.
PgBouncer or PgCat sits between your application and the database, multiplexing hundreds of application connections onto a smaller pool of database connections. In transaction pooling mode, a single database connection can serve many application requests because most of the time a connection is idle between queries.
Set up connection pooling before you need it. Configuring PgBouncer during a traffic spike while your database is overloaded is not a pleasant experience.
Load Testing Done Right
A load test that doesn't break something is a waste of time. The point of load testing is to find the breaking point, understand the failure mode, and fix it before real users hit it.
Run load tests with k6 or Locust against a production-like environment. Start at current peak traffic and ramp to 3-5x. Watch for: the first error, the first timeout, database connection exhaustion, cache hit rate degradation, and autoscaler behavior. Each of these tells you something about your capacity envelope.
Test the full stack. A load test that hits a caching layer and shows 50,000 RPS is misleading if your cache miss path can only handle 500 RPS. Invalidate caches during the test to see real database performance under load.
Planning for the Unknown
You can't predict viral traffic. You can prepare for it. Keep warm pools of instances ready. Use reserved capacity or Savings Plans for your baseline, with on-demand for bursting. Pre-scale before known events (product launches, marketing campaigns, seasonal peaks).
Build a relationship between engineering and marketing. A shared Slack channel where marketing announces upcoming campaigns gives you time to pre-warm infrastructure. This organizational fix is more valuable than any technical solution.
Incident Timeline
- T+0mA marketing campaign goes viral. Traffic jumps from 2,000 RPS to 18,000 RPS in 4 minutes. Nobody told engineering about the campaign.
- T+2mAutoscaler detects the load increase. Triggers scale-up. New EKS pods scheduled, but cluster node pool needs 3 additional EC2 instances. Instance provisioning begins.
- T+5mEC2 instances still launching. Container images pulling (2.1GB image takes 90 seconds). Application startup and health check warmup adds another 60 seconds. Current pods at 98% CPU.
- T+10mDatabase connection pool exhausted. PostgreSQL max_connections hit at 200. New pods can't connect. Some existing pods lose connections due to PgBouncer timeout. 500 errors at 35% of requests.
- T+15mNew nodes online, pods running, but database is the bottleneck. DBA increases max_connections to 400 and adds a PgBouncer instance. Connection pressure eases.
- T+30mFull recovery. 12 minutes of degraded service. Post-incident reveals: no load test above 5,000 RPS, no marketing-to-engineering notification process, container image too large.
Detection Signals
- •CPU or memory utilization exceeding 80% across a service's pod fleet
- •Autoscaler hitting maximum replica limits or failing to provision new instances
- •Database connection pool exhaustion errors in application logs
- •Request queue depth increasing, indicating more requests arriving than the system can process
Prevention
- Load test to at least 3x expected peak traffic monthly. Use tools like k6, Locust, or Gatling against a production-like environment
- Pre-warm autoscaling before known traffic events. Scale up 2 hours before a campaign launch, not in response to the traffic
- Set up communication channels between marketing/sales and engineering for any event that might drive traffic spikes
- Keep container images small (under 200MB). Use multi-stage builds and distroless base images to reduce pull time during scale-up
- Implement database connection pooling (PgBouncer, ProxySQL) as a buffer between application instances and database connections
Key Points
- •Autoscaling is not instant. EKS pod scheduling + node provisioning + image pull + app startup = 3-7 minutes. That's an eternity during a traffic spike
- •The database is almost always the bottleneck during scale events. Application pods scale horizontally. Databases don't
- •Load testing at 1x expected traffic proves nothing. You need to test at 3-5x to find the real breaking point and understand the failure mode
- •Viral traffic doesn't follow normal patterns. Instead of gradual growth you get a step function. 2,000 to 20,000 RPS in under 5 minutes
- •Storage is the forgotten capacity dimension. Disk fills up gradually, nobody notices until the database can't write WAL files at 3am
Common Mistakes
- ✗Setting autoscaler max replicas too low 'to control costs' and then being unable to scale during an actual traffic spike
- ✗Load testing against a staging environment with 1/10th the database size and different network topology, then trusting the results
- ✗Forgetting that scaling the application layer moves the bottleneck to the database or cache layer, which doesn't autoscale
- ✗Not monitoring disk space on database volumes and EBS attachments until they're 100% full