Capacity Planning Failures

Why Capacity Planning Goes Wrong

Capacity planning fails because it requires predicting the future, and engineers are bad at that. You base your estimates on current traffic, add a comfortable margin, and forget about the marketing team's plans to run a Super Bowl ad. Or a competitor goes down and their users flood to your service. Or a TikTok video featuring your product gets 10 million views overnight.

The failure mode is always the same: traffic exceeds capacity, the system can't scale fast enough, users get errors. The difference between companies that handle this well and those that don't is preparation, not prediction.

Autoscaling Lag

Autoscaling is often presented as the solution to capacity planning. It's not. Autoscaling is a tool for handling gradual traffic changes. For sudden spikes, autoscaling is too slow.

Here's the timeline of a Kubernetes scale-up event: Horizontal Pod Autoscaler detects high CPU (30 seconds to detect, 15 second evaluation window). New pod scheduled. No node available. Cluster Autoscaler requests a new EC2 instance (API call: 2-5 seconds). Instance launches (60-90 seconds). Instance joins the cluster and becomes ready (30-45 seconds). Pod scheduled on new node. Container image pulled (30-120 seconds depending on image size). Application starts and passes health check (15-60 seconds).

Total: 3-7 minutes from traffic spike to additional capacity. If your traffic doubles in 30 seconds, you're serving errors for 3-7 minutes.

The Database Bottleneck

You can scale your application pods to 100 replicas. You cannot scale your PostgreSQL primary to 100 replicas. Every new application pod needs database connections. If you have 50 pods with 10 connections each, that's 500 connections. PostgreSQL handles 500 connections reasonably. 2,000 connections will bring it to its knees.

PgBouncer or PgCat sits between your application and the database, multiplexing hundreds of application connections onto a smaller pool of database connections. In transaction pooling mode, a single database connection can serve many application requests because most of the time a connection is idle between queries.

Set up connection pooling before you need it. Configuring PgBouncer during a traffic spike while your database is overloaded is not a pleasant experience.

Load Testing Done Right

A load test that doesn't break something is a waste of time. The point of load testing is to find the breaking point, understand the failure mode, and fix it before real users hit it.

Run load tests with k6 or Locust against a production-like environment. Start at current peak traffic and ramp to 3-5x. Watch for: the first error, the first timeout, database connection exhaustion, cache hit rate degradation, and autoscaler behavior. Each of these tells you something about your capacity envelope.

Test the full stack. A load test that hits a caching layer and shows 50,000 RPS is misleading if your cache miss path can only handle 500 RPS. Invalidate caches during the test to see real database performance under load.

Planning for the Unknown

You can't predict viral traffic. You can prepare for it. Keep warm pools of instances ready. Use reserved capacity or Savings Plans for your baseline, with on-demand for bursting. Pre-scale before known events (product launches, marketing campaigns, seasonal peaks).

Build a relationship between engineering and marketing. A shared Slack channel where marketing announces upcoming campaigns gives you time to pre-warm infrastructure. This organizational fix is more valuable than any technical solution.

Why Capacity Planning Goes Wrong

Autoscaling Lag

Autoscaling is often presented as the solution to capacity planning. It's not. Autoscaling is a tool for handling gradual traffic changes. For sudden spikes, autoscaling is too slow.

Total: 3-7 minutes from traffic spike to additional capacity. If your traffic doubles in 30 seconds, you're serving errors for 3-7 minutes.

The Database Bottleneck

Set up connection pooling before you need it. Configuring PgBouncer during a traffic spike while your database is overloaded is not a pleasant experience.

Load Testing Done Right

A load test that doesn't break something is a waste of time. The point of load testing is to find the breaking point, understand the failure mode, and fix it before real users hit it.

Planning for the Unknown

Why Capacity Planning Goes Wrong

Autoscaling Lag

The Database Bottleneck

Load Testing Done Right

Planning for the Unknown

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Capacity Planning Failures

Why Capacity Planning Goes Wrong

Autoscaling Lag

The Database Bottleneck

Load Testing Done Right

Planning for the Unknown

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics