Thundering Herd & Retry Storms
When Retries Turn a Small Problem Into a Big One
Retries are the most deceptively dangerous pattern in distributed systems. On the surface, it seems like a no-brainer: if a request fails, just try again. But when you have thousands of clients all following that same logic, you get a feedback loop that can turn a minor blip into a full-blown outage.
The Thundering Herd
Picture a cache server that holds the product catalog for an e-commerce site. It restarts for a routine update, and for 30 seconds there are zero cached entries. Every single request from every user goes straight to the database. If you normally serve 10,000 requests per second from cache with a 1% miss rate, that is 100 database queries per second. During the cold cache window, that number jumps to 10,000 queries per second. That is a 100x spike, and no database is sized to handle it.
The standard fix is cache stampede protection. When the first request finds a cache miss, it grabs a lock and fetches from the database. Every subsequent request for that same key waits for the first one to fill the cache instead of all of them hammering the database independently. This approach (sometimes called "request coalescing" or "single-flight") cuts the database load from N concurrent requests down to exactly one.
How Retry Storms Snowball
Retry storms are worse than thundering herds because they reinforce themselves. Here is how the cycle works: the backend gets slow under load. Clients time out and retry. The retries add more load. The backend gets even slower. More clients time out. More retries fire. Each round amplifies the previous one.
Without somebody stepping in, this grows exponentially. If each client retries 3 times and you have 1,000 clients, one second of slowness produces 3,000 extra requests on top of the original 1,000. Those 4,000 requests cause more timeouts, generating another 12,000 retries. Within minutes, the backend is getting hit with 10-50x its normal load, and almost all of it is retries.
Exponential Backoff With Jitter
The standard answer is exponential backoff: wait 1 second, then 2, then 4 between retries. But there is a subtle catch. If 1,000 clients all started retrying at the same moment, they all compute the same backoff intervals and retry in synchronized waves. Random jitter breaks that synchronization. Instead of all clients retrying at T+1s, they spread out between T+0.5s and T+1.5s, distributing the load more evenly.
The formula looks like: delay = min(cap, base * 2^attempt) * random(0.5, 1.5). AWS recommends "full jitter" where delay = random(0, min(cap, base * 2^attempt)), which gives even better distribution across clients.
Retry Budgets
Even with backoff and jitter, individual clients making their own retry decisions independently can still overwhelm a backend. Retry budgets address this at the system level by capping the ratio of retries to original requests. Google's SRE practices suggest a retry budget of 10%: if more than 10% of requests in a given time window are retries, stop retrying entirely. This prevents retry storms from forming no matter what individual clients are doing.
Incident Timeline
- T+0mCache server restarts, every cached entry is gone (cold cache)
- T+0mThousands of requests simultaneously miss cache and slam the database
- T+1mDatabase CPU pegs at 100%, query latency jumps 50x
- T+2mApplication timeouts kick in and trigger automatic retries with no backoff
- T+3mRetry storm multiplies the load by 3-5x, database stops responding
- T+5mEvery service that depends on this database starts failing
- T+10mManual intervention needed. Retries disabled, cache warmed back up gradually.
Detection Signals
- •Sudden spike in database QPS right after a cache restart or widespread cache miss
- •Retry rate hitting 3x or more above normal across multiple clients
- •Backend latency climbing even though organic traffic has not grown
- •Load balancer showing a growing connection queue depth
Prevention
- Add exponential backoff with jitter to all retry logic
- Use cache stampede protection (either lock-based or probabilistic early expiration)
- Set retry budgets that limit total retries per time window
- Put circuit breakers in place that open before a retry storm can form
- Have a cache warming procedure ready for cold-start scenarios
Key Points
- •A thundering herd happens when lots of clients request the same resource at the same time. The textbook example is a cache expiring or restarting.
- •Retry storms feed on themselves: each retry adds load, which causes more timeouts, which triggers more retries.
- •Exponential backoff without jitter still produces synchronized retry waves. Jitter is what actually spreads the load out.
- •Retry budgets put a cap on total retry traffic at the system level, so no single client can overwhelm the backend.
- •The safest default is actually to not retry at all. Only add retries once you have confirmed they help, and always pair them with backoff and jitter.
Common Mistakes
- ✗Using fixed retry intervals instead of exponential backoff, which creates synchronized bursts that hit the backend all at once
- ✗Skipping jitter on backoff timers, so all clients compute the same intervals and retry in lockstep
- ✗Retrying operations that are not idempotent, leading to duplicate writes, double charges, or inconsistent state
- ✗Setting the same TTL on all cache keys, which causes mass expiration events that trigger thundering herds