Cascading Failure Patterns

When One Failure Takes Down Everything

A cascading failure is the distributed systems version of a chain reaction. One component goes down, and instead of the failure staying contained, it ripples through the system until everything is broken. The frustrating part is that each individual component is behaving reasonably on its own. It retries, it queues, it waits for connections. But the collective result is catastrophic.

How Resource Exhaustion Spreads

Cascading failures almost always propagate through resource exhaustion rather than error propagation. When Service A calls Service B and B is slow, A does not immediately fail. A's threads just sit there waiting for B to respond. Each waiting thread holds onto a connection, some memory, and a slot in the thread pool. As more requests come in, more threads block, and eventually A has no threads left to handle anything, including requests that have nothing to do with B.

This is why timeout configuration matters more than almost any other reliability setting in a distributed system. If you have a 30-second timeout on an API that normally responds in 50ms, each thread gets held hostage 600x longer than it should during a downstream outage. Set your timeouts aggressively. If your P99 is 200ms, a 2-second timeout is plenty generous. Anything beyond that is an invitation for a cascade.

Isolating Failure Domains

The bulkhead pattern, borrowed from ship design, isolates failure domains so a breach in one compartment does not sink the whole vessel. In practice, this means giving each downstream dependency its own thread pool (or connection pool). When the recommendations service slows down, it burns through its dedicated pool of 20 threads, but the checkout service still has its own 50 threads working just fine.

Netflix built this approach into Hystrix, and modern service meshes like Istio handle it at the infrastructure layer. The key realization is that you have to design for partial failure. Most of your system should keep running even when some dependencies are down.

A Real-World Example

The 2017 Amazon S3 outage cascaded because a huge number of AWS services depended on S3 for health check dashboards and status pages. The system that was supposed to tell you what was broken was itself broken. This is a good reminder that your monitoring and alerting pipeline should depend as little as possible on the same systems it monitors.

Stopping the Domino Effect

Circuit breakers are the most effective tool against cascading failures. When a downstream service starts failing, the circuit breaker "opens" and returns an error right away (or a cached/default response) without even trying the call. This turns a slow failure into a fast one, freeing up threads and preventing resource exhaustion. The breaker periodically lets a test request through to see if the downstream service has recovered, and closes again once things look healthy.

When One Failure Takes Down Everything

How Resource Exhaustion Spreads

Isolating Failure Domains

A Real-World Example

Stopping the Domino Effect

When One Failure Takes Down Everything

How Resource Exhaustion Spreads

Isolating Failure Domains

A Real-World Example

Stopping the Domino Effect

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Cascading Failure Patterns

When One Failure Takes Down Everything

How Resource Exhaustion Spreads

Isolating Failure Domains

A Real-World Example

Stopping the Domino Effect

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics