Circuit Breakers & Bulkheads — Distributed Systems | CrackingWalnuts

The Cascading Failure Problem

Service A calls Service B, which calls Service C. Service C starts responding slowly because its database is overloaded. Service B's threads are waiting on Service C, so B's thread pool fills up. Now Service A's calls to B start timing out. A's thread pool fills up. Soon, Service D, E, and F (which also depend on A) start failing.

One slow database turned into a system-wide outage. Every service in the chain is now unhealthy, even though only one component had the original problem.

This is a cascading failure, and it happens because each service holds resources (threads, connections, memory) while waiting for a downstream response. When the downstream is slow, those resources are held for longer, exhausting them for other requests.

Circuit breakers and bulkheads prevent cascading failures by limiting the damage a single faulty dependency can cause.

The Circuit Breaker Pattern

Named after electrical circuit breakers that trip to prevent house fires, software circuit breakers trip to prevent system-wide outages.

Closed State (Normal)

All requests flow through to the downstream service. The circuit breaker monitors outcomes: successes, failures, timeouts. It keeps a rolling window of recent results (e.g., the last 100 calls or the last 60 seconds).

Tripping: Closed to Open

When the failure rate in the rolling window exceeds a configured threshold (e.g., 50% of the last 100 calls failed), the breaker trips to the open state.

What counts as a failure? This is configurable and matters a lot:

Network timeouts: yes, always
5xx responses: yes, the downstream is having problems
4xx responses: usually no, these are client errors (bad request, not found)
Slow responses (above a latency threshold): sometimes, depends on whether slowness indicates downstream health

Open State (Fail Fast)

Requests do not reach the downstream service at all. The circuit breaker immediately returns an error (or a fallback response) to the caller. This has two benefits:

The caller fails fast instead of waiting for a timeout. Response time drops from "timeout duration" to "near zero."
The downstream service gets breathing room. No new requests arrive, giving it time to recover from whatever caused the failures.

The breaker stays open for a configurable duration (e.g., 30 seconds).

Half-Open State (Testing Recovery)

After the open-state timeout expires, the breaker transitions to half-open. It allows a small number of test requests (e.g., 1-3) to reach the downstream.

If the test requests succeed, the downstream has recovered. The breaker closes, and normal traffic resumes.

If the test requests fail, the downstream is still unhealthy. The breaker goes back to open for another timeout period.

This gradual testing prevents the "thundering herd" problem where a recovered service gets immediately overwhelmed by the full blast of queued traffic.

The Bulkhead Pattern

A circuit breaker handles one dependency at a time. But what if the problem is not one dependency failing but one dependency consuming shared resources and starving everything else?

Example: your service has a thread pool of 200 threads shared across all downstream calls. The notifications service becomes slow and holds 190 threads waiting. Now your payment calls, inventory checks, and user lookups all compete for the remaining 10 threads. The notifications service, which is not even critical, has effectively taken down payment processing.

Bulkheads solve this by isolating resources per dependency.

Thread Pool Isolation

Give each downstream dependency its own thread pool. Payments get 50 threads. Notifications get 20 threads. Inventory gets 30 threads. If notifications goes slow and its 20 threads fill up, payments still has its own 50 threads and is unaffected.

Hystrix used this approach. Each "command group" ran in its own thread pool. When the pool was full, additional requests were rejected immediately (fail fast) without consuming resources from other pools.

The cost: thread pools have overhead (context switching, memory per thread). With many downstream dependencies, the total thread count can get large. Sizing each pool requires understanding the expected concurrency per dependency.

Semaphore Isolation

A lighter alternative: instead of separate thread pools, use semaphores (concurrency limiters) per dependency. A semaphore with a permit count of 20 allows at most 20 concurrent calls. When all permits are taken, new calls are rejected.

Semaphore isolation runs on the caller's thread (no separate pool), so there is less overhead. But it does not provide timeout protection: a slow call holds the caller's thread until the call completes or the HTTP timeout fires. Thread pool isolation kills the task when the pool's timeout expires, regardless of the downstream's behavior.

Resilience4j defaults to semaphore isolation. Hystrix defaulted to thread pool isolation.

Configuring Circuit Breakers

Sliding Window

Most circuit breakers track failures in a sliding window. Two types:

Count-based: track the last N calls (e.g., 100). Simple and predictable. The threshold is a percentage: "if more than 50% of the last 100 calls failed, trip."

Time-based: track all calls in the last T seconds (e.g., 60). Better for low-traffic services where count-based windows take too long to fill. The threshold works the same way.

Resilience4j supports both. The choice depends on traffic volume. High-traffic services (1000+ RPS) work well with count-based windows. Low-traffic services (1 RPS) need time-based windows because a count-based window of 100 would take 100 seconds to fill.

Minimum Call Threshold

Do not trip the breaker on the first few calls. If the first 3 out of 5 calls fail, is that a 60% failure rate or bad luck? Set a minimum number of calls before the breaker evaluates the threshold. Resilience4j's minimumNumberOfCalls (default: 100) prevents premature tripping.

Timeout Duration

How long to stay open before testing recovery. Too short and you hammer the recovering service with probes. Too long and you stay in a degraded state longer than necessary. 30-60 seconds is a common starting point. Some systems use exponential backoff for the open-state duration: 30s, then 60s, then 120s if recovery keeps failing.

Fallback Strategies

When the breaker is open, what does the caller do?

Return cached data. If you have a recent cached response, return it. Stale data is often better than no data. A product catalog that is 5 minutes old is perfectly usable. A stock price that is 5 minutes old might not be.

Return a default value. If there is a safe default, use it. Recommendations service is down? Show the most popular items. Personalization service is down? Show the generic homepage.

Degrade functionality. Remove the feature that depends on the failing service. Hide the recommendations section. Disable the chat widget. Show a "feature temporarily unavailable" message.

Queue for retry. Write the request to a queue and process it later when the downstream recovers. Works for non-interactive operations like sending emails or updating analytics.

Return an error. Sometimes the honest answer is the best one. If the payment service is down, tell the user "we cannot process your payment right now" rather than pretending everything is fine.

Envoy and Service Mesh Circuit Breaking

Modern architectures often push circuit breaking into the infrastructure layer. Envoy proxy (used by Istio, AWS App Mesh, and others) implements circuit breaking as a sidecar, so application code does not need to include circuit breaker libraries.

Envoy's circuit breaking is configured per upstream cluster and includes:

Max connections: limit on the total number of connections to the upstream cluster
Max pending requests: limit on requests queued waiting for a connection
Max requests: limit on concurrent active requests
Max retries: limit on concurrent retry attempts

Envoy also has outlier detection, which is its version of the circuit breaker pattern. It tracks individual upstream hosts and ejects hosts that exceed a failure threshold. Ejected hosts receive no traffic for a configurable duration before being re-added.

The advantage of sidecar-based circuit breaking: it works for any protocol (HTTP, gRPC, TCP) without application changes. The disadvantage: the sidecar cannot implement application-specific fallback logic (return cached data, degrade gracefully). It can only fail fast.

When NOT to Use Circuit Breakers

Circuit breakers are for protecting your service from a failing dependency. They are not appropriate in every situation.

Critical dependencies with no fallback. If your service literally cannot function without the downstream (e.g., a database for a transactional service), tripping the circuit breaker just changes the error from "timeout" to "breaker open." The user experience is the same. In this case, focus on making the downstream more reliable (replicas, failover) rather than wrapping it in a breaker.

Idempotent batch operations. If you are processing a batch of records and some fail, retrying those individual records later is usually better than tripping a breaker that stops all processing.

Internal libraries and function calls. Circuit breakers make sense for network calls where failure is expected and latency is unpredictable. Wrapping a local function call in a circuit breaker adds overhead and complexity for no benefit.

The pattern works best at the boundary between services in a microservices architecture, where network partitions, deployments, and scaling events create exactly the kind of transient failures that circuit breakers are designed to handle.

The Cascading Failure Problem

One slow database turned into a system-wide outage. Every service in the chain is now unhealthy, even though only one component had the original problem.

Circuit breakers and bulkheads prevent cascading failures by limiting the damage a single faulty dependency can cause.

The Circuit Breaker Pattern

Named after electrical circuit breakers that trip to prevent house fires, software circuit breakers trip to prevent system-wide outages.

Closed State (Normal)

Tripping: Closed to Open

When the failure rate in the rolling window exceeds a configured threshold (e.g., 50% of the last 100 calls failed), the breaker trips to the open state.

What counts as a failure? This is configurable and matters a lot:

Network timeouts: yes, always
5xx responses: yes, the downstream is having problems
4xx responses: usually no, these are client errors (bad request, not found)
Slow responses (above a latency threshold): sometimes, depends on whether slowness indicates downstream health

Open State (Fail Fast)

Requests do not reach the downstream service at all. The circuit breaker immediately returns an error (or a fallback response) to the caller. This has two benefits:

The caller fails fast instead of waiting for a timeout. Response time drops from "timeout duration" to "near zero."
The downstream service gets breathing room. No new requests arrive, giving it time to recover from whatever caused the failures.

The breaker stays open for a configurable duration (e.g., 30 seconds).

Half-Open State (Testing Recovery)

After the open-state timeout expires, the breaker transitions to half-open. It allows a small number of test requests (e.g., 1-3) to reach the downstream.

If the test requests succeed, the downstream has recovered. The breaker closes, and normal traffic resumes.

If the test requests fail, the downstream is still unhealthy. The breaker goes back to open for another timeout period.

This gradual testing prevents the "thundering herd" problem where a recovered service gets immediately overwhelmed by the full blast of queued traffic.

The Bulkhead Pattern

A circuit breaker handles one dependency at a time. But what if the problem is not one dependency failing but one dependency consuming shared resources and starving everything else?

Bulkheads solve this by isolating resources per dependency.

Thread Pool Isolation

Semaphore Isolation

Resilience4j defaults to semaphore isolation. Hystrix defaulted to thread pool isolation.

Configuring Circuit Breakers

Sliding Window

Most circuit breakers track failures in a sliding window. Two types:

Count-based: track the last N calls (e.g., 100). Simple and predictable. The threshold is a percentage: "if more than 50% of the last 100 calls failed, trip."

Time-based: track all calls in the last T seconds (e.g., 60). Better for low-traffic services where count-based windows take too long to fill. The threshold works the same way.

Minimum Call Threshold

Timeout Duration

Fallback Strategies

When the breaker is open, what does the caller do?

Return a default value. If there is a safe default, use it. Recommendations service is down? Show the most popular items. Personalization service is down? Show the generic homepage.

Degrade functionality. Remove the feature that depends on the failing service. Hide the recommendations section. Disable the chat widget. Show a "feature temporarily unavailable" message.

Queue for retry. Write the request to a queue and process it later when the downstream recovers. Works for non-interactive operations like sending emails or updating analytics.

Return an error. Sometimes the honest answer is the best one. If the payment service is down, tell the user "we cannot process your payment right now" rather than pretending everything is fine.

Envoy and Service Mesh Circuit Breaking

Envoy's circuit breaking is configured per upstream cluster and includes:

Max connections: limit on the total number of connections to the upstream cluster
Max pending requests: limit on requests queued waiting for a connection
Max requests: limit on concurrent active requests
Max retries: limit on concurrent retry attempts

When NOT to Use Circuit Breakers

Circuit breakers are for protecting your service from a failing dependency. They are not appropriate in every situation.

Idempotent batch operations. If you are processing a batch of records and some fail, retrying those individual records later is usually better than tripping a breaker that stops all processing.

Architecture

The Cascading Failure Problem

The Circuit Breaker Pattern

Closed State (Normal)

Tripping: Closed to Open

Open State (Fail Fast)

Half-Open State (Testing Recovery)

The Bulkhead Pattern

Thread Pool Isolation

Semaphore Isolation

Configuring Circuit Breakers

Sliding Window

Minimum Call Threshold

Timeout Duration

Fallback Strategies

Envoy and Service Mesh Circuit Breaking

When NOT to Use Circuit Breakers

Key Points

Used By

Common Mistakes

Related

Architecture

The Cascading Failure Problem

The Circuit Breaker Pattern

Closed State (Normal)

Tripping: Closed to Open

Open State (Fail Fast)

Half-Open State (Testing Recovery)

The Bulkhead Pattern

Thread Pool Isolation

Semaphore Isolation

Configuring Circuit Breakers

Sliding Window

Minimum Call Threshold

Timeout Duration

Fallback Strategies

Envoy and Service Mesh Circuit Breaking

When NOT to Use Circuit Breakers

Key Points

Used By

Common Mistakes

Related