Circuit Breakers & Bulkheads
Architecture
The Cascading Failure Problem
Service A calls Service B, which calls Service C. Service C starts responding slowly because its database is overloaded. Service B's threads are waiting on Service C, so B's thread pool fills up. Now Service A's calls to B start timing out. A's thread pool fills up. Soon, Service D, E, and F (which also depend on A) start failing.
One slow database turned into a system-wide outage. Every service in the chain is now unhealthy, even though only one component had the original problem.
This is a cascading failure, and it happens because each service holds resources (threads, connections, memory) while waiting for a downstream response. When the downstream is slow, those resources are held for longer, exhausting them for other requests.
Circuit breakers and bulkheads prevent cascading failures by limiting the damage a single faulty dependency can cause.
The Circuit Breaker Pattern
Named after electrical circuit breakers that trip to prevent house fires, software circuit breakers trip to prevent system-wide outages.
Closed State (Normal)
All requests flow through to the downstream service. The circuit breaker monitors outcomes: successes, failures, timeouts. It keeps a rolling window of recent results (e.g., the last 100 calls or the last 60 seconds).
Tripping: Closed to Open
When the failure rate in the rolling window exceeds a configured threshold (e.g., 50% of the last 100 calls failed), the breaker trips to the open state.
What counts as a failure? This is configurable and matters a lot:
- Network timeouts: yes, always
- 5xx responses: yes, the downstream is having problems
- 4xx responses: usually no, these are client errors (bad request, not found)
- Slow responses (above a latency threshold): sometimes, depends on whether slowness indicates downstream health
Open State (Fail Fast)
Requests do not reach the downstream service at all. The circuit breaker immediately returns an error (or a fallback response) to the caller. This has two benefits:
- The caller fails fast instead of waiting for a timeout. Response time drops from "timeout duration" to "near zero."
- The downstream service gets breathing room. No new requests arrive, giving it time to recover from whatever caused the failures.
The breaker stays open for a configurable duration (e.g., 30 seconds).
Half-Open State (Testing Recovery)
After the open-state timeout expires, the breaker transitions to half-open. It allows a small number of test requests (e.g., 1-3) to reach the downstream.
If the test requests succeed, the downstream has recovered. The breaker closes, and normal traffic resumes.
If the test requests fail, the downstream is still unhealthy. The breaker goes back to open for another timeout period.
This gradual testing prevents the "thundering herd" problem where a recovered service gets immediately overwhelmed by the full blast of queued traffic.
The Bulkhead Pattern
A circuit breaker handles one dependency at a time. But what if the problem is not one dependency failing but one dependency consuming shared resources and starving everything else?
Example: your service has a thread pool of 200 threads shared across all downstream calls. The notifications service becomes slow and holds 190 threads waiting. Now your payment calls, inventory checks, and user lookups all compete for the remaining 10 threads. The notifications service, which is not even critical, has effectively taken down payment processing.
Bulkheads solve this by isolating resources per dependency.
Thread Pool Isolation
Give each downstream dependency its own thread pool. Payments get 50 threads. Notifications get 20 threads. Inventory gets 30 threads. If notifications goes slow and its 20 threads fill up, payments still has its own 50 threads and is unaffected.
Hystrix used this approach. Each "command group" ran in its own thread pool. When the pool was full, additional requests were rejected immediately (fail fast) without consuming resources from other pools.
The cost: thread pools have overhead (context switching, memory per thread). With many downstream dependencies, the total thread count can get large. Sizing each pool requires understanding the expected concurrency per dependency.
Semaphore Isolation
A lighter alternative: instead of separate thread pools, use semaphores (concurrency limiters) per dependency. A semaphore with a permit count of 20 allows at most 20 concurrent calls. When all permits are taken, new calls are rejected.
Semaphore isolation runs on the caller's thread (no separate pool), so there is less overhead. But it does not provide timeout protection: a slow call holds the caller's thread until the call completes or the HTTP timeout fires. Thread pool isolation kills the task when the pool's timeout expires, regardless of the downstream's behavior.
Resilience4j defaults to semaphore isolation. Hystrix defaulted to thread pool isolation.
Configuring Circuit Breakers
Sliding Window
Most circuit breakers track failures in a sliding window. Two types:
Count-based: track the last N calls (e.g., 100). Simple and predictable. The threshold is a percentage: "if more than 50% of the last 100 calls failed, trip."
Time-based: track all calls in the last T seconds (e.g., 60). Better for low-traffic services where count-based windows take too long to fill. The threshold works the same way.
Resilience4j supports both. The choice depends on traffic volume. High-traffic services (1000+ RPS) work well with count-based windows. Low-traffic services (1 RPS) need time-based windows because a count-based window of 100 would take 100 seconds to fill.
Minimum Call Threshold
Do not trip the breaker on the first few calls. If the first 3 out of 5 calls fail, is that a 60% failure rate or bad luck? Set a minimum number of calls before the breaker evaluates the threshold. Resilience4j's minimumNumberOfCalls (default: 100) prevents premature tripping.
Timeout Duration
How long to stay open before testing recovery. Too short and you hammer the recovering service with probes. Too long and you stay in a degraded state longer than necessary. 30-60 seconds is a common starting point. Some systems use exponential backoff for the open-state duration: 30s, then 60s, then 120s if recovery keeps failing.
Fallback Strategies
When the breaker is open, what does the caller do?
Return cached data. If you have a recent cached response, return it. Stale data is often better than no data. A product catalog that is 5 minutes old is perfectly usable. A stock price that is 5 minutes old might not be.
Return a default value. If there is a safe default, use it. Recommendations service is down? Show the most popular items. Personalization service is down? Show the generic homepage.
Degrade functionality. Remove the feature that depends on the failing service. Hide the recommendations section. Disable the chat widget. Show a "feature temporarily unavailable" message.
Queue for retry. Write the request to a queue and process it later when the downstream recovers. Works for non-interactive operations like sending emails or updating analytics.
Return an error. Sometimes the honest answer is the best one. If the payment service is down, tell the user "we cannot process your payment right now" rather than pretending everything is fine.
Envoy and Service Mesh Circuit Breaking
Modern architectures often push circuit breaking into the infrastructure layer. Envoy proxy (used by Istio, AWS App Mesh, and others) implements circuit breaking as a sidecar, so application code does not need to include circuit breaker libraries.
Envoy's circuit breaking is configured per upstream cluster and includes:
- Max connections: limit on the total number of connections to the upstream cluster
- Max pending requests: limit on requests queued waiting for a connection
- Max requests: limit on concurrent active requests
- Max retries: limit on concurrent retry attempts
Envoy also has outlier detection, which is its version of the circuit breaker pattern. It tracks individual upstream hosts and ejects hosts that exceed a failure threshold. Ejected hosts receive no traffic for a configurable duration before being re-added.
The advantage of sidecar-based circuit breaking: it works for any protocol (HTTP, gRPC, TCP) without application changes. The disadvantage: the sidecar cannot implement application-specific fallback logic (return cached data, degrade gracefully). It can only fail fast.
When NOT to Use Circuit Breakers
Circuit breakers are for protecting your service from a failing dependency. They are not appropriate in every situation.
Critical dependencies with no fallback. If your service literally cannot function without the downstream (e.g., a database for a transactional service), tripping the circuit breaker just changes the error from "timeout" to "breaker open." The user experience is the same. In this case, focus on making the downstream more reliable (replicas, failover) rather than wrapping it in a breaker.
Idempotent batch operations. If you are processing a batch of records and some fail, retrying those individual records later is usually better than tripping a breaker that stops all processing.
Internal libraries and function calls. Circuit breakers make sense for network calls where failure is expected and latency is unpredictable. Wrapping a local function call in a circuit breaker adds overhead and complexity for no benefit.
The pattern works best at the boundary between services in a microservices architecture, where network partitions, deployments, and scaling events create exactly the kind of transient failures that circuit breakers are designed to handle.
Key Points
- •A circuit breaker monitors calls to a downstream service and trips open when failures exceed a threshold. Once open, requests fail immediately without attempting the call. This prevents a failing service from dragging down its callers and cascading the failure across the entire system
- •Three states: Closed (normal, requests flow through), Open (tripped, requests fail fast), Half-Open (testing, a limited number of requests probe the downstream to check if it has recovered). The transition from open to half-open happens after a configurable timeout
- •Bulkheads isolate components so that a failure in one does not exhaust shared resources and take down others. Named after ship bulkheads that contain flooding to one compartment. In practice, this means separate thread pools, connection pools, or rate limits per downstream dependency
- •Netflix Hystrix popularized both patterns but is now in maintenance mode. Modern alternatives include Resilience4j (Java), Polly (.NET), Envoy proxy (sidecar-based), and Istio (service mesh-level). The concepts are the same regardless of the implementation
- •Circuit breakers and bulkheads are defense mechanisms, not solutions. They buy you time to recover. The actual fix is always upstream: retry with backoff, degrade gracefully, serve cached data, or alert an operator. A circuit breaker that trips and stays open forever is just a permanent outage with extra steps
Used By
Common Mistakes
- ✗Setting the failure threshold too low. If your circuit breaker trips after 3 failures in 60 seconds, a single burst of network timeouts will open it even though the downstream service is fundamentally healthy. Use a percentage-based threshold (e.g., 50% failure rate over the last 100 calls) rather than an absolute count
- ✗Not distinguishing between failure types. A 500 error from the downstream usually means the service is struggling. A 400 error means your request was bad. Counting 400s toward the circuit breaker threshold will trip it when your own requests are malformed, not when the downstream is failing
- ✗Forgetting the half-open state. A circuit breaker that goes from open to closed without testing first floods the recovering service with full traffic. The half-open state lets through a small number of test requests. Only if they succeed does the breaker close
- ✗Using circuit breakers for services that must not be skipped. If your payment processor is down and you cannot complete the checkout, a circuit breaker that fails fast is the right behavior. But if you silently skip the payment step and ship the order for free, that is worse than being slow