Circuit Breaker Pattern
Stop calling a downstream that is failing. After N failures in a window, the breaker opens and immediately fails subsequent calls without hitting the downstream. After a cooldown, it allows a probe; if successful, close again. Prevents one slow downstream from cascading into the calling service.
Diagram
What it is
A circuit breaker sits between calling code and a downstream service. It watches recent failures. If too many calls fail, it "opens" and rejects subsequent calls instantly without hitting the downstream. After a cooldown, it cautiously lets one probe through. If the probe succeeds, traffic resumes; if not, back to rejecting.
The name comes from electrical circuit breakers: when current spikes, the breaker trips to protect the rest of the circuit. Same idea: when downstream failures spike, the breaker protects the calling service from cascading failure.
The three states (see diagram above)
- CLOSED. Normal operation. Every call goes through. The breaker counts successes and failures over a sliding window.
- OPEN. The breaker has tripped because the failure rate crossed a threshold (e.g., 50% of the last 20 calls). Every call fails instantly with a "circuit open" error. The downstream is not contacted at all. Cooldown timer starts.
- HALF-OPEN. The cooldown elapsed. The breaker lets a small number of probe calls through to test whether the downstream is healthy again. If they succeed, transition back to CLOSED. If they fail, back to OPEN with a fresh cooldown.
Why this matters
Without a circuit breaker, a slow downstream cascades through the system. Each call ties up a thread (or goroutine, or connection) waiting for the timeout. If the downstream takes 30 seconds to fail, 100 calls per minute is enough to exhaust 50 threads. Now the calling service is also slow, and the failure has spread.
With a circuit breaker, the failure is contained. Calls fail fast (microseconds, not 30 seconds). Threads are freed. The service remains responsive. Callers can fall back to a cached value, a default response, a queued retry.
How to trip
Two common policies:
Failure rate over a sliding window. "If 50% of the last 20 calls failed, open." Most resilient because it adapts to traffic level.
Consecutive failures. "If 5 calls in a row fail, open." Simpler, but flaky on bursty traffic (5 failures in a row could mean nothing if the previous 1000 succeeded).
Combine with a minimum-call floor: "open only if at least 10 calls have been observed in the window". Prevents tripping on a one-call sample.
What counts as a failure
Be specific. Not every error means the downstream is unhealthy. A 4xx response (client error) means the caller sent something invalid; do not count it. A 5xx, timeout, or connection error means the downstream is unhealthy; do count it.
The standard practice: classify errors at the boundary, only count "downstream is unhealthy" errors, ignore "bad request" errors from the caller.
Per-downstream
Always per-downstream, never global. One breaker for the payment service, one for the user service, one for the recommendations service. A failing recommendations service should not stop charging customers.
What the breaker does not do
It does not call user code automatically. Calls still have to be wrapped. It does not retry; that is a separate pattern. It does not provide a fallback; the caller has to supply one on catching the open-state error.
The breaker is the trigger. The fallback (cached value, default, degraded response) is the caller's responsibility.
When to skip it
For internal services with reliable, low-latency calls, the breaker overhead might exceed the value. For one-off calls where both ends are controlled and failure can be detected end-to-end, retry is enough. For non-critical paths where failure is acceptable, the open-state behaviour and fallback complexity might not be worth it.
For any external dependency, any database call, any message queue interaction, any third-party API: yes, breaker.
Implementations
Resilience4j is the standard library for Java. Configure failure-rate threshold, sliding window size, wait duration in open state. The decorated function transparently fails fast when the breaker is open.
1 import io.github.resilience4j.circuitbreaker.*;
2 import java.time.Duration;
3 import java.util.function.Supplier;
4
5 CircuitBreakerConfig config = CircuitBreakerConfig.custom()
6 .failureRateThreshold(50) // open at 50% failure rate
7 .slidingWindowSize(20) // last 20 calls
8 .minimumNumberOfCalls(10) // at least 10 to consider
9 .waitDurationInOpenState(Duration.ofSeconds(30))
10 .permittedNumberOfCallsInHalfOpenState(3)
11 .build();
12
13 CircuitBreaker breaker = CircuitBreaker.of("payment-api", config);
14
15 Supplier<Response> decorated = CircuitBreaker
16 .decorateSupplier(breaker, () -> paymentApi.call(req));
17
18 try {
19 Response r = decorated.get();
20 } catch (CallNotPermittedException e) {
21 // Breaker is open; fall back
22 return fallbackResponse();
23 }Key points
- •Three states: closed (normal), open (fail fast), half-open (probe single call).
- •Trip on failure rate or absolute count over a sliding window. 'X failures per minute' is more useful than 'X consecutive failures'.
- •Open state has a cooldown (typically 30s-5min). After cooldown, breaker is half-open: allows ONE probe call.
- •Without a circuit breaker, a slow downstream ties up threads/goroutines waiting on timeouts; the caller slows down trying to call something that won't respond.
- •Per-downstream breakers, not global. One slow service shouldn't block calls to a healthy one.
Follow-up questions
▸Circuit breaker vs retry: are they competing?
▸How is the failure threshold picked?
▸Why per-downstream and not global?
▸What does 'fail fast' actually save?
Gotchas
- !Single global breaker for all downstreams: one bad service blocks everything
- !No probe in half-open state: breaker stays open forever
- !Counting client errors (4xx) as breaker failures: trips on user mistakes, not service health
- !Threshold too low: trips on benign blips. Too high: trips too late, after damage is done
- !Forgetting fallback when breaker opens: callers see CircuitOpenError instead of degraded response