Circuit Breaker: Production Implementation
Build a circuit breaker that wraps downstream calls. State machine: closed → (failures) → open → (cooldown) → half-open → (probe) → closed/open. Per-downstream instance. Trip on failure rate over a sliding window with a minimum-call floor. Half-open allows N probes; success rate decides if we close or reopen.
Diagram
What it is
A production circuit breaker is a state machine wrapped around downstream calls. It tracks recent failures, trips when failures exceed a threshold, and lets calls through again only after a cooldown and probe.
It is not magic. It does not fix the downstream. It does not retry. What it does: turn a slow failing downstream from "ties up our threads" into "fails fast, frees our threads".
The state machine
Closed: normal operation. Calls pass through. The breaker observes successes and failures.
Open: tripped. Calls fail immediately with CircuitOpenError or equivalent. The downstream is not contacted. After a cooldown, transition to half-open.
Half-open: probing. A small number of calls are allowed through. Their outcomes decide the next state: all (or most) successes → closed; any (or many) failures → open with a fresh cooldown.
State transitions must be atomic. Use AtomicReference / atomic.Value / a mutex around the state field.
The trip policy
Two ingredients:
Sliding window of outcomes. Track successes and failures over the last N calls or last T seconds. The bigger the window, the more samples, the more stable the decision. The smaller, the more reactive.
Minimum-call floor. Don't trip on a tiny sample. "5 calls, 3 failures" is 60% failure rate but not enough data. Require at least 10-20 calls in the window before considering the failure rate.
The trip rule: window has at least minCalls AND failure rate ≥ threshold (typically 0.5).
Half-open: the probe phase
When the cooldown expires, the breaker doesn't immediately re-open the floodgates. It allows a small number of probes (3-10 typical) and watches their outcomes.
Two reasons to be careful here:
-
The downstream might still be recovering. A flood of traffic at this moment can re-trip the breaker immediately. Cap concurrent probes; reject extra calls as if open.
-
One success doesn't mean recovery. Wait for the probe batch to complete; close the breaker only if a clear majority succeeded.
Per-downstream
This is the rule that catches teams off guard. One global breaker for all downstreams means: when payment-service fails, calls to user-service also stop. The "isolate the failure" benefit is gone.
The fix: a registry of breakers, one per downstream. Look up by downstream identifier (URL, service name, queue name). Cheap to maintain (a few hundred bytes per breaker, microseconds to look up).
What success looks like
A well-tuned breaker:
- Trips within seconds of a real downstream failure.
- Stays open for the cooldown without flickering.
- Probes carefully when half-open.
- Re-closes within tens of seconds after the downstream recovers.
- Stays closed during normal operation, even with occasional one-off failures.
A badly-tuned breaker either trips on noise (too sensitive) or never trips when it should (too tolerant). Tuning takes observing real production traffic; the defaults from a library are a starting point.
Composing with the rest
The full client stack: bulkhead caps concurrent calls, circuit breaker fails fast when downstream is unhealthy, retry handles single-shot transients, timeout caps individual attempts. The order matters: retry inside breaker (so retries hit the open breaker and fail fast), bulkhead outside breaker (so even open-breaker rejections are rate-limited).
Implementations
The state and counters are atomic. The sliding window is a count-min approximation: a ring of buckets representing recent time slices, each with success and failure counts.
1 public class CircuitBreaker {
2 enum State { CLOSED, OPEN, HALF_OPEN }
3 private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
4 private final AtomicLong openedAt = new AtomicLong(0);
5 private final SlidingWindow window;
6 private final long cooldownNanos;
7 private final double failureThreshold;
8 private final int minCalls;
9
10 public <T> T call(Callable<T> fn) throws Exception {
11 State s = state.get();
12 if (s == State.OPEN) {
13 if (System.nanoTime() - openedAt.get() < cooldownNanos) {
14 throw new CircuitOpenException();
15 }
16 state.compareAndSet(State.OPEN, State.HALF_OPEN);
17 }
18
19 try {
20 T result = fn.call();
21 onSuccess();
22 return result;
23 } catch (Exception e) {
24 onFailure();
25 throw e;
26 }
27 }
28
29 private void onSuccess() {
30 window.recordSuccess();
31 state.compareAndSet(State.HALF_OPEN, State.CLOSED);
32 }
33
34 private void onFailure() {
35 window.recordFailure();
36 State s = state.get();
37 if (s == State.HALF_OPEN) {
38 state.compareAndSet(State.HALF_OPEN, State.OPEN);
39 openedAt.set(System.nanoTime());
40 } else if (window.totalCalls() >= minCalls
41 && window.failureRate() >= failureThreshold) {
42 if (state.compareAndSet(State.CLOSED, State.OPEN)) {
43 openedAt.set(System.nanoTime());
44 }
45 }
46 }
47 }Key points
- •Three states: closed (normal), open (fail-fast), half-open (probing). Atomic state transitions.
- •Trip policy: failure rate over a sliding window (last N calls or last T seconds), with a minimum-call floor.
- •Cooldown in open state: 30s-5min typical. After cooldown, transition to half-open.
- •Half-open allows N probes (typically 3-10). Track their outcomes; success rate decides next state.
- •Per-downstream breaker. Never one global breaker. Different downstreams have different reliability.
Follow-up questions
▸How do I size the sliding window?
▸What if I get a flood of requests right when half-open opens?
▸Should I share circuit breaker state across instances?
▸How does the breaker interact with metrics?
Gotchas
- !One global breaker for all downstreams: one bad downstream stops everything
- !Counting client errors (4xx) as breaker failures: trips on user mistakes, not service health
- !Half-open with no probe cap: surge of probe traffic on a recovering downstream
- !No fallback when breaker opens: caller sees CircuitOpenException with no graceful path
- !Breaker without timeout: slow calls can pile up before the breaker even sees them as failures