Circuit Breaker: Production Implementation

What it is

A production circuit breaker is a state machine wrapped around downstream calls. It tracks recent failures, trips when failures exceed a threshold, and lets calls through again only after a cooldown and probe.

It is not magic. It does not fix the downstream. It does not retry. What it does: turn a slow failing downstream from "ties up our threads" into "fails fast, frees our threads".

The state machine

Closed: normal operation. Calls pass through. The breaker observes successes and failures.

Open: tripped. Calls fail immediately with CircuitOpenError or equivalent. The downstream is not contacted. After a cooldown, transition to half-open.

Half-open: probing. A small number of calls are allowed through. Their outcomes decide the next state: all (or most) successes → closed; any (or many) failures → open with a fresh cooldown.

State transitions must be atomic. Use AtomicReference / atomic.Value / a mutex around the state field.

The trip policy

Two ingredients:

Sliding window of outcomes. Track successes and failures over the last N calls or last T seconds. The bigger the window, the more samples, the more stable the decision. The smaller, the more reactive.

Minimum-call floor. Don't trip on a tiny sample. "5 calls, 3 failures" is 60% failure rate but not enough data. Require at least 10-20 calls in the window before considering the failure rate.

The trip rule: window has at least minCalls AND failure rate ≥ threshold (typically 0.5).

Half-open: the probe phase

When the cooldown expires, the breaker doesn't immediately re-open the floodgates. It allows a small number of probes (3-10 typical) and watches their outcomes.

Two reasons to be careful here:

The downstream might still be recovering. A flood of traffic at this moment can re-trip the breaker immediately. Cap concurrent probes; reject extra calls as if open.
One success doesn't mean recovery. Wait for the probe batch to complete; close the breaker only if a clear majority succeeded.

Per-downstream

This is the rule that catches teams off guard. One global breaker for all downstreams means: when payment-service fails, calls to user-service also stop. The "isolate the failure" benefit is gone.

The fix: a registry of breakers, one per downstream. Look up by downstream identifier (URL, service name, queue name). Cheap to maintain (a few hundred bytes per breaker, microseconds to look up).

What success looks like

A well-tuned breaker:

Trips within seconds of a real downstream failure.
Stays open for the cooldown without flickering.
Probes carefully when half-open.
Re-closes within tens of seconds after the downstream recovers.
Stays closed during normal operation, even with occasional one-off failures.

A badly-tuned breaker either trips on noise (too sensitive) or never trips when it should (too tolerant). Tuning takes observing real production traffic; the defaults from a library are a starting point.

Composing with the rest

The full client stack: bulkhead caps concurrent calls, circuit breaker fails fast when downstream is unhealthy, retry handles single-shot transients, timeout caps individual attempts. The order matters: retry inside breaker (so retries hit the open breaker and fail fast), bulkhead outside breaker (so even open-breaker rejections are rate-limited).

Follow-up questions

▸How do I size the sliding window?

Tradeoff between reactivity and false positives. A 10-call window: trips fast but flaky on bursty traffic. A 100-call or 60-second window: more stable but slower to trip. For high-traffic services, 60-second time-based windows work well; for low-traffic, count-based with a minimum-call floor.

▸What if I get a flood of requests right when half-open opens?

Don't admit them all. Half-open should cap concurrent probes (e.g., 3 at a time, others fail-fast as if open). Without this cap, the still-recovering downstream gets a surge of probe traffic at the moment it can least handle it.

▸Should I share circuit breaker state across instances?

Usually not. Each service instance maintains its own breaker. The downstream's actual health is the same; each instance discovers it independently. Sharing state across instances (via Redis, say) sounds smart but adds latency and a coordination point. Per-instance breakers converge to the same decision quickly enough.

▸How does the breaker interact with metrics?

Emit gauge for state (0/1/2 for closed/open/half-open), counters for trips and resets, histograms for time-in-state. Alerts on 'breaker open for >5 minutes' or 'breaker tripping >10 times per hour'. The breaker is one of the most useful sources of downstream-health signal.

What it is

It is not magic. It does not fix the downstream. It does not retry. What it does: turn a slow failing downstream from "ties up our threads" into "fails fast, frees our threads".

The state machine

Closed: normal operation. Calls pass through. The breaker observes successes and failures.

Open: tripped. Calls fail immediately with CircuitOpenError or equivalent. The downstream is not contacted. After a cooldown, transition to half-open.

Half-open: probing. A small number of calls are allowed through. Their outcomes decide the next state: all (or most) successes → closed; any (or many) failures → open with a fresh cooldown.

State transitions must be atomic. Use AtomicReference / atomic.Value / a mutex around the state field.

The trip policy

Two ingredients:

Minimum-call floor. Don't trip on a tiny sample. "5 calls, 3 failures" is 60% failure rate but not enough data. Require at least 10-20 calls in the window before considering the failure rate.

The trip rule: window has at least minCalls AND failure rate ≥ threshold (typically 0.5).

Half-open: the probe phase

When the cooldown expires, the breaker doesn't immediately re-open the floodgates. It allows a small number of probes (3-10 typical) and watches their outcomes.

Two reasons to be careful here:

The downstream might still be recovering. A flood of traffic at this moment can re-trip the breaker immediately. Cap concurrent probes; reject extra calls as if open.
One success doesn't mean recovery. Wait for the probe batch to complete; close the breaker only if a clear majority succeeded.

Per-downstream

This is the rule that catches teams off guard. One global breaker for all downstreams means: when payment-service fails, calls to user-service also stop. The "isolate the failure" benefit is gone.

The fix: a registry of breakers, one per downstream. Look up by downstream identifier (URL, service name, queue name). Cheap to maintain (a few hundred bytes per breaker, microseconds to look up).

What success looks like

A well-tuned breaker:

Trips within seconds of a real downstream failure.
Stays open for the cooldown without flickering.
Probes carefully when half-open.
Re-closes within tens of seconds after the downstream recovers.
Stays closed during normal operation, even with occasional one-off failures.

Composing with the rest

Follow-up questions

▸How do I size the sliding window?

▸What if I get a flood of requests right when half-open opens?

▸Should I share circuit breaker state across instances?

▸How does the breaker interact with metrics?

Diagram

What it is

The state machine

The trip policy

Half-open: the probe phase

Per-downstream

What success looks like

Composing with the rest

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Circuit Breaker: Production Implementation

Diagram

What it is

The state machine

The trip policy

Half-open: the probe phase

Per-downstream

What success looks like

Composing with the rest

Implementations

Key points

Follow-up questions

Gotchas

Related reading