Circuit Breaker Pattern

What it is

A circuit breaker sits between calling code and a downstream service. It watches recent failures. If too many calls fail, it "opens" and rejects subsequent calls instantly without hitting the downstream. After a cooldown, it cautiously lets one probe through. If the probe succeeds, traffic resumes; if not, back to rejecting.

The name comes from electrical circuit breakers: when current spikes, the breaker trips to protect the rest of the circuit. Same idea: when downstream failures spike, the breaker protects the calling service from cascading failure.

The three states (see diagram above)

CLOSED. Normal operation. Every call goes through. The breaker counts successes and failures over a sliding window.
OPEN. The breaker has tripped because the failure rate crossed a threshold (e.g., 50% of the last 20 calls). Every call fails instantly with a "circuit open" error. The downstream is not contacted at all. Cooldown timer starts.
HALF-OPEN. The cooldown elapsed. The breaker lets a small number of probe calls through to test whether the downstream is healthy again. If they succeed, transition back to CLOSED. If they fail, back to OPEN with a fresh cooldown.

Why this matters

Without a circuit breaker, a slow downstream cascades through the system. Each call ties up a thread (or goroutine, or connection) waiting for the timeout. If the downstream takes 30 seconds to fail, 100 calls per minute is enough to exhaust 50 threads. Now the calling service is also slow, and the failure has spread.

With a circuit breaker, the failure is contained. Calls fail fast (microseconds, not 30 seconds). Threads are freed. The service remains responsive. Callers can fall back to a cached value, a default response, a queued retry.

How to trip

Two common policies:

Failure rate over a sliding window. "If 50% of the last 20 calls failed, open." Most resilient because it adapts to traffic level.

Consecutive failures. "If 5 calls in a row fail, open." Simpler, but flaky on bursty traffic (5 failures in a row could mean nothing if the previous 1000 succeeded).

Combine with a minimum-call floor: "open only if at least 10 calls have been observed in the window". Prevents tripping on a one-call sample.

What counts as a failure

Be specific. Not every error means the downstream is unhealthy. A 4xx response (client error) means the caller sent something invalid; do not count it. A 5xx, timeout, or connection error means the downstream is unhealthy; do count it.

The standard practice: classify errors at the boundary, only count "downstream is unhealthy" errors, ignore "bad request" errors from the caller.

Per-downstream

Always per-downstream, never global. One breaker for the payment service, one for the user service, one for the recommendations service. A failing recommendations service should not stop charging customers.

What the breaker does not do

It does not call user code automatically. Calls still have to be wrapped. It does not retry; that is a separate pattern. It does not provide a fallback; the caller has to supply one on catching the open-state error.

The breaker is the trigger. The fallback (cached value, default, degraded response) is the caller's responsibility.

When to skip it

For internal services with reliable, low-latency calls, the breaker overhead might exceed the value. For one-off calls where both ends are controlled and failure can be detected end-to-end, retry is enough. For non-critical paths where failure is acceptable, the open-state behaviour and fallback complexity might not be worth it.

For any external dependency, any database call, any message queue interaction, any third-party API: yes, breaker.

Follow-up questions

▸Circuit breaker vs retry: are they competing?

Complementary. Retry handles transient blips: 'this one call might work next time'. Circuit breaker handles sustained failure: 'this downstream is down, stop calling it'. Combine: retry within a single call, circuit breaker across calls. The retry hits the breaker first; if the breaker is open, the call fails fast without the retry attempts running.

▸How is the failure threshold picked?

Two parameters: failure rate (e.g., 50%) and minimum calls (e.g., 10). The minimum-calls floor prevents tripping on a tiny sample (one failure out of one is 100% failure rate but means nothing). Sliding-window size controls how reactive: smaller window trips faster, more false positives; larger window is slower to trip and slower to recover.

▸Why per-downstream and not global?

Different downstreams have different reliability. Service A being slow shouldn't stop calls to service B. Per-downstream breakers also allow tuning thresholds per downstream (a payment gateway gets a stricter breaker than a logging service).

▸What does 'fail fast' actually save?

Two things. Threads/goroutines/connections that would have been tied up waiting for the downstream's timeout. And the downstream's load: not piling on more requests while it is trying to recover. The first is critical: with 100 threads and a downstream that takes 30s to time out, 100 calls per minute is enough to exhaust all threads.

What it is

The three states (see diagram above)

CLOSED. Normal operation. Every call goes through. The breaker counts successes and failures over a sliding window.
OPEN. The breaker has tripped because the failure rate crossed a threshold (e.g., 50% of the last 20 calls). Every call fails instantly with a "circuit open" error. The downstream is not contacted at all. Cooldown timer starts.
HALF-OPEN. The cooldown elapsed. The breaker lets a small number of probe calls through to test whether the downstream is healthy again. If they succeed, transition back to CLOSED. If they fail, back to OPEN with a fresh cooldown.

Why this matters

How to trip

Two common policies:

Failure rate over a sliding window. "If 50% of the last 20 calls failed, open." Most resilient because it adapts to traffic level.

Consecutive failures. "If 5 calls in a row fail, open." Simpler, but flaky on bursty traffic (5 failures in a row could mean nothing if the previous 1000 succeeded).

Combine with a minimum-call floor: "open only if at least 10 calls have been observed in the window". Prevents tripping on a one-call sample.

What counts as a failure

The standard practice: classify errors at the boundary, only count "downstream is unhealthy" errors, ignore "bad request" errors from the caller.

Per-downstream

What the breaker does not do

The breaker is the trigger. The fallback (cached value, default, degraded response) is the caller's responsibility.

When to skip it

For any external dependency, any database call, any message queue interaction, any third-party API: yes, breaker.

Follow-up questions

▸Circuit breaker vs retry: are they competing?

▸How is the failure threshold picked?

▸Why per-downstream and not global?

▸What does 'fail fast' actually save?

Diagram

What it is

The three states (see diagram above)

Why this matters

How to trip

What counts as a failure

Per-downstream

What the breaker does not do

When to skip it

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Circuit Breaker Pattern

Diagram

What it is

The three states (see diagram above)

Why this matters

How to trip

What counts as a failure

Per-downstream

What the breaker does not do

When to skip it

Implementations

Key points

Follow-up questions

Gotchas

Related reading