Circuit Breaker
State machine resilience pattern : CLOSED allows calls, OPEN rejects fast with Null Object fallbacks, HALF_OPEN probes cautiously. Proxy wraps remote services transparently.
Key Abstractions
State interface : handle_request, can_proceed, record_success, record_failure
Concrete state : allows all calls, counts failures, transitions to OpenState when threshold reached
Concrete state : rejects all calls immediately, transitions to HalfOpenState after cooldown period expires
Concrete state : allows one probe call, transitions to ClosedState on success or OpenState on failure
Context that holds current state and delegates request handling to it
Proxy : wraps remote service with circuit breaker, caller doesn't know circuit breaker exists
Null Object : safe fallback when circuit is open: cached data, default value, empty result : no exception handling needed
Observer : notified on state transitions for monitoring dashboards and alerting
Class Diagram
How It Works
A circuit breaker in software works exactly like one in your house's electrical panel. When current flows normally, the breaker stays closed (the circuit is complete). When a dangerous fault occurs, too much current, the breaker trips open, breaking the circuit to prevent damage. After a cooldown, you can manually test whether the fault is resolved by flipping it halfway (half-open). If the test passes, the breaker closes again and normal flow resumes.
In distributed systems, the "dangerous current" is a failing downstream service. Without a circuit breaker, every request to a dead service burns a thread waiting for a timeout. Under load, those blocked threads pile up, exhausting your thread pool and cascading the failure to your own callers. The circuit breaker cuts the connection early: after N consecutive failures, it stops sending requests entirely and returns a safe fallback response immediately. This protects your system from wasting resources on a service that isn't going to respond.
The state machine has three states. CLOSED is normal operation: requests flow through and failures are counted. When the failure count hits the threshold, the breaker transitions to OPEN. In OPEN, every request is immediately rejected with a NullResponse fallback. No network call, no waiting, no exceptions. After a cooldown period, the breaker transitions to HALF_OPEN and allows a single probe request through. If the probe succeeds, the service has recovered and the breaker returns to CLOSED. If it fails, back to OPEN for another cooldown cycle.
Requirements
Functional
- Three states: CLOSED, OPEN, HALF_OPEN with well-defined transition rules
- Configurable failure threshold (number of failures before opening)
- Configurable cooldown period (how long to stay open before probing)
- Return a NullResponse fallback when the circuit is open (Null Object pattern)
- Proxy wrapper so callers don't need to know about the circuit breaker
- Observer notifications on state transitions for monitoring
Non-Functional
- Fast rejection in OPEN state: O(1) with no network call
- Thread-safe state transitions under concurrent load
- No resource leaks during state transitions
Design Decisions
What's wrong with a state enum and switch statements?
A naive circuit breaker uses a state enum and switches on it everywhere: if state == OPEN: reject; elif state == HALF_OPEN: maybe_probe; else: allow. This scatters transition logic across multiple methods and makes adding new states (like a FORCED_OPEN for maintenance) painful. The State pattern encapsulates each state's behavior in its own class. Each state knows when and how to transition. Adding a new state means adding a new class. Nothing else changes. The CircuitBreaker context simply delegates to whatever state it currently holds.
Couldn't we just throw CircuitOpenException?
When the circuit is open, you have two choices: throw a CircuitOpenException or return a safe default. Throwing forces every caller to wrap calls in try/catch. If they forget, the exception propagates and crashes something. The Null Object pattern returns a NullResponse with a default value, a cached result, or an empty collection. The caller's code works without modification. It just gets degraded data. This is especially powerful in UI-facing services: show stale data from cache instead of an error page.
Why hide the circuit breaker behind a proxy?
The ServiceProxy implements the same interface as the real service. Calling code doesn't import CircuitBreaker, doesn't check circuit states, doesn't handle special exceptions. You wire up the proxy at configuration time (or through dependency injection) and the rest of the codebase is blissfully unaware. This means you can add circuit breakers to existing services without modifying any caller code.
How do we know when a circuit opens in production?
In production, you need to know when circuits open. An open circuit means a downstream dependency is failing, which usually means an incident. The Observer pattern decouples the circuit breaker from the monitoring infrastructure. You can attach a Prometheus metrics listener, a PagerDuty alerting listener, and a logging listener, all without the circuit breaker knowing about any of them.
Interview Follow-ups
- "How do distributed circuit breakers work?" In a microservice cluster, each instance has its own circuit breaker. They can trip independently or share state via Redis/ZooKeeper. Shared state is better for slow-starting failures (each instance sees a fraction of the errors), but adds a dependency on the shared store. Netflix Hystrix used instance-local breakers; Resilience4j supports both.
- "What are good half-open probe strategies?" Instead of allowing exactly one request, you can allow a percentage (e.g., 10% of traffic) and measure the success rate. If the success rate exceeds a threshold, close the circuit. This is less binary and recovers faster for services under partial failure. Resilience4j's
permittedNumberOfCallsInHalfOpenStatecontrols this. - "Circuit breaker vs. retry: when do you use which?" Retries handle transient failures (a single dropped packet). Circuit breakers handle sustained failures (service is down for minutes). Use both together: retry within the circuit breaker's closed state, but once the circuit opens, stop retrying. Retrying against a dead service is the worst possible behavior: it amplifies the load on a struggling service.
- "How does Netflix Hystrix / Resilience4j implement this?" Hystrix (now in maintenance mode) used a sliding window of request outcomes and a thread pool per dependency. Resilience4j replaced it with a lighter approach: ring buffer of outcomes, no thread pool isolation by default (uses the caller's thread with decorators), and composable with other patterns (retry, rate limiter, bulkhead) via functional decoration.
Code Implementation
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Protocol
import time
# --------------- Response Types ---------------
@dataclass
class Response:
"""Standard response from a service call."""
value: str
is_fallback: bool = False
def __str__(self) -> str:
tag = " [FALLBACK]" if self.is_fallback else ""
return f"{self.value}{tag}"
@dataclass
class NullResponse(Response):
"""
Null Object pattern: returned when the circuit is open.
Provides a safe default instead of throwing an exception.
Callers don't need try/except; they get graceful degradation.
"""
value: str = "Service unavailable - using cached default"
is_fallback: bool = True
# --------------- Configuration ---------------
@dataclass(frozen=True)
class CircuitBreakerConfig:
failure_threshold: int = 3
cooldown_seconds: float = 2.0
success_threshold: int = 1
# --------------- Observer ---------------
class CircuitBreakerListener(ABC):
@abstractmethod
def on_state_change(self, from_state: str, to_state: str) -> None: ...
class LoggingListener(CircuitBreakerListener):
"""Concrete observer that logs state transitions."""
def on_state_change(self, from_state: str, to_state: str) -> None:
icon = "!!" if to_state == "OPEN" else ">>" if to_state == "HALF_OPEN" else "OK"
print(f" [{icon}] Circuit transition: {from_state} -> {to_state}")
# --------------- State Interface ---------------
class CircuitState(ABC):
def __init__(self, breaker: "CircuitBreaker"):
self._breaker = breaker
@property
@abstractmethod
def name(self) -> str: ...
@abstractmethod
def can_proceed(self) -> bool: ...
@abstractmethod
def record_success(self) -> None: ...
@abstractmethod
def record_failure(self) -> None: ...
@abstractmethod
def handle_request(self) -> Response | None:
"""Return NullResponse if circuit should block, None to let call proceed."""
...
# --------------- Concrete States ---------------
class ClosedState(CircuitState):
"""Circuit is healthy. All calls pass through. Failures are counted."""
def __init__(self, breaker: "CircuitBreaker"):
super().__init__(breaker)
self._failure_count = 0
@property
def name(self) -> str:
return "CLOSED"
def can_proceed(self) -> bool:
return True
def handle_request(self) -> Response | None:
return None # Allow the call to proceed
def record_success(self) -> None:
self._failure_count = 0
def record_failure(self) -> None:
self._failure_count += 1
if self._failure_count >= self._breaker.config.failure_threshold:
self._breaker.transition_to(OpenState(self._breaker))
class OpenState(CircuitState):
"""Circuit is tripped. All calls are rejected with a NullResponse fallback."""
def __init__(self, breaker: "CircuitBreaker"):
super().__init__(breaker)
self._opened_at = time.monotonic()
@property
def name(self) -> str:
return "OPEN"
def can_proceed(self) -> bool:
elapsed = time.monotonic() - self._opened_at
if elapsed >= self._breaker.config.cooldown_seconds:
self._breaker.transition_to(HalfOpenState(self._breaker))
return True
return False
def handle_request(self) -> Response | None:
if not self.can_proceed():
return NullResponse()
# Cooldown expired - state already transitioned to HALF_OPEN
return self._breaker.state.handle_request()
def record_success(self) -> None:
pass # Should not happen in OPEN state
def record_failure(self) -> None:
pass # Already open
class HalfOpenState(CircuitState):
"""Probe state. Allows one trial call to test if the service recovered."""
def __init__(self, breaker: "CircuitBreaker"):
super().__init__(breaker)
self._probe_sent = False
@property
def name(self) -> str:
return "HALF_OPEN"
def can_proceed(self) -> bool:
return not self._probe_sent
def handle_request(self) -> Response | None:
if self._probe_sent:
return NullResponse(value="Probe in progress - using fallback")
self._probe_sent = True
return None # Allow the probe call
def record_success(self) -> None:
self._breaker.transition_to(ClosedState(self._breaker))
def record_failure(self) -> None:
self._breaker.transition_to(OpenState(self._breaker))
# --------------- Circuit Breaker (Context) ---------------
class CircuitBreaker:
"""
Context in the State pattern. Holds the current state and delegates
all request handling to it. Notifies observers on transitions.
"""
def __init__(self, config: CircuitBreakerConfig | None = None):
self.config = config or CircuitBreakerConfig()
self._listeners: list[CircuitBreakerListener] = []
self._state: CircuitState = ClosedState(self)
@property
def state(self) -> CircuitState:
return self._state
@property
def state_name(self) -> str:
return self._state.name
def add_listener(self, listener: CircuitBreakerListener) -> None:
self._listeners.append(listener)
def transition_to(self, new_state: CircuitState) -> None:
old_name = self._state.name
self._state = new_state
for listener in self._listeners:
listener.on_state_change(old_name, new_state.name)
def call(self, func, *args, **kwargs) -> Response:
"""Execute func through the circuit breaker."""
blocked = self._state.handle_request()
if blocked is not None:
return blocked
try:
result = func(*args, **kwargs)
self._state.record_success()
return Response(value=result)
except Exception:
self._state.record_failure()
return NullResponse(value="Call failed - using fallback response")
# --------------- Remote Service ---------------
class RemoteService:
"""Simulates a remote service that can be configured to fail."""
def __init__(self, fail_after: int = 0):
self._call_count = 0
self._fail_after = fail_after
self._should_recover = True
def call(self, request: str) -> str:
self._call_count += 1
if self._fail_after > 0 and self._call_count > self._fail_after:
if self._should_recover and self._call_count > self._fail_after + 5:
return f"Recovered! Handled: {request}"
raise ConnectionError(f"Service unavailable (call #{self._call_count})")
return f"OK: {request}"
def reset(self) -> None:
self._call_count = 0
# --------------- Service Proxy ---------------
class ServiceProxy:
"""
Proxy pattern - wraps a remote service with a circuit breaker.
Callers use proxy.call() with the same signature as the real service.
They have no idea a circuit breaker is protecting them.
"""
def __init__(self, service: RemoteService, breaker: CircuitBreaker):
self._service = service
self._breaker = breaker
def call(self, request: str) -> Response:
return self._breaker.call(self._service.call, request)
# --------------- Demo ---------------
if __name__ == "__main__":
print("=" * 60)
print(" CIRCUIT BREAKER DEMO")
print("=" * 60)
config = CircuitBreakerConfig(failure_threshold=3, cooldown_seconds=2.0)
breaker = CircuitBreaker(config)
breaker.add_listener(LoggingListener())
# Service fails after 2 successful calls
service = RemoteService(fail_after=2)
proxy = ServiceProxy(service, breaker)
# Phase 1: Successful calls (CLOSED state)
print("\n--- Phase 1: Normal operation (CLOSED) ---")
for i in range(1, 3):
resp = proxy.call(f"request-{i}")
print(f" Call {i}: {resp} [state={breaker.state_name}]")
# Phase 2: Failures accumulate, circuit opens
print("\n--- Phase 2: Failures trigger OPEN ---")
for i in range(3, 8):
resp = proxy.call(f"request-{i}")
print(f" Call {i}: {resp} [state={breaker.state_name}]")
# Phase 3: Circuit is OPEN - calls return NullResponse immediately
print("\n--- Phase 3: OPEN state - fast-fail with NullResponse ---")
for i in range(8, 11):
resp = proxy.call(f"request-{i}")
print(f" Call {i}: {resp} [state={breaker.state_name}]")
# Phase 4: Wait for cooldown, then HALF_OPEN probe
print(f"\n--- Phase 4: Waiting {config.cooldown_seconds}s cooldown... ---")
time.sleep(config.cooldown_seconds + 0.1)
# Service is still failing - probe will fail, back to OPEN
print("\n--- Phase 5: HALF_OPEN probe (service still down) ---")
resp = proxy.call("probe-1")
print(f" Probe 1: {resp} [state={breaker.state_name}]")
# Wait again for cooldown
print(f"\n--- Phase 6: Waiting {config.cooldown_seconds}s cooldown... ---")
time.sleep(config.cooldown_seconds + 0.1)
# Fix the service and probe again
print("\n--- Phase 7: HALF_OPEN probe (service recovered!) ---")
service.reset() # Service is healthy again
resp = proxy.call("probe-2")
print(f" Probe 2: {resp} [state={breaker.state_name}]")
# Phase 8: Back to normal
print("\n--- Phase 8: Back to CLOSED - normal operation ---")
for i in range(1, 4):
resp = proxy.call(f"healthy-{i}")
print(f" Call {i}: {resp} [state={breaker.state_name}]")
print("\n" + "=" * 60)
print(" DEMO COMPLETE")
print("=" * 60)Common Mistakes
- ✗No cooldown period: circuit stays open forever or flaps between open/closed rapidly
- ✗Throwing exceptions when open: forces every caller to handle CircuitOpenException. Null Object is cleaner.
- ✗Not counting only relevant failures: timeouts should trip the circuit, 404s should not
- ✗Single failure threshold for all services: a flaky service needs different thresholds than a critical one
Key Points
- ✓State machine: CLOSED counts failures. When threshold hit, transitions to OPEN. OPEN rejects all calls for a cooldown period. After cooldown, transitions to HALF_OPEN. HALF_OPEN allows one probe call: success returns to CLOSED, failure returns to OPEN.
- ✓Proxy: ServiceProxy.call() has the same signature as the real service. The caller is unaware of the circuit breaker wrapping their calls.
- ✓Null Object: when OPEN, return a NullResponse instead of throwing. The caller gets graceful degradation (cached data, empty result, default message) with no exception handling needed.
- ✓Observer: monitoring dashboards subscribe to state changes. Alert when circuit opens, celebrate when it closes.