Circuit Breaker
State machine resilience pattern : CLOSED allows calls, OPEN rejects fast with Null Object fallbacks, HALF_OPEN probes cautiously. Proxy wraps remote services transparently.
Key Abstractions
State interface : handle_request, can_proceed, record_success, record_failure
Concrete state : allows all calls, counts failures, transitions to OpenState when threshold reached
Concrete state : rejects all calls immediately, transitions to HalfOpenState after cooldown period expires
Concrete state : allows one probe call, transitions to ClosedState on success or OpenState on failure
Context that holds current state and delegates request handling to it
Proxy : wraps remote service with circuit breaker, caller doesn't know circuit breaker exists
Null Object : safe fallback when circuit is open: cached data, default value, empty result : no exception handling needed
Observer : notified on state transitions for monitoring dashboards and alerting
Class Diagram
How It Works
A circuit breaker in software works exactly like one in your house's electrical panel. When current flows normally, the breaker stays closed (the circuit is complete). When a dangerous fault occurs, too much current, the breaker trips open, breaking the circuit to prevent damage. After a cooldown, you can manually test whether the fault is resolved by flipping it halfway (half-open). If the test passes, the breaker closes again and normal flow resumes.
In distributed systems, the "dangerous current" is a failing downstream service. Without a circuit breaker, every request to a dead service burns a thread waiting for a timeout. Under load, those blocked threads pile up, exhausting your thread pool and cascading the failure to your own callers. The circuit breaker cuts the connection early: after N consecutive failures, it stops sending requests entirely and returns a safe fallback response immediately. This protects your system from wasting resources on a service that isn't going to respond.
The state machine has three states. CLOSED is normal operation: requests flow through and failures are counted. When the failure count hits the threshold, the breaker transitions to OPEN. In OPEN, every request is immediately rejected with a NullResponse fallback. No network call, no waiting, no exceptions. After a cooldown period, the breaker transitions to HALF_OPEN and allows a single probe request through. If the probe succeeds, the service has recovered and the breaker returns to CLOSED. If it fails, back to OPEN for another cooldown cycle.
Requirements
Functional
- Three states: CLOSED, OPEN, HALF_OPEN with well-defined transition rules
- Configurable failure threshold (number of failures before opening)
- Configurable cooldown period (how long to stay open before probing)
- Return a NullResponse fallback when the circuit is open (Null Object pattern)
- Proxy wrapper so callers don't need to know about the circuit breaker
- Observer notifications on state transitions for monitoring
Non-Functional
- Fast rejection in OPEN state: O(1) with no network call
- Thread-safe state transitions under concurrent load
- No resource leaks during state transitions
Design Decisions
What's wrong with a state enum and switch statements?
A naive circuit breaker uses a state enum and switches on it everywhere: if state == OPEN: reject; elif state == HALF_OPEN: maybe_probe; else: allow. This scatters transition logic across multiple methods and makes adding new states (like a FORCED_OPEN for maintenance) painful. The State pattern encapsulates each state's behavior in its own class. Each state knows when and how to transition. Adding a new state means adding a new class. Nothing else changes. The CircuitBreaker context simply delegates to whatever state it currently holds.
Couldn't we just throw CircuitOpenException?
When the circuit is open, you have two choices: throw a CircuitOpenException or return a safe default. Throwing forces every caller to wrap calls in try/catch. If they forget, the exception propagates and crashes something. The Null Object pattern returns a NullResponse with a default value, a cached result, or an empty collection. The caller's code works without modification. It just gets degraded data. This is especially powerful in UI-facing services: show stale data from cache instead of an error page.
Why hide the circuit breaker behind a proxy?
The ServiceProxy implements the same interface as the real service. Calling code doesn't import CircuitBreaker, doesn't check circuit states, doesn't handle special exceptions. You wire up the proxy at configuration time (or through dependency injection) and the rest of the codebase is blissfully unaware. This means you can add circuit breakers to existing services without modifying any caller code.
How do we know when a circuit opens in production?
In production, you need to know when circuits open. An open circuit means a downstream dependency is failing, which usually means an incident. The Observer pattern decouples the circuit breaker from the monitoring infrastructure. You can attach a Prometheus metrics listener, a PagerDuty alerting listener, and a logging listener, all without the circuit breaker knowing about any of them.
Interview Follow-ups
- "How do distributed circuit breakers work?" In a microservice cluster, each instance has its own circuit breaker. They can trip independently or share state via Redis/ZooKeeper. Shared state is better for slow-starting failures (each instance sees a fraction of the errors), but adds a dependency on the shared store. Netflix Hystrix used instance-local breakers; Resilience4j supports both.
- "What are good half-open probe strategies?" Instead of allowing exactly one request, you can allow a percentage (e.g., 10% of traffic) and measure the success rate. If the success rate exceeds a threshold, close the circuit. This is less binary and recovers faster for services under partial failure. Resilience4j's
permittedNumberOfCallsInHalfOpenStatecontrols this. - "Circuit breaker vs. retry: when do you use which?" Retries handle transient failures (a single dropped packet). Circuit breakers handle sustained failures (service is down for minutes). Use both together: retry within the circuit breaker's closed state, but once the circuit opens, stop retrying. Retrying against a dead service is the worst possible behavior: it amplifies the load on a struggling service.
- "How does Netflix Hystrix / Resilience4j implement this?" Hystrix (now in maintenance mode) used a sliding window of request outcomes and a thread pool per dependency. Resilience4j replaced it with a lighter approach: ring buffer of outcomes, no thread pool isolation by default (uses the caller's thread with decorators), and composable with other patterns (retry, rate limiter, bulkhead) via functional decoration.
Code Implementation
1 from __future__ import annotations
2 from abc import ABC, abstractmethod
3 from dataclasses import dataclass, field
4 from typing import Protocol
5 import time
6
7
8 # --------------- Response Types ---------------
9
10 @dataclass
11 class Response:
12 """Standard response from a service call."""
13 value: str
14 is_fallback: bool = False
15
16 def __str__(self) -> str:
17 tag = " [FALLBACK]" if self.is_fallback else ""
18 return f"{self.value}{tag}"
19
20
21 @dataclass
22 class NullResponse(Response):
23 """
24 Null Object pattern: returned when the circuit is open.
25 Provides a safe default instead of throwing an exception.
26 Callers don't need try/except; they get graceful degradation.
27 """
28 value: str = "Service unavailable - using cached default"
29 is_fallback: bool = True
30
31
32 # --------------- Configuration ---------------
33
34 @dataclass(frozen=True)
35 class CircuitBreakerConfig:
36 failure_threshold: int = 3
37 cooldown_seconds: float = 2.0
38 success_threshold: int = 1
39
40
41 # --------------- Observer ---------------
42
43 class CircuitBreakerListener(ABC):
44 @abstractmethod
45 def on_state_change(self, from_state: str, to_state: str) -> None: ...
46
47
48 class LoggingListener(CircuitBreakerListener):
49 """Concrete observer that logs state transitions."""
50 def on_state_change(self, from_state: str, to_state: str) -> None:
51 icon = "!!" if to_state == "OPEN" else ">>" if to_state == "HALF_OPEN" else "OK"
52 print(f" [{icon}] Circuit transition: {from_state} -> {to_state}")
53
54
55 # --------------- State Interface ---------------
56
57 class CircuitState(ABC):
58 def __init__(self, breaker: "CircuitBreaker"):
59 self._breaker = breaker
60
61 @property
62 @abstractmethod
63 def name(self) -> str: ...
64
65 @abstractmethod
66 def can_proceed(self) -> bool: ...
67
68 @abstractmethod
69 def record_success(self) -> None: ...
70
71 @abstractmethod
72 def record_failure(self) -> None: ...
73
74 @abstractmethod
75 def handle_request(self) -> Response | None:
76 """Return NullResponse if circuit should block, None to let call proceed."""
77 ...
78
79
80 # --------------- Concrete States ---------------
81
82 class ClosedState(CircuitState):
83 """Circuit is healthy. All calls pass through. Failures are counted."""
84
85 def __init__(self, breaker: "CircuitBreaker"):
86 super().__init__(breaker)
87 self._failure_count = 0
88
89 @property
90 def name(self) -> str:
91 return "CLOSED"
92
93 def can_proceed(self) -> bool:
94 return True
95
96 def handle_request(self) -> Response | None:
97 return None # Allow the call to proceed
98
99 def record_success(self) -> None:
100 self._failure_count = 0
101
102 def record_failure(self) -> None:
103 self._failure_count += 1
104 if self._failure_count >= self._breaker.config.failure_threshold:
105 self._breaker.transition_to(OpenState(self._breaker))
106
107
108 class OpenState(CircuitState):
109 """Circuit is tripped. All calls are rejected with a NullResponse fallback."""
110
111 def __init__(self, breaker: "CircuitBreaker"):
112 super().__init__(breaker)
113 self._opened_at = time.monotonic()
114
115 @property
116 def name(self) -> str:
117 return "OPEN"
118
119 def can_proceed(self) -> bool:
120 elapsed = time.monotonic() - self._opened_at
121 if elapsed >= self._breaker.config.cooldown_seconds:
122 self._breaker.transition_to(HalfOpenState(self._breaker))
123 return True
124 return False
125
126 def handle_request(self) -> Response | None:
127 if not self.can_proceed():
128 return NullResponse()
129 # Cooldown expired - state already transitioned to HALF_OPEN
130 return self._breaker.state.handle_request()
131
132 def record_success(self) -> None:
133 pass # Should not happen in OPEN state
134
135 def record_failure(self) -> None:
136 pass # Already open
137
138
139 class HalfOpenState(CircuitState):
140 """Probe state. Allows one trial call to test if the service recovered."""
141
142 def __init__(self, breaker: "CircuitBreaker"):
143 super().__init__(breaker)
144 self._probe_sent = False
145
146 @property
147 def name(self) -> str:
148 return "HALF_OPEN"
149
150 def can_proceed(self) -> bool:
151 return not self._probe_sent
152
153 def handle_request(self) -> Response | None:
154 if self._probe_sent:
155 return NullResponse(value="Probe in progress - using fallback")
156 self._probe_sent = True
157 return None # Allow the probe call
158
159 def record_success(self) -> None:
160 self._breaker.transition_to(ClosedState(self._breaker))
161
162 def record_failure(self) -> None:
163 self._breaker.transition_to(OpenState(self._breaker))
164
165
166 # --------------- Circuit Breaker (Context) ---------------
167
168 class CircuitBreaker:
169 """
170 Context in the State pattern. Holds the current state and delegates
171 all request handling to it. Notifies observers on transitions.
172 """
173
174 def __init__(self, config: CircuitBreakerConfig | None = None):
175 self.config = config or CircuitBreakerConfig()
176 self._listeners: list[CircuitBreakerListener] = []
177 self._state: CircuitState = ClosedState(self)
178
179 @property
180 def state(self) -> CircuitState:
181 return self._state
182
183 @property
184 def state_name(self) -> str:
185 return self._state.name
186
187 def add_listener(self, listener: CircuitBreakerListener) -> None:
188 self._listeners.append(listener)
189
190 def transition_to(self, new_state: CircuitState) -> None:
191 old_name = self._state.name
192 self._state = new_state
193 for listener in self._listeners:
194 listener.on_state_change(old_name, new_state.name)
195
196 def call(self, func, *args, **kwargs) -> Response:
197 """Execute func through the circuit breaker."""
198 blocked = self._state.handle_request()
199 if blocked is not None:
200 return blocked
201
202 try:
203 result = func(*args, **kwargs)
204 self._state.record_success()
205 return Response(value=result)
206 except Exception:
207 self._state.record_failure()
208 return NullResponse(value="Call failed - using fallback response")
209
210
211 # --------------- Remote Service ---------------
212
213 class RemoteService:
214 """Simulates a remote service that can be configured to fail."""
215
216 def __init__(self, fail_after: int = 0):
217 self._call_count = 0
218 self._fail_after = fail_after
219 self._should_recover = True
220
221 def call(self, request: str) -> str:
222 self._call_count += 1
223 if self._fail_after > 0 and self._call_count > self._fail_after:
224 if self._should_recover and self._call_count > self._fail_after + 5:
225 return f"Recovered! Handled: {request}"
226 raise ConnectionError(f"Service unavailable (call #{self._call_count})")
227 return f"OK: {request}"
228
229 def reset(self) -> None:
230 self._call_count = 0
231
232
233 # --------------- Service Proxy ---------------
234
235 class ServiceProxy:
236 """
237 Proxy pattern - wraps a remote service with a circuit breaker.
238 Callers use proxy.call() with the same signature as the real service.
239 They have no idea a circuit breaker is protecting them.
240 """
241
242 def __init__(self, service: RemoteService, breaker: CircuitBreaker):
243 self._service = service
244 self._breaker = breaker
245
246 def call(self, request: str) -> Response:
247 return self._breaker.call(self._service.call, request)
248
249
250 # --------------- Demo ---------------
251
252 if __name__ == "__main__":
253 print("=" * 60)
254 print(" CIRCUIT BREAKER DEMO")
255 print("=" * 60)
256
257 config = CircuitBreakerConfig(failure_threshold=3, cooldown_seconds=2.0)
258 breaker = CircuitBreaker(config)
259 breaker.add_listener(LoggingListener())
260
261 # Service fails after 2 successful calls
262 service = RemoteService(fail_after=2)
263 proxy = ServiceProxy(service, breaker)
264
265 # Phase 1: Successful calls (CLOSED state)
266 print("\n--- Phase 1: Normal operation (CLOSED) ---")
267 for i in range(1, 3):
268 resp = proxy.call(f"request-{i}")
269 print(f" Call {i}: {resp} [state={breaker.state_name}]")
270
271 # Phase 2: Failures accumulate, circuit opens
272 print("\n--- Phase 2: Failures trigger OPEN ---")
273 for i in range(3, 8):
274 resp = proxy.call(f"request-{i}")
275 print(f" Call {i}: {resp} [state={breaker.state_name}]")
276
277 # Phase 3: Circuit is OPEN - calls return NullResponse immediately
278 print("\n--- Phase 3: OPEN state - fast-fail with NullResponse ---")
279 for i in range(8, 11):
280 resp = proxy.call(f"request-{i}")
281 print(f" Call {i}: {resp} [state={breaker.state_name}]")
282
283 # Phase 4: Wait for cooldown, then HALF_OPEN probe
284 print(f"\n--- Phase 4: Waiting {config.cooldown_seconds}s cooldown... ---")
285 time.sleep(config.cooldown_seconds + 0.1)
286
287 # Service is still failing - probe will fail, back to OPEN
288 print("\n--- Phase 5: HALF_OPEN probe (service still down) ---")
289 resp = proxy.call("probe-1")
290 print(f" Probe 1: {resp} [state={breaker.state_name}]")
291
292 # Wait again for cooldown
293 print(f"\n--- Phase 6: Waiting {config.cooldown_seconds}s cooldown... ---")
294 time.sleep(config.cooldown_seconds + 0.1)
295
296 # Fix the service and probe again
297 print("\n--- Phase 7: HALF_OPEN probe (service recovered!) ---")
298 service.reset() # Service is healthy again
299 resp = proxy.call("probe-2")
300 print(f" Probe 2: {resp} [state={breaker.state_name}]")
301
302 # Phase 8: Back to normal
303 print("\n--- Phase 8: Back to CLOSED - normal operation ---")
304 for i in range(1, 4):
305 resp = proxy.call(f"healthy-{i}")
306 print(f" Call {i}: {resp} [state={breaker.state_name}]")
307
308 print("\n" + "=" * 60)
309 print(" DEMO COMPLETE")
310 print("=" * 60)Common Mistakes
- ✗No cooldown period: circuit stays open forever or flaps between open/closed rapidly
- ✗Throwing exceptions when open: forces every caller to handle CircuitOpenException. Null Object is cleaner.
- ✗Not counting only relevant failures: timeouts should trip the circuit, 404s should not
- ✗Single failure threshold for all services: a flaky service needs different thresholds than a critical one
Key Points
- ✓State machine: CLOSED counts failures. When threshold hit, transitions to OPEN. OPEN rejects all calls for a cooldown period. After cooldown, transitions to HALF_OPEN. HALF_OPEN allows one probe call: success returns to CLOSED, failure returns to OPEN.
- ✓Proxy: ServiceProxy.call() has the same signature as the real service. The caller is unaware of the circuit breaker wrapping their calls.
- ✓Null Object: when OPEN, return a NullResponse instead of throwing. The caller gets graceful degradation (cached data, empty result, default message) with no exception handling needed.
- ✓Observer: monitoring dashboards subscribe to state changes. Alert when circuit opens, celebrate when it closes.