Circuit Breaker & Resilience Patterns
Architecture Diagram
Why It Exists
Every network call in a microservices system can fail. That is not a possibility to plan for someday. It is the default state of distributed computing.
Without resilience patterns, a single slow dependency will take down the entire call chain. Threads pile up waiting for timeouts. Memory fills with queued requests. Services that are perfectly healthy grind to a halt because they are blocked on one that is not. The result is a full outage caused by a single component nobody thought was critical.
Netflix learned this the hard way on Christmas Eve 2012. A metadata service timeout cascaded through the entire streaming platform. That outage led to the creation of Hystrix and brought the circuit breaker pattern into mainstream distributed systems engineering. The lesson is simple: without planning for partial failure, the result is total failure.
How It Works
Circuit Breaker State Machine
The circuit breaker wraps every call to an external dependency and tracks success/failure rates. Think of it like a fuse in an electrical panel. When too much current flows, the fuse blows to protect the rest of the house.
Closed (Normal). All requests flow through. The breaker monitors a sliding window of recent calls, either count-based or time-based. When the failure rate crosses the configured threshold (say 50% of the last 100 calls), it trips to Open.
Open (Fast-Fail). All requests fail immediately without touching the dependency. This does two things: it stops wasting resources on a service that is clearly down, and it gives that service breathing room to recover. After a configured wait (typically 30 seconds), it moves to Half-Open.
Half-Open (Probing). A small number of probe requests (say 5) go through to the dependency. If enough succeed, the breaker goes back to Closed. If the probes fail, it snaps back to Open for another wait period. This is the "stick a head out and check if the coast is clear" phase.
Configuration Parameters
| Parameter | Typical Value | Effect |
|---|---|---|
failureRateThreshold | 50% | Percentage of failures to trigger Open state |
slidingWindowSize | 100 calls or 60s | Measurement window for failure rate |
minimumNumberOfCalls | 20 | Minimum calls before evaluating failure rate |
waitDurationInOpenState | 30s | Time before probing the dependency |
permittedCallsInHalfOpen | 5 | Number of probe requests in Half-Open |
slowCallRateThreshold | 80% | Percentage of slow calls to trigger Open |
slowCallDurationThreshold | 2s | Definition of "slow" call |
A word of advice: start with generous thresholds and tighten them after observing real traffic patterns. Starting too tight means the circuit breaker trips on normal variance, leading to debugging phantom outages.
Retry with Exponential Backoff
Retries complement circuit breakers for transient problems like network blips and brief overloads. The formula is straightforward: delay = min(base * 2^attempt + random(0, jitter), max_delay).
- Base delay: 100-500ms
- Max retries: 3-5. If it has not worked by then, the problem is not transient
- Jitter: Random 0-100% of calculated delay. This is not optional. Without jitter, all clients retry at exactly the same moment, creating a thundering herd
- Idempotency requirement: Only retry operations that are safe to repeat. GETs, idempotent PUTs with version checks, anything with an idempotency key. Retrying a POST that creates a resource without built-in deduplication is dangerous. Stop and fix that first
Bulkhead Pattern
The name comes from ship construction. Bulkheads are watertight compartments that keep a hull breach from sinking the whole vessel. Same idea here: isolate resource pools per dependency so one slow service cannot consume all the threads and connections.
- Thread pool bulkhead. Each dependency gets its own thread pool. Say 20 threads for the payment service and 10 for notifications. If payments slow down and exhaust their 20 threads, the notification service's threads are completely unaffected.
- Semaphore bulkhead. Lighter weight. Limits concurrent calls via a semaphore count without dedicating thread pools. Works better for reactive and async architectures where thread-per-request does not make sense.
In practice, I reach for semaphore bulkheads in most cases and only use thread pool bulkheads when I need hard isolation for a dependency with a history of causing problems.
Timeout Propagation
This one is easy to overlook and painful when done wrong. In a chain A -> B -> C, if A has a 5-second deadline, B needs to subtract its own processing time and give C a shorter deadline. Otherwise C might spend the full 5 seconds on work that A already abandoned.
Implementation. Use gRPC's built-in deadline propagation, or pass an explicit X-Request-Deadline header. Each service calculates remaining = deadline - elapsed before calling downstream. If remaining is zero or negative, fail fast. Do not make the call. The work would be wasted anyway.
Production Patterns
The Notification System Pattern
This is a pattern I have seen work well in production. The notification system uses circuit breakers at multiple layers.
-
Push provider circuit breakers. FCM and APNs each have independent circuit breakers. When APNs starts returning 503s, its circuit opens while FCM delivery keeps running. Android notifications keep flowing even when Apple is having a bad day.
-
Per-tenant circuit breakers. One tenant sending malformed payloads should not take down notifications for everyone else. Per-tenant breakers contain the blast radius to the misbehaving account.
-
Retry with dead-letter queue. After 3 failed retries, notifications land in a Kafka dead-letter topic. A separate consumer retries with progressively longer delays (5 minutes, 30 minutes, 2 hours) for transient provider outages. This handles the "provider was down for an hour" case without losing messages.
-
Fallback chain. Push notification fails (circuit open)? Fall back to SMS. SMS budget exhausted? Fall back to email. All channels down? Queue for batch retry. The user gets notified through some channel rather than getting nothing.
Envoy-Level Circuit Breaking
In a service mesh, Envoy provides infrastructure-level circuit breaking that works across any language.
max_connections: Maximum connections to an upstream cluster (acts as a bulkhead)max_pending_requests: Maximum queued requests waiting for a connectionmax_requests: Maximum concurrent requests (semaphore bulkhead)max_retries: Maximum concurrent retries across the cluster
Envoy's outlier detection is also worth knowing. It is per-host circuit breaking: if a single pod returns 5 consecutive 5xx errors, Envoy pulls it from the load balancing pool for 30 seconds. This handles the common case where one bad pod drags down an otherwise healthy service. This comes for free in most mesh setups and catches problems before application-level circuit breakers even notice.
Failure Scenarios
Scenario 1: Circuit Breaker Flapping. A dependency is running near capacity. It serves 95% of requests fine but fails 5% from resource contention. During load spikes, the failure rate bounces around the threshold, and the circuit breaker starts oscillating: Open, Half-Open, Closed, Open again. Each transition causes a burst of client errors followed by a burst of probe traffic, which actually adds to the dependency's problems. The result is a sawtooth pattern in the metrics. Detection: monitor circuit_breaker_state_transitions_per_minute and alert when transitions exceed 10/minute. Fix: increase slidingWindowSize to smooth out transient spikes. Raise minimumNumberOfCalls so the breaker needs more data before making a decision. Add slowCallRateThreshold as a secondary trigger to catch degradation before hard failures hit. Consider adaptive thresholds that tighten during low traffic and relax during peak hours.
Scenario 2: Retry Storm Amplifies a Partial Outage. The database hits a 2-second GC pause, and 30% of queries timeout. Three services, each configured with 3 retries and no jitter, detect the timeouts at the same time. Each failed request spawns 3 retries, so 4x the original load hits the database exactly when it can least handle it. A GC pause that would have resolved itself in 2 seconds becomes a 30-second outage because the retry storm overwhelms the connection pool. I have personally seen this turn a minor hiccup into a P1 incident. Detection: monitor total request volume to the dependency. A 3-4x spike during an error event signals a retry storm. Track retry_count distribution per service. Fix: add jitter to all retry delays to desynchronize them. Implement retry budgets that cap retries at 10% of total traffic (1,000 requests/sec means only 100 retries/sec). Use circuit breakers to stop retrying when the dependency is clearly down. Set shorter timeouts to fail fast and reduce the window of concurrent waiting requests.
Scenario 3: Missing Fallback Turns Partial Degradation into Total Outage. The recommendation service circuit breaker opens because the ML serving cluster is overloaded. The product page calls recommendations synchronously in the critical rendering path. The circuit opens and fails fast (good), but the error propagates up as a 500 to the user (bad). The product page is completely broken even though catalog, pricing, and inventory are all healthy. 95% of the page works, but users see 0% of it. This is the most common circuit breaker mistake I see: teams add the breaker but forget the fallback. Detection: check whether circuit_breaker_open events correlate 1:1 with user-facing 5xx errors. They should not. Fix: build a fallback hierarchy. Return cached recommendations (stale but functional), then popular items (generic but useful), then an empty section (the page still renders). Never let a non-critical dependency failure crash the critical path. Design APIs to return partial responses with degradation flags rather than hard failures.
Capacity Planning
Resilience patterns are not free. They add overhead that needs to be budgeted for.
| Pattern | Memory Overhead | Latency Overhead | CPU Overhead |
|---|---|---|---|
| Circuit breaker | ~1 KB per breaker (sliding window) | <0.1ms (state check) | Negligible |
| Retry (3 attempts) | Request buffer per retry | 0-6x original call latency | 1-4x original CPU |
| Bulkhead (thread pool) | ~1 MB per pool (20 threads) | Queue wait time when pool is full | Thread context switching |
| Timeout | Timer per request | None (reduces latency via early termination) | Negligible |
Key thresholds. Circuit breaker sliding windows above 1,000 calls increase memory linearly. Retry policies with max 5 attempts and 2-second base delay can add up to 62 seconds of latency, so verify the caller's timeout budget actually accommodates that (it probably does not, and it is worth checking). Bulkhead thread pools should be sized at steady_state_concurrency * 1.5. Too small and the system gets unnecessary rejections during normal traffic spikes. Too large and it defeats the whole point of isolation. Monitor bulkhead_rejected_calls and circuit_breaker_not_permitted_calls. These represent intentional load shedding, not errors. Do not alert on them the same way as 500s.
Key Points
- •Circuit breakers stop cascading failures by fast-failing requests to dependencies that are already down
- •Three states (Closed, Open, Half-Open) with transitions driven by failure rate thresholds
- •Retry with exponential backoff plus jitter prevents hammering a recovering service with synchronized retries
- •Bulkheads isolate failures by capping concurrent requests per dependency so one bad actor can't eat all the threads
- •Timeout propagation passes deadlines across service boundaries. Without it, downstream services waste work on requests the caller already gave up on
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Resilience4j | Open Source | Java/Kotlin, lightweight, functional API, Spring Boot integration | Small-Enterprise |
| Hystrix (Netflix) | Open Source | Pioneered the pattern, now deprecated. Migrate to Resilience4j | Legacy |
| Envoy Circuit Breaking | Open Source | Infrastructure-level, language-agnostic, mesh-native | Medium-Enterprise |
| Polly (.NET) | Open Source | .NET ecosystem, fluent API, wide range of policy types | Small-Enterprise |
Common Mistakes
- Thresholds too tight. A 1% error rate in a 10-request window trips the circuit on normal noise
- No fallbacks. An open circuit that throws a raw error at users is worse than returning stale data or a degraded response
- Retrying non-idempotent operations. Retrying a payment charge without idempotency keys means double charges. Expect a 3am page
- One timeout for every dependency. A database query and an external API have completely different baseline latencies. Treat them differently
- Not propagating deadlines. Service A sets 5s, service B calls C with another 5s. Now the total exceeds any reasonable budget and the caller already gave up