Circuit Breaker & Resilience Patterns

Why It Exists

Every network call in a microservices system can fail. That is not a possibility to plan for someday. It is the default state of distributed computing.

Without resilience patterns, a single slow dependency will take down the entire call chain. Threads pile up waiting for timeouts. Memory fills with queued requests. Services that are perfectly healthy grind to a halt because they are blocked on one that is not. The result is a full outage caused by a single component nobody thought was critical.

Netflix learned this the hard way on Christmas Eve 2012. A metadata service timeout cascaded through the entire streaming platform. That outage led to the creation of Hystrix and brought the circuit breaker pattern into mainstream distributed systems engineering. The lesson is simple: without planning for partial failure, the result is total failure.

How It Works

Circuit Breaker State Machine

The circuit breaker wraps every call to an external dependency and tracks success/failure rates. Think of it like a fuse in an electrical panel. When too much current flows, the fuse blows to protect the rest of the house.

Closed (Normal). All requests flow through. The breaker monitors a sliding window of recent calls, either count-based or time-based. When the failure rate crosses the configured threshold (say 50% of the last 100 calls), it trips to Open.

Open (Fast-Fail). All requests fail immediately without touching the dependency. This does two things: it stops wasting resources on a service that is clearly down, and it gives that service breathing room to recover. After a configured wait (typically 30 seconds), it moves to Half-Open.

Half-Open (Probing). A small number of probe requests (say 5) go through to the dependency. If enough succeed, the breaker goes back to Closed. If the probes fail, it snaps back to Open for another wait period. This is the "stick a head out and check if the coast is clear" phase.

Configuration Parameters

Parameter	Typical Value	Effect
`failureRateThreshold`	50%	Percentage of failures to trigger Open state
`slidingWindowSize`	100 calls or 60s	Measurement window for failure rate
`minimumNumberOfCalls`	20	Minimum calls before evaluating failure rate
`waitDurationInOpenState`	30s	Time before probing the dependency
`permittedCallsInHalfOpen`	5	Number of probe requests in Half-Open
`slowCallRateThreshold`	80%	Percentage of slow calls to trigger Open
`slowCallDurationThreshold`	2s	Definition of "slow" call

A word of advice: start with generous thresholds and tighten them after observing real traffic patterns. Starting too tight means the circuit breaker trips on normal variance, leading to debugging phantom outages.

Retry with Exponential Backoff

Retries complement circuit breakers for transient problems like network blips and brief overloads. The formula is straightforward: delay = min(base * 2^attempt + random(0, jitter), max_delay).

Base delay: 100-500ms
Max retries: 3-5. If it has not worked by then, the problem is not transient
Jitter: Random 0-100% of calculated delay. This is not optional. Without jitter, all clients retry at exactly the same moment, creating a thundering herd
Idempotency requirement: Only retry operations that are safe to repeat. GETs, idempotent PUTs with version checks, anything with an idempotency key. Retrying a POST that creates a resource without built-in deduplication is dangerous. Stop and fix that first

Bulkhead Pattern

The name comes from ship construction. Bulkheads are watertight compartments that keep a hull breach from sinking the whole vessel. Same idea here: isolate resource pools per dependency so one slow service cannot consume all the threads and connections.

Thread pool bulkhead. Each dependency gets its own thread pool. Say 20 threads for the payment service and 10 for notifications. If payments slow down and exhaust their 20 threads, the notification service's threads are completely unaffected.
Semaphore bulkhead. Lighter weight. Limits concurrent calls via a semaphore count without dedicating thread pools. Works better for reactive and async architectures where thread-per-request does not make sense.

In practice, I reach for semaphore bulkheads in most cases and only use thread pool bulkheads when I need hard isolation for a dependency with a history of causing problems.

Timeout Propagation

This one is easy to overlook and painful when done wrong. In a chain A -> B -> C, if A has a 5-second deadline, B needs to subtract its own processing time and give C a shorter deadline. Otherwise C might spend the full 5 seconds on work that A already abandoned.

Implementation. Use gRPC's built-in deadline propagation, or pass an explicit X-Request-Deadline header. Each service calculates remaining = deadline - elapsed before calling downstream. If remaining is zero or negative, fail fast. Do not make the call. The work would be wasted anyway.

Production Patterns

The Notification System Pattern

This is a pattern I have seen work well in production. The notification system uses circuit breakers at multiple layers.

Push provider circuit breakers. FCM and APNs each have independent circuit breakers. When APNs starts returning 503s, its circuit opens while FCM delivery keeps running. Android notifications keep flowing even when Apple is having a bad day.
Per-tenant circuit breakers. One tenant sending malformed payloads should not take down notifications for everyone else. Per-tenant breakers contain the blast radius to the misbehaving account.
Retry with dead-letter queue. After 3 failed retries, notifications land in a Kafka dead-letter topic. A separate consumer retries with progressively longer delays (5 minutes, 30 minutes, 2 hours) for transient provider outages. This handles the "provider was down for an hour" case without losing messages.
Fallback chain. Push notification fails (circuit open)? Fall back to SMS. SMS budget exhausted? Fall back to email. All channels down? Queue for batch retry. The user gets notified through some channel rather than getting nothing.

Envoy-Level Circuit Breaking

In a service mesh, Envoy provides infrastructure-level circuit breaking that works across any language.

max_connections: Maximum connections to an upstream cluster (acts as a bulkhead)
max_pending_requests: Maximum queued requests waiting for a connection
max_requests: Maximum concurrent requests (semaphore bulkhead)
max_retries: Maximum concurrent retries across the cluster

Envoy's outlier detection is also worth knowing. It is per-host circuit breaking: if a single pod returns 5 consecutive 5xx errors, Envoy pulls it from the load balancing pool for 30 seconds. This handles the common case where one bad pod drags down an otherwise healthy service. This comes for free in most mesh setups and catches problems before application-level circuit breakers even notice.

Failure Scenarios

Scenario 1: Circuit Breaker Flapping. A dependency is running near capacity. It serves 95% of requests fine but fails 5% from resource contention. During load spikes, the failure rate bounces around the threshold, and the circuit breaker starts oscillating: Open, Half-Open, Closed, Open again. Each transition causes a burst of client errors followed by a burst of probe traffic, which actually adds to the dependency's problems. The result is a sawtooth pattern in the metrics. Detection: monitor circuit_breaker_state_transitions_per_minute and alert when transitions exceed 10/minute. Fix: increase slidingWindowSize to smooth out transient spikes. Raise minimumNumberOfCalls so the breaker needs more data before making a decision. Add slowCallRateThreshold as a secondary trigger to catch degradation before hard failures hit. Consider adaptive thresholds that tighten during low traffic and relax during peak hours.

Scenario 2: Retry Storm Amplifies a Partial Outage. The database hits a 2-second GC pause, and 30% of queries timeout. Three services, each configured with 3 retries and no jitter, detect the timeouts at the same time. Each failed request spawns 3 retries, so 4x the original load hits the database exactly when it can least handle it. A GC pause that would have resolved itself in 2 seconds becomes a 30-second outage because the retry storm overwhelms the connection pool. I have personally seen this turn a minor hiccup into a P1 incident. Detection: monitor total request volume to the dependency. A 3-4x spike during an error event signals a retry storm. Track retry_count distribution per service. Fix: add jitter to all retry delays to desynchronize them. Implement retry budgets that cap retries at 10% of total traffic (1,000 requests/sec means only 100 retries/sec). Use circuit breakers to stop retrying when the dependency is clearly down. Set shorter timeouts to fail fast and reduce the window of concurrent waiting requests.

Scenario 3: Missing Fallback Turns Partial Degradation into Total Outage. The recommendation service circuit breaker opens because the ML serving cluster is overloaded. The product page calls recommendations synchronously in the critical rendering path. The circuit opens and fails fast (good), but the error propagates up as a 500 to the user (bad). The product page is completely broken even though catalog, pricing, and inventory are all healthy. 95% of the page works, but users see 0% of it. This is the most common circuit breaker mistake I see: teams add the breaker but forget the fallback. Detection: check whether circuit_breaker_open events correlate 1:1 with user-facing 5xx errors. They should not. Fix: build a fallback hierarchy. Return cached recommendations (stale but functional), then popular items (generic but useful), then an empty section (the page still renders). Never let a non-critical dependency failure crash the critical path. Design APIs to return partial responses with degradation flags rather than hard failures.

Capacity Planning

Resilience patterns are not free. They add overhead that needs to be budgeted for.

Pattern	Memory Overhead	Latency Overhead	CPU Overhead
Circuit breaker	~1 KB per breaker (sliding window)	<0.1ms (state check)	Negligible
Retry (3 attempts)	Request buffer per retry	0-6x original call latency	1-4x original CPU
Bulkhead (thread pool)	~1 MB per pool (20 threads)	Queue wait time when pool is full	Thread context switching
Timeout	Timer per request	None (reduces latency via early termination)	Negligible

Key thresholds. Circuit breaker sliding windows above 1,000 calls increase memory linearly. Retry policies with max 5 attempts and 2-second base delay can add up to 62 seconds of latency, so verify the caller's timeout budget actually accommodates that (it probably does not, and it is worth checking). Bulkhead thread pools should be sized at steady_state_concurrency * 1.5. Too small and the system gets unnecessary rejections during normal traffic spikes. Too large and it defeats the whole point of isolation. Monitor bulkhead_rejected_calls and circuit_breaker_not_permitted_calls. These represent intentional load shedding, not errors. Do not alert on them the same way as 500s.

Tool	Type	Best For	Scale
Resilience4j	Open Source	Java/Kotlin, lightweight, functional API, Spring Boot integration	Small-Enterprise
Hystrix (Netflix)	Open Source	Pioneered the pattern, now deprecated. Migrate to Resilience4j	Legacy
Envoy Circuit Breaking	Open Source	Infrastructure-level, language-agnostic, mesh-native	Medium-Enterprise
Polly (.NET)	Open Source	.NET ecosystem, fluent API, wide range of policy types	Small-Enterprise

Why It Exists

Every network call in a microservices system can fail. That is not a possibility to plan for someday. It is the default state of distributed computing.

How It Works

Circuit Breaker State Machine

Configuration Parameters

Parameter	Typical Value	Effect
`failureRateThreshold`	50%	Percentage of failures to trigger Open state
`slidingWindowSize`	100 calls or 60s	Measurement window for failure rate
`minimumNumberOfCalls`	20	Minimum calls before evaluating failure rate
`waitDurationInOpenState`	30s	Time before probing the dependency
`permittedCallsInHalfOpen`	5	Number of probe requests in Half-Open
`slowCallRateThreshold`	80%	Percentage of slow calls to trigger Open
`slowCallDurationThreshold`	2s	Definition of "slow" call

Retry with Exponential Backoff

Retries complement circuit breakers for transient problems like network blips and brief overloads. The formula is straightforward: delay = min(base * 2^attempt + random(0, jitter), max_delay).

Base delay: 100-500ms
Max retries: 3-5. If it has not worked by then, the problem is not transient
Jitter: Random 0-100% of calculated delay. This is not optional. Without jitter, all clients retry at exactly the same moment, creating a thundering herd
Idempotency requirement: Only retry operations that are safe to repeat. GETs, idempotent PUTs with version checks, anything with an idempotency key. Retrying a POST that creates a resource without built-in deduplication is dangerous. Stop and fix that first

Bulkhead Pattern

Thread pool bulkhead. Each dependency gets its own thread pool. Say 20 threads for the payment service and 10 for notifications. If payments slow down and exhaust their 20 threads, the notification service's threads are completely unaffected.
Semaphore bulkhead. Lighter weight. Limits concurrent calls via a semaphore count without dedicating thread pools. Works better for reactive and async architectures where thread-per-request does not make sense.

In practice, I reach for semaphore bulkheads in most cases and only use thread pool bulkheads when I need hard isolation for a dependency with a history of causing problems.

Timeout Propagation

Production Patterns

The Notification System Pattern

This is a pattern I have seen work well in production. The notification system uses circuit breakers at multiple layers.

Push provider circuit breakers. FCM and APNs each have independent circuit breakers. When APNs starts returning 503s, its circuit opens while FCM delivery keeps running. Android notifications keep flowing even when Apple is having a bad day.
Per-tenant circuit breakers. One tenant sending malformed payloads should not take down notifications for everyone else. Per-tenant breakers contain the blast radius to the misbehaving account.
Retry with dead-letter queue. After 3 failed retries, notifications land in a Kafka dead-letter topic. A separate consumer retries with progressively longer delays (5 minutes, 30 minutes, 2 hours) for transient provider outages. This handles the "provider was down for an hour" case without losing messages.
Fallback chain. Push notification fails (circuit open)? Fall back to SMS. SMS budget exhausted? Fall back to email. All channels down? Queue for batch retry. The user gets notified through some channel rather than getting nothing.

Envoy-Level Circuit Breaking

In a service mesh, Envoy provides infrastructure-level circuit breaking that works across any language.

max_connections: Maximum connections to an upstream cluster (acts as a bulkhead)
max_pending_requests: Maximum queued requests waiting for a connection
max_requests: Maximum concurrent requests (semaphore bulkhead)
max_retries: Maximum concurrent retries across the cluster

Failure Scenarios

Capacity Planning

Resilience patterns are not free. They add overhead that needs to be budgeted for.

Pattern	Memory Overhead	Latency Overhead	CPU Overhead
Circuit breaker	~1 KB per breaker (sliding window)	<0.1ms (state check)	Negligible
Retry (3 attempts)	Request buffer per retry	0-6x original call latency	1-4x original CPU
Bulkhead (thread pool)	~1 MB per pool (20 threads)	Queue wait time when pool is full	Thread context switching
Timeout	Timer per request	None (reduces latency via early termination)	Negligible

Architecture Diagram

Why It Exists

How It Works

Circuit Breaker State Machine

Configuration Parameters

Retry with Exponential Backoff

Bulkhead Pattern

Timeout Propagation

Production Patterns

The Notification System Pattern

Envoy-Level Circuit Breaking

Failure Scenarios

Capacity Planning

Key Points

Tool Comparison

Common Mistakes

Related Topics

Circuit Breaker & Resilience Patterns

Architecture Diagram

Why It Exists

How It Works

Circuit Breaker State Machine

Configuration Parameters

Retry with Exponential Backoff

Bulkhead Pattern

Timeout Propagation

Production Patterns

The Notification System Pattern

Envoy-Level Circuit Breaking

Failure Scenarios

Capacity Planning

Key Points

Tool Comparison

Common Mistakes

Related Topics