Retry with Exponential Backoff and Jitter
Retry transient failures with delays that double each attempt (exponential backoff). Add jitter (randomness) so a thousand clients don't synchronise their retries and DDoS the recovering server. Cap the total attempts and the total elapsed time. Only retry idempotent operations.
Diagram
What it is
Retry is the reflex response to "the call failed". Done badly, it makes outages worse: every failing client retries at the same time, the downstream cannot recover, the outage cascades.
Done well, retry is one of the cheapest ways to improve reliability. The recipe is a few lines of code:
- Try the operation.
- If it succeeds, return.
- If it fails with a non-transient error, give up.
- If it fails with a transient error and there are attempts left, wait and try again.
- The wait is
random(0, base * 2^attempt). Doubles the cap each time, with jitter to avoid synchronisation. - Cap the total attempts AND the total elapsed time.
Stick to that and most of the failure modes disappear.
Why exponential backoff
The downstream is failing. Hitting it harder makes it fail more. Exponential delay means: the first retry is fast (it might have been a one-off glitch), subsequent retries are increasingly patient (the failure is real; the downstream needs time).
attempt wait before retry
----- -----------------
1 100 ms
2 200 ms
3 400 ms
4 800 ms
5 1600 ms
The wait doubles each time. Linear backoff (1s, 2s, 3s, 4s) is sometimes used; exponential is more conservative when failures persist.
Why jitter (the thundering herd)
Without jitter every failing client uses the same delay schedule, so they all retry at exactly the same moments. After 100ms, all of them retry. After 300ms, all of them retry again. Each of those moments is a spike of synchronized traffic. The downstream is being recovered by humans pushing a fix; the moment it comes back online, the next spike crushes it again.
The picture is sharper as two arrival-rate charts. Imagine 1000 clients that all started failing at the same time. Plot how many of their retries arrive in each 25ms window after the crash.
One tall spike at t = 100ms. The downstream sees 1000 requests in a single moment, fails again, and the next round of retries keeps the cycle going.
Now add a small random delay to each client's wait time. The same 1000 retries happen, but now each client picks its own random offset between 0 and 100ms.
Same total traffic. Now the downstream is seeing roughly 250 requests every 25ms instead of 1000 in one shot. That is well within what a recovering service can handle. As soon as it starts answering successfully, those clients are done; the rest of the retries land smoothly behind them.
Full jitter is the simplest version of this idea: delay = random(0, base * 2^attempt). Spread the retries uniformly across the whole backoff window. If the downstream recovers at any point during the window, some clients see it and exit; the rest get shaped naturally into the next attempt. No thundering herd, no synchronized waves.
When not to retry
Three cases:
Permanent errors. 4xx HTTP responses, validation failures, "user not found", auth failures. Retrying never helps. Surface the error immediately.
Non-idempotent operations. POST without an idempotency key. PUT and DELETE are usually safe (same operation has the same effect). POST might create duplicate orders, charge the customer twice, send the email twice. If retrying POSTs is necessary, generate an idempotency key on the client and have the server deduplicate.
Out of time. The request has a budget (an SLA, a user-facing deadline). If the next retry's delay would push past the budget, give up.
Layers and double-retry
Many libraries (HTTP clients, gRPC, database drivers) have built-in retry. Wrapping their calls in another retry produces exponential blowup: 5 attempts × 5 attempts = 25 actual calls. Either disable the library's retry or trust it and don't add another. Don't stack.
Same warning at the system level: client retries → load balancer retries → service retries. Each layer multiplies. Be deliberate about where retries live.
The four knobs of a retry policy
- base: starting delay. Typical: 100ms.
- max_attempts: total tries (including the first). Typical: 3-5.
- max_duration: total elapsed cap. Typical: a fraction of the latency budget.
- jitter: full jitter (random in [0, max]) or decorrelated.
Pick these per call site based on what the operation costs and what the downstream can take. Don't reuse one policy for every call; an internal RPC and an external email send have different shapes.
Implementations
Hand-rolling retry is fine, but production code usually uses a library (Failsafe, Resilience4j, Spring Retry). They provide backoff, jitter, max-attempts, max-duration, and metrics in one configuration block.
1 import dev.failsafe.Failsafe;
2 import dev.failsafe.RetryPolicy;
3 import java.time.Duration;
4 import java.time.temporal.ChronoUnit;
5
6 RetryPolicy<HttpResponse> policy = RetryPolicy.<HttpResponse>builder()
7 .handle(IOException.class, TimeoutException.class)
8 .handleResultIf(r -> r.statusCode() >= 500)
9 .withBackoff(100, 5_000, ChronoUnit.MILLIS)
10 .withJitter(0.5) // ±50% jitter
11 .withMaxAttempts(5)
12 .withMaxDuration(Duration.ofSeconds(30))
13 .build();
14
15 HttpResponse r = Failsafe.with(policy).get(() -> client.send(request));Key points
- •Exponential backoff: delay = base * 2^attempt. Avoids hammering the failing service.
- •Jitter is required, not optional. Without it, retries synchronise into a thundering herd.
- •Bound retries: max attempts AND max total elapsed time. Both, not either.
- •Only retry transient errors (5xx, timeouts, connection refused). Never retry 4xx (client errors).
- •Only retry idempotent operations. POST is dangerous; GET, PUT, DELETE are usually safe.
Follow-up questions
▸Why is jitter important?
▸Full jitter vs decorrelated jitter?
▸When NOT to retry?
▸How is a transient error identified?
Gotchas
- !No jitter = thundering herd when downstream recovers
- !Retrying non-idempotent operations (POST without idempotency key) double-charges users
- !No total-time bound = retries last longer than the original request budget
- !Retrying 4xx errors wastes everyone's time
- !Retry inside a retry (caller retry plus the HTTP client's) creates exponential blowup