Retry with Exponential Backoff and Jitter

What it is

Retry is the reflex response to "the call failed". Done badly, it makes outages worse: every failing client retries at the same time, the downstream cannot recover, the outage cascades.

Done well, retry is one of the cheapest ways to improve reliability. The recipe is a few lines of code:

Try the operation.
If it succeeds, return.
If it fails with a non-transient error, give up.
If it fails with a transient error and there are attempts left, wait and try again.
The wait is random(0, base * 2^attempt). Doubles the cap each time, with jitter to avoid synchronisation.
Cap the total attempts AND the total elapsed time.

Stick to that and most of the failure modes disappear.

Why exponential backoff

The downstream is failing. Hitting it harder makes it fail more. Exponential delay means: the first retry is fast (it might have been a one-off glitch), subsequent retries are increasingly patient (the failure is real; the downstream needs time).

attempt   wait before retry
-----     -----------------
1         100 ms
2         200 ms
3         400 ms
4         800 ms
5         1600 ms

The wait doubles each time. Linear backoff (1s, 2s, 3s, 4s) is sometimes used; exponential is more conservative when failures persist.

Why jitter (the thundering herd)

Without jitter every failing client uses the same delay schedule, so they all retry at exactly the same moments. After 100ms, all of them retry. After 300ms, all of them retry again. Each of those moments is a spike of synchronized traffic. The downstream is being recovered by humans pushing a fix; the moment it comes back online, the next spike crushes it again.

The picture is sharper as two arrival-rate charts. Imagine 1000 clients that all started failing at the same time. Plot how many of their retries arrive in each 25ms window after the crash.

One tall spike at t = 100ms. The downstream sees 1000 requests in a single moment, fails again, and the next round of retries keeps the cycle going.

Now add a small random delay to each client's wait time. The same 1000 retries happen, but now each client picks its own random offset between 0 and 100ms.

Same total traffic. Now the downstream is seeing roughly 250 requests every 25ms instead of 1000 in one shot. That is well within what a recovering service can handle. As soon as it starts answering successfully, those clients are done; the rest of the retries land smoothly behind them.

Full jitter is the simplest version of this idea: delay = random(0, base * 2^attempt). Spread the retries uniformly across the whole backoff window. If the downstream recovers at any point during the window, some clients see it and exit; the rest get shaped naturally into the next attempt. No thundering herd, no synchronized waves.

When not to retry

Three cases:

Permanent errors. 4xx HTTP responses, validation failures, "user not found", auth failures. Retrying never helps. Surface the error immediately.

Non-idempotent operations. POST without an idempotency key. PUT and DELETE are usually safe (same operation has the same effect). POST might create duplicate orders, charge the customer twice, send the email twice. If retrying POSTs is necessary, generate an idempotency key on the client and have the server deduplicate.

Out of time. The request has a budget (an SLA, a user-facing deadline). If the next retry's delay would push past the budget, give up.

Layers and double-retry

Many libraries (HTTP clients, gRPC, database drivers) have built-in retry. Wrapping their calls in another retry produces exponential blowup: 5 attempts × 5 attempts = 25 actual calls. Either disable the library's retry or trust it and don't add another. Don't stack.

Same warning at the system level: client retries → load balancer retries → service retries. Each layer multiplies. Be deliberate about where retries live.

The four knobs of a retry policy

base: starting delay. Typical: 100ms.
max_attempts: total tries (including the first). Typical: 3-5.
max_duration: total elapsed cap. Typical: a fraction of the latency budget.
jitter: full jitter (random in [0, max]) or decorrelated.

Pick these per call site based on what the operation costs and what the downstream can take. Don't reuse one policy for every call; an internal RPC and an external email send have different shapes.

Follow-up questions

▸Why is jitter important?

Without jitter, all clients that started failing at roughly the same time retry at exactly the same intervals. When the downstream recovers, every retry hits at once and overwhelms it again. Jitter (random delay) spreads the retries over the interval, letting the downstream absorb them gradually. Standard practice in distributed systems.

▸Full jitter vs decorrelated jitter?

Full jitter: delay = random(0, base * 2^attempt). Simple, works well. Decorrelated jitter: delay = random(base, prev_delay * 3), where prev_delay is the previous attempt's delay. Better at avoiding correlation across clients but more complex. For most workloads, full jitter is enough.

▸When NOT to retry?

Three cases. Permanent errors (4xx HTTP, validation failures, auth failures): retrying never helps. Non-idempotent operations (POST without an idempotency key): retry can charge the customer twice. Operations approaching a hard deadline: a retry that takes another 5s when the request budget is 1s is wasted work. Always check the operation type and the remaining budget.

▸How is a transient error identified?

5xx HTTP responses, network timeouts, connection refused, DNS failures are usually transient. 4xx HTTP, 'invalid input', 'permission denied' are permanent. For databases, deadlocks are transient (retry); constraint violations are not. The library should support per-error classification; avoid blanket-retrying all exceptions.

What it is

Retry is the reflex response to "the call failed". Done badly, it makes outages worse: every failing client retries at the same time, the downstream cannot recover, the outage cascades.

Done well, retry is one of the cheapest ways to improve reliability. The recipe is a few lines of code:

Try the operation.
If it succeeds, return.
If it fails with a non-transient error, give up.
If it fails with a transient error and there are attempts left, wait and try again.
The wait is random(0, base * 2^attempt). Doubles the cap each time, with jitter to avoid synchronisation.
Cap the total attempts AND the total elapsed time.

Stick to that and most of the failure modes disappear.

Why exponential backoff

attempt   wait before retry
-----     -----------------
1         100 ms
2         200 ms
3         400 ms
4         800 ms
5         1600 ms

The wait doubles each time. Linear backoff (1s, 2s, 3s, 4s) is sometimes used; exponential is more conservative when failures persist.

Why jitter (the thundering herd)

The picture is sharper as two arrival-rate charts. Imagine 1000 clients that all started failing at the same time. Plot how many of their retries arrive in each 25ms window after the crash.

One tall spike at t = 100ms. The downstream sees 1000 requests in a single moment, fails again, and the next round of retries keeps the cycle going.

Now add a small random delay to each client's wait time. The same 1000 retries happen, but now each client picks its own random offset between 0 and 100ms.

When not to retry

Three cases:

Permanent errors. 4xx HTTP responses, validation failures, "user not found", auth failures. Retrying never helps. Surface the error immediately.

Out of time. The request has a budget (an SLA, a user-facing deadline). If the next retry's delay would push past the budget, give up.

Layers and double-retry

Same warning at the system level: client retries → load balancer retries → service retries. Each layer multiplies. Be deliberate about where retries live.

The four knobs of a retry policy

base: starting delay. Typical: 100ms.
max_attempts: total tries (including the first). Typical: 3-5.
max_duration: total elapsed cap. Typical: a fraction of the latency budget.
jitter: full jitter (random in [0, max]) or decorrelated.

Pick these per call site based on what the operation costs and what the downstream can take. Don't reuse one policy for every call; an internal RPC and an external email send have different shapes.

Follow-up questions

▸Why is jitter important?

▸Full jitter vs decorrelated jitter?

▸When NOT to retry?

▸How is a transient error identified?

Diagram

What it is

Why exponential backoff

Why jitter (the thundering herd)

When not to retry

Layers and double-retry

The four knobs of a retry policy

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Retry with Exponential Backoff and Jitter

Diagram

What it is

Why exponential backoff

Why jitter (the thundering herd)

When not to retry

Layers and double-retry

The four knobs of a retry policy

Implementations

Key points

Follow-up questions

Gotchas

Related reading