Retry Library: Production Implementation
Build a retry library: classify errors (transient vs permanent), exponential backoff with jitter, max-attempts AND max-duration, per-call deadline budget, optional circuit breaker integration. Two ways to get it wrong: retry permanent errors, or retry without jitter.
What it is
A production retry library does more than "wrap a call in a for-loop". It encodes assumptions about which errors are worth retrying, how aggressive to be, when to give up, and how to play nicely with the rest of the system.
The contract: caller hands the library a function; library decides whether and when to call it again on failure; returns the first success or the last error if everything failed.
What goes in the policy
Six knobs cover most cases:
- isTransient(err) classifier. Returns true for "this might succeed on retry". Per-protocol: HTTP, gRPC, database, and any custom errors all need their own classifier.
- maxAttempts. Total tries including the first. Typical: 3-5.
- maxDuration. Total elapsed cap. The retry stops if the next attempt would exceed it.
- baseDelay. Starting delay for backoff. Typical: 50-200ms.
- jitter. Always full jitter in production. Random delay in [0, base * 2^attempt].
- respectContext. The retry must abort if the caller's context is cancelled.
Some libraries add:
- timeout per attempt. So one slow attempt doesn't eat the whole budget.
- retryAfter parsing. If the server sends a 429 with Retry-After: 30, honour it instead of computing a local backoff.
- circuit breaker integration. Call the breaker before each attempt; if open, fail fast.
The error classifier matters most
A retry policy with the wrong classifier is worse than no retry. Two failure modes:
Too lax: classifying validation errors as transient retries them. Each retry sends the same bad input, fails the same way, eventually exhausts attempts. Best case: wasted time; worst case: hammers the server with traffic.
Too strict: classifying timeouts as permanent never retries them. A genuine network blip becomes a hard failure for the user. Misses the cheap win of retry.
The classifier should be explicit about each error type: this transient, this permanent, this requires deciding by the response code. Don't catch-all.
Composition with other patterns
Retry doesn't live alone. The full client stack:
client.call()
> bulkhead (cap concurrency)
> circuit breaker (fail fast if downstream is down)
> retry (handle transient failures)
> timeout (cap per-attempt latency)
> actual HTTP call
Each layer compensates for the others' weaknesses. Bulkhead caps damage; circuit breaker prevents repeated failures; retry handles single-shot blips; timeout caps any one attempt.
A common mistake is putting these in the wrong order. Retry inside the circuit breaker (so retries trip the breaker faster) is right. Retry outside the timeout (so timeout kills a slow retry attempt and the next attempt starts fresh) is right. Retry inside another retry layer is wrong (multiplicative blowup).
What every implementation should do
Three things separate production-grade retry from textbook retry:
- Honour context/cancellation. When the caller cancels, the retry stops immediately. Even mid-backoff.
- Respect deadlines. Don't start an attempt if the remaining budget is too small.
- Pass per-attempt timeout downstream. So one slow attempt doesn't deny the next its chance.
Get these right and the retry layer becomes invisible: it improves reliability without surprising the caller with extra latency or stacked failures.
Implementations
The classifier is the heart of correctness. Three categories: PERMANENT (give up), TRANSIENT (retry), UNKNOWN (configurable; usually treat as permanent to be safe). Per-attempt budget caps individual call latency.
1 public class Retrier {
2 private final int maxAttempts;
3 private final Duration maxTotal;
4 private final Duration base;
5 private final Predicate<Throwable> isTransient;
6 private final Random rnd = new Random();
7
8 public <T> T execute(Callable<T> fn) throws Exception {
9 long deadline = System.nanoTime() + maxTotal.toNanos();
10 Throwable last = null;
11 for (int attempt = 0; attempt < maxAttempts; attempt++) {
12 long remaining = deadline - System.nanoTime();
13 if (remaining <= 0) break;
14 try {
15 return fn.call();
16 } catch (Throwable e) {
17 last = e;
18 if (!isTransient.test(e)) throw e;
19 if (attempt == maxAttempts - 1) break;
20
21 long maxDelay = base.toNanos() * (1L << attempt);
22 long delay = (long) (rnd.nextDouble() * maxDelay);
23 if (System.nanoTime() + delay > deadline) break;
24 Thread.sleep(delay / 1_000_000);
25 }
26 }
27 throw new RetriesExhausted(last);
28 }
29 }Key points
- •Classify errors: only retry transient (5xx, timeout, connection refused). Never retry 4xx.
- •Exponential backoff with FULL jitter: delay = random(0, base * 2^attempt).
- •Bound both attempts AND total elapsed time. Either alone is not enough.
- •Pass remaining deadline to each attempt; don't run a 5s call when 1s of budget remains.
- •Compose with circuit breaker: retries hit the breaker first, fail fast if open.
Follow-up questions
▸What goes in the retryable-error classifier?
▸Why should I prefer a deadline over a fixed max-attempts?
▸Should retry be in the client SDK or in the application code?
Gotchas
- !Retrying without a per-attempt timeout means slow downstream + slow retries = unbounded latency
- !Retrying a non-idempotent POST without an idempotency key duplicates the side effect
- !Stacked retries (application retry + the HTTP client's retry) multiply attempts exponentially
- !Using elapsed time but not respecting context cancellation: retry continues after caller gave up
- !Treating all exceptions as retryable: validation errors and OOM get retried too