Retry Library: Production Implementation

What it is

A production retry library does more than "wrap a call in a for-loop". It encodes assumptions about which errors are worth retrying, how aggressive to be, when to give up, and how to play nicely with the rest of the system.

The contract: caller hands the library a function; library decides whether and when to call it again on failure; returns the first success or the last error if everything failed.

What goes in the policy

Six knobs cover most cases:

isTransient(err) classifier. Returns true for "this might succeed on retry". Per-protocol: HTTP, gRPC, database, and any custom errors all need their own classifier.
maxAttempts. Total tries including the first. Typical: 3-5.
maxDuration. Total elapsed cap. The retry stops if the next attempt would exceed it.
baseDelay. Starting delay for backoff. Typical: 50-200ms.
jitter. Always full jitter in production. Random delay in [0, base * 2^attempt].
respectContext. The retry must abort if the caller's context is cancelled.

Some libraries add:

timeout per attempt. So one slow attempt doesn't eat the whole budget.
retryAfter parsing. If the server sends a 429 with Retry-After: 30, honour it instead of computing a local backoff.
circuit breaker integration. Call the breaker before each attempt; if open, fail fast.

The error classifier matters most

A retry policy with the wrong classifier is worse than no retry. Two failure modes:

Too lax: classifying validation errors as transient retries them. Each retry sends the same bad input, fails the same way, eventually exhausts attempts. Best case: wasted time; worst case: hammers the server with traffic.

Too strict: classifying timeouts as permanent never retries them. A genuine network blip becomes a hard failure for the user. Misses the cheap win of retry.

The classifier should be explicit about each error type: this transient, this permanent, this requires deciding by the response code. Don't catch-all.

Composition with other patterns

Retry doesn't live alone. The full client stack:

client.call()
  > bulkhead (cap concurrency)
    > circuit breaker (fail fast if downstream is down)
      > retry (handle transient failures)
        > timeout (cap per-attempt latency)
          > actual HTTP call

Each layer compensates for the others' weaknesses. Bulkhead caps damage; circuit breaker prevents repeated failures; retry handles single-shot blips; timeout caps any one attempt.

A common mistake is putting these in the wrong order. Retry inside the circuit breaker (so retries trip the breaker faster) is right. Retry outside the timeout (so timeout kills a slow retry attempt and the next attempt starts fresh) is right. Retry inside another retry layer is wrong (multiplicative blowup).

What every implementation should do

Three things separate production-grade retry from textbook retry:

Honour context/cancellation. When the caller cancels, the retry stops immediately. Even mid-backoff.
Respect deadlines. Don't start an attempt if the remaining budget is too small.
Pass per-attempt timeout downstream. So one slow attempt doesn't deny the next its chance.

Get these right and the retry layer becomes invisible: it improves reliability without surprising the caller with extra latency or stacked failures.

Follow-up questions

▸What goes in the retryable-error classifier?

Network errors (timeout, connection refused, DNS fail), 5xx HTTP responses, gRPC UNAVAILABLE/DEADLINE_EXCEEDED/RESOURCE_EXHAUSTED, database deadlock errors, cloud throttling responses (429 with Retry-After). NOT in: 4xx (client errors), validation failures, auth errors, business-logic errors. The classifier is per-protocol; HTTP, gRPC, DB each have their own.

▸Why should I prefer a deadline over a fixed max-attempts?

Because attempts can grow latency unboundedly. With max-attempts=5 and exponential backoff, the worst case is base * (1+2+4+8+16) = 31 * base. If base is 1s, that's 31s of latency for one user request. A deadline (e.g., 5s total) caps the experience. Both together give the best behaviour: stop on attempts AND on time.

▸Should retry be in the client SDK or in the application code?

Client SDK is the cleanest place; the application calls one function and gets retry for free. The risk: stacked retries when the application also wraps the SDK in its own retry. Solution: SDK exposes a way to disable its retry, or document clearly. When the SDK isn't under the team's control, application-level retry is the right place.

What it is

The contract: caller hands the library a function; library decides whether and when to call it again on failure; returns the first success or the last error if everything failed.

What goes in the policy

Six knobs cover most cases:

isTransient(err) classifier. Returns true for "this might succeed on retry". Per-protocol: HTTP, gRPC, database, and any custom errors all need their own classifier.
maxAttempts. Total tries including the first. Typical: 3-5.
maxDuration. Total elapsed cap. The retry stops if the next attempt would exceed it.
baseDelay. Starting delay for backoff. Typical: 50-200ms.
jitter. Always full jitter in production. Random delay in [0, base * 2^attempt].
respectContext. The retry must abort if the caller's context is cancelled.

Some libraries add:

timeout per attempt. So one slow attempt doesn't eat the whole budget.
retryAfter parsing. If the server sends a 429 with Retry-After: 30, honour it instead of computing a local backoff.
circuit breaker integration. Call the breaker before each attempt; if open, fail fast.

The error classifier matters most

A retry policy with the wrong classifier is worse than no retry. Two failure modes:

Too strict: classifying timeouts as permanent never retries them. A genuine network blip becomes a hard failure for the user. Misses the cheap win of retry.

The classifier should be explicit about each error type: this transient, this permanent, this requires deciding by the response code. Don't catch-all.

Composition with other patterns

Retry doesn't live alone. The full client stack:

client.call()
  > bulkhead (cap concurrency)
    > circuit breaker (fail fast if downstream is down)
      > retry (handle transient failures)
        > timeout (cap per-attempt latency)
          > actual HTTP call

Each layer compensates for the others' weaknesses. Bulkhead caps damage; circuit breaker prevents repeated failures; retry handles single-shot blips; timeout caps any one attempt.

What every implementation should do

Three things separate production-grade retry from textbook retry:

Honour context/cancellation. When the caller cancels, the retry stops immediately. Even mid-backoff.
Respect deadlines. Don't start an attempt if the remaining budget is too small.
Pass per-attempt timeout downstream. So one slow attempt doesn't deny the next its chance.

Get these right and the retry layer becomes invisible: it improves reliability without surprising the caller with extra latency or stacked failures.

Follow-up questions

▸What goes in the retryable-error classifier?

▸Why should I prefer a deadline over a fixed max-attempts?

▸Should retry be in the client SDK or in the application code?

What it is

What goes in the policy

The error classifier matters most

Composition with other patterns

What every implementation should do

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Retry Library: Production Implementation

What it is

What goes in the policy

The error classifier matters most

Composition with other patterns

What every implementation should do

Implementations

Key points

Follow-up questions

Gotchas

Related reading