Performance & TradeoffsTopic 7 of 14

ConceptAdvancedAsked Often

Tail Latency: Why p99 Matters More Than Average

In one line

p99 (the 99th percentile latency) is the user experience; the average is a lie. Tail latency amplifies in fan-out: a request that fans out to 10 services has a p99 dominated by the slowest of 10 backends, not by their average. Mitigations: hedged requests, tied requests, micro-partitioning, smaller fan-out, better tail metrics.

Diagram

What it is

Tail latency is the slow end of the latency distribution: the p99, the p99.9, the worst 1% or 0.1% of requests. It's invariably much worse than the average, and for user-facing systems it IS the user experience for everyone unlucky enough to hit it.

The famous Dean & Barroso paper "The Tail at Scale" (2013) is the canonical reference. The basic insight: averages lie, and the tail amplifies as systems get more complex.

Why averages lie

A typical service might have:

p50 (median): 10ms
p95: 50ms
p99: 200ms
p99.9: 800ms
max: 5000ms (a GC pause, a connection timeout)

Average is something like 30ms. Looks fine. But:

1% of users see 200ms+
0.1% of users see 800ms+
A few users per million wait 5 seconds

At a million requests per day, that's 1000 users at 800ms+ and a handful experiencing seconds. Multiply by users-per-day at scale.

The average is also pulled by the tail; outliers skew the mean upward and the typical experience cannot be recovered from it.

Tail amplification in fan-out

The really nasty effect happens when one user request fans out to many backends.

Suppose each backend has p99 = 100ms (so 1% of requests are slower). One request fans out to N backends and waits for all. The probability that ALL are fast is 0.99^N.

N=1: 99% fast, 1% slow → fan-out p99 ≈ backend p99 (100ms)
N=10: 90% all-fast, 10% at least one slow → fan-out p99 ≈ backend p99.9 (much higher)
N=100: 37% all-fast, 63% at least one slow → fan-out p99 ≈ backend p99.99

A user-facing endpoint that touches 10 internal services has its p99 dominated by the worst of 10 draws. The math is unforgiving.

Mitigations

Hedged requests. Send to one replica; if not done by some hedging delay (often the per-replica p95), send a duplicate to another replica. Take whichever responds first. Trades a small load increase (~5%) for big p99 wins. Standard in Google's storage layers, BigTable, Spanner.

Tied requests. Like hedging, but cancel the loser. Reduces wasted work. Requires the backend to support cancellation.

Smaller fan-out. Combine multiple backend calls into one batch endpoint. Aggregate at the data layer. Eliminate optional calls.

Faster slow path. Often the tail is dominated by specific causes (GC pause, connection establishment, cold cache, slow query). Each is fixable individually.

Per-backend p99 work. Reducing per-backend p99 from 200ms to 50ms is the highest-leverage move in fan-out scenarios: the fan-out p99 follows.

Measurement matters

The load test must not have coordinated omission. Naive testers (wrk, ab, jmeter default mode) wait for responses before sending the next request. When the server slows down, the tester slows down with it, missing the tail. Use wrk2, k6, vegeta, gatling, open-loop tools that send at constant rate.

Metrics must capture the tail. HDR Histogram (Java), prometheus quantile summaries (with care), Datadog/New Relic distributions all work. Avoid:

Simple averages.
Prometheus default histogram_quantile (interpolates linearly between buckets; tail buckets must be configured).
"p99 over the last 5 minutes" with default Prometheus summaries (the percentile drifts as new samples arrive).

The average is meaningless for user-facing latency, so track p50, p99, and p99.9 instead. Fan-out amplifies the tail; reducing fan-out or per-backend p99 wins more than reducing the average. And coordinated omission in load tests will quietly lie about the numbers, so use open-loop tools when the answer matters.

For services that fan out to many backends, hedged requests are the single biggest available win without changing the backends themselves. Implement them once, reap years of latency improvement.

Implementations

Hedged request: cut tail latency

Send the request to one replica; if it doesn't respond within p95, send a duplicate to another replica. First response wins; cancel the slow one. Trades 5% extra load for dramatic p99 improvement.

 1  CompletableFuture<Result> hedged(List<Replica> replicas, Request req,
 2                                    Duration hedgeAfter,
 3                                    ScheduledExecutorService scheduler) {
 4      // Start first request
 5      CompletableFuture<Result> first =
 6          replicas.get(0).callAsync(req);
 7  
 8      // After hedgeAfter, start second if first hasn't completed
 9      CompletableFuture<Result> hedge = new CompletableFuture<>();
10      scheduler.schedule(() -> {
11          if (!first.isDone()) {
12              replicas.get(1).callAsync(req)
13                  .whenComplete((r, e) -> {
14                      if (e != null) hedge.completeExceptionally(e);
15                      else hedge.complete(r);
16                  });
17          }
18      }, hedgeAfter.toMillis(), TimeUnit.MILLISECONDS);
19  
20      // Return whichever finishes first
21      return CompletableFuture.anyOf(first, hedge)
22          .thenApply(o -> (Result) o);
23  }
24  
25  // Pattern from Google: hedge after p95 of the per-replica latency.
26  // For a backend with p50=10ms, p95=50ms, p99=200ms, hedging at 50ms
27  // typically cuts p99 by 5-10x at ~5% extra load.

HDR Histogram for accurate tail metrics

HDR Histogram records latencies in fixed-precision buckets, supports many percentiles cheaply. The standard for capturing tail metrics in Java services. Avoid simple averages or Prometheus summaries that dynamic-aggregate; they lose the tail.

 1  import org.HdrHistogram.Histogram;
 2  
 3  // 3 significant digits, max 60 seconds
 4  Histogram hist = new Histogram(60_000_000_000L, 3);
 5  
 6  void recordLatency(long nanos) {
 7      hist.recordValue(nanos);
 8  }
 9  
10  void report() {
11      System.out.printf("p50: %d ms%n", hist.getValueAtPercentile(50) / 1_000_000);
12      System.out.printf("p99: %d ms%n", hist.getValueAtPercentile(99) / 1_000_000);
13      System.out.printf("p99.9: %d ms%n", hist.getValueAtPercentile(99.9) / 1_000_000);
14      System.out.printf("max: %d ms%n", hist.getMaxValue() / 1_000_000);
15  }
16  
17  // Reset for next reporting interval; or use a recorder for atomic snapshots

Key points

•Average hides outliers. The p99 IS the user experience for 1% of requests, which is millions of users at scale.
•Tail amplification: a request fanning out to N backends has p99 ≈ max of N independent draws. With N=10, p99 ≈ p99.9 of one backend.
•Mitigations: hedged requests (send to two replicas, take first), tied requests (cancel the loser), micro-batching, fewer fan-outs.
•Coordinated omission: load testing tools that wait for slow responses miss tail latency. Use HDR Histogram, k6, or wrk2.
•Read 'The Tail at Scale' by Dean & Barroso (2013); foundational paper, still relevant.

Follow-up questions

▸Why does p99 amplify with fan-out?

Each backend independently has some chance of being slow. Fanning out to N backends and waiting for all means waiting for the slowest. The probability that ALL N are fast is much lower than the probability that ONE is fast. With N=10 and per-backend p99=100ms, ~10% of fan-out requests will hit at least one slow backend. The fan-out p99 lands near the original p99.9 of one backend.

▸What's the simplest way to reduce tail latency?

Hedged requests. Send to one replica; if not done after the p95 latency, send a duplicate to another replica; take the first response. ~5% extra load, dramatic p99 improvement (often 5-10x). Only safe with idempotent operations.

▸What is coordinated omission?

A flaw in many load testers: they send the next request only after the previous one returns. If the server slows down, the tester slows down with it, sending fewer requests and missing the latency that users at constant rate would actually experience. Use open-loop testers (wrk2, k6, vegeta) that send at constant rate regardless of response time.

▸Optimise for p50 or p99?

Depends on the workload. For batch/throughput, p50 (or average). For user-facing requests, p99. For 'tail-sensitive' systems (search, ad serving), even p99.9. The question is: how many users have a bad experience? At millions of requests, even p99.99 is many users.

Gotchas

!Reporting average latency in dashboards: hides the tail entirely
!Coordinated omission in load tests: reported p99 understates real p99 by 10-100x
!Fan-out without thinking about amplification: p99 of the user request is much worse than p99 of any backend
!Hedging non-idempotent operations: double-execution under network failures
!Prometheus summary metrics with default config: percentiles drift over the reporting window

Tail Latency: Why p99 Matters More Than Average

In one line

Diagram

What it is

The famous Dean & Barroso paper "The Tail at Scale" (2013) is the canonical reference. The basic insight: averages lie, and the tail amplifies as systems get more complex.

Why averages lie

A typical service might have:

p50 (median): 10ms
p95: 50ms
p99: 200ms
p99.9: 800ms
max: 5000ms (a GC pause, a connection timeout)

Average is something like 30ms. Looks fine. But:

1% of users see 200ms+
0.1% of users see 800ms+
A few users per million wait 5 seconds

At a million requests per day, that's 1000 users at 800ms+ and a handful experiencing seconds. Multiply by users-per-day at scale.

The average is also pulled by the tail; outliers skew the mean upward and the typical experience cannot be recovered from it.

Tail amplification in fan-out

The really nasty effect happens when one user request fans out to many backends.

Suppose each backend has p99 = 100ms (so 1% of requests are slower). One request fans out to N backends and waits for all. The probability that ALL are fast is 0.99^N.

N=1: 99% fast, 1% slow → fan-out p99 ≈ backend p99 (100ms)
N=10: 90% all-fast, 10% at least one slow → fan-out p99 ≈ backend p99.9 (much higher)
N=100: 37% all-fast, 63% at least one slow → fan-out p99 ≈ backend p99.99

A user-facing endpoint that touches 10 internal services has its p99 dominated by the worst of 10 draws. The math is unforgiving.

Mitigations

Tied requests. Like hedging, but cancel the loser. Reduces wasted work. Requires the backend to support cancellation.

Smaller fan-out. Combine multiple backend calls into one batch endpoint. Aggregate at the data layer. Eliminate optional calls.

Faster slow path. Often the tail is dominated by specific causes (GC pause, connection establishment, cold cache, slow query). Each is fixable individually.

Per-backend p99 work. Reducing per-backend p99 from 200ms to 50ms is the highest-leverage move in fan-out scenarios: the fan-out p99 follows.

Measurement matters

Metrics must capture the tail. HDR Histogram (Java), prometheus quantile summaries (with care), Datadog/New Relic distributions all work. Avoid:

Simple averages.
Prometheus default histogram_quantile (interpolates linearly between buckets; tail buckets must be configured).
"p99 over the last 5 minutes" with default Prometheus summaries (the percentile drifts as new samples arrive).

For services that fan out to many backends, hedged requests are the single biggest available win without changing the backends themselves. Implement them once, reap years of latency improvement.

Implementations

Hedged request: cut tail latency

Send the request to one replica; if it doesn't respond within p95, send a duplicate to another replica. First response wins; cancel the slow one. Trades 5% extra load for dramatic p99 improvement.

 1  CompletableFuture<Result> hedged(List<Replica> replicas, Request req,
 2                                    Duration hedgeAfter,
 3                                    ScheduledExecutorService scheduler) {
 4      // Start first request
 5      CompletableFuture<Result> first =
 6          replicas.get(0).callAsync(req);
 7  
 8      // After hedgeAfter, start second if first hasn't completed
 9      CompletableFuture<Result> hedge = new CompletableFuture<>();
10      scheduler.schedule(() -> {
11          if (!first.isDone()) {
12              replicas.get(1).callAsync(req)
13                  .whenComplete((r, e) -> {
14                      if (e != null) hedge.completeExceptionally(e);
15                      else hedge.complete(r);
16                  });
17          }
18      }, hedgeAfter.toMillis(), TimeUnit.MILLISECONDS);
19  
20      // Return whichever finishes first
21      return CompletableFuture.anyOf(first, hedge)
22          .thenApply(o -> (Result) o);
23  }
24  
25  // Pattern from Google: hedge after p95 of the per-replica latency.
26  // For a backend with p50=10ms, p95=50ms, p99=200ms, hedging at 50ms
27  // typically cuts p99 by 5-10x at ~5% extra load.

HDR Histogram for accurate tail metrics

 1  import org.HdrHistogram.Histogram;
 2  
 3  // 3 significant digits, max 60 seconds
 4  Histogram hist = new Histogram(60_000_000_000L, 3);
 5  
 6  void recordLatency(long nanos) {
 7      hist.recordValue(nanos);
 8  }
 9  
10  void report() {
11      System.out.printf("p50: %d ms%n", hist.getValueAtPercentile(50) / 1_000_000);
12      System.out.printf("p99: %d ms%n", hist.getValueAtPercentile(99) / 1_000_000);
13      System.out.printf("p99.9: %d ms%n", hist.getValueAtPercentile(99.9) / 1_000_000);
14      System.out.printf("max: %d ms%n", hist.getMaxValue() / 1_000_000);
15  }
16  
17  // Reset for next reporting interval; or use a recorder for atomic snapshots

Key points

•Average hides outliers. The p99 IS the user experience for 1% of requests, which is millions of users at scale.
•Tail amplification: a request fanning out to N backends has p99 ≈ max of N independent draws. With N=10, p99 ≈ p99.9 of one backend.
•Mitigations: hedged requests (send to two replicas, take first), tied requests (cancel the loser), micro-batching, fewer fan-outs.
•Coordinated omission: load testing tools that wait for slow responses miss tail latency. Use HDR Histogram, k6, or wrk2.
•Read 'The Tail at Scale' by Dean & Barroso (2013); foundational paper, still relevant.

Follow-up questions

▸Why does p99 amplify with fan-out?

▸What's the simplest way to reduce tail latency?

▸What is coordinated omission?

▸Optimise for p50 or p99?

Gotchas

!Reporting average latency in dashboards: hides the tail entirely
!Coordinated omission in load tests: reported p99 understates real p99 by 10-100x
!Fan-out without thinking about amplification: p99 of the user request is much worse than p99 of any backend
!Hedging non-idempotent operations: double-execution under network failures
!Prometheus summary metrics with default config: percentiles drift over the reporting window

Tail Latency: Why p99 Matters More Than Average

Diagram

What it is

Why averages lie

Tail amplification in fan-out

Mitigations

Measurement matters

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Tail Latency: Why p99 Matters More Than Average

Diagram

What it is

Why averages lie

Tail amplification in fan-out

Mitigations

Measurement matters

Implementations

Key points

Follow-up questions

Gotchas

Related reading