Tail Latency: Why p99 Matters More Than Average
p99 (the 99th percentile latency) is the user experience; the average is a lie. Tail latency amplifies in fan-out: a request that fans out to 10 services has a p99 dominated by the slowest of 10 backends, not by their average. Mitigations: hedged requests, tied requests, micro-partitioning, smaller fan-out, better tail metrics.
Diagram
What it is
Tail latency is the slow end of the latency distribution: the p99, the p99.9, the worst 1% or 0.1% of requests. It's invariably much worse than the average, and for user-facing systems it IS the user experience for everyone unlucky enough to hit it.
The famous Dean & Barroso paper "The Tail at Scale" (2013) is the canonical reference. The basic insight: averages lie, and the tail amplifies as systems get more complex.
Why averages lie
A typical service might have:
- p50 (median): 10ms
- p95: 50ms
- p99: 200ms
- p99.9: 800ms
- max: 5000ms (a GC pause, a connection timeout)
Average is something like 30ms. Looks fine. But:
- 1% of users see 200ms+
- 0.1% of users see 800ms+
- A few users per million wait 5 seconds
At a million requests per day, that's 1000 users at 800ms+ and a handful experiencing seconds. Multiply by users-per-day at scale.
The average is also pulled by the tail; outliers skew the mean upward and the typical experience cannot be recovered from it.
Tail amplification in fan-out
The really nasty effect happens when one user request fans out to many backends.
Suppose each backend has p99 = 100ms (so 1% of requests are slower). One request fans out to N backends and waits for all. The probability that ALL are fast is 0.99^N.
- N=1: 99% fast, 1% slow → fan-out p99 ≈ backend p99 (100ms)
- N=10: 90% all-fast, 10% at least one slow → fan-out p99 ≈ backend p99.9 (much higher)
- N=100: 37% all-fast, 63% at least one slow → fan-out p99 ≈ backend p99.99
A user-facing endpoint that touches 10 internal services has its p99 dominated by the worst of 10 draws. The math is unforgiving.
Mitigations
Hedged requests. Send to one replica; if not done by some hedging delay (often the per-replica p95), send a duplicate to another replica. Take whichever responds first. Trades a small load increase (~5%) for big p99 wins. Standard in Google's storage layers, BigTable, Spanner.
Tied requests. Like hedging, but cancel the loser. Reduces wasted work. Requires the backend to support cancellation.
Smaller fan-out. Combine multiple backend calls into one batch endpoint. Aggregate at the data layer. Eliminate optional calls.
Faster slow path. Often the tail is dominated by specific causes (GC pause, connection establishment, cold cache, slow query). Each is fixable individually.
Per-backend p99 work. Reducing per-backend p99 from 200ms to 50ms is the highest-leverage move in fan-out scenarios: the fan-out p99 follows.
Measurement matters
The load test must not have coordinated omission. Naive testers (wrk, ab, jmeter default mode) wait for responses before sending the next request. When the server slows down, the tester slows down with it, missing the tail. Use wrk2, k6, vegeta, gatling, open-loop tools that send at constant rate.
Metrics must capture the tail. HDR Histogram (Java), prometheus quantile summaries (with care), Datadog/New Relic distributions all work. Avoid:
- Simple averages.
- Prometheus default
histogram_quantile(interpolates linearly between buckets; tail buckets must be configured). - "p99 over the last 5 minutes" with default Prometheus summaries (the percentile drifts as new samples arrive).
The average is meaningless for user-facing latency, so track p50, p99, and p99.9 instead. Fan-out amplifies the tail; reducing fan-out or per-backend p99 wins more than reducing the average. And coordinated omission in load tests will quietly lie about the numbers, so use open-loop tools when the answer matters.
For services that fan out to many backends, hedged requests are the single biggest available win without changing the backends themselves. Implement them once, reap years of latency improvement.
Implementations
Send the request to one replica; if it doesn't respond within p95, send a duplicate to another replica. First response wins; cancel the slow one. Trades 5% extra load for dramatic p99 improvement.
1 CompletableFuture<Result> hedged(List<Replica> replicas, Request req,
2 Duration hedgeAfter,
3 ScheduledExecutorService scheduler) {
4 // Start first request
5 CompletableFuture<Result> first =
6 replicas.get(0).callAsync(req);
7
8 // After hedgeAfter, start second if first hasn't completed
9 CompletableFuture<Result> hedge = new CompletableFuture<>();
10 scheduler.schedule(() -> {
11 if (!first.isDone()) {
12 replicas.get(1).callAsync(req)
13 .whenComplete((r, e) -> {
14 if (e != null) hedge.completeExceptionally(e);
15 else hedge.complete(r);
16 });
17 }
18 }, hedgeAfter.toMillis(), TimeUnit.MILLISECONDS);
19
20 // Return whichever finishes first
21 return CompletableFuture.anyOf(first, hedge)
22 .thenApply(o -> (Result) o);
23 }
24
25 // Pattern from Google: hedge after p95 of the per-replica latency.
26 // For a backend with p50=10ms, p95=50ms, p99=200ms, hedging at 50ms
27 // typically cuts p99 by 5-10x at ~5% extra load.HDR Histogram records latencies in fixed-precision buckets, supports many percentiles cheaply. The standard for capturing tail metrics in Java services. Avoid simple averages or Prometheus summaries that dynamic-aggregate; they lose the tail.
1 import org.HdrHistogram.Histogram;
2
3 // 3 significant digits, max 60 seconds
4 Histogram hist = new Histogram(60_000_000_000L, 3);
5
6 void recordLatency(long nanos) {
7 hist.recordValue(nanos);
8 }
9
10 void report() {
11 System.out.printf("p50: %d ms%n", hist.getValueAtPercentile(50) / 1_000_000);
12 System.out.printf("p99: %d ms%n", hist.getValueAtPercentile(99) / 1_000_000);
13 System.out.printf("p99.9: %d ms%n", hist.getValueAtPercentile(99.9) / 1_000_000);
14 System.out.printf("max: %d ms%n", hist.getMaxValue() / 1_000_000);
15 }
16
17 // Reset for next reporting interval; or use a recorder for atomic snapshotsKey points
- •Average hides outliers. The p99 IS the user experience for 1% of requests, which is millions of users at scale.
- •Tail amplification: a request fanning out to N backends has p99 ≈ max of N independent draws. With N=10, p99 ≈ p99.9 of one backend.
- •Mitigations: hedged requests (send to two replicas, take first), tied requests (cancel the loser), micro-batching, fewer fan-outs.
- •Coordinated omission: load testing tools that wait for slow responses miss tail latency. Use HDR Histogram, k6, or wrk2.
- •Read 'The Tail at Scale' by Dean & Barroso (2013); foundational paper, still relevant.
Follow-up questions
▸Why does p99 amplify with fan-out?
▸What's the simplest way to reduce tail latency?
▸What is coordinated omission?
▸Optimise for p50 or p99?
Gotchas
- !Reporting average latency in dashboards: hides the tail entirely
- !Coordinated omission in load tests: reported p99 understates real p99 by 10-100x
- !Fan-out without thinking about amplification: p99 of the user request is much worse than p99 of any backend
- !Hedging non-idempotent operations: double-execution under network failures
- !Prometheus summary metrics with default config: percentiles drift over the reporting window