Streaming Quantiles: T-Digest & DDSketch
Architecture
Why Percentiles Break When You Aggregate
Here is a scenario that plays out at every company running microservices at scale. You have 50 instances of a service behind a load balancer. Each instance reports its own P99 latency to a monitoring system. The dashboard averages those 50 P99 values and displays "P99: 45ms." An oncall engineer sees this and thinks latency is fine.
But the real P99, computed over all requests across all 50 instances, is 380ms. What happened? The averaging hid the tail. Most instances were handling easy requests at low latency. A few instances were handling expensive queries and reporting P99s of 200-400ms. Averaging 47 values of ~30ms with 3 values of ~350ms gives you 50ms. The actual user experience at the 99th percentile was nearly 10x worse than what the dashboard showed.
This is not an edge case. It is the default failure mode of every monitoring system that tries to aggregate percentiles naively. Percentiles are not additive. You cannot average them, sum them, or combine them with any simple arithmetic operation and get a correct answer. The only way to compute a correct global percentile is to have access to the full sorted dataset, or to use a data structure specifically designed to approximate quantiles in a mergeable way.
T-Digest: Concentrating Resolution Where It Matters
T-Digest, created by Ted Dunning in 2013, takes a clever approach to the quantile problem. Instead of storing all data points or using fixed-width bins, it represents the distribution as a collection of weighted centroids. The key insight is about where to spend resolution.
For latency monitoring, the P50 does not need much precision. Whether the median is 22ms or 24ms rarely changes any decision. But the difference between a P99 of 150ms and 300ms is the difference between meeting your SLO and breaching it. T-Digest exploits this by placing more centroids near the tails (near P0 and P100) and fewer centroids in the middle.
A centroid has two values: a mean and a weight (count of data points it represents). When a new data point arrives, T-Digest finds the nearest centroid and tries to absorb the point into it by updating the mean and incrementing the weight. But there is a constraint: centroids near the tails are allowed to absorb fewer points than centroids in the middle. This is controlled by a scale function that maps quantile position to maximum centroid size.
The result is a structure that might have 300 centroids total, with 50 of them covering the bottom 1% of the distribution and another 50 covering the top 1%, while only 200 cover the middle 98%. When you query "what is P99?", you are looking at a region with dense centroid coverage, so the interpolation between centroids is accurate.
T-Digest Compression and Scale Functions
The compression parameter delta (typically 100-300) controls how many centroids T-Digest maintains. Higher delta means more centroids, more memory, and more accuracy. Lower delta saves memory but reduces tail accuracy.
The relationship between delta and accuracy is not uniform across quantiles. At compression 100:
| Quantile | Typical Relative Error |
|---|---|
| P50 | < 0.1% |
| P90 | < 0.5% |
| P99 | 1-3% |
| P99.9 | 3-10% |
At compression 300, the P99.9 error drops to roughly 1-3%. For SLO monitoring where you care about P99.9, that jump from 100 to 300 compression is worth the extra memory (roughly 3x more centroids, which means going from ~2.4 KB to ~7.2 KB).
T-Digest has gone through several scale function iterations. The original (k0) used a simple linear mapping. The k1 function used an arcsin-based mapping that concentrated centroids more aggressively at the tails. k2 and k3 are more recent refinements that improve accuracy for extreme quantiles (P99.99+). Most implementations default to k2 or k3. Unless you are doing something unusual, the default is fine.
Merging T-Digests
This is the property that makes T-Digest useful in distributed systems. Given T-Digest sketches from 50 different servers, you can merge them into a single T-Digest that approximates the quantiles of the combined dataset. The merge algorithm collects all centroids from all sketches, sorts them by mean, and then re-compresses using the same scale function constraints. The merged sketch maintains the same accuracy guarantees as if all data had been fed into a single sketch.
In practice, the merge pipeline looks like this: each server maintains a local T-Digest that absorbs incoming measurements. Every 10-60 seconds, the server serializes its T-Digest (a few KB) and sends it to a central aggregator. The aggregator merges all incoming sketches into a global T-Digest and queries it for percentiles. The global percentiles are correct, not averages of local percentiles.
Datadog uses this exact pattern, except they chose DDSketch instead of T-Digest for reasons discussed below.
DDSketch: Relative Error, Not Absolute
DDSketch (Masson et al., 2019, developed at Datadog) takes a fundamentally different approach. Instead of centroids, it uses a histogram with logarithmically-spaced bin boundaries. Each bin covers a range of values where the ratio between the maximum and minimum is at most (1 + alpha), where alpha is the relative error parameter.
With alpha = 0.01 (1% relative error), the bins might look like: [100, 101], [101, 102.01], [102.01, 103.03], and so on. Each bin is 1% wider than the last. When a value arrives, it lands in the bin whose range contains it. To query a quantile, walk the bins in order, accumulating counts until you reach the desired percentile, and return the bin midpoint.
The guarantee: the returned value is within alpha of the true quantile. If the true P99 is 200ms, DDSketch with alpha=0.01 returns a value between 198ms and 202ms. If the true P99 is 2000ms, it returns between 1980ms and 2020ms. The error scales with the value itself, which is exactly what you want for latency monitoring.
Why relative error is superior for most monitoring use cases: an absolute error of 2ms is wonderful when latency is 200ms (1% error) but useless when latency is 2ms (100% error). Relative error gives proportional precision everywhere.
DDSketch Memory Management
The logarithmic binning of DDSketch means the number of bins grows with the range of observed values, not the number of observations. If all your latencies fall between 1ms and 10,000ms, the number of bins with alpha=0.01 is approximately log(10000/1) / log(1.01) which is about 924 bins. At 4 bytes per counter, that is ~3.7 KB.
But what if an outlier pushes the range to 1ms-1,000,000ms? The number of bins jumps to about 1,386. To bound memory, DDSketch implements collapsing: when the number of bins exceeds a limit, the lowest bins (representing the fastest requests) are merged together. This sacrifices accuracy at the low end of the distribution, which is usually acceptable because you care more about tail latency than about whether the fastest requests took 1ms or 2ms.
Datadog's production implementation uses a maximum of 2048 bins, which provides sub-1% relative error for latency values spanning five orders of magnitude. The serialized sketch is typically 1-4 KB, comparable to T-Digest.
GK Summary: The Classic Algorithm
Before T-Digest and DDSketch, there was the Greenwald-Khanna (GK) summary from 2001. It provides epsilon-approximate quantiles with O(1/epsilon * log(epsilonN)) space. The guarantee is absolute: a query for quantile phi returns a value whose true rank is within epsilonN of the target rank phi*N.
GK works by maintaining a sorted summary of tuples (value, gap, delta) where gap and delta bound the uncertainty in each value's rank. When the summary grows too large, tuples are merged (compressed) while preserving the error guarantee.
GK is less popular than T-Digest and DDSketch in modern systems for several reasons. The absolute error guarantee means you get the same absolute precision at P50 and P99, which wastes resolution in the middle. Merge operations are more complex than T-Digest. And the implementation is trickier to get right compared to DDSketch's simple binning approach.
That said, GK is the theoretically clean solution and appears in academic work as the baseline. Apache Spark's approxQuantile uses a GK-based algorithm.
Comparison Table
| Property | T-Digest | DDSketch | GK Summary |
|---|---|---|---|
| Error type | Heuristic, better at tails | Relative (alpha) | Absolute (epsilon) |
| Typical memory | 2-8 KB | 1-4 KB | 1-10 KB |
| Merge support | Yes, with re-compression | Yes, trivial (add bins) | Yes, but complex |
| Best for | General percentile monitoring | SLO compliance, relative precision | Academic baselines, Spark |
| Worst case accuracy | P99.9 can degrade if compression is low | Outlier ranges expand bin count | Uniform precision wastes resolution at median |
| Implementation complexity | Moderate (scale functions, buffer management) | Low (logarithmic binning) | High (tuple compression logic) |
| Used by | Elasticsearch, ClickHouse, Spark | Datadog, Envoy (proposed) | Apache Spark |
DDSketch wins on simplicity and theoretical cleanliness. T-Digest wins on ecosystem adoption and has a longer track record. For a new system today, DDSketch is the easier choice unless you need compatibility with tools that already use T-Digest.
Prometheus Histograms: A Cautionary Tale
Prometheus, the most widely deployed metrics system in the Kubernetes ecosystem, does not use any of these algorithms by default. Instead, it uses manually configured histogram buckets with fixed boundaries. You define buckets like [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s], and Prometheus counts how many observations fall into each bucket.
The histogram_quantile() function then uses linear interpolation within buckets to estimate quantiles. If 1% of observations fall between the 100ms and 250ms bucket boundaries, and you ask for P99, Prometheus linearly interpolates within that 100ms-250ms range. But latency is not linearly distributed within a bucket. The actual P99 might be 230ms while Prometheus reports 175ms.
This gets worse with poorly chosen boundaries. If your P99 is actually 87ms but your nearest bucket boundaries are 50ms and 100ms, the interpolation produces an answer somewhere between 50ms and 100ms depending on the counts. With enough traffic concentration at the boundary, the estimate can be off by 50% or more.
Prometheus native histograms (introduced in Prometheus 2.40+) attempt to fix this by using exponentially-spaced buckets, which is conceptually similar to DDSketch. If your organization is running Prometheus and cares about accurate tail latency, migrating to native histograms is one of the highest-impact monitoring improvements available.
Production Architecture for Global Percentiles
The architecture for correct global percentiles across a distributed fleet looks like this.
Each application instance maintains a local quantile sketch (T-Digest or DDSketch). The sketch processes every request's latency measurement in constant time with constant memory. Every 10-60 seconds, the instance serializes the sketch (a few KB) and ships it to the metrics backend, then resets its local sketch for the next window.
The metrics backend receives sketches from all instances for a given time window. It merges them into a single global sketch and computes percentiles from the merged result. The global P99 is now correct, not an average of local P99s.
For alerting, you query the merged sketch. For dashboards, you store the merged sketch per time window and query it at display time. Some systems (like Datadog) store the raw sketches and merge on read, which gives you flexibility to re-slice by arbitrary dimensions (datacenter, service version, customer tier) without pre-aggregation.
The critical detail: all instances must use the same sketch parameters. If server A uses T-Digest with compression 100 and server B uses compression 300, merging them is technically possible but the accuracy guarantees become murky. Standardize your sketch parameters in a shared library and do not let individual services override them.
Choosing Between T-Digest and DDSketch
For most teams, the choice comes down to ecosystem. If you are already using Elasticsearch or ClickHouse for metrics, T-Digest is built in. If you are shipping to Datadog, DDSketch is what their agent uses natively. If you are building from scratch, DDSketch is the simpler implementation with cleaner theoretical guarantees.
There is one technical distinction worth considering. T-Digest's accuracy is empirically validated but not formally proven. The scale functions work well in practice across a wide range of distributions, but the error bounds are heuristic. DDSketch has a formal proof: relative error is bounded by alpha regardless of the input distribution. If you need to put a number in an SLO document ("our P99 estimate is within 1% of the true value"), DDSketch lets you make that claim with mathematical backing. T-Digest does not.
On the flip side, T-Digest tends to produce better absolute accuracy at extreme quantiles (P99.99) for typical latency distributions, because the centroid concentration at the tails is adaptive rather than predetermined. DDSketch's accuracy at P99.99 depends on whether the bins in that region have enough resolution, which depends on the alpha parameter. For very extreme tails, you might need alpha = 0.001 (0.1% relative error), which increases the bin count.
Neither structure handles negative values naturally, though both have been extended to support them. For latency monitoring, this is irrelevant. For financial metrics (profit/loss), you would need the signed variants.
Key Points
- •You cannot average percentiles. The P99 of shard A and the P99 of shard B does not give you the global P99. This is the fundamental reason streaming quantile sketches exist
- •T-Digest uses weighted centroids that cluster more tightly at the tails (P1, P99) where accuracy matters most for latency monitoring. The compression parameter (typically 100-300) controls the accuracy-memory trade-off
- •DDSketch uses logarithmic bin boundaries to provide a relative error guarantee. If relative error is 1%, your P99 estimate is within 1% of the true value regardless of the underlying distribution shape
- •Both T-Digest and DDSketch support merging from multiple servers, which makes them suitable for computing global percentiles across a distributed fleet without centralizing raw data
- •Prometheus histogram_quantile, Datadog DDSketch, Elasticsearch percentile aggregation, and Envoy proxy latency histograms all rely on streaming quantile algorithms in production
Used By
Common Mistakes
- ✗Averaging percentiles across shards or time windows. This is mathematically wrong and produces values that can be arbitrarily far from the true global percentile. A system with 100 shards each reporting P99 = 50ms might actually have a global P99 of 500ms if the tail latency is concentrated on a few shards
- ✗Using fixed-width histogram bins for long-tail latency distributions. Bins of [0-10ms, 10-20ms, 20-30ms, ...] waste resolution on the low end where everything clusters and have no resolution at the tail where the interesting behavior lives. Exponential or logarithmic bins are almost always better
- ✗Setting T-Digest compression too low for tail accuracy. A compression parameter of 25 might seem fine when you look at P50, but the P99.9 estimate can be off by 30% or more. For SLO monitoring, use compression 200+ and verify tail accuracy against known distributions
- ✗Treating Prometheus default histogram buckets as precise quantile estimates. Prometheus histograms use linear interpolation within fixed buckets, and the default buckets (5ms, 10ms, 25ms, 50ms, ...) produce quantile estimates that can be wildly off if your actual latency distribution does not align with the bucket boundaries