HyperLogLog & Cardinality Estimation
Architecture
The Problem: Counting Unique Things at Scale
You want to know how many unique users visited your site today. With 10,000 users, store them in a HashSet. With 100 million users across 50 services, that HashSet eats gigabytes of RAM per counter. And you have hundreds of counters: unique users per page, per campaign, per country, per hour.
Exact counting at this scale is a memory problem disguised as a counting problem. HyperLogLog solves it by trading a small amount of accuracy for a massive reduction in memory. About 12KB per counter, regardless of whether you are counting thousands or billions of distinct elements.
How It Actually Works
Take any element, hash it to a uniformly distributed binary string. Now look at the leading zeros. If the hash starts with 0001..., that is a run of three zeros before the first 1. The key insight: the longer the run of leading zeros you have observed, the more distinct elements you have probably seen.
Think of it like coin flips. Getting one head in a row is common. Getting ten heads in a row means you have been flipping for a while. If someone tells you the longest streak of heads they observed was 20 in a row, you can estimate they flipped the coin roughly 2^20 times (about a million).
HyperLogLog improves on this raw idea in three ways.
Stochastic averaging. Instead of keeping one maximum, split elements into m buckets (called registers) using the first p bits of the hash. Each register independently tracks its own longest zero-run from the remaining bits. This reduces variance dramatically. With m = 16384 registers (p = 14 bits), you get 16384 independent experiments instead of one.
Harmonic mean correction. Averaging the register values directly would give you a biased estimate. HyperLogLog uses the harmonic mean instead, which handles outliers much better. The formula is: E = alpha_m * m^2 / sum(2^(-M[j])) where M[j] is the value in register j and alpha_m is a bias correction constant that depends on m.
Small range and large range corrections. When the estimate is very low (below 5/2 * m), some registers might still be at zero, which means the harmonic mean estimate is too high. HyperLogLog switches to a linear counting method for this range. At the high end (above 2^32 / 30 for 32-bit hashes), hash collisions start to matter, and a large range correction kicks in. The HyperLogLog++ paper from Google mostly eliminates the need for the large range correction by using 64-bit hashes.
Memory and Accuracy
Each register stores a single number: the longest zero-run seen. With 64-bit hashes, this number can range from 0 to about 64, so each register needs 6 bits. With m = 16384 registers, that is 16384 * 6 / 8 = 12288 bytes. About 12KB.
The standard error formula is 1.04 / sqrt(m):
| Registers (m) | Bits (p) | Memory | Standard Error |
|---|---|---|---|
| 256 | 8 | 192 bytes | 6.50% |
| 1024 | 10 | 768 bytes | 3.25% |
| 4096 | 12 | 3 KB | 1.63% |
| 16384 | 14 | 12 KB | 0.81% |
| 65536 | 16 | 48 KB | 0.41% |
Redis chose 16384 registers as the default for PFADD/PFCOUNT. That gives you 0.81% error in 12KB. Most analytics workloads do not need better than 1% accuracy for unique counts, so this lands in a sweet spot.
The Merge Property
This is where HyperLogLog becomes genuinely powerful in distributed systems. Merging two HLL sketches is trivial: take the maximum of each corresponding register. If sketch A has register[5] = 7 and sketch B has register[5] = 9, the merged sketch has register[5] = 9.
This means you can:
- Run HLL on each partition of a sharded database, then merge to get the global count
- Maintain per-hour HLL sketches and merge them to get daily or weekly unique counts
- Compute the intersection of two HLL sketches (via inclusion-exclusion: |A union B| = |A| + |B| - |A intersect B|), though intersection estimates get noisy when the sets overlap significantly
Compare this to exact counting: to merge two exact sets, you need to transmit every element and deduplicate. With HLL, you transmit 12KB and take element-wise max.
Sparse vs Dense Representation
When an HLL sketch is first created, most registers are at zero. Storing all 16384 registers wastes memory. Redis and Google's HLL++ use a sparse representation for low-cardinality cases: they store only the non-zero registers as a sorted list of (index, value) pairs.
Once enough registers fill up (typically when the list exceeds ~40% of the dense size), the implementation switches to the dense 12KB array. This optimization makes HLL viable even when you have millions of counters, most of which track small cardinalities.
When to Use HLL
Use it when you need to count distinct elements at scale and can tolerate about 1% error. Unique visitors, unique IP addresses, unique search queries, unique events. Anything where "approximately 14.2 million" is as useful as "exactly 14,198,347."
Do not use it when you need exact counts (billing, compliance reporting), when you need to know which elements are in the set (use a Bloom filter or an actual set), or when your cardinality is small enough that exact counting is cheap (under a few million, just use a set).
Do not use it for intersection estimation when the sets have very different sizes or low overlap. The inclusion-exclusion formula amplifies errors. If set A has 100 million elements and set B has 200 million, but they only share 1000, the intersection estimate will be dominated by noise. For intersection, consider MinHash or purpose-built sketch structures.
Variants Worth Knowing
HyperLogLog++ (Google, 2013) improved the original with 64-bit hashes (eliminating the large range correction), a more accurate bias correction for medium cardinalities using empirical data rather than a formula, and the sparse representation. This is what most modern implementations use.
Streaming HLL implementations in Flink, Kafka Streams, and Spark process elements in a single pass with constant memory. The sketch updates are associative and commutative, which means they parallelize perfectly across stream partitions.
LogLog-Beta (2016) is a simpler variant that achieves similar accuracy to HLL++ without the piecewise bias correction. It replaces the harmonic mean with a different estimator that naturally handles small and large ranges. Not widely adopted yet, but worth tracking.
Key Points
- •Estimates the count of distinct elements in a stream using roughly 12KB of memory regardless of cardinality. That is 12KB for 1 million uniques or 1 billion uniques
- •Standard error of 1.04/sqrt(m) where m is the number of registers. With 16384 registers (the Redis default), you get about 0.81% error
- •The core insight: the longest run of leading zeros in a hash tells you something about how many distinct values you have seen. More distinct values means longer runs become more likely
- •HyperLogLog structures are fully mergeable. You can compute cardinality of a union by merging HLL sketches from different servers without touching the original data
- •Redis PFCOUNT, ClickHouse uniqHLL12, BigQuery APPROX_COUNT_DISTINCT, and Presto approx_distinct all use HyperLogLog under the hood
Used By
Common Mistakes
- ✗Using exact COUNT(DISTINCT) on datasets above 100M rows when approximate is acceptable. A query that takes 45 seconds with exact counting takes 200ms with HLL
- ✗Not understanding that HLL cannot tell you WHICH elements are in the set, only how many. If you need membership testing, you want a Bloom filter instead
- ✗Forgetting that the error is relative to the true cardinality. 0.81% error on 1 billion uniques means you could be off by 8 million. For dashboards that is fine. For billing, it is not
- ✗Using too few registers to save memory. Going from 16384 to 256 registers saves 12KB but increases error from 0.81% to 6.5%. The memory savings is almost never worth the accuracy loss