Count-Min Sketch & Frequency Estimation
Architecture
The Problem: Counting Frequencies in a Firehose
You are processing a network tap at a large ISP. Packets fly past at millions per second, each with a source IP. Someone asks: "Which IPs sent the most traffic in the last hour?" You cannot store a counter per IP address because there are 4 billion possible IPv4 addresses and the stream never stops. Even if most IPs are inactive, you do not know which ones will appear, so you cannot pre-allocate.
You could sample. Take every 100th packet and count those. But sampling misses bursty traffic patterns and gives poor estimates for IPs that send just above the threshold you care about. What you want is a data structure that processes every single packet in constant time, uses bounded memory, and gives you a reasonable frequency estimate for any IP you ask about later.
Count-Min Sketch does exactly this. It was introduced by Cormode and Muthukrishnan in 2005, and it has since become the default frequency estimation structure in streaming systems.
The Two-Dimensional Counter Array
A Count-Min Sketch is a two-dimensional array of counters with d rows and w columns. Each row has its own hash function that maps elements to column positions in that row.
To record that element x appeared, hash x with each of the d hash functions to get d column indices, one per row. Increment the counter at each of those d positions. To estimate the frequency of x, hash x the same way, look up the d counters, and return the smallest one.
Why the minimum? Every counter in the sketch only goes up. When two different elements hash to the same position in some row, that counter gets inflated by both elements' counts. Taking the minimum across all d rows gives you the position with the least collision noise. At least one of the d counters might be free of collisions for your element, and the minimum finds it.
The guarantee is one-sided: the returned estimate is always greater than or equal to the true count. It never underestimates. The question is only "by how much does it overestimate?"
Error Bounds and Parameter Tuning
The Count-Min Sketch provides a clean probabilistic guarantee:
Estimate(x) <= TrueCount(x) + epsilon * N with probability at least 1 - delta
where N is the total number of events inserted, w = ceil(e / epsilon), and d = ceil(ln(1 / delta)).
The epsilon controls accuracy. If epsilon is 0.001 and you have processed 1 billion events, the overestimate for any element is at most 1 million with high probability. The delta controls confidence. With delta = 0.01, you have a 99% chance that the estimate falls within the epsilon bound.
Here is what this looks like for different parameter choices:
| Width (w) | Depth (d) | Memory | epsilon | delta | Max overcount (N=1B) |
|---|---|---|---|---|---|
| 272 | 5 | 5.4 KB | 0.01 | 0.01 | 10,000,000 |
| 2,718 | 5 | 54 KB | 0.001 | 0.01 | 1,000,000 |
| 27,183 | 5 | 544 KB | 0.0001 | 0.01 | 100,000 |
| 27,183 | 7 | 760 KB | 0.0001 | 0.001 | 100,000 |
| 271,829 | 5 | 5.4 MB | 0.00001 | 0.01 | 10,000 |
Notice that increasing width (more columns) improves accuracy directly. Increasing depth (more rows) improves confidence. If you need both tight accuracy and high confidence, you pay in both dimensions. But even the most aggressive setting here costs only 5.4 MB to give you per-element error bounded at 10,000 on a stream of 1 billion events. Storing exact counts for 1 billion distinct elements would cost tens of gigabytes.
Why Minimum, Not Average
Taking the average of the d counters seems natural. After all, averaging reduces noise in most statistical settings. But CMS counters have a crucial property: they can only be overcounted, never undercounted. The noise is always positive. Averaging positive noise gives you a positive bias. The minimum gives you the counter with the least positive bias.
Consider an element with true count 500 and three row counters showing 500, 723, and 512. The average is 578, which is 78 over. The minimum is 500, which is exact. The minimum found a row where no other heavy hitter collided with your element. That happens more often than you might expect, especially when you tune d to be large enough (5-7 rows is typical).
There is one caveat. The minimum is not always correct. Sometimes all d rows have collisions, and the minimum still overcounts. That is what the delta parameter controls: the probability that all rows collide simultaneously.
Conservative Update: Reducing Overcounts
Standard CMS increments all d counters unconditionally. Conservative update (Estan and Varghese, 2002) is smarter: when you increment, check all d counters for element x, find the current minimum, and only increment counters that are at the minimum value.
The intuition: if counter[row2] is already 500 and the current minimum is 300, then counter[row2] has accumulated 200 extra counts from collisions. Incrementing it further just adds more noise. Only increment the counters that are at the minimum, because those are the ones most likely to reflect the true count.
Conservative update dramatically reduces overestimation for heavy hitters. In experiments on Zipfian workloads, conservative update reduces the average error by 50-80% compared to standard update. The downside: conservative update sketches cannot be merged by simply adding corresponding counters. This matters in distributed settings where you want to maintain a CMS per stream partition and merge them periodically. If you need mergeability, you are stuck with standard update.
Count-Mean-Min: Handling Skewed Distributions
Real-world frequency distributions are Zipfian. A few elements dominate, and there is a long tail of rare elements. CMS struggles with the long tail because the heavy hitters contaminate counters across all rows. A rare element with true count 2 might share a counter with a heavy hitter with count 10 million, making the noise-to-signal ratio absurd.
Count-Mean-Min (Deng and Rafiei, 2007) addresses this by subtracting an estimate of the noise floor from each counter. For row i, the noise estimate is (N - counter[row_i][hash_i(x)]) / (w - 1), which represents the average count contributed by all other elements hashing to that row. The final estimate is the median of the noise-corrected values across all rows, clipped to zero.
In practice, Count-Mean-Min gives much better estimates for infrequent items, sometimes by an order of magnitude. The cost is computational: you need the total count N and a median computation per query. For heavy hitters detection where you only care about the top items, standard CMS is usually sufficient. For queries like "how many times did this specific low-volume user trigger an event," Count-Mean-Min is worth the extra work.
Heavy Hitters: Finding the Top-K
One of the most common CMS applications is finding heavy hitters: elements whose frequency exceeds some fraction of the total stream. The algorithm is straightforward.
Maintain a CMS and a min-heap of size k. For each element in the stream, increment it in the CMS, query its estimated frequency, and if the estimate exceeds the smallest element in the heap, insert it (evicting the smallest). At the end, the heap contains the approximate top-k elements.
This works because CMS overcounts. Any element truly in the top-k will have an estimated frequency at least as high as its true frequency, so it will not be missed. Some elements slightly below the threshold might sneak in due to overcounting, but they will be close to the boundary anyway. Redis's TOPK.ADD in the RedisBloom module uses exactly this approach.
For streaming top-k, there are also purpose-built algorithms like SpaceSaving and Lossy Counting. SpaceSaving is often more accurate than CMS+heap for the specific heavy-hitters problem, but CMS is more general because it also lets you query arbitrary elements, not just the top ones.
CMS vs Bloom Filter vs HyperLogLog
These three probabilistic structures answer fundamentally different questions:
| Question | Structure | Answer type |
|---|---|---|
| "Is x in the set?" | Bloom Filter | Yes/No (with false positives) |
| "How many distinct elements?" | HyperLogLog | Approximate count |
| "How many times did x appear?" | Count-Min Sketch | Approximate frequency |
They are complementary, not competing. A system might use a Bloom filter to check whether a user has been seen at all, an HLL to count total unique users, and a CMS to find which users are generating the most traffic. Choosing between them is about which question you are answering, not which one is "better."
Production Deployment Considerations
Memory budgeting. CMS memory is deterministic: w * d * counter_size bytes. With 4-byte counters, a 27,183 x 5 sketch uses 544 KB. You can plan capacity exactly, which makes CMS attractive for embedded systems and edge devices where memory is constrained.
Counter overflow. If you use 32-bit counters, they overflow at about 4.3 billion. For a stream processing 100,000 events per second, that is 12 hours of operation. Use 64-bit counters for anything long-running. The memory doubles, but overflow becomes effectively impossible.
Sliding windows. A CMS accumulates counts forever, which is often not what you want. For "frequency in the last hour," you need to decay old counts. One approach: maintain multiple CMS instances for time buckets (per minute, per 10 minutes) and rotate them. Another approach: multiply all counters by a decay factor periodically. The former preserves error bounds; the latter does not but is simpler.
Hash function selection. As with Bloom filters, use a fast hash with good distribution. MurmurHash3 or xxHash are standard choices. Do not use cryptographic hashes; they are orders of magnitude slower for no benefit in this context. The d hash functions can be derived from a single 128-bit hash using the Kirsch-Mitzenmacher technique, same as with Bloom filters.
When CMS Falls Short
CMS has a blind spot for infrequent items on heavy streams. If your stream has 10 billion events and you are looking for something that appeared 50 times, the error bound (even with aggressive tuning) might be larger than the true count. The estimate is noise, not signal.
For these use cases, consider exact counting on a sampled stream (if uniform sampling preserves the items you care about), or a different structure entirely. Lossy Counting gives exact counts for all items above a threshold using bounded memory, though it is more complex to implement and does not support arbitrary point queries.
CMS also does not natively support decrements. If you need to track net counts (additions minus removals), you need a variant like Count-Sketch, which uses +1/-1 randomization and median-based estimation. Count-Sketch is unbiased (unlike CMS's systematic overcount), but has higher variance. The choice depends on whether bias or variance is more damaging for your application.
Key Points
- •Estimates how often an element appears in a data stream using a fixed amount of memory regardless of stream length. The structure never needs to know the universe of possible elements in advance
- •Uses d independent hash functions mapping elements to d rows of w counters each. Insertion increments d counters, query returns the minimum of d counters. The minimum is the least-overcounted estimate
- •Overestimates frequency by at most epsilon * N with probability at least 1 - delta, where w = ceil(e/epsilon) and d = ceil(ln(1/delta)). There are no underestimates
- •Conservative update optimization only increments counters that equal the current minimum, significantly reducing overestimation for frequent items at the cost of losing the strict mergeability property
- •Used in production at Cisco, Apache Flink, Redis (RedisBloom Top-K), AT&T, and Twitter for network monitoring, streaming analytics, fraud detection, and trending topic detection
Used By
Common Mistakes
- ✗Using Count-Min Sketch for low-frequency items where the noise from hash collisions dominates the actual count. If an item appears 3 times and the overcounting noise is 50, your estimate is useless. CMS works best for heavy hitters
- ✗Not tuning width and depth for the actual workload. Default parameters from a library might target epsilon=0.01 and delta=0.01 (width 272, depth 5), but your stream might have 10 billion events, making the absolute error bound 100 million. Think in absolute terms, not just relative
- ✗Treating frequency estimates as exact for billing or compliance. CMS always overcounts. An estimate of 1,247 means the true count is at most 1,247 and at least something lower. If you are charging per event, that overcount becomes a financial dispute
- ✗Forgetting that conservative update breaks mergeability. If you need to merge sketches from different stream partitions, you must use standard update, not conservative update. Choose one or the other based on your architecture