Bloom Filters & Probabilistic Membership
Architecture
The Problem: Avoiding Expensive Lookups
You have a database with 200 million user records across dozens of SSTables on disk. A request comes in asking whether user_abc123 exists. Without any filtering, you read from disk, decompress, binary search, and discover the user is not there. Multiply that by a few thousand requests per second for users that do not exist, and you have burned a massive amount of I/O on nothing.
Bloom filters exist to short-circuit that path. Before touching disk, you ask a tiny in-memory data structure: "Could this user possibly be in this SSTable?" If the answer is no, you skip the disk read entirely. If the answer is yes, you do the read, knowing there is a small chance it was a false positive. In Cassandra's case, this eliminates 90%+ of unnecessary SSTable reads on a typical workload.
How the Bit Array Works
Start with an array of m bits, all set to 0. Pick k independent hash functions, each mapping an element to a position in the array (a number from 0 to m-1).
To insert an element, compute all k hash positions and set each of those bits to 1. To check whether an element might be in the set, compute all k hash positions and check whether every one of them is 1. If any position is 0, the element was never inserted. If all positions are 1, the element was probably inserted, but there is a chance those bits were set by other elements.
That "chance those bits were set by other elements" is the false positive. It is the entire trade-off of a Bloom filter: you get constant-time membership checks in very little memory, at the cost of occasionally getting a "yes" when the true answer is "no." You never get a "no" when the true answer is "yes."
Understanding the False Positive Math
The false positive probability after inserting n elements into an m-bit array with k hash functions is approximately:
p ≈ (1 - e^(-kn/m))^k
Here is the intuition behind it. After inserting n elements with k hash functions each, you have set approximately k*n bit positions (with some collisions). The fraction of bits still at 0 is roughly e^(-kn/m). When you check a new element, each of its k hash positions independently needs to land on a 1-bit for a false positive. The probability of one position being a 1 is (1 - e^(-kn/m)), and you need all k positions to be 1, so you raise it to the power of k.
In practical terms: if you have 1 million elements and want a 1% false positive rate, you need about 9.6 million bits (1.2 MB) and 7 hash functions. For 0.1%, you need about 14.4 million bits (1.8 MB) and 10 hash functions. The memory grows linearly with the number of elements and logarithmically with the inverse of the false positive rate.
Choosing Optimal k and m
Given n elements and a desired false positive rate p, the optimal parameters are:
m = -n * ln(p) / (ln(2))^2
k = (m / n) * ln(2)
| Elements (n) | Target FP Rate | Bits (m) | Hash Functions (k) | Memory |
|---|---|---|---|---|
| 100,000 | 1% | 958,506 | 7 | 117 KB |
| 1,000,000 | 1% | 9,585,059 | 7 | 1.14 MB |
| 1,000,000 | 0.1% | 14,377,588 | 10 | 1.71 MB |
| 10,000,000 | 1% | 95,850,584 | 7 | 11.4 MB |
| 100,000,000 | 1% | 958,505,838 | 7 | 114 MB |
These numbers reveal why Bloom filters are popular: 1 million elements at 1% false positive rate costs just over 1 MB. Compare that to storing the actual elements, which might cost 50-100 MB depending on element size.
The trick with choosing k: more hash functions reduce false positives up to the optimum, but too many hash functions fill the bit array faster and actually increase false positives. The formula k = (m/n) * ln(2) finds the sweet spot.
Hash Function Engineering
In theory, you need k independent hash functions. In practice, nobody uses k separate hash functions. Kirsch and Mitzenmacher proved in 2004 that you can use two hash functions and derive the rest as linear combinations: h_i(x) = h1(x) + i * h2(x). This gives you the same false positive guarantees with two hash computations instead of k.
Most implementations use a single 128-bit hash (like MurmurHash3's 128-bit variant) and split it into two 64-bit halves. RocksDB does exactly this. The quality of the hash matters enormously here. A hash function with poor avalanche properties will create correlated bit positions. You will see false positive rates 2-5x higher than the theoretical prediction and spend days wondering why your Bloom filter is not performing as expected.
Counting Bloom Filters: Enabling Deletion
Standard Bloom filters have a hard limitation: you cannot remove elements. If you clear the bits for element A, you might be clearing bits that element B also set. The next lookup for element B returns "not in set" when B is actually present. You have introduced false negatives, which violates the core guarantee.
Counting Bloom filters replace each bit with a small counter (typically 4 bits). Insertion increments the k counters. Deletion decrements them. A position is considered "set" if its counter is above zero. This works because decrementing only reduces the counter for one element; other elements that hash to the same position still keep the counter above zero.
The cost is memory. A 4-bit counter per position means 4x the memory of a standard Bloom filter. For the 1 million element / 1% false positive case, that jumps from 1.14 MB to about 4.56 MB. And there is a subtle risk: counter overflow. With 4-bit counters, the maximum value is 15. If more than 15 elements hash to the same position, the counter saturates and deletion becomes unsafe. This is rare with properly sized filters, but it can happen in adversarial scenarios or with poor hash functions.
Cuckoo Filters: A Modern Alternative
Cuckoo filters (Fan et al., 2014) store fingerprints of elements in a cuckoo hash table instead of setting bits in an array. They support deletion natively (remove the fingerprint), achieve slightly better space efficiency than Bloom filters for false positive rates below about 3%, and provide faster lookups due to better cache locality (only two memory accesses regardless of the desired false positive rate).
The trade-off is complexity. Cuckoo filters have a maximum capacity. Once the table reaches about 95% occupancy, insertions start failing. You need to either resize (expensive) or accept that the filter is full. Bloom filters degrade smoothly when overfull; the false positive rate just increases. Cuckoo filters fall off a cliff.
For new projects where deletion is a requirement, Cuckoo filters are usually the better choice. For append-only workloads like SSTable filtering, standard Bloom filters remain simpler and well-understood.
Partitioned Bloom Filters
A standard Bloom filter has one contiguous bit array. A partitioned Bloom filter splits the m bits into k separate arrays of m/k bits each, with each hash function targeting exactly one partition. The advantage is cleaner analysis: each partition is independent, which simplifies the false positive probability calculation and can improve cache behavior.
The downside is that partitioned Bloom filters have slightly higher false positive rates than standard Bloom filters with the same total memory. The independence between partitions means you lose some of the beneficial "smoothing" that happens when hash functions share the same array. In practice the difference is small, typically less than 10% higher false positive rate, but it is worth knowing about.
Scalable Bloom Filters
What happens when you do not know n in advance? Standard Bloom filters require you to commit to n upfront. If you underestimate, the false positive rate blows up. If you overestimate, you waste memory.
Scalable Bloom filters (Almeida et al., 2007) solve this by chaining multiple Bloom filters together. When the current filter reaches its target fill ratio, a new filter is created with tighter false positive parameters. The overall false positive rate is the sum of the individual filter rates, so each successive filter uses a geometrically decreasing false positive target (typically each new filter has half the false positive rate of the previous one).
Lookups must check all filters in the chain, which costs more than a single filter. But the total false positive rate converges to a bounded value, and you never need to know the final element count upfront. LevelDB's Bloom filter implementation takes a different approach to the same problem: it creates a new Bloom filter per SSTable, so each filter has a known and bounded n.
Production Patterns
Negative cache in front of a database. This is the canonical use case. Cassandra checks a Bloom filter before reading each SSTable. If the filter says the partition key is not in this SSTable, skip it. With 10 SSTables and a 1% false positive rate each, you read at most 1.1 SSTables on average instead of 10. That is an order of magnitude less I/O.
Chrome Safe Browsing. The browser maintains a local Bloom filter of known malicious URLs. Every URL you visit gets checked against it. If the filter says "possibly malicious," the browser sends a hashed prefix to Google's servers for a definitive check. The Bloom filter keeps the common case (safe URLs) entirely local and fast.
Network deduplication. When processing a stream of events and you need to deduplicate, a Bloom filter can tell you "definitely new" or "probably seen before." For event processing where occasional duplicate processing is acceptable, this is much cheaper than maintaining an exact set of all seen event IDs.
Distributed join optimization. Before shuffling data across the network for a distributed join, build a Bloom filter on the smaller table's join key. Send the filter to nodes holding the larger table. Each node filters out rows that definitely will not match, dramatically reducing network transfer. Spark and Presto both support this optimization.
When Not To Use Bloom Filters
If your set fits comfortably in memory as a hash set, just use a hash set. The complexity of managing a Bloom filter is not worth it when you have 50,000 elements and 64 bytes per element, which comes to about 3 MB. Bloom filters shine when the set is too large for exact representation, or when you need many filters (one per SSTable, one per partition, one per time window).
Also avoid Bloom filters when false positives carry real costs beyond an extra lookup. In a rate limiter, a false positive means blocking a legitimate user. In a cache, a false positive means a pointless backend request. The former is much worse than the latter. Think carefully about what happens downstream when the filter lies.
Sizing Failures in Practice
The most common production incident with Bloom filters is undersizing. Someone provisions a filter for 1 million elements, the dataset grows to 5 million, and suddenly the false positive rate jumps from 1% to 25%. The system still works, technically. But performance degrades so badly that the filter might as well not exist.
Monitor two things: the number of elements inserted and the observed false positive rate. If you are running Cassandra, the BloomFilterFalsePositives and BloomFilterFalseRatio metrics are your early warning. When the false ratio starts climbing, it is time to rebuild the SSTable with a larger filter. RocksDB makes this easier because new filters are created during compaction, so you can adjust the bits-per-key setting and let compaction handle the migration.
Key Points
- •Answers 'is this element in the set?' with zero false negatives and a tunable false positive rate. If a Bloom filter says no, the element is definitely not there. If it says yes, the element is probably there
- •A bit array of m bits with k independent hash functions. Insertion hashes the element k times and sets those k bit positions to 1. Lookup hashes the element k times and checks whether all k positions are 1
- •False positive rate is approximately (1 - e^(-kn/m))^k where n is the number of inserted elements, m is the bit array size, and k is the number of hash functions. Tuning m and k for a target false positive rate is straightforward given expected n
- •Standard Bloom filters do not support deletion. Removing an element would require clearing bits that might be shared with other elements. Counting Bloom filters replace each bit with a counter to enable deletion at the cost of 3-4x more memory
- •Cassandra, Chrome Safe Browsing, RocksDB, PostgreSQL, and Bitcoin SPV nodes all use Bloom filters in production to avoid expensive lookups against data that is almost certainly not present
Used By
Common Mistakes
- ✗Not sizing the filter correctly for the expected number of elements. An undersized Bloom filter degrades gracefully in theory but catastrophically in practice: at 50% fill, false positive rates spike well above your design target
- ✗Assuming you can delete elements from a standard Bloom filter. Clearing a bit that belongs to multiple elements corrupts the filter silently. If you need deletion, use a Counting Bloom filter or a Cuckoo filter from the start
- ✗Using a Bloom filter where false positives affect correctness, not just performance. A Bloom filter as a negative cache is fine (false positive means an unnecessary lookup). A Bloom filter deciding whether to charge a customer is not fine
- ✗Ignoring hash function quality. Bloom filters need hash functions that produce independent, uniformly distributed outputs. Using a weak hash function creates correlated bit positions and inflates your actual false positive rate well beyond the theoretical bound