Bloom Filters & Probabilistic Membership

The Problem: Avoiding Expensive Lookups

You have a database with 200 million user records across dozens of SSTables on disk. A request comes in asking whether user_abc123 exists. Without any filtering, you read from disk, decompress, binary search, and discover the user is not there. Multiply that by a few thousand requests per second for users that do not exist, and you have burned a massive amount of I/O on nothing.

Bloom filters exist to short-circuit that path. Before touching disk, you ask a tiny in-memory data structure: "Could this user possibly be in this SSTable?" If the answer is no, you skip the disk read entirely. If the answer is yes, you do the read, knowing there is a small chance it was a false positive. In Cassandra's case, this eliminates 90%+ of unnecessary SSTable reads on a typical workload.

How the Bit Array Works

Start with an array of m bits, all set to 0. Pick k independent hash functions, each mapping an element to a position in the array (a number from 0 to m-1).

To insert an element, compute all k hash positions and set each of those bits to 1. To check whether an element might be in the set, compute all k hash positions and check whether every one of them is 1. If any position is 0, the element was never inserted. If all positions are 1, the element was probably inserted, but there is a chance those bits were set by other elements.

That "chance those bits were set by other elements" is the false positive. It is the entire trade-off of a Bloom filter: you get constant-time membership checks in very little memory, at the cost of occasionally getting a "yes" when the true answer is "no." You never get a "no" when the true answer is "yes."

Understanding the False Positive Math

The false positive probability after inserting n elements into an m-bit array with k hash functions is approximately:

p ≈ (1 - e^(-kn/m))^k

Here is the intuition behind it. After inserting n elements with k hash functions each, you have set approximately k*n bit positions (with some collisions). The fraction of bits still at 0 is roughly e^(-kn/m). When you check a new element, each of its k hash positions independently needs to land on a 1-bit for a false positive. The probability of one position being a 1 is (1 - e^(-kn/m)), and you need all k positions to be 1, so you raise it to the power of k.

In practical terms: if you have 1 million elements and want a 1% false positive rate, you need about 9.6 million bits (1.2 MB) and 7 hash functions. For 0.1%, you need about 14.4 million bits (1.8 MB) and 10 hash functions. The memory grows linearly with the number of elements and logarithmically with the inverse of the false positive rate.

Choosing Optimal k and m

Given n elements and a desired false positive rate p, the optimal parameters are:

m = -n * ln(p) / (ln(2))^2

k = (m / n) * ln(2)

Elements (n)	Target FP Rate	Bits (m)	Hash Functions (k)	Memory
100,000	1%	958,506	7	117 KB
1,000,000	1%	9,585,059	7	1.14 MB
1,000,000	0.1%	14,377,588	10	1.71 MB
10,000,000	1%	95,850,584	7	11.4 MB
100,000,000	1%	958,505,838	7	114 MB

These numbers reveal why Bloom filters are popular: 1 million elements at 1% false positive rate costs just over 1 MB. Compare that to storing the actual elements, which might cost 50-100 MB depending on element size.

The trick with choosing k: more hash functions reduce false positives up to the optimum, but too many hash functions fill the bit array faster and actually increase false positives. The formula k = (m/n) * ln(2) finds the sweet spot.

Hash Function Engineering

In theory, you need k independent hash functions. In practice, nobody uses k separate hash functions. Kirsch and Mitzenmacher proved in 2004 that you can use two hash functions and derive the rest as linear combinations: h_i(x) = h1(x) + i * h2(x). This gives you the same false positive guarantees with two hash computations instead of k.

Most implementations use a single 128-bit hash (like MurmurHash3's 128-bit variant) and split it into two 64-bit halves. RocksDB does exactly this. The quality of the hash matters enormously here. A hash function with poor avalanche properties will create correlated bit positions. You will see false positive rates 2-5x higher than the theoretical prediction and spend days wondering why your Bloom filter is not performing as expected.

Counting Bloom Filters: Enabling Deletion

Standard Bloom filters have a hard limitation: you cannot remove elements. If you clear the bits for element A, you might be clearing bits that element B also set. The next lookup for element B returns "not in set" when B is actually present. You have introduced false negatives, which violates the core guarantee.

Counting Bloom filters replace each bit with a small counter (typically 4 bits). Insertion increments the k counters. Deletion decrements them. A position is considered "set" if its counter is above zero. This works because decrementing only reduces the counter for one element; other elements that hash to the same position still keep the counter above zero.

The cost is memory. A 4-bit counter per position means 4x the memory of a standard Bloom filter. For the 1 million element / 1% false positive case, that jumps from 1.14 MB to about 4.56 MB. And there is a subtle risk: counter overflow. With 4-bit counters, the maximum value is 15. If more than 15 elements hash to the same position, the counter saturates and deletion becomes unsafe. This is rare with properly sized filters, but it can happen in adversarial scenarios or with poor hash functions.

Cuckoo Filters: A Modern Alternative

Cuckoo filters (Fan et al., 2014) store fingerprints of elements in a cuckoo hash table instead of setting bits in an array. They support deletion natively (remove the fingerprint), achieve slightly better space efficiency than Bloom filters for false positive rates below about 3%, and provide faster lookups due to better cache locality (only two memory accesses regardless of the desired false positive rate).

The trade-off is complexity. Cuckoo filters have a maximum capacity. Once the table reaches about 95% occupancy, insertions start failing. You need to either resize (expensive) or accept that the filter is full. Bloom filters degrade smoothly when overfull; the false positive rate just increases. Cuckoo filters fall off a cliff.

For new projects where deletion is a requirement, Cuckoo filters are usually the better choice. For append-only workloads like SSTable filtering, standard Bloom filters remain simpler and well-understood.

Partitioned Bloom Filters

A standard Bloom filter has one contiguous bit array. A partitioned Bloom filter splits the m bits into k separate arrays of m/k bits each, with each hash function targeting exactly one partition. The advantage is cleaner analysis: each partition is independent, which simplifies the false positive probability calculation and can improve cache behavior.

The downside is that partitioned Bloom filters have slightly higher false positive rates than standard Bloom filters with the same total memory. The independence between partitions means you lose some of the beneficial "smoothing" that happens when hash functions share the same array. In practice the difference is small, typically less than 10% higher false positive rate, but it is worth knowing about.

Scalable Bloom Filters

What happens when you do not know n in advance? Standard Bloom filters require you to commit to n upfront. If you underestimate, the false positive rate blows up. If you overestimate, you waste memory.

Scalable Bloom filters (Almeida et al., 2007) solve this by chaining multiple Bloom filters together. When the current filter reaches its target fill ratio, a new filter is created with tighter false positive parameters. The overall false positive rate is the sum of the individual filter rates, so each successive filter uses a geometrically decreasing false positive target (typically each new filter has half the false positive rate of the previous one).

Lookups must check all filters in the chain, which costs more than a single filter. But the total false positive rate converges to a bounded value, and you never need to know the final element count upfront. LevelDB's Bloom filter implementation takes a different approach to the same problem: it creates a new Bloom filter per SSTable, so each filter has a known and bounded n.

Production Patterns

Negative cache in front of a database. This is the canonical use case. Cassandra checks a Bloom filter before reading each SSTable. If the filter says the partition key is not in this SSTable, skip it. With 10 SSTables and a 1% false positive rate each, you read at most 1.1 SSTables on average instead of 10. That is an order of magnitude less I/O.

Chrome Safe Browsing. The browser maintains a local Bloom filter of known malicious URLs. Every URL you visit gets checked against it. If the filter says "possibly malicious," the browser sends a hashed prefix to Google's servers for a definitive check. The Bloom filter keeps the common case (safe URLs) entirely local and fast.

Network deduplication. When processing a stream of events and you need to deduplicate, a Bloom filter can tell you "definitely new" or "probably seen before." For event processing where occasional duplicate processing is acceptable, this is much cheaper than maintaining an exact set of all seen event IDs.

Distributed join optimization. Before shuffling data across the network for a distributed join, build a Bloom filter on the smaller table's join key. Send the filter to nodes holding the larger table. Each node filters out rows that definitely will not match, dramatically reducing network transfer. Spark and Presto both support this optimization.

When Not To Use Bloom Filters

If your set fits comfortably in memory as a hash set, just use a hash set. The complexity of managing a Bloom filter is not worth it when you have 50,000 elements and 64 bytes per element, which comes to about 3 MB. Bloom filters shine when the set is too large for exact representation, or when you need many filters (one per SSTable, one per partition, one per time window).

Also avoid Bloom filters when false positives carry real costs beyond an extra lookup. In a rate limiter, a false positive means blocking a legitimate user. In a cache, a false positive means a pointless backend request. The former is much worse than the latter. Think carefully about what happens downstream when the filter lies.

Sizing Failures in Practice

The most common production incident with Bloom filters is undersizing. Someone provisions a filter for 1 million elements, the dataset grows to 5 million, and suddenly the false positive rate jumps from 1% to 25%. The system still works, technically. But performance degrades so badly that the filter might as well not exist.

Monitor two things: the number of elements inserted and the observed false positive rate. If you are running Cassandra, the BloomFilterFalsePositives and BloomFilterFalseRatio metrics are your early warning. When the false ratio starts climbing, it is time to rebuild the SSTable with a larger filter. RocksDB makes this easier because new filters are created during compaction, so you can adjust the bits-per-key setting and let compaction handle the migration.

The Problem: Avoiding Expensive Lookups

How the Bit Array Works

Start with an array of m bits, all set to 0. Pick k independent hash functions, each mapping an element to a position in the array (a number from 0 to m-1).

Understanding the False Positive Math

The false positive probability after inserting n elements into an m-bit array with k hash functions is approximately:

p ≈ (1 - e^(-kn/m))^k

Choosing Optimal k and m

Given n elements and a desired false positive rate p, the optimal parameters are:

m = -n * ln(p) / (ln(2))^2

k = (m / n) * ln(2)

Elements (n)	Target FP Rate	Bits (m)	Hash Functions (k)	Memory
100,000	1%	958,506	7	117 KB
1,000,000	1%	9,585,059	7	1.14 MB
1,000,000	0.1%	14,377,588	10	1.71 MB
10,000,000	1%	95,850,584	7	11.4 MB
100,000,000	1%	958,505,838	7	114 MB

Architecture

The Problem: Avoiding Expensive Lookups

How the Bit Array Works

Understanding the False Positive Math

Choosing Optimal k and m

Hash Function Engineering

Counting Bloom Filters: Enabling Deletion

Cuckoo Filters: A Modern Alternative

Partitioned Bloom Filters

Scalable Bloom Filters

Production Patterns

When Not To Use Bloom Filters

Sizing Failures in Practice

Key Points

Used By

Common Mistakes

Related

Bloom Filters & Probabilistic Membership

Architecture

The Problem: Avoiding Expensive Lookups

How the Bit Array Works

Understanding the False Positive Math

Choosing Optimal k and m

Hash Function Engineering

Counting Bloom Filters: Enabling Deletion

Cuckoo Filters: A Modern Alternative

Partitioned Bloom Filters

Scalable Bloom Filters

Production Patterns

When Not To Use Bloom Filters

Sizing Failures in Practice

Key Points

Used By

Common Mistakes

Related