MinHash & Locality-Sensitive Hashing

The All-Pairs Problem

You have 100 million web pages and need to find near-duplicates. Maybe someone scraped your content, or you want to deduplicate crawl results, or you are building a plagiarism detector. The naive approach compares every pair of pages. That is 100 million times 100 million divided by 2 = 5 x 10^15 comparisons. Even at a million comparisons per second, this takes 158 years.

MinHash and LSH reduce this to a tractable problem. Instead of comparing all pairs, you build compact signatures that preserve similarity information, then use a clever indexing scheme to only compare items that are likely to be similar.

Jaccard Similarity and Shingling

Before MinHash, you need a similarity measure. Jaccard similarity is the ratio of the intersection to the union of two sets:

J(A, B) = |A ∩ B| / |A ∪ B|

If sets A and B have 8 elements in common out of 10 total unique elements, the Jaccard similarity is 0.8.

For documents, the "set" is typically a set of shingles (n-grams). A k-shingle is a contiguous sequence of k characters or k words. For the sentence "the cat sat on the mat," the 2-word shingles are: {"the cat", "cat sat", "sat on", "on the", "the mat"}.

Shingling converts a document into a set, and Jaccard similarity measures how much two documents share. Documents with 90% shingle overlap are near-duplicates. Documents with 5% overlap are unrelated.

How MinHash Works

Computing Jaccard similarity directly requires iterating through both sets. For large sets (a web page might have 10,000 shingles), this is expensive per pair. MinHash compresses each set into a fixed-size signature that preserves Jaccard similarity.

The Core Idea

Take a random permutation of all possible elements in the universe. The MinHash of a set is the first element in the permuted order that belongs to the set.

The critical property: the probability that two sets have the same MinHash equals their Jaccard similarity.

Pr[MinHash(A) = MinHash(B)] = J(A, B)

Why? The minimum element in the permutation comes from A ∩ B if and only if the first element (in permutation order) that belongs to either set belongs to both. The probability of this is |A ∩ B| / |A ∪ B|, which is exactly the Jaccard similarity.

Building the Signature

One MinHash value gives you a single bit of information. To get a good estimate, use k independent hash functions (each simulating a random permutation). For each hash function, compute the minimum hash value across all elements in the set.

The result is a signature: a vector of k minimum hash values. To estimate J(A, B), count how many positions have the same value in both signatures and divide by k.

Signature A: [3, 7, 2, 9, 1, 4, 8, 3, 6, 2]
Signature B: [3, 4, 2, 9, 5, 4, 8, 7, 6, 2]
Matches:      Y  N  Y  Y  N  Y  Y  N  Y  Y  → 7/10 = 0.7

Estimated Jaccard similarity: 0.7

With k=200, the standard error is 1/sqrt(200) ≈ 7%. Good enough for most deduplication tasks.

Practical Hash Functions

You do not actually generate k independent random permutations. Instead, use k hash functions. For each element in the set, compute all k hashes. The MinHash for hash function i is the minimum value of hash_i across all elements.

In practice, people use a single hash function with k different seeds, or the Kirsch-Mitzenmacher trick (two base hashes combined linearly, same as for Bloom filters).

LSH: From Signatures to Candidates

MinHash gives you compact signatures, but you still need to compare every pair of signatures to find similar items. With 100 million items and 200-dimensional signatures, that is still 5 x 10^15 comparisons.

Locality-Sensitive Hashing eliminates most comparisons by grouping similar signatures into buckets.

The Banding Technique

Split each signature into b bands of r rows each (k = b * r). For each band, hash the r values together into a bucket. Two items are "candidate pairs" if they hash to the same bucket in at least one band.

The probability that two items with Jaccard similarity s become candidates is:

P(candidate) = 1 - (1 - s^r)^b

This is an S-shaped curve. Below a threshold similarity, the probability drops quickly toward zero. Above the threshold, it rises quickly toward one. The threshold is approximately (1/b)^(1/r).

Example: With b=20 bands of r=5 rows each (k=100):

Threshold ≈ (1/20)^(1/5) ≈ 0.55
Two items with J=0.8: P(candidate) ≈ 0.9996 (almost certainly caught)
Two items with J=0.3: P(candidate) ≈ 0.048 (almost certainly skipped)

You can shift the threshold by adjusting b and r. More bands with fewer rows lowers the threshold (catches more pairs, including more false positives). Fewer bands with more rows raises it (misses marginal pairs but has fewer false positives).

Index Structure

For each band, maintain a hash table mapping bucket IDs to lists of item IDs. To query a new item, compute its signature, hash each band, and look up matching items in each band's hash table. The union of all matches across bands is the candidate set.

For 100 million items with 20 bands, you have 20 hash tables, each with roughly 100 million entries. Total memory is manageable (a few GB with integer IDs and hash values).

SimHash: Cosine Similarity's Counterpart

MinHash estimates Jaccard similarity (set overlap). SimHash estimates cosine similarity (angular distance between vectors).

SimHash produces a b-bit fingerprint. For each of the b bit positions, compute a weighted sum of the input vector's components using a random hyperplane. If the sum is positive, the bit is 1; if negative, the bit is 0.

The Hamming distance between two SimHash fingerprints approximates the cosine distance. Two fingerprints differing in 10 out of 64 bits indicates high similarity (cosine similarity ≈ cos(π × 10/64) ≈ 0.87).

Google used SimHash for web page deduplication in their crawl pipeline. Each page is represented as a TF-IDF vector, SimHashed to a 64-bit fingerprint. Pages with fingerprints differing in fewer than 3 bits are flagged as near-duplicates.

SimHash's advantage over MinHash for text: it works directly on weighted vectors (TF-IDF scores) without needing to convert to sets first. Its disadvantage: it only captures cosine similarity, which is blind to set size differences that Jaccard catches.

Production Patterns

Web deduplication. Google's approach: shingle each page, compute a SimHash fingerprint, compare against a database of existing fingerprints using a multi-probe technique (flip each bit position and check for matches). Near-duplicates are flagged for removal from search results.

Spotify playlist similarity. Represent each playlist as a set of track IDs. MinHash the set, LSH-index the signatures. Finding playlists similar to a given one becomes a bucket lookup instead of an all-pairs comparison.

Entity resolution. Matching records across databases (is "John Smith, 123 Main St" the same as "J. Smith, 123 Main Street"?). Shingle the text fields, MinHash them, and use LSH to find candidate pairs. Then run a more expensive fuzzy matcher on just the candidates.

Malware detection. Compute SimHash of binary files or code behavior vectors. Similar SimHash fingerprints suggest the same malware family with minor mutations. Faster than comparing full binaries and scales to millions of samples.

Scaling Considerations

MinHash signature computation is embarrassingly parallel. Each document's signature is independent. Map over all documents, compute signatures, done.

LSH index building is also parallel by band. Each band's hash table is independent and can be built and queried in parallel.

The bottleneck is usually the candidate verification step. LSH gives you candidate pairs, but you still need to verify them (compute exact similarity and apply your threshold). If your LSH parameters produce too many false positive candidates, verification becomes the bottleneck. Tune b and r carefully to balance candidate volume against verification cost.

Apache Spark's MLlib includes a MinHashLSH implementation that distributes both signature computation and candidate generation across a cluster. For datasets beyond a single machine, this is the standard approach.

The All-Pairs Problem

Jaccard Similarity and Shingling

Before MinHash, you need a similarity measure. Jaccard similarity is the ratio of the intersection to the union of two sets:

J(A, B) = |A ∩ B| / |A ∪ B|

If sets A and B have 8 elements in common out of 10 total unique elements, the Jaccard similarity is 0.8.

How MinHash Works

The Core Idea

Take a random permutation of all possible elements in the universe. The MinHash of a set is the first element in the permuted order that belongs to the set.

The critical property: the probability that two sets have the same MinHash equals their Jaccard similarity.

Pr[MinHash(A) = MinHash(B)] = J(A, B)

Building the Signature

The result is a signature: a vector of k minimum hash values. To estimate J(A, B), count how many positions have the same value in both signatures and divide by k.

Signature A: [3, 7, 2, 9, 1, 4, 8, 3, 6, 2]
Signature B: [3, 4, 2, 9, 5, 4, 8, 7, 6, 2]
Matches:      Y  N  Y  Y  N  Y  Y  N  Y  Y  → 7/10 = 0.7

Estimated Jaccard similarity: 0.7

With k=200, the standard error is 1/sqrt(200) ≈ 7%. Good enough for most deduplication tasks.

Practical Hash Functions

In practice, people use a single hash function with k different seeds, or the Kirsch-Mitzenmacher trick (two base hashes combined linearly, same as for Bloom filters).

LSH: From Signatures to Candidates

Locality-Sensitive Hashing eliminates most comparisons by grouping similar signatures into buckets.

The Banding Technique

The probability that two items with Jaccard similarity s become candidates is:

P(candidate) = 1 - (1 - s^r)^b

This is an S-shaped curve. Below a threshold similarity, the probability drops quickly toward zero. Above the threshold, it rises quickly toward one. The threshold is approximately (1/b)^(1/r).

Example: With b=20 bands of r=5 rows each (k=100):

Threshold ≈ (1/20)^(1/5) ≈ 0.55
Two items with J=0.8: P(candidate) ≈ 0.9996 (almost certainly caught)
Two items with J=0.3: P(candidate) ≈ 0.048 (almost certainly skipped)

Index Structure

For 100 million items with 20 bands, you have 20 hash tables, each with roughly 100 million entries. Total memory is manageable (a few GB with integer IDs and hash values).

SimHash: Cosine Similarity's Counterpart

MinHash estimates Jaccard similarity (set overlap). SimHash estimates cosine similarity (angular distance between vectors).

Production Patterns

Scaling Considerations

MinHash signature computation is embarrassingly parallel. Each document's signature is independent. Map over all documents, compute signatures, done.

LSH index building is also parallel by band. Each band's hash table is independent and can be built and queried in parallel.

Architecture

The All-Pairs Problem

Jaccard Similarity and Shingling

How MinHash Works

The Core Idea

Building the Signature

Practical Hash Functions

LSH: From Signatures to Candidates

The Banding Technique

Index Structure

SimHash: Cosine Similarity's Counterpart

Production Patterns

Scaling Considerations

Key Points

Used By

Common Mistakes

Related

MinHash & Locality-Sensitive Hashing

Architecture

The All-Pairs Problem

Jaccard Similarity and Shingling

How MinHash Works

The Core Idea

Building the Signature

Practical Hash Functions

LSH: From Signatures to Candidates

The Banding Technique

Index Structure

SimHash: Cosine Similarity's Counterpart

Production Patterns

Scaling Considerations

Key Points

Used By

Common Mistakes

Related