MinHash & Locality-Sensitive Hashing
Architecture
The All-Pairs Problem
You have 100 million web pages and need to find near-duplicates. Maybe someone scraped your content, or you want to deduplicate crawl results, or you are building a plagiarism detector. The naive approach compares every pair of pages. That is 100 million times 100 million divided by 2 = 5 x 10^15 comparisons. Even at a million comparisons per second, this takes 158 years.
MinHash and LSH reduce this to a tractable problem. Instead of comparing all pairs, you build compact signatures that preserve similarity information, then use a clever indexing scheme to only compare items that are likely to be similar.
Jaccard Similarity and Shingling
Before MinHash, you need a similarity measure. Jaccard similarity is the ratio of the intersection to the union of two sets:
J(A, B) = |A ∩ B| / |A ∪ B|
If sets A and B have 8 elements in common out of 10 total unique elements, the Jaccard similarity is 0.8.
For documents, the "set" is typically a set of shingles (n-grams). A k-shingle is a contiguous sequence of k characters or k words. For the sentence "the cat sat on the mat," the 2-word shingles are: {"the cat", "cat sat", "sat on", "on the", "the mat"}.
Shingling converts a document into a set, and Jaccard similarity measures how much two documents share. Documents with 90% shingle overlap are near-duplicates. Documents with 5% overlap are unrelated.
How MinHash Works
Computing Jaccard similarity directly requires iterating through both sets. For large sets (a web page might have 10,000 shingles), this is expensive per pair. MinHash compresses each set into a fixed-size signature that preserves Jaccard similarity.
The Core Idea
Take a random permutation of all possible elements in the universe. The MinHash of a set is the first element in the permuted order that belongs to the set.
The critical property: the probability that two sets have the same MinHash equals their Jaccard similarity.
Pr[MinHash(A) = MinHash(B)] = J(A, B)
Why? The minimum element in the permutation comes from A ∩ B if and only if the first element (in permutation order) that belongs to either set belongs to both. The probability of this is |A ∩ B| / |A ∪ B|, which is exactly the Jaccard similarity.
Building the Signature
One MinHash value gives you a single bit of information. To get a good estimate, use k independent hash functions (each simulating a random permutation). For each hash function, compute the minimum hash value across all elements in the set.
The result is a signature: a vector of k minimum hash values. To estimate J(A, B), count how many positions have the same value in both signatures and divide by k.
Signature A: [3, 7, 2, 9, 1, 4, 8, 3, 6, 2]
Signature B: [3, 4, 2, 9, 5, 4, 8, 7, 6, 2]
Matches: Y N Y Y N Y Y N Y Y → 7/10 = 0.7
Estimated Jaccard similarity: 0.7
With k=200, the standard error is 1/sqrt(200) ≈ 7%. Good enough for most deduplication tasks.
Practical Hash Functions
You do not actually generate k independent random permutations. Instead, use k hash functions. For each element in the set, compute all k hashes. The MinHash for hash function i is the minimum value of hash_i across all elements.
In practice, people use a single hash function with k different seeds, or the Kirsch-Mitzenmacher trick (two base hashes combined linearly, same as for Bloom filters).
LSH: From Signatures to Candidates
MinHash gives you compact signatures, but you still need to compare every pair of signatures to find similar items. With 100 million items and 200-dimensional signatures, that is still 5 x 10^15 comparisons.
Locality-Sensitive Hashing eliminates most comparisons by grouping similar signatures into buckets.
The Banding Technique
Split each signature into b bands of r rows each (k = b * r). For each band, hash the r values together into a bucket. Two items are "candidate pairs" if they hash to the same bucket in at least one band.
The probability that two items with Jaccard similarity s become candidates is:
P(candidate) = 1 - (1 - s^r)^b
This is an S-shaped curve. Below a threshold similarity, the probability drops quickly toward zero. Above the threshold, it rises quickly toward one. The threshold is approximately (1/b)^(1/r).
Example: With b=20 bands of r=5 rows each (k=100):
- Threshold ≈ (1/20)^(1/5) ≈ 0.55
- Two items with J=0.8: P(candidate) ≈ 0.9996 (almost certainly caught)
- Two items with J=0.3: P(candidate) ≈ 0.048 (almost certainly skipped)
You can shift the threshold by adjusting b and r. More bands with fewer rows lowers the threshold (catches more pairs, including more false positives). Fewer bands with more rows raises it (misses marginal pairs but has fewer false positives).
Index Structure
For each band, maintain a hash table mapping bucket IDs to lists of item IDs. To query a new item, compute its signature, hash each band, and look up matching items in each band's hash table. The union of all matches across bands is the candidate set.
For 100 million items with 20 bands, you have 20 hash tables, each with roughly 100 million entries. Total memory is manageable (a few GB with integer IDs and hash values).
SimHash: Cosine Similarity's Counterpart
MinHash estimates Jaccard similarity (set overlap). SimHash estimates cosine similarity (angular distance between vectors).
SimHash produces a b-bit fingerprint. For each of the b bit positions, compute a weighted sum of the input vector's components using a random hyperplane. If the sum is positive, the bit is 1; if negative, the bit is 0.
The Hamming distance between two SimHash fingerprints approximates the cosine distance. Two fingerprints differing in 10 out of 64 bits indicates high similarity (cosine similarity ≈ cos(π × 10/64) ≈ 0.87).
Google used SimHash for web page deduplication in their crawl pipeline. Each page is represented as a TF-IDF vector, SimHashed to a 64-bit fingerprint. Pages with fingerprints differing in fewer than 3 bits are flagged as near-duplicates.
SimHash's advantage over MinHash for text: it works directly on weighted vectors (TF-IDF scores) without needing to convert to sets first. Its disadvantage: it only captures cosine similarity, which is blind to set size differences that Jaccard catches.
Production Patterns
Web deduplication. Google's approach: shingle each page, compute a SimHash fingerprint, compare against a database of existing fingerprints using a multi-probe technique (flip each bit position and check for matches). Near-duplicates are flagged for removal from search results.
Spotify playlist similarity. Represent each playlist as a set of track IDs. MinHash the set, LSH-index the signatures. Finding playlists similar to a given one becomes a bucket lookup instead of an all-pairs comparison.
Entity resolution. Matching records across databases (is "John Smith, 123 Main St" the same as "J. Smith, 123 Main Street"?). Shingle the text fields, MinHash them, and use LSH to find candidate pairs. Then run a more expensive fuzzy matcher on just the candidates.
Malware detection. Compute SimHash of binary files or code behavior vectors. Similar SimHash fingerprints suggest the same malware family with minor mutations. Faster than comparing full binaries and scales to millions of samples.
Scaling Considerations
MinHash signature computation is embarrassingly parallel. Each document's signature is independent. Map over all documents, compute signatures, done.
LSH index building is also parallel by band. Each band's hash table is independent and can be built and queried in parallel.
The bottleneck is usually the candidate verification step. LSH gives you candidate pairs, but you still need to verify them (compute exact similarity and apply your threshold). If your LSH parameters produce too many false positive candidates, verification becomes the bottleneck. Tune b and r carefully to balance candidate volume against verification cost.
Apache Spark's MLlib includes a MinHashLSH implementation that distributes both signature computation and candidate generation across a cluster. For datasets beyond a single machine, this is the standard approach.
Key Points
- •MinHash estimates the Jaccard similarity between two sets without comparing them element by element. Two documents with 80% word overlap will have matching MinHash values about 80% of the time. This turns a potentially expensive set intersection into a fast signature comparison
- •An LSH (Locality-Sensitive Hashing) index groups similar items into the same hash bucket with high probability. Unlike regular hash functions that spread similar inputs apart, LSH deliberately creates collisions for similar items. This enables sub-linear similarity search
- •The banding technique controls the trade-off between false positives and false negatives. Split the signature into b bands of r rows each. Two items are candidates if they match in at least one band. More bands with fewer rows catches more true pairs but also more false ones
- •SimHash (Charikar, 2002) is the cosine similarity counterpart to MinHash. It produces a fixed-size binary fingerprint where the Hamming distance between fingerprints approximates the cosine distance between the original vectors. Google used SimHash for web page deduplication at scale
- •The combination of MinHash signatures and LSH indexing reduces near-duplicate detection from O(n²) pairwise comparisons to roughly O(n), making it practical on datasets with billions of items
Used By
Common Mistakes
- ✗Using too few hash functions in the MinHash signature. The estimation error of Jaccard similarity is proportional to 1/sqrt(k) where k is the number of hash functions. With 100 hash functions, your error is about 10%. With 400, it is about 5%. Skimping on k for speed gives unreliable results
- ✗Applying MinHash to weighted or ordered data without adaptation. Standard MinHash works on unweighted sets (is this element present or not). For weighted sets (TF-IDF vectors), you need weighted MinHash. For ordered sequences (sentences), you need shingling first
- ✗Setting LSH bands and rows without understanding the S-curve. The probability of two items becoming candidates is 1-(1-s^r)^b where s is the true similarity. Plotting this curve for your chosen b and r shows you exactly where the threshold falls and how steep it is
- ✗Running all-pairs similarity on a large dataset without LSH. Pairwise comparison of 10 million items is 50 trillion comparisons. No amount of hardware makes this feasible. LSH reduces it to checking only items that fall in the same bucket, which is typically a few orders of magnitude fewer pairs