Comparing sets cheaply
Comparing large sets, such as the words in two documents, by exact overlap is costly when there are many of them. MinHash builds a small signature for each set so that comparing signatures estimates their similarity.
The Jaccard connection
Set similarity is often measured by the Jaccard index, the size of the intersection divided by the size of the union. MinHash is designed so that two signatures agree in a position with probability exactly equal to this index.
How a signature is made
Imagine applying a random permutation to the universe of all possible elements. For each set, record the element that comes first under that permutation, its minimum. Repeat with many independent permutations to form a vector of minimums, the signature.
- The chance two sets share a minimum equals their Jaccard similarity.
- In practice random hash functions stand in for full permutations.
Estimating similarity
To estimate similarity between two sets, count the fraction of signature positions where their minimums match. More permutations give a more accurate estimate. The compact signatures also feed directly into locality sensitive hashing for fast near duplicate detection at scale.
Key idea
MinHash records the minimum element under many random permutations to form a signature whose matching fraction equals the Jaccard similarity in expectation, turning expensive set comparison into cheap signature matching.