Hashing that respects similarity
Ordinary hashing scatters even near identical inputs to unrelated buckets. Locality sensitive hashing, or LSH, deliberately does the opposite: it makes similar items likely to share a bucket and dissimilar items unlikely to.
The core property
A family of hash functions is locality sensitive if the probability of a collision grows with similarity. Two close points collide often, while far apart points collide rarely. The exact family depends on the distance measure, with different constructions for cosine, Jaccard, and Euclidean distances.
Amplifying the signal
A single function is too coarse. LSH combines functions in two stages.
- AND: concatenate several hashes so only very similar items match, cutting false matches.
- OR: use several such concatenated tables so true neighbors are caught in at least one.
This tuning sharpens the gap between near and far collision rates.
Searching
To find neighbors of a query, hash it into each table and inspect only the items in its buckets, then verify true distance on those few candidates. This replaces a full scan with examining a small candidate set, which is what makes approximate near neighbor search fast at scale.
Key idea
Locality sensitive hashing uses hash families where collision probability rises with similarity, amplified by combining tables, so near neighbor search inspects only a small candidate set of bucket mates instead of scanning everything.