Finding the needles that look alike

foundations
algorithms
Spotting near-duplicates in a few million items sounds like a job for comparing every pair — which is a job you can’t afford. Locality-sensitive hashing turns a quadratic similarity search into a near-linear one by hashing similar things into the same bucket on purpose.
Author

Umut Altun

Published

May 23, 2023

Here’s a problem that sounds easy until you do the arithmetic. You have a few million items — documents, records, ad creatives, user profiles, whatever — and you want to find the ones that are near-duplicates of each other. Not exact copies, which a hash table finds instantly; near copies, the ones that are 90% the same with small differences. Almost-the-same.

The obvious approach is to compare every item to every other item and keep the pairs that score similar. For a few million items, “every pair” is a few trillion comparisons. Even at a nanosecond each you’re looking at hours to days, and you usually have far more than a few million. The quadratic blowup makes the obvious approach a non-starter, and the frustrating part is that almost all of that work is wasted: the overwhelming majority of pairs aren’t remotely similar, and you knew that before you started. You’re spending trillions of comparisons to confirm that random pairs of things are different.

What you want is a way to only compare pairs that have a real chance of being similar — to skip the trillions of obvious non-matches without looking at them. That’s exactly what locality-sensitive hashing does, and the idea behind it is one of those things that felt like sleight of hand the first time I understood it.

A normal hash function tries to scatter inputs apart: similar inputs should get wildly different hashes, so the buckets fill evenly and collisions are rare. LSH does the opposite on purpose. It uses a hash function designed so that similar items are likely to collide — to land in the same bucket — and dissimilar items are likely not to. You flip the goal of hashing: collisions stop being an accident to avoid and become the signal you’re looking for. Hash everything once, and the near-duplicates have herded themselves into shared buckets. Now you only compare within buckets, which is a tiny fraction of all pairs.

For set-similarity — “how much overlap do these two things have” — the classic construction is MinHash. The trick: represent each item as a set of features (the words in a document, the shingles of a string, the tags on a creative), then hash that set down to a small signature where the probability that two signatures match equals the actual overlap of the original sets. A pair that’s 80% similar has roughly an 80% chance of matching on any given signature slot. Stack a few slots into bands and you get a tunable filter — items only become “candidates” if they collide in a whole band, which you can dial to catch 85%-similar pairs while ignoring 60%-similar ones.

# normal hashing: scatter, so collisions are rare and meaningless.
# LSH/MinHash: collide on purpose, so a shared bucket *means* "probably similar".
for item in items:
    sig = minhash(features(item))          # similarity-preserving signature
    for band in bands(sig):
        buckets[band].add(item)            # similar items pile into shared buckets

candidates = pairs_within_each_bucket(buckets)   # tiny fraction of all pairs
results = [p for p in candidates if real_similarity(p) >= threshold]  # verify

The shape of the win is the thing to hold onto. You hash every item once — linear — then do exact, expensive similarity checks only on the handful of candidate pairs that survived the bucketing, instead of on all of them. A search that was quadratic and impossible becomes near-linear and routine. The trillions of comparisons that would have all said “not similar” simply never happen, because LSH never put those pairs in the same bucket to begin with.

There’s a cost, and it’s the honest tradeoff at the centre of the whole technique: LSH is approximate. It can miss a genuine near-duplicate (two similar items that unluckily never shared a bucket — a false negative) and it surfaces some candidates that turn out not to match (false positives, which the verification step catches). You tune the bands and rows to trade these off — more bands catches more true pairs at the cost of more candidates to verify — but you can’t drive both errors to zero. What you’re buying with that little bit of inexactness is the difference between “computable” and “not.” For finding near-duplicates at scale there usually isn’t an exact algorithm that’s also affordable, so approximate-and-fast wins by default, because exact-and-impossible doesn’t actually solve anyone’s problem.

I keep LSH in my head as the first thing to reach for whenever a problem smells quadratic — anywhere the naive solution is “compare everything to everything.” Near-duplicate detection is the textbook case, but the same shape shows up in clustering, in nearest-neighbour search, in deduplication, in matching records across messy datasets. The move is always the same: stop trying to be exact about the trillions of pairs that don’t matter, and find a cheap, approximate way to look only at the ones that might.1


One of the algorithms from my data-mining coursework that turned out to earn its keep in practice. Code is illustrative.

Footnotes

  1. MinHash is for set overlap (Jaccard similarity). The LSH family is broader — there are constructions for cosine similarity (random hyperplane / SimHash, useful for embeddings) and for Euclidean distance. The unifying idea is the same across all of them: a hash that collides with probability tied to the similarity measure you care about. Pick the construction that matches your notion of “similar,” and the bucketing does the rest.↩︎