Counting a billion things in a kilobyte

foundations

algorithms

Counting how many distinct things you’ve seen sounds like it needs to remember all of them. HyperLogLog counts the unique elements in a massive stream using a fixed sliver of memory — by paying attention to the luckiest hash it has seen.

Author

Umut Altun

Published

August 22, 2023

“How many distinct users did we see today?” is a question that sounds trivial and quietly isn’t. Counting events is easy — a single number you increment. Counting distinct things is different, because to know whether the user in front of you is new, you have to remember everyone you’ve already counted. The naive distinct-count keeps a set of every unique ID seen, and that set grows with the number of distinct items. At a billion uniques, you’re holding a billion IDs in memory just to answer “how many.” Across many such counts — per country, per campaign, per hour — the memory bill gets absurd fast.

The thing that surprised me when I first met HyperLogLog is that you can answer “how many distinct” without remembering what you’ve seen at all — in a fixed, tiny amount of memory, a couple of kilobytes, whether the true count is a thousand or a billion. It gives up exactness to do it, in the same spirit as hashing things into buckets to avoid comparing every pair: trade a little precision for an enormous saving in resources. And the core idea is genuinely beautiful, so let me try to do it justice.

Start with a coin-flipping intuition. If I tell you I flipped a coin a bunch of times and the longest run of heads I ever saw was 3, you’d guess I didn’t flip very many times. If I said my longest run was 20 heads in a row, you’d guess I flipped an enormous number of times, because a run of 20 is rare — you only stumble onto it if you’ve had a lot of attempts. The length of the rarest streak you’ve encountered is evidence of how many tries you’ve made.

HyperLogLog runs that argument on hashes. Hash each item to a string of random-looking bits. Count the leading zeros in each hash — a hash starting with one zero is unremarkable (happens half the time), a hash starting with ten zeros is rare (one in about a thousand). The single most important thing you track is the maximum number of leading zeros you’ve seen across all items. If somewhere in your stream a hash showed up with ten leading zeros, you’ve probably seen on the order of a thousand distinct items — because you’d only expect to hit something that rare after roughly that many tries. The luckiest hash you’ve seen is a fossil of how many distinct things produced it.

# don't remember the items. remember the rarest hash pattern they produced.
max_leading_zeros = 0
for item in stream:
    h = hash(item)
    max_leading_zeros = max(max_leading_zeros, leading_zeros(h))
# a max of ~k leading zeros  ->  roughly 2^k distinct items seen

And here’s the part that makes it work regardless of scale: duplicates are free. If the same user shows up a million times, they hash to the same value with the same leading-zero count every time, so they move the maximum exactly zero times after the first. The estimate only responds to distinct items, automatically, without ever checking whether something is a repeat — which is the entire thing that made distinct-counting expensive in the first place. You’re not deduplicating. You’re tracking a statistic that duplicates can’t budge.

One max-leading-zeros counter on its own is a wildly noisy estimator — one unusually lucky hash early on throws it way off. So HyperLogLog splits the stream into many independent buckets (using part of each hash to pick the bucket), keeps a max-leading-zeros counter in each, and combines them with a harmonic mean that tames the outliers. That’s where the “couple of kilobytes” goes: a few thousand small counters, each a handful of bits. With that, you get distinct-count estimates accurate to a couple of percent, on streams of any size, in fixed memory — and you can merge two HyperLogLogs by taking the per-bucket maxima, which means you can count uniques on shards in parallel and combine them losslessly. That mergeability is the unsung feature; it’s what makes it work across a distributed pipeline.

The honest cost, again, is exactness: HyperLogLog gives you “about 9.8 million uniques, give or take a couple percent,” never “9,814,237 exactly.” For a surprising number of real questions that’s completely fine — “roughly how many distinct devices, within a couple percent” is a number you act on the same way whether it’s exact or not, and nobody was going to make a different decision over the last few thousand. When you genuinely need the exact figure (billing, audit, dedup you’ll act on per-item), HyperLogLog is the wrong tool and you pay for the real set. The skill is knowing which of the two questions you actually have.

The thing that stuck with me wasn’t HyperLogLog itself — it was noticing that “count the distinct things” and “estimate how many distinct things” are different questions with wildly different costs, and that I’d been defaulting to the expensive one out of habit. A lot of “we can’t compute that at our scale” turns out to be “we assumed we needed the exact answer,” and the exact answer was never the thing anyone needed.¹

From the data-mining and big-data side of my coursework — sketch algorithms are the part I’ve reached for most since. Code is illustrative.

Footnotes

HyperLogLog is one of a family of “sketch” algorithms — small, fixed-memory summaries of huge streams that answer one question approximately. Count-Min Sketch does the same trick for frequency (“roughly how many times did this item appear”), Bloom filters for membership (“have I probably seen this before”). They all make the same bargain: a bounded, tunable error rate in exchange for memory that doesn’t grow with the data. Once you’ve internalised the bargain, you start seeing places to make it everywhere.↩︎