Counting unique items cheaply
HyperLogLog estimates the number of distinct elements in a stream using only a few kilobytes, even for billions of items. Exact counting would need memory proportional to the cardinality, which it avoids entirely.
The leading zero intuition
Hash each item to a uniform bit string. In random bit strings, seeing a long run of leading zeros is rare. If the longest run you ever observe is r, then you have probably seen on the order of two to the r distinct values.
- One register would be noisy.
- So the stream is split into many buckets.
Buckets and harmonic mean
The first few hash bits pick a register, and the rest determine the leading zero count stored there, keeping the maximum seen. To combine registers into one estimate, HyperLogLog uses a harmonic mean, which suppresses the influence of unusually large outlier registers.
- More registers gives a smaller relative error.
- Corrections handle very small and very large ranges.
Why it scales
Each register holds only a tiny count, so memory stays fixed regardless of how many items stream through. Sketches also merge by taking the elementwise maximum of registers, which makes distributed counting trivial.
Key idea
HyperLogLog hashes items, routes them to registers, and tracks the longest leading zero run per register, then combines registers with a harmonic mean to estimate distinct counts in fixed tiny memory that merges cleanly across machines.