← Lessons

quiz vs the machine

Platinum1780

Algorithms

HyperLogLog Deep Dive

Count distinct items in massive streams using leading zero patterns and tiny registers.

7 min read · advanced · beat Platinum to climb

Counting unique items cheaply

HyperLogLog estimates the number of distinct elements in a stream using only a few kilobytes, even for billions of items. Exact counting would need memory proportional to the cardinality, which it avoids entirely.

The leading zero intuition

Hash each item to a uniform bit string. In random bit strings, seeing a long run of leading zeros is rare. If the longest run you ever observe is r, then you have probably seen on the order of two to the r distinct values.

  • One register would be noisy.
  • So the stream is split into many buckets.

Buckets and harmonic mean

The first few hash bits pick a register, and the rest determine the leading zero count stored there, keeping the maximum seen. To combine registers into one estimate, HyperLogLog uses a harmonic mean, which suppresses the influence of unusually large outlier registers.

  • More registers gives a smaller relative error.
  • Corrections handle very small and very large ranges.

Why it scales

Each register holds only a tiny count, so memory stays fixed regardless of how many items stream through. Sketches also merge by taking the elementwise maximum of registers, which makes distributed counting trivial.

Key idea

HyperLogLog hashes items, routes them to registers, and tracks the longest leading zero run per register, then combines registers with a harmonic mean to estimate distinct counts in fixed tiny memory that merges cleanly across machines.

Check yourself

Answer to earn rating on the learn ladder.

1. What observation does HyperLogLog use to estimate cardinality?

2. Why does HyperLogLog combine its registers with a harmonic mean?

3. How can two HyperLogLog sketches be merged?