← Lessons

quiz vs the machine

Platinum1760

System Design

Approximate Count with HyperLogLog

Estimating the number of distinct items in massive streams using tiny memory.

5 min read · advanced · beat Platinum to climb

The problem

Counting distinct items, like unique visitors, exactly needs a set holding every value seen, which is huge for billions of items. HyperLogLog estimates the count in kilobytes with a small, bounded error.

The intuition

  • Hash each item to a uniform bit string.
  • Track the position of the leftmost run of leading zeros seen. A long run is rare, so seeing one hints that many distinct items passed through.
  • Split the hash into registers and average their estimates with a harmonic mean to cut variance.

What you get

  • Memory is fixed regardless of how many items arrive.
  • Typical error is around a couple of percent, tunable by register count.
  • Sketches are mergeable: combine per shard counts by taking the max of each register.

For dashboards that ask how many unique users, an exact answer is overkill and HyperLogLog is the practical choice.

Key idea

HyperLogLog estimates distinct counts in fixed tiny memory by tracking leading zero runs across mergeable registers.

Check yourself

Answer to earn rating on the learn ladder.

1. What does HyperLogLog estimate?

2. Why can per shard HyperLogLog sketches be merged?

3. What is the cost of using HyperLogLog instead of an exact set?