Counting in small space
A count min sketch estimates how often each item appears in a stream while using far less memory than an exact table. It is a grid of counters with one row per hash function and many columns.
Update and estimate
To record an item, hash it once per row, and for each row increment the counter in the chosen column. To estimate its count, hash it the same way and take the minimum of the counters it touches.
- Collisions can only add to a counter, never subtract.
- So each counter is an overestimate, and the minimum is the tightest of them.
Why the minimum
Different items that collide in one row likely do not collide in the others, so at least one of an item's counters tends to be only mildly inflated. Taking the minimum picks that cleanest estimate, bounding the overcount with high probability.
- More columns reduce the overestimate size.
- More rows reduce the chance the bound is exceeded.
Strengths and limits
The sketch never undercounts, which makes it ideal for finding heavy hitters. It cannot reliably report items with very small counts, since their estimates are dominated by collision noise.
Key idea
A count min sketch keeps a grid of counters, increments one per row by hashing, and estimates frequency as the minimum across rows, which only ever overcounts and tightly bounds the error for frequent items.