Sampling from an unbounded stream
Suppose items arrive one at a time and you must keep a uniform random sample of fixed size, but you never learn the total count in advance and cannot store everything. Reservoir sampling solves this with a small fixed buffer called the reservoir, deciding on the fly whether each new item replaces an existing one.
The single item case
To keep just one item uniformly, hold the first item, then for the item at position k replace the held item with probability one over k.
- The first item starts as the sample.
- The second replaces it with probability one half.
- The kth item replaces the current sample with probability one over k.
A short induction shows that after processing n items, every item is held with probability exactly one over n, which is uniform.
Larger reservoirs
For a sample of size m, fill the reservoir with the first m items. Then for the item at position k beyond m, keep it with probability m over k, and if kept, evict a uniformly chosen current member. The same uniformity argument extends to size m.
Key idea
Replace reservoir members with shrinking probability as the stream grows, so every item ends up equally likely without ever knowing the total count.