What it does
A bloom filter is a compact bit array that answers set membership with one twist: it can say definitely not present or possibly present, but it never says definitely present. It allows false positives but never false negatives.
How analytics uses it
- A columnar file stores a bloom filter per data block. Before scanning a block for a value, the engine checks the filter and skips the block if the value is definitely absent.
- A join can build a filter on one side keys and ship it to prune rows on the other side early.
- A deduplication step checks the filter to avoid re reading items it has surely never seen.
The trade off
A definitely not answer is exact and saves a real scan. A possibly answer occasionally wastes a scan that finds nothing, the false positive. Sizing the bit array and hash count controls that rate.
In analytics the win is avoiding expensive reads on the many blocks that cannot contain a value, turning a full scan into a targeted one.
Key idea
A bloom filter cheaply rules out absent values so analytic engines skip blocks and rows that cannot match, accepting rare false positives.