← Lessons

quiz vs the machine

Platinum1820

Databases

Bloom Filters in Analytics

Probabilistic membership tests that prune joins and skip files.

6 min read · advanced · beat Platinum to climb

A Compact Membership Test

A bloom filter is a small bit array that answers does this set probably contain a value. It can report definitely not present or possibly present, never a false negative but sometimes a false positive. This tiny structure lets analytics engines avoid reading data that cannot match.

Where It Helps

  • File skipping: a per file bloom filter on a column lets a point filter skip files that definitely lack the value.
  • Join pruning: the build side of a join broadcasts a bloom filter of its keys so the probe side discards non matching rows early, before a shuffle.

The Tradeoff

A bloom filter uses fixed bits and several hash functions. More bits and more keys raise the false positive rate, which only causes wasted reads, never wrong answers. Tuning size against expected keys balances memory and pruning power.

Key idea

Bloom filters give a compact no false negative membership test that lets analytics engines skip files and prune join probes early, trading a tunable false positive rate for far less data read.

Check yourself

Answer to earn rating on the learn ladder.

1. What answer can a bloom filter never give incorrectly?

2. How do bloom filters help joins?

3. What does a higher false positive rate cause?