← Lessons

quiz vs the machine

Platinum1740

System Design

Data Partitioning and Bucketing

Two ways to physically organize table files so queries scan less data.

5 min read · advanced · beat Platinum to climb

Cutting the data you scan

A query that filters on a date should not read years of files. Partitioning and bucketing physically arrange a table so engines can skip irrelevant files.

  • Partitioning splits a table into folders by a column, often a date. A filter on that column means the engine reads only the matching folders, called partition pruning.
  • Bucketing hashes a column into a fixed number of files within each partition. Rows with the same key always land in the same bucket, which speeds joins and removes shuffles.

Choosing a key

Partition on a low cardinality column you filter by, like event date. Avoid high cardinality keys such as user id, which create millions of tiny files and slow everything down, a problem called the small files problem. Bucket on high cardinality join keys instead, since the bucket count is fixed.

Key idea

Partition on a low cardinality filter column to prune folders, and bucket on a high cardinality join key into a fixed number of files to speed joins without exploding file counts.

Check yourself

Answer to earn rating on the learn ladder.

1. What is partition pruning?

2. Why is partitioning by a high cardinality column like user id a bad idea?

3. What does bucketing help with?