Data Partitioning and Bucketing

Cutting the data you scan

A query that filters on a date should not read years of files. Partitioning and bucketing physically arrange a table so engines can skip irrelevant files.

Partitioning splits a table into folders by a column, often a date. A filter on that column means the engine reads only the matching folders, called partition pruning.
Bucketing hashes a column into a fixed number of files within each partition. Rows with the same key always land in the same bucket, which speeds joins and removes shuffles.

Choosing a key

Partition on a low cardinality column you filter by, like event date. Avoid high cardinality keys such as user id, which create millions of tiny files and slow everything down, a problem called the small files problem. Bucket on high cardinality join keys instead, since the bucket count is fixed.

Key idea

Partition on a low cardinality filter column to prune folders, and bucket on a high cardinality join key into a fixed number of files to speed joins without exploding file counts.

Data Partitioning and Bucketing

Cutting the data you scan

Choosing a key

Key idea

Check yourself