Cutting the data you scan
A query that filters on a date should not read years of files. Partitioning and bucketing physically arrange a table so engines can skip irrelevant files.
- Partitioning splits a table into folders by a column, often a date. A filter on that column means the engine reads only the matching folders, called partition pruning.
- Bucketing hashes a column into a fixed number of files within each partition. Rows with the same key always land in the same bucket, which speeds joins and removes shuffles.
Choosing a key
Partition on a low cardinality column you filter by, like event date. Avoid high cardinality keys such as user id, which create millions of tiny files and slow everything down, a problem called the small files problem. Bucket on high cardinality join keys instead, since the bucket count is fixed.
Key idea
Partition on a low cardinality filter column to prune folders, and bucket on a high cardinality join key into a fixed number of files to speed joins without exploding file counts.