← Lessons

quiz vs the machine

Platinum1730

System Design

Data Partitioning Strategy

Splitting large tables by key so queries scan only the data they need.

5 min read · advanced · beat Platinum to climb

Why partition

Partitioning divides a large dataset into smaller physical chunks based on a column value. The payoff is partition pruning: a query that filters on the partition column only reads the relevant chunks instead of the whole table.

Common strategies

  • Time partitioning splits by date, such as one folder per day. It suits append heavy event data and easy retention.
  • Hash partitioning spreads rows evenly by hashing a key, avoiding hot spots.
  • Range partitioning groups by value ranges, useful for ordered scans.

Pitfalls to avoid

  • Too many small partitions create huge metadata overhead and many tiny files that slow reads.
  • Skew happens when one partition holds most of the data, overloading a single worker.
  • Partitioning on a column rarely used in filters wastes the benefit entirely.

Practical guidance

Pick the column most queries filter on, and aim for partition sizes large enough to be efficient but small enough to prune. Time plus one balanced key is a common pattern.

Key idea

Partition on the column queries filter on so pruning reads only the chunks you need, while avoiding tiny skewed partitions.

Check yourself

Answer to earn rating on the learn ladder.

1. What benefit does partition pruning provide?

2. What problem do too many small partitions cause?