Shuffle Optimization

Why the shuffle dominates distributed job cost and how to shrink the data that crosses the network.

What the shuffle is

A shuffle redistributes data across the cluster so that rows sharing a key land on the same node. Operations like group by, join, and repartition trigger it. The shuffle writes intermediate data to disk and sends it over the network, making it the most expensive step in most jobs.

Why it hurts

Network transfer of all keyed data is slow compared to local compute.
Disk spill of shuffle files adds input and output cost.
A heavy key can overload one node, called skew.

How to reduce it

Filter and project early so less data reaches the shuffle.
Pre aggregate with a map side combine so each partition sends partial sums instead of raw rows.
Use a broadcast for small tables to avoid shuffling the large one.
Choose a partition count that matches cluster cores to keep tasks balanced.

The mindset

Treat every shuffle as a budget. The cheapest shuffle is the one you avoid or the one carrying the least data.

Key idea

The shuffle moves keyed data across the network and dominates job cost, so the main optimization is shrinking or eliminating the data that must cross it.

Shuffle Optimization

What the shuffle is

Why it hurts

How to reduce it

The mindset

Key idea

Check yourself