What the shuffle is
A shuffle redistributes data across the cluster so that rows sharing a key land on the same node. Operations like group by, join, and repartition trigger it. The shuffle writes intermediate data to disk and sends it over the network, making it the most expensive step in most jobs.
Why it hurts
- Network transfer of all keyed data is slow compared to local compute.
- Disk spill of shuffle files adds input and output cost.
- A heavy key can overload one node, called skew.
How to reduce it
- Filter and project early so less data reaches the shuffle.
- Pre aggregate with a map side combine so each partition sends partial sums instead of raw rows.
- Use a broadcast for small tables to avoid shuffling the large one.
- Choose a partition count that matches cluster cores to keep tasks balanced.
The mindset
Treat every shuffle as a budget. The cheapest shuffle is the one you avoid or the one carrying the least data.
Key idea
The shuffle moves keyed data across the network and dominates job cost, so the main optimization is shrinking or eliminating the data that must cross it.