← Lessons

quiz vs the machine

Platinum1760

System Design

Shuffle Optimization

Why the shuffle dominates distributed job cost and how to shrink the data that crosses the network.

6 min read · advanced · beat Platinum to climb

What the shuffle is

A shuffle redistributes data across the cluster so that rows sharing a key land on the same node. Operations like group by, join, and repartition trigger it. The shuffle writes intermediate data to disk and sends it over the network, making it the most expensive step in most jobs.

Why it hurts

  • Network transfer of all keyed data is slow compared to local compute.
  • Disk spill of shuffle files adds input and output cost.
  • A heavy key can overload one node, called skew.

How to reduce it

  • Filter and project early so less data reaches the shuffle.
  • Pre aggregate with a map side combine so each partition sends partial sums instead of raw rows.
  • Use a broadcast for small tables to avoid shuffling the large one.
  • Choose a partition count that matches cluster cores to keep tasks balanced.

The mindset

Treat every shuffle as a budget. The cheapest shuffle is the one you avoid or the one carrying the least data.

Key idea

The shuffle moves keyed data across the network and dominates job cost, so the main optimization is shrinking or eliminating the data that must cross it.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is the shuffle usually the most expensive step?

2. How does a map side pre aggregate help?

3. What is the cheapest shuffle?