← Lessons

quiz vs the machine

Gold1430

System Design

The Shuffle and Sort Phase

The expensive cross network step that moves map output to the right reducers.

5 min read · core · beat Gold to climb

Where it sits

Between map and reduce lies the shuffle and sort phase. It is the step that takes every key value pair a mapper produced and delivers it to the reducer responsible for that key.

What happens

  • Each mapper partitions its output, usually by hashing the key, so a key always lands on the same reducer.
  • Output is sorted by key locally so each reducer receives its keys in order and can process groups one at a time.
  • Reducers fetch their partitions across the network from every mapper.

Why it dominates cost

This phase moves data across the network and to disk, so it is often the slowest part of a job. A combiner that pre aggregates map output before the shuffle, and good partitioning, are the main levers to cut shuffle volume.

Understanding the shuffle is the key to tuning any distributed analytic job, because that is where time and bandwidth go.

Key idea

Shuffle and sort partitions, orders, and ships map output to reducers, and it usually dominates a job cost.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the shuffle phase accomplish?

2. How does a combiner reduce shuffle cost?