Two functions and a shuffle
MapReduce structures a distributed computation as two user functions. Map turns each input record into zero or more key value pairs. Reduce receives all values for a single key and combines them. Between them sits the shuffle, which groups every value by key and moves it to the right reducer.
Why it scales
The model scales because map tasks are fully independent: each processes its own input split with no coordination. The framework runs thousands of map tasks across machines, then partitions their output by key for the reducers.
- Map phase: parallel, embarrassingly so, one task per input split.
- Shuffle phase: sort and route pairs by key across the network.
- Reduce phase: parallel across distinct keys.
Combiners and skew
A combiner pre aggregates map output locally, like a mini reduce, cutting the data shuffled across the network. The main hazard is skew: if one key holds far more values than others, its reducer becomes a straggler that the whole job waits on. Good key design spreads load evenly.
Key idea
MapReduce expresses big computations as independent map tasks feeding key grouped reduce tasks through a shuffle, scaling by parallelism while combiners cut network cost and skew threatens balance.