The model
MapReduce expresses a large batch computation as two pure functions. The map function turns each input record into zero or more key value pairs. The reduce function takes all values that share a key and combines them into a result.
Why it scales
- Map tasks are independent, so the framework runs them in parallel across many machines on local data.
- The framework groups pairs by key, then runs reduce tasks in parallel per key group.
- Failed tasks simply re run because both functions are deterministic on their input.
A canonical example
Word count maps each word to the pair word and one, then reduce sums the ones for each word. The same shape handles log analysis, index building, and aggregation.
The power is that the engineer writes only map and reduce while the framework handles distribution, scheduling, and fault tolerance.
Key idea
MapReduce expresses big batch work as parallel map then grouped reduce, letting the framework handle scaling and failures.