Parallel Map Filter Reduce
Map, filter, and reduce are the building blocks of many data pipelines, and each parallelizes in a distinct way. Understanding their differences tells you which parts of a pipeline scale freely and which need care.
Map applies a function to every element independently. Because no element depends on another, you can split the input across cores with no coordination. Map is the easiest stage to parallelize.
Filter keeps only elements that pass a test. The test on each element is independent, so it parallelizes like map, but the output size is unknown in advance, which can complicate where results are written.
Reduce combines all elements into a single value, such as a sum. This requires combining results across workers, so it is not embarrassingly parallel. If the combining operation is associative, you can reduce each chunk separately and then merge the partial results in a tree.
- Map Fully independent, scales cleanly.
- Filter Independent tests, but variable output length.
- Reduce Needs an associative combiner to split safely.
A common pattern is map then filter then reduce, where the first two stages fan out across cores and the final reduce merges partial answers.
Key idea
Map and filter parallelize cleanly because elements are independent, while reduce parallelizes only when its combining operation is associative.