The Parallel Prefix Scan
A prefix scan, also called a prefix sum, turns a sequence into its running totals. From the input three one two it produces three four six. Each output seems to depend on all the elements before it, which makes it look strictly sequential, yet it parallelizes elegantly.
The trick relies on the combining operation being associative. Because grouping does not change the result, partial sums of separate ranges can be computed independently and then stitched together. This lets a parallel scan run in two sweeps over the data.
- Up sweep Build a tree of partial sums, combining pairs upward.
- Down sweep Push the accumulated offsets back down to each leaf.
- Associativity Required so regrouped partial sums still give correct totals.
A naive sequential scan does one addition per element for a total of n steps. A work efficient parallel scan does about twice the additions but spreads them so the longest dependency chain is only about log n deep, giving a large speedup on wide hardware.
Scan is a workhorse primitive. It underlies stream compaction, where you compute output positions for kept elements, radix sorting, and allocating variable sized output. Once you can scan in parallel, many problems that looked sequential become tractable on many cores or a GPU.
Key idea
A parallel prefix scan exploits associativity to compute running totals in two tree sweeps, cutting the dependency depth from n to about log n.