← Lessons

quiz vs the machine

Platinum1780

Databases

Distributed Aggregation

How partial and final aggregation stages cut data shuffled across nodes.

6 min read · advanced · beat Platinum to climb

Two Stage Aggregation

Aggregating across a distributed warehouse uses two stages. In the partial stage each node aggregates its local data, producing one partial result per group. In the final stage these partials are shuffled by group key to a node that merges them into the answer.

Why Two Stages

A naive plan would ship every row to one node, overwhelming the network. Local pre aggregation shrinks data first, so only one partial per group per node crosses the wire. For a sum over millions of rows in ten groups, each node sends just ten partials.

Combinable State

This works because many aggregates carry combinable state. A sum merges by adding partials, a count by adding, an average by carrying sum and count. Exact distinct does not combine cleanly, so engines use approximate sketches that do merge.

Key idea

Distributed aggregation pre aggregates locally then shuffles only one partial per group per node to a merge step, slashing network traffic for any aggregate whose state combines.

Check yourself

Answer to earn rating on the learn ladder.

1. Why pre aggregate locally before shuffling?

2. Why are exact distinct counts harder to distribute?

3. How does an average survive two stage aggregation?