← Lessons

quiz vs the machine

Platinum1750

System Design

The Broadcast Join

Joining a huge table to a small one by copying the small side to every node.

5 min read · advanced · beat Platinum to climb

The normal cost of a join

A distributed join usually shuffles both tables so matching keys meet on the same node. Shuffling two large tables across the network is expensive and a common source of skew.

The broadcast shortcut

When one side is small enough to fit in memory, the engine broadcasts a full copy of it to every node holding the large table. Each node then joins its local partition against the in memory small table with no shuffle of the big side.

Why it wins

  • The large table never moves, only the small table is copied once per node.
  • There is no shuffle skew on the big side because partitions stay put.
  • It turns a costly network heavy join into a local hash lookup.

When not to use it

If the small side is not actually small, broadcasting it floods memory and the network, so engines guard it with a size threshold and fall back to a shuffle join.

Picking a broadcast join for a big to small join is one of the highest impact tuning choices in analytics.

Key idea

A broadcast join copies a small table to every node so the large table joins locally without an expensive shuffle.

Check yourself

Answer to earn rating on the learn ladder.

1. When is a broadcast join the right choice?

2. What does a broadcast join avoid?