← Lessons

quiz vs the machine

Gold1440

System Design

Broadcast Joins Deep Dive

Replicating a small table to every node so a large table joins locally without a shuffle.

5 min read · core · beat Gold to climb

The idea

A normal join shuffles both tables so matching keys meet, which is expensive. A broadcast join avoids this when one table is small. The engine copies the small table to every node, then each node joins its local slice of the large table against the in memory copy. No shuffle of the large table is needed.

Why it is fast

  • The large table stays partitioned in place, avoiding its costly network shuffle.
  • Each node does a fast in memory hash lookup against the small table.
  • It removes the stage boundary that a shuffle join would create.

When to use it

Broadcast joins shine for dimension joins, like enriching a huge fact table with a small lookup of countries or product names. Engines often auto broadcast when the small side fits under a size threshold.

The limit

If the broadcast side is too large, copying it to every node wastes memory and can crash executors. Beyond the threshold a shuffle join is safer.

Key idea

A broadcast join replicates a small table to every node so the large table joins locally with no shuffle, ideal for small dimension lookups within a size limit.

Check yourself

Answer to earn rating on the learn ladder.

1. What makes a broadcast join fast?

2. When is a broadcast join appropriate?

3. What happens if the broadcast side is too large?