Broadcast Joins Deep Dive

Replicating a small table to every node so a large table joins locally without a shuffle.

The idea

A normal join shuffles both tables so matching keys meet, which is expensive. A broadcast join avoids this when one table is small. The engine copies the small table to every node, then each node joins its local slice of the large table against the in memory copy. No shuffle of the large table is needed.

Why it is fast

The large table stays partitioned in place, avoiding its costly network shuffle.
Each node does a fast in memory hash lookup against the small table.
It removes the stage boundary that a shuffle join would create.

When to use it

Broadcast joins shine for dimension joins, like enriching a huge fact table with a small lookup of countries or product names. Engines often auto broadcast when the small side fits under a size threshold.

The limit

If the broadcast side is too large, copying it to every node wastes memory and can crash executors. Beyond the threshold a shuffle join is safer.

Key idea

A broadcast join replicates a small table to every node so the large table joins locally with no shuffle, ideal for small dimension lookups within a size limit.

Broadcast Joins Deep Dive

The idea

Why it is fast

When to use it

The limit

Key idea

Check yourself