← Lessons

quiz vs the machine

Platinum1880

Databases

Cost Based Optimization in Warehouses

Using statistics to pick join orders and methods that scan less.

6 min read · advanced · beat Platinum to climb

Choosing Among Plans

A query can run many ways with very different costs. A cost based optimizer estimates the cost of candidate plans using statistics about the data, then picks the cheapest. In a warehouse, cost is dominated by bytes scanned and shuffled, not CPU.

What Statistics Drive

  • Cardinality: how many rows each filter and join produces, which sizes every later step.
  • Distribution: histograms estimate how selective a filter is.
  • Join order: filter and join the smallest intermediates first to keep data tiny.
  • Join method: pick a broadcast join when one side is small or a shuffle join when both are large.

Why Estimates Matter

A wrong cardinality estimate cascades. If the optimizer thinks a join yields a thousand rows but it yields a billion, every downstream choice is wrong. Stale statistics are a common cause of bad warehouse plans, so engines refresh them as data changes.

Key idea

A cost based optimizer uses data statistics to estimate plan costs and choose join orders and methods that minimize bytes scanned and shuffled, so accurate up to date statistics are essential to good plans.

Check yourself

Answer to earn rating on the learn ladder.

1. What dominates query cost in a data warehouse?

2. Why are cardinality estimates so important?

3. When does a broadcast join make sense?