Choosing Among Plans
A query can run many ways with very different costs. A cost based optimizer estimates the cost of candidate plans using statistics about the data, then picks the cheapest. In a warehouse, cost is dominated by bytes scanned and shuffled, not CPU.
What Statistics Drive
- Cardinality: how many rows each filter and join produces, which sizes every later step.
- Distribution: histograms estimate how selective a filter is.
- Join order: filter and join the smallest intermediates first to keep data tiny.
- Join method: pick a broadcast join when one side is small or a shuffle join when both are large.
Why Estimates Matter
A wrong cardinality estimate cascades. If the optimizer thinks a join yields a thousand rows but it yields a billion, every downstream choice is wrong. Stale statistics are a common cause of bad warehouse plans, so engines refresh them as data changes.
Key idea
A cost based optimizer uses data statistics to estimate plan costs and choose join orders and methods that minimize bytes scanned and shuffled, so accurate up to date statistics are essential to good plans.