Statistics Collection

The Optimizer Needs Estimates

A cost based optimizer compares plans by estimating how many rows each step produces. Those estimates come from statistics the engine gathers about each table and column.

What Gets Collected

Row counts and table sizes for scan cost.
Histograms describing the distribution of values in a column.
Most common values lists for skewed columns.
Distinct value estimates to predict join and group sizes.

These are built by sampling rows rather than scanning everything, which keeps collection cheap but introduces some error.

Staleness and Selectivity

The optimizer multiplies statistics into a selectivity, the fraction of rows a predicate keeps. If statistics are stale after heavy changes, estimates drift and plans go wrong, for example choosing a nested loop when a hash join would win. Engines refresh statistics automatically after enough rows change.

Key idea

The optimizer relies on sampled statistics such as histograms and distinct counts to estimate selectivity and plan costs, so stale statistics lead to bad plans and must be refreshed.

Statistics Collection

The Optimizer Needs Estimates

What Gets Collected

Staleness and Selectivity

Key idea

Check yourself