← Lessons

quiz vs the machine

Gold1410

Databases

Statistics Collection

How the optimizer estimates rows from sampled data distributions.

5 min read · core · beat Gold to climb

The Optimizer Needs Estimates

A cost based optimizer compares plans by estimating how many rows each step produces. Those estimates come from statistics the engine gathers about each table and column.

What Gets Collected

  • Row counts and table sizes for scan cost.
  • Histograms describing the distribution of values in a column.
  • Most common values lists for skewed columns.
  • Distinct value estimates to predict join and group sizes.

These are built by sampling rows rather than scanning everything, which keeps collection cheap but introduces some error.

Staleness and Selectivity

The optimizer multiplies statistics into a selectivity, the fraction of rows a predicate keeps. If statistics are stale after heavy changes, estimates drift and plans go wrong, for example choosing a nested loop when a hash join would win. Engines refresh statistics automatically after enough rows change.

Key idea

The optimizer relies on sampled statistics such as histograms and distinct counts to estimate selectivity and plan costs, so stale statistics lead to bad plans and must be refreshed.

Check yourself

Answer to earn rating on the learn ladder.

1. What is selectivity?

2. Why do stale statistics cause problems?