The Stochastic Gradient Descent

The idea

Stochastic gradient descent (SGD) updates parameters using the gradient from a single training example rather than the whole dataset.

Each example gives a noisy estimate of the true gradient.
Updates are cheap and frequent.
Over many steps the noise averages out toward the right direction.

Why use it

Computing the full gradient over millions of examples is expensive. SGD makes many small updates per pass, so the model starts improving long before seeing all the data.

It scales to huge datasets.
The noise can help escape shallow traps.

The tradeoff

The path is jittery because each step trusts a single sample. A decreasing learning rate helps the updates settle near a minimum instead of bouncing around it.

SGD trades exact gradients for speed, and in practice that trade usually wins on large problems.

Key idea

SGD estimates the gradient from one example per step, giving cheap noisy updates that converge in aggregate and scale to large datasets.

The Stochastic Gradient Descent

The idea

Why use it

The tradeoff

Key idea

Check yourself