The idea
Stochastic gradient descent (SGD) updates parameters using the gradient from a single training example rather than the whole dataset.
- Each example gives a noisy estimate of the true gradient.
- Updates are cheap and frequent.
- Over many steps the noise averages out toward the right direction.
Why use it
Computing the full gradient over millions of examples is expensive. SGD makes many small updates per pass, so the model starts improving long before seeing all the data.
- It scales to huge datasets.
- The noise can help escape shallow traps.
The tradeoff
The path is jittery because each step trusts a single sample. A decreasing learning rate helps the updates settle near a minimum instead of bouncing around it.
SGD trades exact gradients for speed, and in practice that trade usually wins on large problems.
Key idea
SGD estimates the gradient from one example per step, giving cheap noisy updates that converge in aggregate and scale to large datasets.