Why iterate
The closed form for regression inverts a large matrix, which is slow or infeasible with millions of features or examples. Gradient descent instead nudges the weights step by step toward lower error.
The update rule
- Compute the gradient, the direction of steepest increase of the loss.
- Move the weights a small step in the opposite direction.
- Repeat until the loss stops dropping.
The step size is the learning rate. Too large and the updates overshoot and diverge; too small and training crawls.
Batch and stochastic flavors
- Batch gradient descent uses all data per step, smooth but expensive.
- Stochastic uses one example per step, noisy but fast.
- Mini batch uses a small group, the common compromise.
Convergence tips
- Scale features so the error surface is round, not a stretched valley.
- Decay the learning rate over time to settle near the minimum.
- For convex least squares it converges to the global optimum.
Key idea
Gradient descent fits regression by repeatedly stepping downhill on the loss. The learning rate and feature scaling decide whether it converges smoothly or diverges.