The Gradient Descent For Regression

Why iterate

The closed form for regression inverts a large matrix, which is slow or infeasible with millions of features or examples. Gradient descent instead nudges the weights step by step toward lower error.

The update rule

Compute the gradient, the direction of steepest increase of the loss.
Move the weights a small step in the opposite direction.
Repeat until the loss stops dropping.

The step size is the learning rate. Too large and the updates overshoot and diverge; too small and training crawls.

Batch and stochastic flavors

Batch gradient descent uses all data per step, smooth but expensive.
Stochastic uses one example per step, noisy but fast.
Mini batch uses a small group, the common compromise.

Convergence tips

Scale features so the error surface is round, not a stretched valley.
Decay the learning rate over time to settle near the minimum.
For convex least squares it converges to the global optimum.

Key idea

Gradient descent fits regression by repeatedly stepping downhill on the loss. The learning rate and feature scaling decide whether it converges smoothly or diverges.

The Gradient Descent For Regression

Why iterate

The update rule

Batch and stochastic flavors

Convergence tips

Key idea

Check yourself