← Lessons

quiz vs the machine

Platinum1750

Machine Learning

The Second Order Methods Newton

Use curvature from the Hessian to take smarter, better scaled steps.

6 min read · advanced · beat Platinum to climb

Beyond the gradient

First order methods use only the slope. Newton's method also uses curvature from the Hessian, the matrix of second derivatives, to choose both direction and step size.

  • The step is the inverse Hessian times the gradient.
  • Near a minimum this can converge very fast, even quadratically.

Why curvature helps

The Hessian rescales the step so that steep and flat directions are treated appropriately. Plain gradient descent struggles when the surface is much steeper in some directions than others; Newton fixes that automatically.

The catch

For a model with millions of parameters, forming and inverting the full Hessian is far too expensive in time and memory.

  • Each is quadratic or cubic in parameter count.
  • The Hessian may not be positive definite away from a minimum.

Quasi Newton methods like BFGS approximate the inverse Hessian cheaply, capturing much of the benefit at lower cost.

Key idea

Newton's method scales steps by the inverse Hessian for fast convergence, but the full Hessian is too costly at scale, so quasi Newton approximations capture much of the gain.

Check yourself

Answer to earn rating on the learn ladder.

1. What extra information does Newton's method use beyond the gradient?

2. Why is full Newton impractical for large models?

3. What do quasi Newton methods like BFGS do?