Second Order Methods Overview

First order methods use only the gradient, the slope. Second order methods also use curvature, the rate at which the slope changes, to choose better step sizes and directions.

The Hessian idea

The Hessian is the matrix of second derivatives of the loss.
It tells you how steeply the gradient itself changes in every direction.
Newton's method multiplies the gradient by the inverse Hessian to jump toward the minimum.

Why it can help

In a ravine the curvature is large across the walls and small along the floor. Curvature information lets the optimizer take a long step along the floor and a short one across the walls, reaching the minimum in far fewer steps than first order methods.

Why it is rare in deep learning

The Hessian for a large network is enormous, and inverting it is infeasible. So practitioners use approximations. L-BFGS estimates curvature from recent gradients. K-FAC approximates the Hessian with a structured form. These reduce the cost but still rarely beat well tuned Adam at scale, so first order methods remain dominant while second order ideas inspire better preconditioners.

Key idea

Second order methods use curvature from the Hessian to take smarter steps, but its size forces approximations that rarely beat tuned first order optimizers at scale.

Second Order Methods Overview