Backpropagation

The problem it solves

A neural network has many layers and thousands or millions of weights. To train it with gradient descent, you need the gradient of the loss with respect to every weight. Backpropagation computes all of them efficiently in one backward sweep.

Forward then backward

Training each batch has two phases:

The forward pass feeds inputs through the layers to produce a prediction and a loss
The backward pass applies the chain rule from calculus, propagating the error gradient from the output back through each layer

By reusing intermediate results, backpropagation finds every gradient in time comparable to a single forward pass, rather than recomputing from scratch for each weight.

Why the chain rule

Each layer's output depends on the previous layer's output. The chain rule multiplies local derivatives along this path, so the gradient at an early layer is a product of many terms. This is also why very deep networks can suffer from vanishing or exploding gradients.

Key idea

Backpropagation uses the chain rule in a single backward pass to compute every weight's gradient efficiently, enabling deep networks to learn.

The problem it solves

Forward then backward

Why the chain rule

Key idea

Check yourself