The Softmax and Cross Entropy

Softmax

Softmax converts a vector of raw scores called logits into a probability distribution.

It exponentiates each logit and divides by the sum of all exponentials.
Outputs are positive and sum to one.
A higher logit gets a larger share, while the gap between logits sets the sharpness.

Cross entropy

Cross entropy measures how far a predicted distribution is from the true distribution.

For a one hot target it reduces to the negative log probability of the correct class.
Minimizing it pushes the model to assign high probability to the right answer.

A clean gradient

Combining softmax with cross entropy yields a simple gradient: the predicted probability minus the target. Libraries fuse the two for numerical stability.

Key idea

Softmax maps logits to a probability distribution and cross entropy scores it against the truth, and together they give the clean predicted minus target gradient that drives classification.

The Softmax and Cross Entropy

Softmax

Cross entropy

A clean gradient

Key idea

Check yourself