Softmax
Softmax converts a vector of raw scores called logits into a probability distribution.
- It exponentiates each logit and divides by the sum of all exponentials.
- Outputs are positive and sum to one.
- A higher logit gets a larger share, while the gap between logits sets the sharpness.
Cross entropy
Cross entropy measures how far a predicted distribution is from the true distribution.
- For a one hot target it reduces to the negative log probability of the correct class.
- Minimizing it pushes the model to assign high probability to the right answer.
A clean gradient
Combining softmax with cross entropy yields a simple gradient: the predicted probability minus the target. Libraries fuse the two for numerical stability.
Key idea
Softmax maps logits to a probability distribution and cross entropy scores it against the truth, and together they give the clean predicted minus target gradient that drives classification.