From two classes to many
Softmax regression, also called multinomial logistic regression, extends logistic regression to more than two classes. Each class gets its own weight vector, producing one score per class.
The softmax function
The softmax turns a vector of class scores into probabilities that are positive and sum to one. It exponentiates each score and divides by the total, so the largest score becomes the most probable class.
Training
- The loss is categorical cross entropy, comparing predicted probabilities to the one hot true label.
- The loss stays convex, so optimization is stable.
- The gradient again reduces to predicted minus true probability times the feature.
Practical details
- Scores are shift invariant, so subtracting the maximum score before exponentiating prevents numerical overflow.
- One weight vector is redundant since probabilities sum to one, but keeping all of them simplifies code and pairs well with regularization.
- Softmax is the standard output layer for multiclass neural networks for these same reasons.
Key idea
Softmax regression gives each class its own weight vector and uses the softmax to produce probabilities that sum to one. It trains with convex categorical cross entropy and is the canonical multiclass output.