The Softmax Regression

Generalizing logistic regression to many classes with one weight vector per class.

From two classes to many

Softmax regression, also called multinomial logistic regression, extends logistic regression to more than two classes. Each class gets its own weight vector, producing one score per class.

The softmax function

The softmax turns a vector of class scores into probabilities that are positive and sum to one. It exponentiates each score and divides by the total, so the largest score becomes the most probable class.

Training

The loss is categorical cross entropy, comparing predicted probabilities to the one hot true label.
The loss stays convex, so optimization is stable.
The gradient again reduces to predicted minus true probability times the feature.

Practical details

Scores are shift invariant, so subtracting the maximum score before exponentiating prevents numerical overflow.
One weight vector is redundant since probabilities sum to one, but keeping all of them simplifies code and pairs well with regularization.
Softmax is the standard output layer for multiclass neural networks for these same reasons.

Key idea

Softmax regression gives each class its own weight vector and uses the softmax to produce probabilities that sum to one. It trains with convex categorical cross entropy and is the canonical multiclass output.

The Softmax Regression

From two classes to many

The softmax function

Training

Practical details

Key idea

Check yourself