The Sigmoid and Softmax Functions
Classifiers produce raw scores, but we usually want probabilities that sum to one and stay between zero and one. Two functions perform this conversion: the sigmoid and the softmax.
The sigmoid function takes a single real number and squashes it into the range zero to one. It is shaped like a smooth S. Large positive inputs map near one, large negative inputs map near zero, and an input of zero maps to one half. This makes sigmoid the natural choice for binary classification, where one output expresses the probability of the positive class.
The softmax function generalizes this to many classes. Given a vector of scores, it exponentiates each and divides by the sum of all exponentials. The result is a set of probabilities that are all positive and add up to one. Softmax is the standard final step for multiclass classification.
Two properties are worth remembering:
- Both functions are monotonic, so a higher score always yields a higher probability
- Softmax with two classes reduces to the sigmoid, so they are deeply related
Because the outputs are smooth, both functions are easy to differentiate, which keeps the training loop happy. They turn arbitrary scores into calibrated looking probabilities the loss can act on.
Key idea
Sigmoid maps one score to a probability for binary tasks, while softmax maps a vector of scores to probabilities that sum to one for multiclass tasks.