The Maximum Likelihood Estimation
Maximum likelihood estimation, or MLE, is a principled way to fit a model. It asks which parameter values make the observed data most probable.
The procedure
- Write the likelihood, the probability of the data as a function of the parameters.
- Find the parameter values that maximize it.
Because data points are usually assumed independent, the likelihood is a product over examples. We almost always maximize the log likelihood instead, since turning a product into a sum is numerically stable and easier to differentiate. The maximizing parameters are identical either way.
A simple example
If you flip a coin ten times and see seven heads, the maximum likelihood estimate of the head probability is seven tenths, the observed frequency. MLE recovers the intuitive answer.
Link to machine learning
Many training objectives are MLE in disguise.
- Minimizing mean squared error in linear regression is MLE under Gaussian noise.
- Minimizing cross entropy in classification is MLE for a categorical model.
Cautions
MLE can overfit with little data, since it trusts the sample completely. Adding a prior turns it into maximum a posteriori estimation, which acts like regularization.
Key idea
Maximum likelihood estimation picks parameters that make the observed data most probable, and common losses like squared error and cross entropy are MLE in disguise.