← Lessons

quiz vs the machine

Platinum1800

Machine Learning

The Maximum Likelihood Estimation

Choosing parameters that make the observed data most probable.

5 min read · advanced · beat Platinum to climb

The Maximum Likelihood Estimation

Maximum likelihood estimation, or MLE, is a principled way to fit a model. It asks which parameter values make the observed data most probable.

The procedure

  • Write the likelihood, the probability of the data as a function of the parameters.
  • Find the parameter values that maximize it.

Because data points are usually assumed independent, the likelihood is a product over examples. We almost always maximize the log likelihood instead, since turning a product into a sum is numerically stable and easier to differentiate. The maximizing parameters are identical either way.

A simple example

If you flip a coin ten times and see seven heads, the maximum likelihood estimate of the head probability is seven tenths, the observed frequency. MLE recovers the intuitive answer.

Link to machine learning

Many training objectives are MLE in disguise.

  • Minimizing mean squared error in linear regression is MLE under Gaussian noise.
  • Minimizing cross entropy in classification is MLE for a categorical model.

Cautions

MLE can overfit with little data, since it trusts the sample completely. Adding a prior turns it into maximum a posteriori estimation, which acts like regularization.

Key idea

Maximum likelihood estimation picks parameters that make the observed data most probable, and common losses like squared error and cross entropy are MLE in disguise.

Check yourself

Answer to earn rating on the learn ladder.

1. What does maximum likelihood estimation maximize?

2. Why is the log likelihood usually maximized instead of the likelihood?

3. Minimizing mean squared error corresponds to MLE under what noise assumption?