← Lessons

quiz vs the machine

Platinum1730

Machine Learning

The Mixture of Experts

Routing each token to a few expert subnetworks so capacity grows without proportional cost.

5 min read · advanced · beat Platinum to climb

Sparse capacity

A mixture of experts replaces a dense layer with many parallel expert subnetworks plus a small router. For each token the router picks only a few experts to run. The model holds a huge number of parameters, yet each token uses only a fraction, so compute stays modest while capacity grows.

The router

The router scores the experts for a token and selects the top few, often one or two. Only those experts process the token, and their outputs are combined weighted by the router scores. The unchosen experts do no work for that token, which is why the layer is called sparse.

Balancing the load

Left alone, the router may favor a few popular experts and starve the rest. Training adds a load balancing term that encourages tokens to spread across experts so all of them learn and hardware stays evenly used.

Trade offs

  • Far more parameters at similar compute per token.
  • All experts must still fit in memory, so memory cost is high.
  • Routing adds complexity and uneven load risk.

Key idea

A mixture of experts routes each token to a few expert subnetworks, growing parameter count without growing per token compute, balanced by a load balancing loss.

Check yourself

Answer to earn rating on the learn ladder.

1. How many experts process a typical token?

2. Why add a load balancing loss?

3. What stays roughly constant per token in a mixture of experts?