← Lessons

quiz vs the machine

Platinum1820

Machine Learning

Mixture of Experts

Scaling parameters without scaling compute by routing tokens to a few experts.

5 min read · advanced · beat Platinum to climb

The motivation

Bigger models tend to be smarter, but running every parameter for every token is expensive. Mixture of experts breaks a layer into many parallel sub networks called experts and uses only a few of them per input.

How routing works

A small gating network, also called the router, looks at each token and chooses the top few experts to handle it. Only those experts run, so the model has a huge parameter count but a modest cost per token. This is called sparse activation.

  • The router scores all experts for a token
  • It picks the top experts, often one or two
  • Only the selected experts compute, and their outputs are combined

Balancing the load

Left alone, the router may send most tokens to a few favorite experts while others sit idle. An auxiliary load balancing loss encourages even usage, and a capacity limit caps how many tokens each expert takes per batch so none is overwhelmed.

Trade offs

  • Upside much larger effective capacity for similar compute per token
  • Downside more memory to hold all experts and added complexity in routing and distributed training

Many recent large language models use mixture of experts layers in place of some dense feed forward layers to grow capacity cheaply.

Key idea

Mixture of experts routes each token to a few specialized sub networks, growing parameters while keeping per token compute low.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the gating network in a mixture of experts decide?

2. What is the main benefit of sparse expert activation?

3. Why add an auxiliary load balancing loss?