Mixture of Experts

The motivation

Bigger models tend to be smarter, but running every parameter for every token is expensive. Mixture of experts breaks a layer into many parallel sub networks called experts and uses only a few of them per input.

How routing works

A small gating network, also called the router, looks at each token and chooses the top few experts to handle it. Only those experts run, so the model has a huge parameter count but a modest cost per token. This is called sparse activation.

The router scores all experts for a token
It picks the top experts, often one or two
Only the selected experts compute, and their outputs are combined

Balancing the load

Left alone, the router may send most tokens to a few favorite experts while others sit idle. An auxiliary load balancing loss encourages even usage, and a capacity limit caps how many tokens each expert takes per batch so none is overwhelmed.

Trade offs

Upside much larger effective capacity for similar compute per token
Downside more memory to hold all experts and added complexity in routing and distributed training

Many recent large language models use mixture of experts layers in place of some dense feed forward layers to grow capacity cheaply.

Key idea