The motivation
Bigger models tend to be smarter, but running every parameter for every token is expensive. Mixture of experts breaks a layer into many parallel sub networks called experts and uses only a few of them per input.
How routing works
A small gating network, also called the router, looks at each token and chooses the top few experts to handle it. Only those experts run, so the model has a huge parameter count but a modest cost per token. This is called sparse activation.
- The router scores all experts for a token
- It picks the top experts, often one or two
- Only the selected experts compute, and their outputs are combined
Balancing the load
Left alone, the router may send most tokens to a few favorite experts while others sit idle. An auxiliary load balancing loss encourages even usage, and a capacity limit caps how many tokens each expert takes per batch so none is overwhelmed.
Trade offs
- Upside much larger effective capacity for similar compute per token
- Downside more memory to hold all experts and added complexity in routing and distributed training
Many recent large language models use mixture of experts layers in place of some dense feed forward layers to grow capacity cheaply.
Key idea
Mixture of experts routes each token to a few specialized sub networks, growing parameters while keeping per token compute low.