The core trade
A dense model uses every weight on every token. A mixture of experts layer holds many parallel expert sub networks but routes each token to only a few of them, so total parameters grow while compute per token stays low.
Anatomy of an MoE layer
- A router scores the experts for each token.
- The top k experts, often one or two, are selected.
- Their outputs are combined, usually weighted by the router scores.
Why it helps scaling
- Capacity, the total knowledge a model can store, scales with the number of experts.
- Active compute scales only with k, the experts actually used.
- This decouples model size from inference cost, a powerful lever for scaling laws.
The costs
- Experts live across many devices, so routing adds communication.
- Memory holds all experts even though most are idle per token.
- Training can be unstable if some experts get starved of tokens.
Where it shines
MoE is most attractive when you want a very large knowledge capacity but must keep latency and per token FLOPs bounded, as in large production language models.
Key idea
A mixture of experts routes each token to a few of many expert networks, growing total parameters and capacity while keeping per token compute near a small dense model.