The Mixture Of Experts Deep

Growing total parameters while keeping per token compute fixed via sparse experts.

The core trade

A dense model uses every weight on every token. A mixture of experts layer holds many parallel expert sub networks but routes each token to only a few of them, so total parameters grow while compute per token stays low.

Anatomy of an MoE layer

A router scores the experts for each token.
The top k experts, often one or two, are selected.
Their outputs are combined, usually weighted by the router scores.

Why it helps scaling

Capacity, the total knowledge a model can store, scales with the number of experts.
Active compute scales only with k, the experts actually used.
This decouples model size from inference cost, a powerful lever for scaling laws.

The costs

Experts live across many devices, so routing adds communication.
Memory holds all experts even though most are idle per token.
Training can be unstable if some experts get starved of tokens.

Where it shines

MoE is most attractive when you want a very large knowledge capacity but must keep latency and per token FLOPs bounded, as in large production language models.

Key idea

A mixture of experts routes each token to a few of many expert networks, growing total parameters and capacity while keeping per token compute near a small dense model.