← Lessons

quiz vs the machine

Gold1380

Machine Learning

The Mixture Of Experts Deep

Growing total parameters while keeping per token compute fixed via sparse experts.

6 min read · core · beat Gold to climb

The core trade

A dense model uses every weight on every token. A mixture of experts layer holds many parallel expert sub networks but routes each token to only a few of them, so total parameters grow while compute per token stays low.

Anatomy of an MoE layer

  • A router scores the experts for each token.
  • The top k experts, often one or two, are selected.
  • Their outputs are combined, usually weighted by the router scores.

Why it helps scaling

  • Capacity, the total knowledge a model can store, scales with the number of experts.
  • Active compute scales only with k, the experts actually used.
  • This decouples model size from inference cost, a powerful lever for scaling laws.

The costs

  • Experts live across many devices, so routing adds communication.
  • Memory holds all experts even though most are idle per token.
  • Training can be unstable if some experts get starved of tokens.

Where it shines

MoE is most attractive when you want a very large knowledge capacity but must keep latency and per token FLOPs bounded, as in large production language models.

Key idea

A mixture of experts routes each token to a few of many expert networks, growing total parameters and capacity while keeping per token compute near a small dense model.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a mixture of experts decouple?

2. In a top k MoE layer, what does the router do?

3. Which is a real cost of MoE?