Sparse capacity
A mixture of experts replaces a dense layer with many parallel expert subnetworks plus a small router. For each token the router picks only a few experts to run. The model holds a huge number of parameters, yet each token uses only a fraction, so compute stays modest while capacity grows.
The router
The router scores the experts for a token and selects the top few, often one or two. Only those experts process the token, and their outputs are combined weighted by the router scores. The unchosen experts do no work for that token, which is why the layer is called sparse.
Balancing the load
Left alone, the router may favor a few popular experts and starve the rest. Training adds a load balancing term that encourages tokens to spread across experts so all of them learn and hardware stays evenly used.
Trade offs
- Far more parameters at similar compute per token.
- All experts must still fit in memory, so memory cost is high.
- Routing adds complexity and uneven load risk.
Key idea
A mixture of experts routes each token to a few expert subnetworks, growing parameter count without growing per token compute, balanced by a load balancing loss.