The Mixture of Experts

Routing each token to a few expert subnetworks so capacity grows without proportional cost.

Sparse capacity

A mixture of experts replaces a dense layer with many parallel expert subnetworks plus a small router. For each token the router picks only a few experts to run. The model holds a huge number of parameters, yet each token uses only a fraction, so compute stays modest while capacity grows.

The router

The router scores the experts for a token and selects the top few, often one or two. Only those experts process the token, and their outputs are combined weighted by the router scores. The unchosen experts do no work for that token, which is why the layer is called sparse.

Balancing the load

Left alone, the router may favor a few popular experts and starve the rest. Training adds a load balancing term that encourages tokens to spread across experts so all of them learn and hardware stays evenly used.

Trade offs

Far more parameters at similar compute per token.
All experts must still fit in memory, so memory cost is high.
Routing adds complexity and uneven load risk.

Key idea

A mixture of experts routes each token to a few expert subnetworks, growing parameter count without growing per token compute, balanced by a load balancing loss.

The Mixture of Experts

Sparse capacity

The router

Balancing the load

Trade offs

Key idea

Check yourself