The collapse problem
Left alone, a router may send most tokens to a handful of favorite experts. Those experts improve, attract more tokens, and the rest starve. The model wastes capacity and training becomes unstable.
The load balancing loss
The standard fix adds an auxiliary term that rewards spreading tokens evenly.
- It penalizes the product of the fraction of tokens sent to an expert and the router probability mass on it.
- Minimizing it pushes the router toward a uniform assignment.
- A small weight keeps it from overriding the main task loss.
Expert capacity and dropping
- Each expert gets a capacity limit of tokens per batch.
- Tokens beyond capacity are dropped or passed through unchanged.
- A capacity factor above one leaves slack so fewer tokens are dropped.
Other stabilizers
- Noise on router logits during training encourages exploration.
- Z loss keeps logits from growing too large.
- Some schemes route from the expert side to guarantee balance by construction.
Key idea
Routers tend to collapse onto a few experts, so an auxiliary load balancing loss plus capacity limits and logit regularizers keep token assignment roughly uniform and stable.