← Lessons

quiz vs the machine

Gold1420

Machine Learning

The Expert Routing Balancing

Keeping mixture of experts from collapsing onto a few overused experts.

6 min read · core · beat Gold to climb

The collapse problem

Left alone, a router may send most tokens to a handful of favorite experts. Those experts improve, attract more tokens, and the rest starve. The model wastes capacity and training becomes unstable.

The load balancing loss

The standard fix adds an auxiliary term that rewards spreading tokens evenly.

  • It penalizes the product of the fraction of tokens sent to an expert and the router probability mass on it.
  • Minimizing it pushes the router toward a uniform assignment.
  • A small weight keeps it from overriding the main task loss.

Expert capacity and dropping

  • Each expert gets a capacity limit of tokens per batch.
  • Tokens beyond capacity are dropped or passed through unchanged.
  • A capacity factor above one leaves slack so fewer tokens are dropped.

Other stabilizers

  • Noise on router logits during training encourages exploration.
  • Z loss keeps logits from growing too large.
  • Some schemes route from the expert side to guarantee balance by construction.

Key idea

Routers tend to collapse onto a few experts, so an auxiliary load balancing loss plus capacity limits and logit regularizers keep token assignment roughly uniform and stable.

Check yourself

Answer to earn rating on the learn ladder.

1. What problem does load balancing in MoE prevent?

2. What does an expert capacity factor above one provide?

3. Why is the load balancing loss given a small weight?