The Expert Routing Balancing

The collapse problem

Left alone, a router may send most tokens to a handful of favorite experts. Those experts improve, attract more tokens, and the rest starve. The model wastes capacity and training becomes unstable.

The load balancing loss

The standard fix adds an auxiliary term that rewards spreading tokens evenly.

It penalizes the product of the fraction of tokens sent to an expert and the router probability mass on it.
Minimizing it pushes the router toward a uniform assignment.
A small weight keeps it from overriding the main task loss.

Expert capacity and dropping

Each expert gets a capacity limit of tokens per batch.
Tokens beyond capacity are dropped or passed through unchanged.
A capacity factor above one leaves slack so fewer tokens are dropped.

Other stabilizers

Noise on router logits during training encourages exploration.
Z loss keeps logits from growing too large.
Some schemes route from the expert side to guarantee balance by construction.

Key idea

Routers tend to collapse onto a few experts, so an auxiliary load balancing loss plus capacity limits and logit regularizers keep token assignment roughly uniform and stable.

The Expert Routing Balancing

The collapse problem

The load balancing loss

Expert capacity and dropping

Other stabilizers

Key idea

Check yourself