Dense versus sparse
In a dense network every parameter participates in every forward pass. In a sparsely activated network only a selected subset runs for a given input, so a model can be enormous in total yet cheap per example.
Forms of sparsity
- Conditional computation selects which blocks to run, as in mixture of experts.
- Activation sparsity means many neuron outputs are zero, so their downstream work can be skipped.
- Structured sparsity zeroes weights in regular patterns that hardware can exploit.
Why it matters for scaling
- The active parameter count, not the total, drives FLOPs and latency.
- A sparse model can match a much larger dense model on capacity at a fraction of the compute.
- It shifts the bottleneck from raw compute toward memory and communication.
The practical catch
Sparsity helps only if hardware can skip the unused work. Irregular sparsity is hard to accelerate, so real systems favor structured patterns and block level routing that map cleanly onto GPUs.
Key idea
Sparse activation runs only a selected subset of a large network per input, so active parameters set the cost; gains are real only when hardware can skip the unused, structured work.