The Sparse Activation

Dense versus sparse

In a dense network every parameter participates in every forward pass. In a sparsely activated network only a selected subset runs for a given input, so a model can be enormous in total yet cheap per example.

Forms of sparsity

Conditional computation selects which blocks to run, as in mixture of experts.
Activation sparsity means many neuron outputs are zero, so their downstream work can be skipped.
Structured sparsity zeroes weights in regular patterns that hardware can exploit.

Why it matters for scaling

The active parameter count, not the total, drives FLOPs and latency.
A sparse model can match a much larger dense model on capacity at a fraction of the compute.
It shifts the bottleneck from raw compute toward memory and communication.

The practical catch

Sparsity helps only if hardware can skip the unused work. Irregular sparsity is hard to accelerate, so real systems favor structured patterns and block level routing that map cleanly onto GPUs.

Key idea

Sparse activation runs only a selected subset of a large network per input, so active parameters set the cost; gains are real only when hardware can skip the unused, structured work.

The Sparse Activation

Dense versus sparse

Forms of sparsity

Why it matters for scaling

The practical catch

Key idea

Check yourself