← Lessons

quiz vs the machine

Gold1360

Machine Learning

The Sparse Activation

Using only a fraction of a network per input to make huge models affordable.

5 min read · core · beat Gold to climb

Dense versus sparse

In a dense network every parameter participates in every forward pass. In a sparsely activated network only a selected subset runs for a given input, so a model can be enormous in total yet cheap per example.

Forms of sparsity

  • Conditional computation selects which blocks to run, as in mixture of experts.
  • Activation sparsity means many neuron outputs are zero, so their downstream work can be skipped.
  • Structured sparsity zeroes weights in regular patterns that hardware can exploit.

Why it matters for scaling

  • The active parameter count, not the total, drives FLOPs and latency.
  • A sparse model can match a much larger dense model on capacity at a fraction of the compute.
  • It shifts the bottleneck from raw compute toward memory and communication.

The practical catch

Sparsity helps only if hardware can skip the unused work. Irregular sparsity is hard to accelerate, so real systems favor structured patterns and block level routing that map cleanly onto GPUs.

Key idea

Sparse activation runs only a selected subset of a large network per input, so active parameters set the cost; gains are real only when hardware can skip the unused, structured work.

Check yourself

Answer to earn rating on the learn ladder.

1. In a sparsely activated model, what mainly drives per example FLOPs?

2. Why do real systems prefer structured sparsity?