The Linear Attention

The quadratic culprit

Softmax attention forms a score matrix of size length by length, which is the source of quadratic cost and memory. Linear attention avoids ever building that matrix by changing the math.

Replacing softmax with a kernel

Softmax couples queries and keys in a way that forces the full matrix. Linear attention replaces the softmax similarity with a feature map applied to queries and keys, so the similarity becomes a dot product of mapped features. Because dot products are associative, the order of operations can change.

Reordering the multiply

Instead of computing query times key first, which is length by length, you compute key times value first, producing a small fixed size summary, then multiply by the query. Cost becomes linear in length and memory no longer grows with the square.

Form a running summary of mapped keys times values.
Each query reads from that summary.

The recurrent view

This summary can be updated token by token, turning attention into a fast recurrence. That makes linear attention attractive for streaming and very long sequences.

The catch

The feature map only approximates softmax, so quality can lag full attention on tasks needing sharp, selective focus. Research keeps improving the maps.

Key idea

Linear attention swaps softmax for a feature map so similarities are dot products, then reorders the multiply to summarize keys and values once and read with each query, achieving linear cost and a recurrent form at the price of approximating sharp softmax focus.