The Attention Sinks

A strange habit

Trained transformers often pour a large share of attention onto the very first few tokens, even when those tokens carry little meaning. These are called attention sinks.

Why sinks form

Softmax forces attention weights to sum to one, so a head must put its weight somewhere even when nothing is truly relevant. The early tokens, always present and visible to everyone, become a convenient dumping ground for leftover attention mass.

Why it matters for streaming

When serving very long or endless streams, a common trick is to drop old tokens from the kv cache to save memory. If you naively drop the first tokens, you remove the sinks, the softmax loses its dumping ground, and quality collapses.

The fix

Always keep the first few sink tokens in the cache.
Slide a window over the recent tokens.

Keeping the sinks plus a recent window lets a model handle streams far longer than its training length with stable quality. Some models add a dedicated learned sink token to make this clean.

Key idea

Attention sinks are the early tokens that softmax dumps leftover weight onto, and retaining them alongside a recent window when trimming the kv cache keeps streaming generation stable far beyond the training length.

The Attention Sinks

A strange habit

Why sinks form

Why it matters for streaming

The fix

Key idea

Check yourself