← Lessons

quiz vs the machine

Platinum1720

Machine Learning

The Attention Sinks

Why models dump attention on the first tokens and how to exploit it.

5 min read · advanced · beat Platinum to climb

A strange habit

Trained transformers often pour a large share of attention onto the very first few tokens, even when those tokens carry little meaning. These are called attention sinks.

Why sinks form

Softmax forces attention weights to sum to one, so a head must put its weight somewhere even when nothing is truly relevant. The early tokens, always present and visible to everyone, become a convenient dumping ground for leftover attention mass.

Why it matters for streaming

When serving very long or endless streams, a common trick is to drop old tokens from the kv cache to save memory. If you naively drop the first tokens, you remove the sinks, the softmax loses its dumping ground, and quality collapses.

The fix

  • Always keep the first few sink tokens in the cache.
  • Slide a window over the recent tokens.

Keeping the sinks plus a recent window lets a model handle streams far longer than its training length with stable quality. Some models add a dedicated learned sink token to make this clean.

Key idea

Attention sinks are the early tokens that softmax dumps leftover weight onto, and retaining them alongside a recent window when trimming the kv cache keeps streaming generation stable far beyond the training length.

Check yourself

Answer to earn rating on the learn ladder.

1. Why do attention sinks form on early tokens?

2. What breaks if you drop the sink tokens during streaming?