A strange habit
Trained transformers often pour a large share of attention onto the very first few tokens, even when those tokens carry little meaning. These are called attention sinks.
Why sinks form
Softmax forces attention weights to sum to one, so a head must put its weight somewhere even when nothing is truly relevant. The early tokens, always present and visible to everyone, become a convenient dumping ground for leftover attention mass.
Why it matters for streaming
When serving very long or endless streams, a common trick is to drop old tokens from the kv cache to save memory. If you naively drop the first tokens, you remove the sinks, the softmax loses its dumping ground, and quality collapses.
The fix
- Always keep the first few sink tokens in the cache.
- Slide a window over the recent tokens.
Keeping the sinks plus a recent window lets a model handle streams far longer than its training length with stable quality. Some models add a dedicated learned sink token to make this clean.
Key idea
Attention sinks are the early tokens that softmax dumps leftover weight onto, and retaining them alongside a recent window when trimming the kv cache keeps streaming generation stable far beyond the training length.