The fragmentation problem
A serving system must hold a KV cache for every active request. If it reserves one big contiguous block per request sized for the maximum length, most of that block sits empty while the request is short, wasting huge amounts of memory. Different lengths also leave gaps that nothing can use.
Pages and a block table
Paged attention borrows ideas from operating system virtual memory. It splits the KV cache into fixed size pages stored anywhere in memory and keeps a block table mapping each request to its scattered pages:
- Allocate pages only as a request grows.
- Pages need not be contiguous, so gaps vanish.
- A lookup table lets attention find the right pages.
Benefits
- Almost no wasted memory, so many more requests fit at once.
- Pages can be shared between requests with a common prefix, such as the same system prompt, saving even more.
This higher memory efficiency directly raises how many requests a GPU can serve in parallel.
Key idea
Paged attention stores the KV cache in fixed size pages with a block table, eliminating fragmentation and enabling prefix sharing across requests.