← Lessons

quiz vs the machine

Platinum1750

Machine Learning

Paged Attention

Managing the KV cache in fixed size pages like virtual memory to cut waste.

5 min read · advanced · beat Platinum to climb

The fragmentation problem

A serving system must hold a KV cache for every active request. If it reserves one big contiguous block per request sized for the maximum length, most of that block sits empty while the request is short, wasting huge amounts of memory. Different lengths also leave gaps that nothing can use.

Pages and a block table

Paged attention borrows ideas from operating system virtual memory. It splits the KV cache into fixed size pages stored anywhere in memory and keeps a block table mapping each request to its scattered pages:

  • Allocate pages only as a request grows.
  • Pages need not be contiguous, so gaps vanish.
  • A lookup table lets attention find the right pages.

Benefits

  • Almost no wasted memory, so many more requests fit at once.
  • Pages can be shared between requests with a common prefix, such as the same system prompt, saving even more.

This higher memory efficiency directly raises how many requests a GPU can serve in parallel.

Key idea

Paged attention stores the KV cache in fixed size pages with a block table, eliminating fragmentation and enabling prefix sharing across requests.

Check yourself

Answer to earn rating on the learn ladder.

1. What operating system idea does paged attention borrow?

2. What can paged attention share across requests?

3. What problem does it mainly solve?