Paged Attention

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 8 16:6
Editor
Edited
Edited
2026 Apr 9 10:52
Refs
Refs
KV cache memory management is one of the key bottlenecks in LLM serving systems, where existing systems allocate KV cache in contiguous memory spaces, causing severe internal and external fragmentation. Inspired by virtual memory and paging techniques from operating systems, PagedAttention manages KV cache non-contiguously in fixed-size blocks.
The core of PagedAttention is a custom CUDA kernel that performs attention computation at the block level. When computing the dot product between query and key cache , each thread group handles the operation between one query token and one key token in the form . Softmax normalization is performed in a numerically stable manner by first computing , then calculating , and dividing by to obtain the final logits. This process efficiently aggregates max and sum across the entire thread block using intra-warp shuffle operations and cross-warp reduction via shared memory. The final computation with values is accumulated as , and output after inter-warp reduction.
Architecturally, the KV cache is partitioned into blocks of , with each block placed non-contiguously in physical memory. The key cache uses a layout, and the value cache uses a layout to optimize memory coalescing. Through a three-level parallelization hierarchy of thread group, warp, and thread block, a single thread block handles the full context attention for one sequence and one head. The grid is configured as .
notion image
 
 
 
 
 
Applying PagedAttention to the vLLM system achieved up to throughput improvement over FasterTransformer and up to higher throughput compared to HuggingFace Transformers. KV cache memory waste was reduced to less than , a dramatic improvement over the waste in existing systems. The memory efficiency advantage is maximized especially with long sequences and large batch sizes.
Paged Attention - vLLM
You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
 

Recommendations