KV Cache

Creator

Creator

Seonglae Cho

Created

Created

2024 Mar 2 7:7

Editor

Editor

Seonglae Cho

Edited

Edited

2025 May 9 19:36

Refs

Refs

AI Optimization

Needle retrieval task

It is mainly used in

Transformer Inference. In the training stage, all input data is usually processed at the same time only once (unlike generation step repeats using same tokens), and computations for all inputs are processed in batches, so the benefits of KV cache are not that significant and there are many other things to memory so it is not used. Also, it is risky if memory changes dynamically.

After storing the Key/Value tensor in GPU memory for reuse, it often takes up a lot of memory, and in some cases, it is larger than the model itself.

The default setting for the parameter option in Huggingface transformer's generate function is true.

Number of decoder layers: More layers mean more KV caches

Sequence length: Longer sequences increase memory usage as they need to store more pairs of keys and values

Dimension of the model: The larger the dimensions of the keys and values, the more data needs to be stored

There are several techniques such as KV compression based on

Attention head types (

FastGen) and More cache in lower layers, less in higher layers (

PyramidKV). Combining those methods could implement memory-optimized Transformer model inference.

Attention KV Caches

Prefix Aware KV Cache

Static KV Cache

https://www.youtube.com/watch?v=0VLAoVGf_74

KV Cache in production

LLM inference speed of light

In the process of working on calm, a minimal from-scratch fast CUDA implementation of transformer-based language model inference, a critical consideration was establishing the speed of light for the inference process, and measuring the progress relative to that speed of light. In this post we’ll cover this theoretical limit and its implications.

https://zeux.io/2024/03/15/llm-inference-sol

LLM inference speed of light

LLM 인퍼런스 훑어보기 (2) - KV Cache

앞선 포스트에서는 Large Language Model (LLM) 인퍼런스의 중요성과 왜 우리는 LLM을 효율적으로 활용해야 하는지를 알아보았다. 더불어, LLM을 이용한 문장 생성은 autoregressive generation이며, 해당 생성 과정에서 사용할 수 있는 다양한 디코딩 전략을 소개하였다. 지난 포스트는 아래 링크를 참조하자. https://dytis.tistory.com/53 LLM 인퍼런스 훑어보기 (1) - LLM을 이용한 문장 생성 인공 지능과 기계 학습 기술의 발전은 현대 사회에 혁명적인 변화를 가져왔다. 특히, Large Language Model(LLM)과 같은 최신 기술은 자연어 처리 및 이해 분야에서 차별화된 성능을 보이며, 다양한 분 dytis.tistory.com 이번 ..

https://dytis.tistory.com/54

LLM 인퍼런스 훑어보기 (2) - KV Cache

Long Context로 인한 Large KV Cache의 문제점과 해결 방안: Part I-KV cache의 메모리 요구량

Auto-regressive 모델이란 이전 단계의 출력들을 이용하여 다음 단계의 출력을 예측하는 모델이다. GPT는 auto-regressive 모델로 이전에 생성된 토큰를 기반으로 다음 토큰을 생성한다. GPT는 이전 토큰 생성 시 발생된 중간값인…

Long Context로 인한 Large KV Cache의 문제점과 해결 방안: Part I-KV cache의 메모리 요구량

https://moon-walker.medium.com/long-context로-인한-large-kv-cache의-문제점과-해결-방안-part-i-kv-cache의-메모리-요구량-025f3d5dea93

Long Context로 인한 Large KV Cache의 문제점과 해결 방안: Part I-KV cache의 메모리 요구량

Cross Layer KV-sharing with stateful caching

Optimizing AI Inference at Character.AI

At Character.AI, we're building toward AGI. In that future state, large language models (LLMs) will enhance daily life, providing business productivity and entertainment and helping people with everything from education to coaching, support, brainstorming, creative writing and more. To make that a reality globally, it's critical to achieve highly

https://research.character.ai/optimizing-inference

Optimizing AI Inference at Character.AI

Inference and KV cache

All About Transformer Inference | How To Scale Your Model

Performing inference on a Transformer can be very different from training. Partly this is because inference adds a new factor to consider: latency. In this section, we will go all the way from sampling a single new token from a model to efficiently scaling a large Transformer across many slices of accelerators as part of an inference engine.

All About Transformer Inference | How To Scale Your Model

https://jax-ml.github.io/scaling-book/inference/

Backlinks

Dynamic NTK MLA Contextual Compression Steering Vector Attention Mechanism Optimization AI Model Memory

Recommendations

//////////