AI Concepts
KV Cache
Understand how key-value caching makes autoregressive LLM inference faster by reusing attention work from previous tokens.
After this, you will understand
How KV Cache helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.
Start with the word in plain English before adding machinery.
The idea becomes unclear when it is mixed with KV Cache, Key-Value Cache, and Autoregressive Inference too early.
Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.
Think before readingBefore learning the mechanics, what should a beginner understand about KV Cache and Key-Value Cache?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- KV cache
- Key-value cache
- Autoregressive inference
- Prefill phase
- Decode phase
- Attention keys and values
- Cache memory growth
- Serving latency tradeoffs
Definition
A KV cache stores the key and value tensors produced by attention layers for previous tokens during autoregressive generation.
The plain-English version:
do not recompute the attention keys and values for old tokens every time
store them once and reuse them for the next token
It is a serving optimization for transformer language models that generate one token at a time.
Why This Concept Exists
LLMs generate text step by step.
Suppose the model has already processed:
The database connection timed out because
To generate the next token, the model needs context from those previous tokens.
After it generates one more token, it still needs context from the same previous tokens again.
Without caching, the model would repeatedly redo attention-related work for tokens it already processed. KV cache exists because that repeated work becomes expensive during generation.
The Beginner Mental Model
A beginner may think:
The model reads the prompt once and then just writes the answer.
That hides the serving loop.
For many language models, generation looks more like:
read current context
predict one token
append that token
repeat
The KV cache is one of the tricks that makes that loop fast enough to serve real users.
What Gets Cached
Attention uses queries, keys, and values.
During generation, the current token produces a new query. It attends over keys and values from the current and previous tokens.
The useful observation is:
past keys and values do not need to be rebuilt from scratch every step
So the serving system stores them per layer.
A simplified step is:
new token -> compute current key and value
cache <- append current key and value
current query attends over cached keys and values
The exact tensor shapes are implementation details, but the engineering idea is stable: reuse past attention state.
Prefill And Decode
LLM serving often talks about two phases.
Prefill processes the prompt:
prompt tokens -> build initial model state and KV cache
Decode generates new tokens:
use cache + current token -> predict next token
append new key and value to cache
repeat
Prefill can process prompt tokens together. Decode is more sequential because each new token depends on the previous generated token.
That is one reason long prompts and long outputs create different performance pressures.
Why KV Cache Helps Latency
Without a cache, every new token step would spend work recomputing old keys and values.
With a cache, the model still attends to previous context, but it can reuse stored keys and values.
That improves time per generated token.
The tradeoff is memory:
more tokens + more layers + more heads + larger head dimension -> larger cache
So KV cache reduces repeated computation, but it increases memory pressure.
Product And Infrastructure Pressure
In a ChatGPT-style assistant, every active generation request may hold a KV cache.
For an AI coding assistant, a long file context and long completion can make the cache large.
For a document assistant, retrieved context increases prompt length, which increases prefill work and cache size.
This is why LLM serving teams care about:
- batching active requests
- prompt length
- output length
- GPU memory
- cache eviction or paging
- quantized caches
- attention kernels
KV cache is not a tiny implementation detail. It shapes real serving cost and latency.
Common Confusions
KV cache is not model memory.
It does not mean the model permanently remembers a user. It is temporary inference state for the current generation.
KV cache is not the same as retrieval.
Retrieval fetches external information to place into context. KV cache stores attention state for tokens already inside the current context.
KV cache does not remove attention cost entirely.
The model still has to attend over available cached context. The cache avoids recomputing past keys and values.
KV cache is mostly an inference concern.
Training has different parallelism and memory patterns. Caching previous token states is mainly useful for autoregressive decoding.
What This Does Not Mean
KV cache does not make long context free.
Longer context still consumes memory and can create latency, batching, and quality challenges. The cache helps reuse work, but the serving system still has to store and manage the growing state.
Related Topics
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.