AI Concepts

KV Cache

Understand how key-value caching makes autoregressive LLM inference faster by reusing attention work from previous tokens.

intermediate4 min readUpdated 2026-05-26MechanicsInferenceOperationsTradeoffs
KV CacheKey-Value CacheAutoregressive InferencePrefillDecodeAttention CacheMemory Tradeoffs

After this, you will understand

How KV Cache helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.

Beginner version

Start with the word in plain English before adding machinery.

Confusion point

The idea becomes unclear when it is mixed with KV Cache, Key-Value Cache, and Autoregressive Inference too early.

Better mental model

Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.

Think before readingBefore learning the mechanics, what should a beginner understand about KV Cache and Key-Value Cache?
As you read, separate the vocabulary from the implementation details. The word should feel clear before the system design gets complex.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Quantization

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

  1. 1Quantizationai-concepts

Concepts Covered

  • KV cache
  • Key-value cache
  • Autoregressive inference
  • Prefill phase
  • Decode phase
  • Attention keys and values
  • Cache memory growth
  • Serving latency tradeoffs

Definition

A KV cache stores the key and value tensors produced by attention layers for previous tokens during autoregressive generation.

The plain-English version:

do not recompute the attention keys and values for old tokens every time
store them once and reuse them for the next token

It is a serving optimization for transformer language models that generate one token at a time.

Why This Concept Exists

LLMs generate text step by step.

Suppose the model has already processed:

The database connection timed out because

To generate the next token, the model needs context from those previous tokens.

After it generates one more token, it still needs context from the same previous tokens again.

Without caching, the model would repeatedly redo attention-related work for tokens it already processed. KV cache exists because that repeated work becomes expensive during generation.

The Beginner Mental Model

A beginner may think:

The model reads the prompt once and then just writes the answer.

That hides the serving loop.

For many language models, generation looks more like:

read current context
predict one token
append that token
repeat

The KV cache is one of the tricks that makes that loop fast enough to serve real users.

What Gets Cached

Attention uses queries, keys, and values.

During generation, the current token produces a new query. It attends over keys and values from the current and previous tokens.

The useful observation is:

past keys and values do not need to be rebuilt from scratch every step

So the serving system stores them per layer.

A simplified step is:

new token -> compute current key and value
cache <- append current key and value
current query attends over cached keys and values

The exact tensor shapes are implementation details, but the engineering idea is stable: reuse past attention state.

Prefill And Decode

LLM serving often talks about two phases.

Prefill processes the prompt:

prompt tokens -> build initial model state and KV cache

Decode generates new tokens:

use cache + current token -> predict next token
append new key and value to cache
repeat

Prefill can process prompt tokens together. Decode is more sequential because each new token depends on the previous generated token.

That is one reason long prompts and long outputs create different performance pressures.

Why KV Cache Helps Latency

Without a cache, every new token step would spend work recomputing old keys and values.

With a cache, the model still attends to previous context, but it can reuse stored keys and values.

That improves time per generated token.

The tradeoff is memory:

more tokens + more layers + more heads + larger head dimension -> larger cache

So KV cache reduces repeated computation, but it increases memory pressure.

Product And Infrastructure Pressure

In a ChatGPT-style assistant, every active generation request may hold a KV cache.

For an AI coding assistant, a long file context and long completion can make the cache large.

For a document assistant, retrieved context increases prompt length, which increases prefill work and cache size.

This is why LLM serving teams care about:

  • batching active requests
  • prompt length
  • output length
  • GPU memory
  • cache eviction or paging
  • quantized caches
  • attention kernels

KV cache is not a tiny implementation detail. It shapes real serving cost and latency.

Common Confusions

KV cache is not model memory.

It does not mean the model permanently remembers a user. It is temporary inference state for the current generation.

KV cache is not the same as retrieval.

Retrieval fetches external information to place into context. KV cache stores attention state for tokens already inside the current context.

KV cache does not remove attention cost entirely.

The model still has to attend over available cached context. The cache avoids recomputing past keys and values.

KV cache is mostly an inference concern.

Training has different parallelism and memory patterns. Caching previous token states is mainly useful for autoregressive decoding.

What This Does Not Mean

KV cache does not make long context free.

Longer context still consumes memory and can create latency, batching, and quality challenges. The cache helps reuse work, but the serving system still has to store and manage the growing state.

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.