AI Concepts
Multi-Head Attention
Learn why transformers run several attention heads in parallel so token representations can mix different learned context signals.
After this, you will understand
How Multi-Head Attention helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.
Start with the word in plain English before adding machinery.
The idea becomes unclear when it is mixed with Multi-Head Attention, Attention Heads, and Parallel Attention too early.
Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.
Think before readingBefore learning the mechanics, what should a beginner understand about Multi-Head Attention and Attention Heads?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- Multi-head attention
- Attention heads
- Learned projections
- Parallel context mixing
- Representation subspaces
- Combining head outputs
- Capacity and compute tradeoffs
Definition
Multi-head attention runs several attention computations in parallel over learned projections of the current representations, then combines their outputs.
The beginner version is:
one attention operation -> one learned way to mix context
multiple heads -> several learned ways to mix context in the same layer
The model does not ask a human to assign a grammar head, a reference head, and a code head. Training learns how the heads become useful.
Why This Concept Exists
One token position may need more than one kind of context signal at once.
In a sentence, a token might need:
- nearby phrase structure
- a far-away subject
- the object being referenced
- punctuation or formatting boundaries
Multi-head attention gives a transformer layer more room to form different attention patterns and value mixtures in parallel instead of squeezing every relationship through one attention view.
The Mechanical Shape
Each head gets learned projections for its attention computation.
A simplified layer flow is:
input representations
-> head 1 attention output
-> head 2 attention output
-> head N attention output
-> combine head outputs
-> project back into the model representation
The heads are parallel parts of one layer. Their outputs are combined before later transformer computation continues.
Why Learned Projections Matter
Heads do not all look at an identical representation through identical parameters.
Learned projections create different query, key, and value views for the heads. That is why "multi-head" means more than repeating the exact same comparison several times.
The architecture gives capacity for different relationships. Training decides what becomes useful for the objective and data.
A Careful Mental Model
It is tempting to say:
head 1 handles grammar
head 2 handles facts
head 3 handles code
That can make the first picture intuitive, but it is too rigid.
A better mental model is:
different heads can learn different context-mixing patterns
Some patterns may become interpretable. Others are distributed across heads, layers, and feed-forward computation.
Capacity, Compute, And Design
Multi-head attention increases the structure available inside an attention layer.
Engineers care because architecture choices affect:
- representation capacity
- memory movement
- attention-kernel efficiency
- KV cache shape during inference
- how model width is divided across heads
Those details become sharper when we reach KV cache and attention optimizations. For now, remember that more architectural structure is not free. It must fit the model's quality, training, and serving budget.
Common Confusions
A head is not a separate model.
It is a component inside an attention layer.
More heads does not automatically mean better product behavior.
Model quality depends on data, objective, scale, training, architecture balance, evaluation, and product system design.
Multi-head attention is not a committee of human-readable specialists.
It is parallel learned computation over projected representations.
Related Topics
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.