AI Concepts

Multi-Head Attention

Learn why transformers run several attention heads in parallel so token representations can mix different learned context signals.

intermediate3 min readUpdated 2026-05-26MechanicsModelingInferenceTradeoffs
Multi-Head AttentionAttention HeadsParallel AttentionLearned ProjectionsRepresentation SubspacesContext Mixing

After this, you will understand

How Multi-Head Attention helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.

Beginner version

Start with the word in plain English before adding machinery.

Confusion point

The idea becomes unclear when it is mixed with Multi-Head Attention, Attention Heads, and Parallel Attention too early.

Better mental model

Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.

Think before readingBefore learning the mechanics, what should a beginner understand about Multi-Head Attention and Attention Heads?
As you read, separate the vocabulary from the implementation details. The word should feel clear before the system design gets complex.

Reading in progress

This page is saved in your local study history so you can continue later.

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

  1. 1Masked Attentionai-concepts

Concepts Covered

  • Multi-head attention
  • Attention heads
  • Learned projections
  • Parallel context mixing
  • Representation subspaces
  • Combining head outputs
  • Capacity and compute tradeoffs

Definition

Multi-head attention runs several attention computations in parallel over learned projections of the current representations, then combines their outputs.

The beginner version is:

one attention operation -> one learned way to mix context
multiple heads -> several learned ways to mix context in the same layer

The model does not ask a human to assign a grammar head, a reference head, and a code head. Training learns how the heads become useful.

Why This Concept Exists

One token position may need more than one kind of context signal at once.

In a sentence, a token might need:

  • nearby phrase structure
  • a far-away subject
  • the object being referenced
  • punctuation or formatting boundaries

Multi-head attention gives a transformer layer more room to form different attention patterns and value mixtures in parallel instead of squeezing every relationship through one attention view.

The Mechanical Shape

Each head gets learned projections for its attention computation.

A simplified layer flow is:

input representations
  -> head 1 attention output
  -> head 2 attention output
  -> head N attention output
  -> combine head outputs
  -> project back into the model representation

The heads are parallel parts of one layer. Their outputs are combined before later transformer computation continues.

Why Learned Projections Matter

Heads do not all look at an identical representation through identical parameters.

Learned projections create different query, key, and value views for the heads. That is why "multi-head" means more than repeating the exact same comparison several times.

The architecture gives capacity for different relationships. Training decides what becomes useful for the objective and data.

A Careful Mental Model

It is tempting to say:

head 1 handles grammar
head 2 handles facts
head 3 handles code

That can make the first picture intuitive, but it is too rigid.

A better mental model is:

different heads can learn different context-mixing patterns

Some patterns may become interpretable. Others are distributed across heads, layers, and feed-forward computation.

Capacity, Compute, And Design

Multi-head attention increases the structure available inside an attention layer.

Engineers care because architecture choices affect:

  • representation capacity
  • memory movement
  • attention-kernel efficiency
  • KV cache shape during inference
  • how model width is divided across heads

Those details become sharper when we reach KV cache and attention optimizations. For now, remember that more architectural structure is not free. It must fit the model's quality, training, and serving budget.

Common Confusions

A head is not a separate model.

It is a component inside an attention layer.

More heads does not automatically mean better product behavior.

Model quality depends on data, objective, scale, training, architecture balance, evaluation, and product system design.

Multi-head attention is not a committee of human-readable specialists.

It is parallel learned computation over projected representations.

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.