AI Concepts

Attention

Understand attention as the mechanism that lets token positions choose which context signals matter when their representations are updated.

intermediate3 min readUpdated 2026-05-22MechanicsModelingInferenceTradeoffs

AttentionSelf-AttentionQueriesKeysValuesContext Mixing

After this, you will understand

How Attention helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.

Beginner version

Start with the word in plain English before adding machinery.

Confusion point

The idea becomes unclear when it is mixed with Attention, Self-Attention, and Queries too early.

Better mental model

Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.

Think before readingBefore learning the mechanics, what should a beginner understand about Attention and Self-Attention?

As you read, separate the vocabulary from the implementation details. The word should feel clear before the system design gets complex.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Multi-Head Attention

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

1Multi-Head Attentionai-concepts

Concepts Covered

Attention
Self-attention
Queries
Keys
Values
Attention weights
Context mixing
Causal boundaries
Why attention cost grows with context

Definition

Attention is a mechanism that updates a representation by weighting information from other available representations.

In a transformer, self-attention lets token positions use other token positions in the same sequence as context.

Keep the plain-English question:

For this token position, what other positions should matter right now?

Attention turns that question into learned computation.

Why This Concept Exists

A token can be ambiguous until context settles it.

In:

The bank approved the loan.

bank should connect to a financial meaning.

In:

They sat by the river bank.

the surrounding tokens point elsewhere.

Attention gives the model a way to update token representations using relevant context instead of forcing all context through one fixed summary.

Queries, Keys, And Values

Attention is often introduced with three names:

query
key
value

Use a retrieval-shaped mental model, carefully:

query -> what this position is looking for
key -> what each available position advertises
value -> what information can be mixed in

The model creates these learned projections from token representations.

The query is compared with keys. Those comparisons become attention weights. The weights control how values are combined into an updated representation.

A Small Attention Flow

A simplified self-attention step looks like this:

token representations
  -> build queries, keys, values
  -> compare each query with allowed keys
  -> turn scores into weights
  -> mix values using those weights

The output is not usually a copied sentence fragment. It is another numeric representation that carries context-shaped information forward into later model computation.

Attention Weights Are Not Human Explanations

Attention weights show how an attention operation distributes weight over available positions.

That can be useful for understanding the mechanic.

But a high weight is not automatically a complete human explanation for why the whole model produced a final answer. Later layers, multiple heads, feed-forward transformations, output scoring, and product layers still shape behavior.

Self-Attention And Available Context

Self-attention means token positions attend over the sequence representations available in that attention operation.

The word "available" matters.

Some transformer setups allow a position to use tokens on both sides. Autoregressive language-model generation uses a causal boundary so a position cannot read future tokens it is supposed to predict from.

That boundary becomes important when we discuss masked attention.

Why Attention Changed The Shape Of Language Models

Attention creates direct context interactions between token positions.

That makes long-range relationships easier to represent than a mental model where every earlier signal must survive a single step-by-step path through a sequence.

It also creates scaling pressure. More context positions mean more comparisons in standard attention, which is one reason context length, KV cache behavior, Flash Attention, and other optimizations matter later.

Common Confusions

Attention is not consciousness or focus like a person feels it.

It is learned weighted computation over representations.

Attention is not vector search over your document database.

Both use comparison ideas, but transformer attention operates inside model computation over available representations. Retrieval systems search external stored items and add selected context through a product pipeline.

Attention does not replace the rest of the model.

It routes context signals. Other layers transform those signals and the training objective shapes the parameters.

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.

Prerequisites

Read these first if the mechanics feel unfamiliar.

Transformer ArchitectureStart here if Transformer Architecture is still fuzzy.Tokens And TokenizationStart here if Tokens And Tokenization is still fuzzy.

Read these in order

What to study next

Prerequisites

More Links