AI Concepts
Attention
Understand attention as the mechanism that lets token positions choose which context signals matter when their representations are updated.
After this, you will understand
How Attention helps you see what mechanism is doing the work, what tradeoff it introduces, and where it appears in AI systems.
Start with the word in plain English before adding machinery.
The idea becomes unclear when it is mixed with Attention, Self-Attention, and Queries too early.
Connect the word to inputs, outputs, model behavior, product boundaries, and evaluation.
Think before readingBefore learning the mechanics, what should a beginner understand about Attention and Self-Attention?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- Attention
- Self-attention
- Queries
- Keys
- Values
- Attention weights
- Context mixing
- Causal boundaries
- Why attention cost grows with context
Definition
Attention is a mechanism that updates a representation by weighting information from other available representations.
In a transformer, self-attention lets token positions use other token positions in the same sequence as context.
Keep the plain-English question:
For this token position, what other positions should matter right now?
Attention turns that question into learned computation.
Why This Concept Exists
A token can be ambiguous until context settles it.
In:
The bank approved the loan.
bank should connect to a financial meaning.
In:
They sat by the river bank.
the surrounding tokens point elsewhere.
Attention gives the model a way to update token representations using relevant context instead of forcing all context through one fixed summary.
Queries, Keys, And Values
Attention is often introduced with three names:
- query
- key
- value
Use a retrieval-shaped mental model, carefully:
query -> what this position is looking for
key -> what each available position advertises
value -> what information can be mixed in
The model creates these learned projections from token representations.
The query is compared with keys. Those comparisons become attention weights. The weights control how values are combined into an updated representation.
A Small Attention Flow
A simplified self-attention step looks like this:
token representations
-> build queries, keys, values
-> compare each query with allowed keys
-> turn scores into weights
-> mix values using those weights
The output is not usually a copied sentence fragment. It is another numeric representation that carries context-shaped information forward into later model computation.
Attention Weights Are Not Human Explanations
Attention weights show how an attention operation distributes weight over available positions.
That can be useful for understanding the mechanic.
But a high weight is not automatically a complete human explanation for why the whole model produced a final answer. Later layers, multiple heads, feed-forward transformations, output scoring, and product layers still shape behavior.
Self-Attention And Available Context
Self-attention means token positions attend over the sequence representations available in that attention operation.
The word "available" matters.
Some transformer setups allow a position to use tokens on both sides. Autoregressive language-model generation uses a causal boundary so a position cannot read future tokens it is supposed to predict from.
That boundary becomes important when we discuss masked attention.
Why Attention Changed The Shape Of Language Models
Attention creates direct context interactions between token positions.
That makes long-range relationships easier to represent than a mental model where every earlier signal must survive a single step-by-step path through a sequence.
It also creates scaling pressure. More context positions mean more comparisons in standard attention, which is one reason context length, KV cache behavior, Flash Attention, and other optimizations matter later.
Common Confusions
Attention is not consciousness or focus like a person feels it.
It is learned weighted computation over representations.
Attention is not vector search over your document database.
Both use comparison ideas, but transformer attention operates inside model computation over available representations. Retrieval systems search external stored items and add selected context through a product pipeline.
Attention does not replace the rest of the model.
It routes context signals. Other layers transform those signals and the training objective shapes the parameters.
Related Topics
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.