AI Foundations

Tokens And Tokenization

Explain tokens and tokenization in plain English so software engineers understand how language models read, price, limit, and generate text.

foundation7 min readUpdated 2026-05-22FoundationsVocabularyMechanicsInference
TokensTokenizationText GenerationContext WindowPromptLanguage Model

After this, you will understand

Tokens explain why language models do not read text exactly like humans, why context has limits, and why generation streams piece by piece.

Beginner version

A token is a piece of text that a language model processes; tokenization is the step that splits text into those pieces.

Confusion point

Beginners assume models read full words or sentences directly, then get confused by context limits, pricing, and strange text boundaries.

Better mental model

Treat text as model-readable pieces, then reason about prompt size, context windows, output length, streaming, latency, and cost.

Think before readingWhy does a long prompt cost more and sometimes fail to fit inside a model's context window?
Because the model receives text as tokens. More text usually means more tokens, and the model can only process a limited number of tokens at once.

Reading in progress

This page is saved in your local study history so you can continue later.

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

  1. 1Parameters And Weightsai-foundations
  2. 2Prompts, Context, And Completionsai-foundations

Concepts Covered

  • Tokens
  • Tokenization
  • Language models
  • Prompt length
  • Context windows
  • Text generation
  • Streaming output
  • Cost and latency
  • Why tokens are not always words
  • Why this matters in AI products

1. Plain-English Definition

A token is a piece of text that a language model processes.

Tokenization is the step that splits text into tokens.

A token can be a whole word, part of a word, punctuation, whitespace, or another small text piece depending on the tokenizer.

For example, a sentence like:

AI is changing software.

may be split into pieces like:

AI | is | changing | software | .

But a longer or less common word might be split into smaller pieces.

The important beginner idea is:

Language models do not directly process text exactly as humans see it. They process tokens.

2. Why This Idea Exists

Tokens exist because models need a structured way to handle text.

Humans see language as words, phrases, punctuation, tone, and meaning. Computers need a representation they can process.

A language model cannot directly receive "meaning" as a clean object. The text first has to be converted into pieces the model knows how to work with.

Tokenization is that conversion step.

It gives the model a vocabulary of text pieces. The model learns patterns between those pieces during training. During inference, the model receives tokens as input and predicts tokens as output.

This is why tokens show up everywhere in language model products:

  • prompt limits
  • context windows
  • pricing
  • output length
  • streaming
  • latency
  • memory usage

If you do not understand tokens, a lot of LLM behavior feels random.

If you do understand tokens, many product constraints become easier to reason about.

3. The Beginner Mental Model

Think of tokens as the model's reading units.

Humans read words and sentences. Language models process tokens.

text -> tokenizer -> tokens -> model

And when a model generates text:

model -> next token -> next token -> next token -> text

That is why generated answers often stream gradually. The model is producing pieces of text one after another.

This does not mean the model is thinking one English word at a time. It is predicting token sequences based on patterns it learned.

For beginner purposes, the useful model is:

The tokenizer turns text into model-readable pieces. The model reads and writes those pieces.

4. What That Mental Model Misses

The reading-unit model is useful, but it hides some important details.

First, tokens are not always words. A common word may be one token. A rare word may be split into several tokens. Punctuation and spaces can matter too.

Second, different models can use different tokenizers. The exact split is not universal across every AI system.

Third, token count is not the same thing as character count. A short-looking string can be many tokens if it contains unusual symbols, code, IDs, or text in certain languages.

Fourth, tokens do not by themselves contain understanding. They are pieces of text. The model learns patterns over token sequences during training.

Fifth, token limits shape product design. If a model can only receive a certain number of tokens, the product must choose what context to include and what to leave out.

That last point is huge. Context engineering exists partly because tokens are limited.

5. A Concrete Example

Imagine you are building a coding assistant.

The user asks:

Why is this checkout function failing?

The product might want to send the model:

  • the user's question
  • the current function
  • nearby code
  • error logs
  • package versions
  • relevant tests
  • previous conversation

All of that becomes tokens.

If the product sends too little, the model may not have enough context.

If the product sends too much, the request may become expensive, slow, or exceed the model's context window.

So the product has to choose.

available project context -> select useful pieces -> tokenize -> send to model

This is why "just send the whole codebase" is usually not a serious product plan. The model has token limits, latency limits, and cost limits.

Tokens turn context selection into an engineering problem.

6. How It Works At A Practical Level

At a practical level, a language model request goes through a flow like this:

user text -> tokenizer -> input tokens -> model -> output tokens -> text

The tokenizer maps text pieces into token IDs. A token ID is a numeric identifier for a token in the model's vocabulary.

You do not need to memorize token IDs as a beginner. The important thing is that the model receives numbers representing tokens, not raw human meaning.

During inference, the model looks at the input tokens and predicts what token should come next. Then it can use the new token as part of the context to predict the next one, and so on.

This repeated next-token generation is one reason language models can write paragraphs, code, summaries, and answers.

Tokens also affect cost. Many model providers price usage by input tokens and output tokens.

Tokens affect latency too. More input tokens can take longer to process. More output tokens take longer to generate.

Tokens affect context windows. A context window is the maximum amount of token context the model can consider in one request.

So tokens are not just an internal detail. They shape the product.

7. Where You See This In Real AI Products

In a ChatGPT-style assistant, your message, the conversation history, hidden instructions, tool results, and generated answer all involve tokens.

In a Perplexity-style search product, retrieved passages are added to the prompt as tokens. The product has to decide which passages are worth spending context on.

In a coding assistant, file snippets and error logs become tokens. The assistant cannot always include every file, so context selection matters.

In a document Q&A system, the system may split documents into chunks because the full document may not fit into the model context.

In an AI agent, the agent's instructions, tool descriptions, intermediate results, and conversation history consume tokens.

In all of these products, token usage affects quality, cost, latency, and reliability.

8. Common Confusions

A token is not always a word.

Some words are one token. Some are multiple tokens. Some tokens are punctuation or whitespace-like pieces.

Tokenization is not the same thing as embedding.

Tokenization splits text into pieces. Embedding turns something into a numeric representation that can capture useful meaning.

The context window is not the same thing as model memory.

The context window is what the model can consider in a request. Long-term product memory usually requires external storage and retrieval.

More tokens are not always better.

More context can help, but irrelevant context can distract the model, increase cost, and slow the product.

Output tokens are not free.

Long answers cost more and take longer to generate.

9. What This Does Not Mean

This does not mean you need to manually split every prompt into tokens.

Most product code uses model APIs or libraries that handle tokenization.

This does not mean tokenization explains all model behavior.

It explains the text representation layer, not the full model.

This does not mean bigger context windows remove the need for good context selection.

Large context windows help, but products still need to choose useful information, control cost, avoid leaking private data, and evaluate output quality.

This does not mean tokens are only a billing concept.

Billing is one visible place tokens appear, but tokens also shape generation, limits, latency, and architecture.

10. What To Learn Next

Next, learn prompts, context, and completions.

Tokens explain how text is broken into model-readable pieces.

Prompts explain what input you give the model.

Context explains what information the model can see during a request.

Completions explain the output the model produces.

Together, these ideas make language model products much easier to understand:

prompt + context -> tokens -> model -> completion

Once that is clear, embeddings, retrieval, RAG, and agents become less intimidating because you can see how information enters the model and how output comes back.

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.