Concepts

Inverted Index

Map searchable terms to the documents that contain them so search can find candidates without scanning every document.

intermediate4 min readUpdated 2026-05-15DataCapacityOperationsTradeoffs

SearchTokenizationPostings ListsQuery PlanningWrite Amplification

After this, you will understand

How Inverted Index helps you see where this idea appears in production systems, what problem forces it, and how to reason about the tradeoffs.

Naive mental model

Treat the idea as a definition to memorize.

Production pressure

Real systems force the idea to handle Search, Tokenization, and Postings Lists.

Better reasoning

Use the concept to decide what the system guarantees, what it risks, and what it costs to operate.

Think before readingWhere would Inverted Index appear in a real production system, and what failure or bottleneck would it help you reason about?

As you read, look for the pressure that creates the idea first. The mechanics matter more once the reason is clear.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Live Playback Window

Concepts Covered

Forward index vs inverted index
Terms, tokens, and documents
Postings lists
Candidate retrieval
Search freshness
Read and write tradeoffs
Why this is different from a normal database index

Definition

An inverted index maps a searchable term to the documents that contain that term.

Instead of asking "what words are inside document 123?", an inverted index asks "which documents contain the word kafka?"

Example:

kafka        -> [tweet_9, tweet_31, tweet_88]
replication  -> [tweet_31, tweet_52]
latency      -> [tweet_9, tweet_77, tweet_88]

That structure is the heart of most text search systems. It lets the search service find a small candidate set before ranking, filtering, and returning results.

The Pain That Forces Inverted Indexes

A naive search system stores posts in a table and runs a query like:

SELECT *
FROM posts
WHERE body ILIKE '%kafka%';

This works in a toy system. It fails when the corpus grows.

For every search request, the database may need to inspect huge amounts of text. The query cost grows with the number of posts, not just the number of matching posts. At social-network scale, this becomes impossible: every user query would become a large scan across a constantly growing body of text.

An inverted index exists because search must start from the query terms, not from the full document collection.

Mental Model

A normal row-oriented table is optimized around documents:

tweet_id -> author_id, body, created_at, engagement, language

An inverted index turns the access pattern around:

term -> documents containing that term

The search engine tokenizes the text, normalizes each token, and appends the document id to each term's postings list.

For a post like:

Kafka helps absorb traffic spikes

the index might receive:

kafka   -> tweet_123
absorb  -> tweet_123
traffic -> tweet_123
spike   -> tweet_123

Now a query for kafka traffic can retrieve candidate documents by intersecting or combining the postings lists for kafka and traffic.

What Lives In A Posting

A postings list can store more than document ids. It may include:

document id
term frequency
term positions
field information, such as title, body, hashtag, or username
timestamp
lightweight quality or filtering metadata

Term positions matter for phrase queries. If a user searches for "distributed systems", the engine needs to know whether those words appear next to each other, not merely somewhere in the same document.

Why It Is Not Just A B-Tree

A B-tree is excellent for ordered key lookups and range scans. It can help a database find a row by id, created_at, or short_code.

Text search has a different shape. The query is not "find the row where primary key equals X." The query is "find documents that contain these terms, maybe in this order, maybe with filters, maybe ranked by freshness, quality, and personalization."

An inverted index is built for term-to-document retrieval. B-trees may still appear around the system for metadata, dictionaries, checkpoints, or storage engines, but the core search retrieval structure is the inverted index.

Tradeoffs

Inverted indexes make reads fast, but writes become more expensive.

Every new document must be:

parsed
tokenized
normalized
written into multiple term lists
made visible to search
possibly replicated across serving nodes

One post can update many postings lists. That is write amplification. Search systems accept this cost because query-time scanning would be far worse.

Operational Reality

Important signals:

indexing lag
postings list size for hot terms
query latency by term
merge pressure
shard imbalance
memory used by term dictionaries
dropped or delayed index updates

Hot terms can behave like hot keys. A query for a very common term may touch enormous postings lists, while a rare term may be cheap. This is one reason search systems need query planning, caches, sharding, ranking cutoffs, and careful operational limits.

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.

Used In Systems

System studies where this idea appears in context.

Twitter/X Real-Time Search SystemSee the idea under full production pressure.

Related Concepts

Core ideas that connect to this topic.

Database IndexingUnderstand the concept behind the design decision.Search Indexing PipelineUnderstand the concept behind the design decision.Query Fan-OutUnderstand the concept behind the design decision.