Concepts
Inverted Index
Map searchable terms to the documents that contain them so search can find candidates without scanning every document.
Concepts Covered
- Forward index vs inverted index
- Terms, tokens, and documents
- Postings lists
- Candidate retrieval
- Search freshness
- Read and write tradeoffs
- Why this is different from a normal database index
Definition
An inverted index maps a searchable term to the documents that contain that term.
Instead of asking "what words are inside document 123?", an inverted index asks "which documents contain the word kafka?"
Example:
kafka -> [tweet_9, tweet_31, tweet_88]
replication -> [tweet_31, tweet_52]
latency -> [tweet_9, tweet_77, tweet_88]
That structure is the heart of most text search systems. It lets the search service find a small candidate set before ranking, filtering, and returning results.
The Pain That Forces Inverted Indexes
A naive search system stores posts in a table and runs a query like:
SELECT *
FROM posts
WHERE body ILIKE '%kafka%';
This works in a toy system. It fails when the corpus grows.
For every search request, the database may need to inspect huge amounts of text. The query cost grows with the number of posts, not just the number of matching posts. At social-network scale, this becomes impossible: every user query would become a large scan across a constantly growing body of text.
An inverted index exists because search must start from the query terms, not from the full document collection.
Mental Model
A normal row-oriented table is optimized around documents:
tweet_id -> author_id, body, created_at, engagement, language
An inverted index turns the access pattern around:
term -> documents containing that term
The search engine tokenizes the text, normalizes each token, and appends the document id to each term's postings list.
For a post like:
Kafka helps absorb traffic spikes
the index might receive:
kafka -> tweet_123
absorb -> tweet_123
traffic -> tweet_123
spike -> tweet_123
Now a query for kafka traffic can retrieve candidate documents by intersecting or combining the postings lists for kafka and traffic.
What Lives In A Posting
A postings list can store more than document ids. It may include:
- document id
- term frequency
- term positions
- field information, such as title, body, hashtag, or username
- timestamp
- lightweight quality or filtering metadata
Term positions matter for phrase queries. If a user searches for "distributed systems", the engine needs to know whether those words appear next to each other, not merely somewhere in the same document.
Why It Is Not Just A B-Tree
A B-tree is excellent for ordered key lookups and range scans. It can help a database find a row by id, created_at, or short_code.
Text search has a different shape. The query is not "find the row where primary key equals X." The query is "find documents that contain these terms, maybe in this order, maybe with filters, maybe ranked by freshness, quality, and personalization."
An inverted index is built for term-to-document retrieval. B-trees may still appear around the system for metadata, dictionaries, checkpoints, or storage engines, but the core search retrieval structure is the inverted index.
Tradeoffs
Inverted indexes make reads fast, but writes become more expensive.
Every new document must be:
- parsed
- tokenized
- normalized
- written into multiple term lists
- made visible to search
- possibly replicated across serving nodes
One post can update many postings lists. That is write amplification. Search systems accept this cost because query-time scanning would be far worse.
Operational Reality
Important signals:
- indexing lag
- postings list size for hot terms
- query latency by term
- merge pressure
- shard imbalance
- memory used by term dictionaries
- dropped or delayed index updates
Hot terms can behave like hot keys. A query for a very common term may touch enormous postings lists, while a rare term may be cheap. This is one reason search systems need query planning, caches, sharding, ranking cutoffs, and careful operational limits.
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Used In Systems
System studies where this idea appears in context.
Related Concepts
Core ideas that connect to this topic.