Concepts
Search Indexing Pipeline
Move newly created or updated documents from the write path into searchable indexes with bounded freshness and retryable processing.
Concepts Covered
- Document write path
- Index update events
- Tokenization and enrichment
- Fresh index segments
- Backfills
- Replay and retry
- Indexing lag
Definition
A search indexing pipeline is the asynchronous path that turns newly written documents into searchable index data.
When a user creates a post, the post usually becomes durable in a source-of-truth store first. Search visibility is a separate concern. The indexing pipeline consumes the new document, processes it, writes it into an index, and makes it available for queries.
The Pain That Forces An Indexing Pipeline
A naive implementation tries to update the search index directly inside the user request:
create post
-> write post database
-> tokenize text
-> update search index
-> replicate index update
-> return success
That creates a fragile write path. If the search index is slow, the user cannot post. If tokenization or enrichment is expensive, the request gets slower. If the index update fails after the post database commit succeeds, the post exists but cannot be searched.
The search pipeline exists because durable posting and searchable visibility are related, but they should not be the same synchronous operation.
Mental Model
Think of indexing as a durable conveyor belt:
source write -> index event -> processor -> fresh index segment -> search serving
The source write records the truth. The event records that search needs to catch up. Processors transform the document into searchable terms and metadata. Search serving nodes load the resulting index data.
The system is allowed to be briefly behind, but it should know how far behind it is.
What The Pipeline Does
Common stages:
- read document change events
- fetch or receive the document payload
- tokenize text
- normalize terms
- detect language
- extract hashtags, mentions, URLs, and media signals
- attach metadata such as author, timestamp, visibility, and safety state
- write terms into index structures
- publish or load index segments into search serving nodes
The exact stages vary, but the important split is stable: the product write path makes the post durable; the indexing path makes it discoverable.
Freshness
Freshness describes how quickly a new or updated document becomes searchable.
For a real-time social search product, freshness matters. Users expect a post about a live event to appear quickly. But "quickly" is not the same as "inside the original write transaction."
The system usually defines an acceptable freshness target:
p50 searchable in 1 second
p95 searchable in 5 seconds
p99 searchable in 30 seconds
Those numbers are product and scale decisions. The key is that freshness is measured and operated, not assumed.
Failure Handling
Indexing workers should be idempotent. If the same event is processed twice, the resulting index should not contain duplicate documents.
Failures should be retryable:
- temporary processor failures retry
- bad documents move to a dead-letter queue
- missed events can be replayed
- full backfills can rebuild an index from source data
This is why search pipelines often lean on event streams, idempotent consumers, dead-letter queues, and backpressure.
Operational Reality
Important signals:
- indexing lag
- events processed per second
- failed documents
- dead-letter queue depth
- processor CPU and memory
- index segment merge pressure
- freshness percentiles
- difference between source document count and indexed document count
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Used In Systems
System studies where this idea appears in context.
Related Concepts
Core ideas that connect to this topic.
Related Patterns
Reusable architecture moves built from these ideas.