Inverted Index

What Problem Does It Solve?

You have 100 million documents. Someone searches for "consensus protocol." Without an index, you scan every document, check every word, and return matches. That takes minutes, maybe hours. With an inverted index, you look up "consensus" in what amounts to a hash table, get back a sorted list of document IDs that contain it, do the same for "protocol," intersect the two lists, and return results. That takes milliseconds.

The inverted index is the single most important data structure in information retrieval. Google, Elasticsearch, Solr, PostgreSQL full-text search, SQLite FTS, and every other search system you have used is built on top of one.

The name "inverted" comes from the reversal of perspective. A forward index maps documents to their terms: "document 1 contains words [cat, sat, mat]." An inverted index maps terms to their documents: "the word cat appears in documents [1, 47, 2093]." This inversion is what makes search fast.

Anatomy of a Posting List

The inverted index is a dictionary of terms, each pointing to a posting list. The simplest posting list is just a sorted array of document IDs:

"database" → [3, 17, 42, 88, 156, 302, 1847]

But real posting lists carry more information per entry:

Document ID: which document contains the term
Term frequency: how many times the term appears in this document (needed for BM25 scoring)
Positions: the exact word positions within the document (needed for phrase queries and proximity queries)
Offsets: character-level start and end positions (needed for highlighting search results)

Lucene stores these in separate files: .doc for document IDs and frequencies, .pos for positions, .pay for payloads and offsets. This separation lets you read only what you need. A simple Boolean query only needs document IDs. A BM25 ranking query needs IDs and frequencies. Phrase queries additionally need positions.

Posting List Compression

An uncompressed posting list of 10 million document IDs at 4 bytes each takes 40 MB. Multiply that by hundreds of thousands of unique terms in a real corpus, and you are looking at terabytes of raw posting data. Compression is not optional.

The key insight is that posting lists are sorted. Instead of storing absolute document IDs [3, 17, 42, 88], you store gaps between consecutive IDs [3, 14, 25, 46]. These gaps (called d-gaps or deltas) are much smaller numbers, and smaller numbers compress better.

Variable Byte (VByte) encoding is the simplest approach. Each byte uses 7 bits for data and 1 bit to signal "more bytes follow." Small gaps (under 128) take 1 byte. Large gaps take 2-4 bytes. VByte is fast to decode because it works byte-by-byte with no bit manipulation.

PForDelta (Patched Frame of Reference) packs blocks of 128 gaps using the minimum number of bits needed for most of them. If 120 out of 128 gaps fit in 6 bits, use 6 bits for the block and store the 8 exceptions separately. This achieves better compression than VByte and is SIMD-friendly.

Roaring Bitmaps take a different approach entirely. They split the document ID space into chunks of 65536 and choose between a bitmap, a sorted array, or a run-length encoding for each chunk depending on density. Roaring is popular for filter-style queries (give me all documents matching this tag) where you need fast set operations.

Lucene uses a combination of these. It switched from VByte to PForDelta in Lucene 4.0 for better throughput and has continued evolving its codec. The current default codec uses block-based encoding with SIMD acceleration.

Building the Index: The Analyzer Chain

Before anything can be indexed, raw text must be transformed into a sequence of terms (tokens). The analyzer chain controls this transformation, and every decision here is permanent for the life of that index.

A typical analyzer chain for English:

Character filters: strip HTML tags, normalize Unicode
Tokenizer: split on whitespace and punctuation ("don't" becomes "don" and "t", or "dont", depending on your tokenizer)
Token filters: lowercase everything, apply stemming ("running" becomes "run"), remove stop words ("the", "is", "at"), expand synonyms

The same analyzer must run on both the indexed documents and the query terms. If you stem "running" to "run" at index time, your query for "running" must also stem to "run" or it will not match. This is the most common source of "my search returns no results" bugs.

Different fields often need different analyzers. A product SKU field should use a keyword analyzer (no tokenization at all). A title field might use a standard analyzer with stemming. A description field might add synonym expansion. Choosing the wrong analyzer for a field silently destroys search quality in ways that are hard to diagnose after the fact.

Query Execution: Set Operations on Posting Lists

Once you have an inverted index, query execution reduces to operations on sorted lists.

AND query ("database AND replication"): retrieve the posting list for "database" and the posting list for "replication," then intersect them. Since both lists are sorted by document ID, you can walk through them simultaneously with two pointers. When both pointers point to the same document ID, that document matches. Advance whichever pointer is behind. This runs in O(m + n) where m and n are the list lengths.

OR query ("database OR replication"): merge both posting lists. Every document in either list matches. Walk through both lists, outputting every unique document ID. Also O(m + n).

NOT query ("database NOT mysql"): walk through the "database" list and skip any document that also appears in the "mysql" list. Straightforward with sorted lists.

Phrase query ("distributed consensus" as an exact phrase): first intersect the posting lists for "distributed" and "consensus." Then, for each document in the intersection, check the position lists. "Distributed" at position 5 and "consensus" at position 6 means the phrase appears. "Distributed" at position 5 and "consensus" at position 47 does not.

Skip pointers accelerate intersection on very long posting lists. Instead of walking through every entry, skip pointers let you jump ahead. If you are intersecting a short list (1000 entries) with a long list (10 million entries), you can use the short list's document IDs to binary search or skip through the long list instead of scanning it linearly.

Segment Architecture: Why Immutability Matters

Lucene builds its inverted index as a sequence of immutable segments. When you index a document, it goes into an in-memory buffer. When the buffer fills up (or a flush is triggered), it writes a new segment to disk. Each segment is a complete, self-contained inverted index.

Why immutable? Because updating a sorted posting list in place is expensive. If "database" has a posting list of 10 million entries and you need to insert document 5,000,001 in the middle, you would need to shift millions of entries. Immutable segments avoid this entirely. New documents go into new segments. Deletions are recorded in a separate bitset (a "live docs" bitmap). Searches query all segments and skip deleted documents.

Over time, segments accumulate. Too many segments means too many posting lists to merge at query time. Lucene runs a background merge process that combines small segments into larger ones, drops deleted documents, and compacts the index. The merge policy controls how aggressively this happens. Tiered merge (the default) groups segments by size and merges similar-sized segments together.

Elasticsearch adds a layer on top. Each Elasticsearch shard is a Lucene index with its own segments. An Elasticsearch index with 5 primary shards has 5 independent Lucene indices, each with its own merge process. Search queries fan out to all shards, each shard returns its top results, and a coordinating node merges and re-ranks.

The Real-Time Search Problem

Immutable segments create a latency gap. A document written to the in-memory buffer is not searchable until that buffer is flushed to a segment. Lucene flushes every second by default (configurable via refresh_interval in Elasticsearch).

This one-second delay is the "near real-time" in Lucene's NRT search. For most applications, it is unnoticeable. For real-time dashboards or chat search, it can matter.

Reducing the refresh interval below one second is possible but creates more segments, which means more merging, more file handles, and more overhead per query. The right balance depends on your latency requirements and indexing throughput.

Distributed Inverted Indexes

When a single inverted index gets too large for one machine, you shard it. There are two approaches:

Document-partitioned (the common approach): split documents across shards. Each shard has a complete inverted index for its subset of documents. A query goes to all shards, each returns local top results, and the coordinator merges them. This is what Elasticsearch does. The advantage is that each shard is independent and can be queried in parallel. The disadvantage is that IDF statistics are per-shard, which can cause scoring inconsistencies on small indices.

Term-partitioned: split the dictionary across shards. Shard 1 owns terms A-M, shard 2 owns N-Z. A query for "database replication" goes to shard 1 (for "database") and shard 2 (for "replication"), and the results are intersected. The advantage is that term statistics are global. The disadvantage is that multi-term queries require cross-shard coordination, and hot terms create load imbalance.

Document partitioning won in practice. Nearly every production search system uses it because the operational simplicity outweighs the minor scoring differences.

When a B-Tree is Better

Inverted indexes are optimized for "which documents contain this term?" queries. They are not good at range queries on numeric fields ("price between 10 and 50"), spatial queries ("restaurants within 5 miles"), or sorted access patterns ("all documents ordered by date").

For these, Lucene uses different data structures alongside the inverted index. BKD trees handle numeric and geo ranges. Doc values (column-oriented storage) handle sorting and aggregations. An Elasticsearch index is not just an inverted index; it is a combination of data structures, each optimized for a specific access pattern.

The inverted index handles the text matching. Everything else uses purpose-built structures. Knowing where the inverted index ends and other structures begin helps you design better schemas and write better queries.

What Problem Does It Solve?

Anatomy of a Posting List

The inverted index is a dictionary of terms, each pointing to a posting list. The simplest posting list is just a sorted array of document IDs:

"database" → [3, 17, 42, 88, 156, 302, 1847]

But real posting lists carry more information per entry:

Document ID: which document contains the term
Term frequency: how many times the term appears in this document (needed for BM25 scoring)
Positions: the exact word positions within the document (needed for phrase queries and proximity queries)
Offsets: character-level start and end positions (needed for highlighting search results)

Posting List Compression

Building the Index: The Analyzer Chain

A typical analyzer chain for English:

Character filters: strip HTML tags, normalize Unicode
Tokenizer: split on whitespace and punctuation ("don't" becomes "don" and "t", or "dont", depending on your tokenizer)
Token filters: lowercase everything, apply stemming ("running" becomes "run"), remove stop words ("the", "is", "at"), expand synonyms

Query Execution: Set Operations on Posting Lists

Once you have an inverted index, query execution reduces to operations on sorted lists.

OR query ("database OR replication"): merge both posting lists. Every document in either list matches. Walk through both lists, outputting every unique document ID. Also O(m + n).

NOT query ("database NOT mysql"): walk through the "database" list and skip any document that also appears in the "mysql" list. Straightforward with sorted lists.

Segment Architecture: Why Immutability Matters

The Real-Time Search Problem

This one-second delay is the "near real-time" in Lucene's NRT search. For most applications, it is unnoticeable. For real-time dashboards or chat search, it can matter.

Distributed Inverted Indexes

When a single inverted index gets too large for one machine, you shard it. There are two approaches:

Document partitioning won in practice. Nearly every production search system uses it because the operational simplicity outweighs the minor scoring differences.

Architecture

What Problem Does It Solve?

Anatomy of a Posting List

Posting List Compression

Building the Index: The Analyzer Chain

Query Execution: Set Operations on Posting Lists

Segment Architecture: Why Immutability Matters

The Real-Time Search Problem

Distributed Inverted Indexes

When a B-Tree is Better

Key Points

Used By

Common Mistakes

Related

Inverted Index

Architecture

What Problem Does It Solve?

Anatomy of a Posting List

Posting List Compression

Building the Index: The Analyzer Chain

Query Execution: Set Operations on Posting Lists

Segment Architecture: Why Immutability Matters

The Real-Time Search Problem

Distributed Inverted Indexes

When a B-Tree is Better

Key Points

Used By

Common Mistakes

Related