BM25 — Best Matching 25 — Distributed Systems | CrackingWalnuts

Why TF-IDF Falls Short

Before BM25, search engines ranked documents using TF-IDF: term frequency times inverse document frequency. The idea is sound. If a word appears frequently in a document (high TF) and rarely across the whole corpus (high IDF), that document is probably relevant for that word.

The problem shows up fast in practice. Raw term frequency is unbounded. A 10,000-word article that mentions "database" 40 times gets 4x the TF score of a 500-word article that mentions it 10 times. But is the long article really more relevant? Usually not. It is just longer.

TF-IDF has no built-in way to handle this. People tried log-scaling the term frequency (1 + log(tf)), which helps but still does not properly account for document length. A 10,000-word document has more room to mention any given term, so raw counts are always biased toward longer content.

BM25 solves both problems with a single, elegant formula.

The BM25 Formula

The score for a document D given a query Q with terms q1, q2, ..., qn is:

score(D, Q) = Σ IDF(qi) × (tf(qi, D) × (k1 + 1)) / (tf(qi, D) + k1 × (1 - b + b × |D| / avgdl))

That looks dense, so let's break it apart.

IDF component: IDF(qi) = ln((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)

N is the total number of documents. n(qi) is the number of documents containing term qi. This is the standard inverse document frequency idea: rare terms get higher weights. The +0.5 smoothing prevents division by zero and handles edge cases when a term appears in almost every document.

Term frequency saturation: tf(qi, D) × (k1 + 1) / (tf(qi, D) + k1 × ...)

This is the key insight of BM25. As tf grows, the numerator grows, but the denominator grows at the same rate. The result asymptotically approaches (k1 + 1). So the first occurrence of a term in a document matters a lot. The second occurrence matters less. By the 10th occurrence, you are barely getting any additional score. This is term frequency saturation, and it is what prevents long documents from dominating.

The parameter k1 controls how fast saturation happens. With k1=0, term frequency is completely ignored (every document with the term scores the same). With k1=1.2 (the common default), you get reasonable saturation. With k1=100, you are back to nearly linear TF behavior.

Document length normalization: 1 - b + b × |D| / avgdl

|D| is the length of document D (in terms or tokens). avgdl is the average document length across the corpus. When b=0.75 (the default), a document that is twice as long as average gets penalized, while a document half the average length gets a small boost. This normalizes for the fact that longer documents naturally contain more term occurrences.

Setting b=0 disables length normalization entirely. Setting b=1 gives full normalization. The value 0.75 works well as a starting point for most corpora, but it is not universal.

Parameter Tuning in Practice

The defaults (k1=1.2, b=0.75) come from experiments on TREC document collections in the 1990s. They work surprisingly well on a wide range of data, but "surprisingly well" is not the same as "optimal for your use case."

Short-form content like product names, email subjects, or tweet-length text: lower b (0.3-0.5) and lower k1 (0.5-1.0). These documents are all roughly the same length, so heavy length normalization hurts more than it helps. And with short text, even a single term occurrence is highly significant.

Long-form content like research papers, legal documents, or wiki articles: the defaults work well, or try slightly higher b (0.8) to more aggressively penalize verbose documents that mention terms incidentally.

Mixed-length corpora are the hardest. A corpus with both 50-word product descriptions and 5,000-word user manuals will struggle with any single b value. The practical solution is to index them into separate fields or separate indices with different BM25 parameters.

How do you actually evaluate? Build a test set of queries with known relevant documents (even 50-100 queries helps), compute metrics like NDCG@10 or MAP, and grid-search over k1 and b values. Elasticsearch's rank evaluation API makes this straightforward.

How Query Processing Works

BM25 does not magically score all documents at once. The actual execution follows the structure of the inverted index.

For each term in the query, the search engine looks up that term's posting list: a sorted list of (document_id, term_frequency) pairs. It walks through the posting list, computes the BM25 contribution of that term for each document, and accumulates the score.

For a multi-term query, the engine needs to combine scores across multiple posting lists. The naive approach iterates through every posting in every list. But if you only need the top 10 results, processing millions of postings is wasteful.

WAND (Weak AND) is the optimization that makes this tractable. It maintains a threshold score (the score of the current 10th-best document) and uses upper bounds on per-term BM25 contributions to skip documents that cannot possibly beat the threshold. If a document's maximum possible score from remaining terms is below the threshold, skip it.

Block-Max WAND takes this further by precomputing maximum term frequencies within blocks of the posting list. If an entire block of 128 documents cannot produce a score above the threshold, skip the whole block. In practice, Block-Max WAND skips 90%+ of the posting list for selective queries, making top-k retrieval nearly independent of corpus size.

Elasticsearch uses these optimizations under the hood. When you run a match query, you are not doing a brute-force scan. The query planner decides the execution strategy based on term statistics.

BM25F: Multi-Field Scoring

Real documents have structure. A web page has a title, a URL, headers, and body text. A product listing has a name, category, description, and reviews. Matching "wireless headphones" in the title is very different from matching it deep in a review.

BM25F (BM25 for structured documents) extends BM25 to handle multiple fields. Instead of treating the document as a flat bag of words, it computes a weighted combination of term frequencies across fields before applying the BM25 formula.

The weighted term frequency becomes:

tf_combined = w_title × tf_title + w_body × tf_body + w_url × tf_url

Then you plug tf_combined into the standard BM25 formula with a single set of k1 and b parameters.

Elasticsearch does not implement BM25F directly. Instead, it computes BM25 independently per field and combines the scores. The result is similar but not identical. The combined_fields query type in Elasticsearch 7.13+ gets closer to true BM25F by modeling a single synthetic field from multiple source fields.

IDF Across Shards: The Coordination Problem

Here is where BM25 meets distributed systems. Elasticsearch splits an index into shards, and each shard is a separate Lucene index with its own IDF statistics. A term that appears in 1% of documents on shard 1 might appear in 5% on shard 2. If each shard uses its local IDF, the same document would get different scores depending on which shard it lives on.

For small indices or uneven shard sizes, this matters. A rare term on one shard might be common on another, leading to inconsistent ranking.

Elasticsearch offers search_type=dfs_query_then_fetch to fix this. The "DFS" (distributed frequency statistics) phase collects term statistics from all shards before scoring. Each shard then uses the global IDF instead of its local one. The cost is an extra round trip to collect statistics, which adds latency. Most production deployments skip DFS because their indices are large enough that per-shard statistics converge to the global distribution.

BM25 vs. Vector Search

BM25 finds documents that contain your exact query terms. Vector search (using embeddings from models like sentence-transformers) finds documents that are semantically similar, even without shared vocabulary. "Automobile repair shop" can match "car mechanic garage" in vector space but never in BM25.

So why not just use vector search for everything?

Because BM25 handles precision cases that vector search gets wrong. Searching for error code "NullPointerException" or product SKU "XJ-4829" or a person's name "Rakesh Rathi" needs exact lexical matching. Vector embeddings collapse these into a dense representation that loses the specific token identity.

The industry has converged on hybrid search: run both BM25 and vector search, then combine the results using reciprocal rank fusion (RRF) or a learned re-ranker. Elasticsearch added vector search alongside BM25 in version 8.0, and this hybrid approach has become the standard architecture.

The combination is powerful because the failure modes are complementary. BM25 fails on vocabulary mismatch. Vector search fails on precision queries. Together, they cover each other's weaknesses.

When BM25 Breaks Down

BM25 assumes that query terms are independent. The score for "New York" is just the sum of the scores for "New" and "York" separately. This means a document about "New Zealand" and "York University" could outscore a document about "New York City" if the individual term statistics work out that way.

Phrase queries, proximity boosting, and span queries are the standard workarounds. Elasticsearch's match_phrase query requires terms to appear adjacent, and proximity queries allow a configurable slop between terms. These are post-BM25 filters and boosters, not part of the BM25 formula itself.

BM25 also has no concept of entity or intent. A query for "apple" returns documents about the fruit and the company in whatever order BM25 decides, with no understanding that these are different concepts. Handling this requires query classification or personalization layers above BM25.

Production Patterns

Elasticsearch relevance tuning. Start with default BM25 (k1=1.2, b=0.75). Index your most important fields (title, name) separately from body content. Apply field boosts in your query (title^3, body^1). Build a test query set and iterate on boosts and parameters using the rank eval API. This gets you 80% of the way. The remaining 20% comes from synonyms, custom analyzers, and function score queries that blend BM25 with business signals like popularity or recency.

SQLite FTS5 for embedded search. SQLite's FTS5 extension uses BM25 natively. If you are building a mobile app or a desktop tool that needs local full-text search, FTS5 gives you BM25 ranking out of the box with zero infrastructure. The rank auxiliary function returns BM25 scores directly.

Two-stage retrieval. Use BM25 as the first stage to retrieve the top 1000 candidates cheaply, then re-rank with a more expensive model (cross-encoder, learning-to-rank, or LLM-based re-ranker). BM25 is fast enough to scan millions of posting lists. The expensive model only needs to look at 1000 documents. This is how most production search systems work at scale.

Why TF-IDF Falls Short

BM25 solves both problems with a single, elegant formula.

The BM25 Formula

The score for a document D given a query Q with terms q1, q2, ..., qn is:

score(D, Q) = Σ IDF(qi) × (tf(qi, D) × (k1 + 1)) / (tf(qi, D) + k1 × (1 - b + b × |D| / avgdl))

That looks dense, so let's break it apart.

IDF component: IDF(qi) = ln((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)

Term frequency saturation: tf(qi, D) × (k1 + 1) / (tf(qi, D) + k1 × ...)

Document length normalization: 1 - b + b × |D| / avgdl

Setting b=0 disables length normalization entirely. Setting b=1 gives full normalization. The value 0.75 works well as a starting point for most corpora, but it is not universal.

Parameter Tuning in Practice

How Query Processing Works

BM25 does not magically score all documents at once. The actual execution follows the structure of the inverted index.

Elasticsearch uses these optimizations under the hood. When you run a match query, you are not doing a brute-force scan. The query planner decides the execution strategy based on term statistics.

BM25F: Multi-Field Scoring

The weighted term frequency becomes:

tf_combined = w_title × tf_title + w_body × tf_body + w_url × tf_url

Then you plug tf_combined into the standard BM25 formula with a single set of k1 and b parameters.

IDF Across Shards: The Coordination Problem

For small indices or uneven shard sizes, this matters. A rare term on one shard might be common on another, leading to inconsistent ranking.

BM25 vs. Vector Search

So why not just use vector search for everything?

The combination is powerful because the failure modes are complementary. BM25 fails on vocabulary mismatch. Vector search fails on precision queries. Together, they cover each other's weaknesses.

BM25 — Best Matching 25

Architecture

Why TF-IDF Falls Short

The BM25 Formula

Parameter Tuning in Practice

How Query Processing Works

BM25F: Multi-Field Scoring

IDF Across Shards: The Coordination Problem

BM25 vs. Vector Search

When BM25 Breaks Down

Production Patterns

Key Points

Used By

Common Mistakes

Related

BM25 — Best Matching 25

Architecture

Why TF-IDF Falls Short

The BM25 Formula

Parameter Tuning in Practice

How Query Processing Works

BM25F: Multi-Field Scoring

IDF Across Shards: The Coordination Problem

BM25 vs. Vector Search

When BM25 Breaks Down

Production Patterns

Key Points

Used By

Common Mistakes

Related