Platform ScaleProblem 26 of 29

Platform ScaleAdvanced

Search Index Engine

Composite PatternVisitor PatternStrategy PatternBuilder PatternDecorator Pattern

Inverted index with Composite field structure, Visitor-based query operations, swappable tokenization strategies, and Decorator for query enrichment. The engine behind Elasticsearch.

Key Abstractions

IndexNode

Composite : root index contains field-level sub-indices

FieldIndex

Leaf : inverted index for a single field: term → doc IDs

IndexVisitor

Visitor : EvalVisitor for search, StatsVisitor for index statistics

TokenizerStrategy

Strategy : WhitespaceTokenizer, NgramTokenizer, LowercaseTokenizer

IndexBuilder

Builder : addField with type, set analyzer, set boost, build

QueryDecorator

Decorator : BoostDecorator multiplies scores, FilterDecorator excludes results

Class Diagram

How It Works

A search index engine maps content to locations. When you index a document, each field (title, body, tags) is tokenized into terms, and those terms are stored in an inverted index: a mapping from term to the set of documents containing it. When you search, the engine looks up your query terms in the inverted index and scores matching documents using TF-IDF.

The Composite pattern structures the index itself. A SearchIndex is the root that contains multiple FieldIndex children, one per field. Querying the root fans out to all field indices automatically. You can weight title matches differently from body matches without changing any search logic.

The Visitor pattern separates operations from structure. SearchVisitor traverses the index tree to find and score documents. StatsVisitor walks the same tree to collect statistics. Neither modifies any index class. Adding a new operation like an export visitor or a validation visitor means writing one new class.

The Strategy pattern makes tokenization per-field. Title uses lowercase splitting. Body might use ngram tokenization for partial matching. Tags use exact whitespace splitting. The tokenizer is injected at field creation time and used consistently for both indexing and querying.

The Builder pattern assembles the index schema step by step. You declare fields with their tokenizers, then call build() to get a validated SearchIndex. This prevents half-configured indices from leaking into production.

The Decorator pattern enriches query results after the core search. A BoostDecorator multiplies scores. A FilterDecorator excludes documents. Decorators compose: boost wraps filter, and the pipeline processes results through each layer without modifying the search visitor.

Requirements

Functional

Index documents with multiple fields (title, body, tags) using per-field tokenization
Build inverted indices mapping terms to document IDs with term frequency tracking
Search across all fields and return documents ranked by TF-IDF score
Support different tokenization strategies per field (whitespace, lowercase, ngram)
Collect index statistics (unique terms, document counts) without modifying index classes
Post-process search results with composable decorators (boosting, filtering)

Non-Functional

Adding a new index operation (export, validation) should not modify any index class
Changing tokenization for a field should not require changes to search or scoring logic
Index schema must be validated at build time to prevent runtime inconsistencies
Result decoration must compose without modifying the core search pipeline

Design Decisions

Couldn't we use a flat lookup table instead of Composite?

A search index is not a flat lookup table. It is a tree: the root index contains field-level sub-indices, each with its own inverted index and tokenizer. The Composite pattern lets you treat a single FieldIndex and the entire SearchIndex uniformly through the IndexNode interface.

This matters when you search. Calling accept(visitor) on the root automatically fans out to every field. You do not write loops over fields in your search logic. The composite handles traversal. When you add a new field, zero search code changes.

What if we put search methods directly on IndexNode?

If you add a search() method directly to IndexNode, it works. But then you need statistics collection. Edit FieldIndex and SearchIndex. Then you want index validation. Edit both again. Every new operation touches every node class.

With Visitor, SearchVisitor scores documents, StatsVisitor counts terms, and a future ExportVisitor could serialize the index. Each is its own class. The index hierarchy stays frozen.

Does one tokenizer work for all fields?

Tokenization is not one-size-fits-all. English titles split on whitespace. Body text benefits from ngram tokenization for partial matches. Tags need exact matching. Hardcoding any of these means you cannot adapt to new content types or languages.

Strategy lets you inject the tokenizer at field creation time. The same FieldIndex class works with any tokenizer. Swapping from whitespace to ngram means passing a different strategy object, not rewriting the index.

Why not just construct the index directly?

An index without fields is useless. An index where some documents have a "title" field and others do not is a source of silent bugs. The Builder pattern forces you to declare the schema upfront: which fields exist, which tokenizer each uses. Calling build() validates the schema and returns a ready-to-use SearchIndex. No half-initialized objects escape into the system.

Couldn't the search visitor handle boosting and filtering?

Boosting, filtering, deduplication, and personalization are all post-processing concerns. Putting them inside the search visitor creates a monolithic searcher that knows about business rules. Decorator keeps each concern in its own class. You compose them: BoostDecorator(2.0, FilterDecorator(excluded)). Each layer wraps the previous one. Adding a new post-processing step means writing one new decorator, not editing the search pipeline.

Interview Follow-ups

"How would you implement BM25 instead of TF-IDF?" Replace the scoring formula in SearchVisitor. BM25 adds document length normalization and a saturation function on term frequency. The visitor interface stays the same. Only the math inside visit_field_index changes. You could even make scoring itself a Strategy if you want to switch between TF-IDF and BM25 at runtime.
"How would you shard the index across multiple machines?" Each shard is a SearchIndex holding a subset of documents (by hash of doc ID). A coordinator node fans out queries to all shards, each returns local top-K results, and the coordinator merges them. The Composite pattern extends naturally: a DistributedIndex is a composite whose children are remote shards instead of local FieldIndex objects.
"How would you handle real-time indexing while serving queries?" Use a two-phase approach: an in-memory buffer index for recent writes and a larger immutable on-disk index for historical data. Queries search both and merge results. Periodically, the buffer is flushed and merged into the main index. This is exactly how Lucene segments work.
"How would you add faceted search?" A faceted search counts how many results fall into each category. Add a FacetVisitor that traverses the index tree and groups matched documents by a designated facet field value. The Visitor pattern means this is one new class. No changes to the index structure or the search visitor.

Code Implementation

  1  from __future__ import annotations
  2  from abc import ABC, abstractmethod
  3  from dataclasses import dataclass, field
  4  from collections import defaultdict
  5  import math
  6  
  7  
  8  # ── Tokenizer Strategy ────────────────────────────────────────────
  9  
 10  class TokenizerStrategy(ABC):
 11      @abstractmethod
 12      def tokenize(self, text: str) -> list[str]:
 13          ...
 14  
 15  
 16  class WhitespaceTokenizer(TokenizerStrategy):
 17      def tokenize(self, text: str) -> list[str]:
 18          return text.split()
 19  
 20  
 21  class LowercaseTokenizer(TokenizerStrategy):
 22      def tokenize(self, text: str) -> list[str]:
 23          return text.lower().split()
 24  
 25  
 26  class NgramTokenizer(TokenizerStrategy):
 27      def __init__(self, n: int = 3):
 28          self.n = n
 29  
 30      def tokenize(self, text: str) -> list[str]:
 31          text = text.lower()
 32          tokens = []
 33          for word in text.split():
 34              for i in range(len(word) - self.n + 1):
 35                  tokens.append(word[i:i + self.n])
 36          return tokens if tokens else [text.lower()]
 37  
 38  
 39  # ── Composite: IndexNode hierarchy ────────────────────────────────
 40  
 41  class IndexNode(ABC):
 42      @abstractmethod
 43      def accept(self, visitor: "IndexVisitor"):
 44          ...
 45  
 46      @abstractmethod
 47      def add_document(self, doc_id: str, text: str) -> None:
 48          ...
 49  
 50  
 51  class FieldIndex(IndexNode):
 52      def __init__(self, name: str, tokenizer: TokenizerStrategy):
 53          self.name = name
 54          self.tokenizer = tokenizer
 55          self.inverted_index: dict[str, set[str]] = defaultdict(set)
 56          self.tf_scores: dict[str, dict[str, float]] = defaultdict(dict)
 57          self.doc_count = 0
 58  
 59      def accept(self, visitor: "IndexVisitor"):
 60          return visitor.visit_field_index(self)
 61  
 62      def add_document(self, doc_id: str, text: str) -> None:
 63          tokens = self.tokenizer.tokenize(text)
 64          self.doc_count += 1
 65          term_freq: dict[str, int] = defaultdict(int)
 66          for token in tokens:
 67              term_freq[token] += 1
 68          total = len(tokens) if tokens else 1
 69          for term, count in term_freq.items():
 70              self.inverted_index[term].add(doc_id)
 71              self.tf_scores[term][doc_id] = count / total
 72  
 73      def search(self, term: str) -> set[str]:
 74          tokens = self.tokenizer.tokenize(term)
 75          if not tokens:
 76              return set()
 77          return self.inverted_index.get(tokens[0], set())
 78  
 79      def get_tf(self, term: str, doc_id: str) -> float:
 80          return self.tf_scores.get(term, {}).get(doc_id, 0.0)
 81  
 82      def get_idf(self, term: str) -> float:
 83          df = len(self.inverted_index.get(term, set()))
 84          if df == 0 or self.doc_count == 0:
 85              return 0.0
 86          return math.log(self.doc_count / df)
 87  
 88  
 89  class SearchIndex(IndexNode):
 90      def __init__(self):
 91          self.fields: dict[str, FieldIndex] = {}
 92  
 93      def accept(self, visitor: "IndexVisitor"):
 94          return visitor.visit_search_index(self)
 95  
 96      def add_field(self, field_index: FieldIndex) -> None:
 97          self.fields[field_index.name] = field_index
 98  
 99      def get_field(self, name: str) -> FieldIndex:
100          return self.fields[name]
101  
102      def add_document(self, doc_id: str, text: str) -> None:
103          for fi in self.fields.values():
104              fi.add_document(doc_id, text)
105  
106      def add_document_fields(self, doc_id: str, fields: dict[str, str]) -> None:
107          for field_name, text in fields.items():
108              if field_name in self.fields:
109                  self.fields[field_name].add_document(doc_id, text)
110  
111  
112  # ── Visitor ───────────────────────────────────────────────────────
113  
114  class IndexVisitor(ABC):
115      @abstractmethod
116      def visit_field_index(self, field_index: FieldIndex):
117          ...
118  
119      @abstractmethod
120      def visit_search_index(self, index: SearchIndex):
121          ...
122  
123  
124  class SearchVisitor(IndexVisitor):
125      def __init__(self, query: str):
126          self.query = query
127          self.results: dict[str, float] = defaultdict(float)
128  
129      def visit_field_index(self, field_index: FieldIndex) -> None:
130          tokens = field_index.tokenizer.tokenize(self.query)
131          for token in tokens:
132              idf = field_index.get_idf(token)
133              for doc_id in field_index.inverted_index.get(token, set()):
134                  tf = field_index.get_tf(token, doc_id)
135                  self.results[doc_id] += tf * idf
136  
137      def visit_search_index(self, index: SearchIndex) -> None:
138          for fi in index.fields.values():
139              fi.accept(self)
140  
141      def get_results(self) -> list[tuple[str, float]]:
142          return sorted(self.results.items(), key=lambda x: -x[1])
143  
144  
145  class StatsVisitor(IndexVisitor):
146      def __init__(self):
147          self.total_terms = 0
148          self.total_docs = 0
149          self.field_stats: list[dict] = []
150  
151      def visit_field_index(self, field_index: FieldIndex) -> None:
152          terms = len(field_index.inverted_index)
153          self.total_terms += terms
154          self.total_docs = max(self.total_docs, field_index.doc_count)
155          self.field_stats.append({
156              "field": field_index.name,
157              "unique_terms": terms,
158              "documents": field_index.doc_count,
159          })
160  
161      def visit_search_index(self, index: SearchIndex) -> None:
162          for fi in index.fields.values():
163              fi.accept(self)
164  
165  
166  # ── Builder ───────────────────────────────────────────────────────
167  
168  class IndexBuilder:
169      def __init__(self):
170          self._fields: dict[str, FieldIndex] = {}
171  
172      def add_field(self, name: str, tokenizer: TokenizerStrategy) -> "IndexBuilder":
173          self._fields[name] = FieldIndex(name, tokenizer)
174          return self
175  
176      def build(self) -> SearchIndex:
177          if not self._fields:
178              raise ValueError("Index must have at least one field")
179          index = SearchIndex()
180          for fi in self._fields.values():
181              index.add_field(fi)
182          return index
183  
184  
185  # ── Query Decorator ───────────────────────────────────────────────
186  
187  class QueryDecorator(ABC):
188      def __init__(self, wrapped: "QueryDecorator | None" = None):
189          self.wrapped = wrapped
190  
191      @abstractmethod
192      def execute(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
193          ...
194  
195      def _delegate(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
196          if self.wrapped:
197              return self.wrapped.execute(results)
198          return results
199  
200  
201  class BoostDecorator(QueryDecorator):
202      def __init__(self, multiplier: float, wrapped: QueryDecorator | None = None):
203          super().__init__(wrapped)
204          self.multiplier = multiplier
205  
206      def execute(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
207          results = self._delegate(results)
208          return [(doc_id, score * self.multiplier) for doc_id, score in results]
209  
210  
211  class FilterDecorator(QueryDecorator):
212      def __init__(self, excluded: set[str], wrapped: QueryDecorator | None = None):
213          super().__init__(wrapped)
214          self.excluded = excluded
215  
216      def execute(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
217          results = self._delegate(results)
218          return [(doc_id, score) for doc_id, score in results if doc_id not in self.excluded]
219  
220  
221  # ── Demo ──────────────────────────────────────────────────────────
222  
223  if __name__ == "__main__":
224      # Build index with different tokenizers per field
225      index = (
226          IndexBuilder()
227          .add_field("title", LowercaseTokenizer())
228          .add_field("body", LowercaseTokenizer())
229          .add_field("tags", WhitespaceTokenizer())
230          .build()
231      )
232  
233      # Add documents with per-field content
234      docs = [
235          {"id": "doc1", "title": "Introduction to Search Engines",
236           "body": "Search engines use inverted indices to map terms to documents efficiently",
237           "tags": "search indexing fundamentals"},
238          {"id": "doc2", "title": "Building Inverted Indices",
239           "body": "An inverted index stores a mapping from content to its location in documents",
240           "tags": "indexing data-structures"},
241          {"id": "doc3", "title": "TF-IDF Scoring Explained",
242           "body": "Term frequency inverse document frequency measures how important a term is to a document",
243           "tags": "scoring ranking search"},
244          {"id": "doc4", "title": "Distributed Search Architecture",
245           "body": "Sharding distributes the search index across multiple nodes for horizontal scaling",
246           "tags": "distributed architecture scaling"},
247      ]
248  
249      for doc in docs:
250          index.add_document_fields(doc["id"], {
251              "title": doc["title"],
252              "body": doc["body"],
253              "tags": doc["tags"],
254          })
255  
256      # Search with TF-IDF scoring via Visitor
257      print("=== Search: 'search' ===")
258      visitor = SearchVisitor("search")
259      index.accept(visitor)
260      for doc_id, score in visitor.get_results():
261          print(f"  {doc_id}: {score:.4f}")
262  
263      print("\n=== Search: 'inverted index' ===")
264      visitor = SearchVisitor("inverted index")
265      index.accept(visitor)
266      for doc_id, score in visitor.get_results():
267          print(f"  {doc_id}: {score:.4f}")
268  
269      # Stats via Visitor
270      print("\n=== Index Statistics ===")
271      stats = StatsVisitor()
272      index.accept(stats)
273      print(f"  Total unique terms: {stats.total_terms}")
274      print(f"  Total documents:    {stats.total_docs}")
275      for fs in stats.field_stats:
276          print(f"  Field '{fs['field']}': {fs['unique_terms']} terms, {fs['documents']} docs")
277  
278      # Query Decorators: boost then filter
279      print("\n=== Decorated Query: 'search' (boost 2x, filter out doc4) ===")
280      visitor = SearchVisitor("search")
281      index.accept(visitor)
282      raw_results = visitor.get_results()
283      pipeline = BoostDecorator(2.0, FilterDecorator({"doc4"}))
284      decorated = pipeline.execute(raw_results)
285      for doc_id, score in decorated:
286          print(f"  {doc_id}: {score:.4f}")
287  
288      # Ngram tokenizer demo
289      print("\n=== Ngram Tokenizer Demo ===")
290      ngram_index = IndexBuilder().add_field("content", NgramTokenizer(3)).build()
291      ngram_index.add_document_fields("d1", {"content": "elasticsearch"})
292      ngram_index.add_document_fields("d2", {"content": "elastic band"})
293      visitor = SearchVisitor("ela")
294      ngram_index.accept(visitor)
295      print("  Search 'ela':")
296      for doc_id, score in visitor.get_results():
297          print(f"    {doc_id}: {score:.4f}")

Common Mistakes

✗Using a single flat index for all fields: you can't weight title matches higher than body matches
✗Putting search logic inside index classes: adding TF-IDF scoring requires modifying FieldIndex
✗Hardcoding tokenization: English whitespace doesn't work for CJK languages
✗Not building the index schema upfront: leads to inconsistent field types across documents

Key Points

✓Composite: a SearchIndex contains multiple FieldIndex objects (title, body, tags). Querying the root traverses all relevant sub-indices.
✓Visitor: search, scoring, statistics collection are visitors that traverse the index tree without modifying index classes.
✓Strategy: tokenization is per-field. Title might use whitespace tokenizer, body uses ngram, tags use exact match.
✓Builder: index schema is assembled step-by-step. Builder validates required fields exist and analyzers are set.

Related Designs