Search Index Engine
Inverted index with Composite field structure, Visitor-based query operations, swappable tokenization strategies, and Decorator for query enrichment. The engine behind Elasticsearch.
Key Abstractions
Composite : root index contains field-level sub-indices
Leaf : inverted index for a single field: term → doc IDs
Visitor : EvalVisitor for search, StatsVisitor for index statistics
Strategy : WhitespaceTokenizer, NgramTokenizer, LowercaseTokenizer
Builder : addField with type, set analyzer, set boost, build
Decorator : BoostDecorator multiplies scores, FilterDecorator excludes results
Class Diagram
How It Works
A search index engine maps content to locations. When you index a document, each field (title, body, tags) is tokenized into terms, and those terms are stored in an inverted index: a mapping from term to the set of documents containing it. When you search, the engine looks up your query terms in the inverted index and scores matching documents using TF-IDF.
The Composite pattern structures the index itself. A SearchIndex is the root that contains multiple FieldIndex children, one per field. Querying the root fans out to all field indices automatically. You can weight title matches differently from body matches without changing any search logic.
The Visitor pattern separates operations from structure. SearchVisitor traverses the index tree to find and score documents. StatsVisitor walks the same tree to collect statistics. Neither modifies any index class. Adding a new operation like an export visitor or a validation visitor means writing one new class.
The Strategy pattern makes tokenization per-field. Title uses lowercase splitting. Body might use ngram tokenization for partial matching. Tags use exact whitespace splitting. The tokenizer is injected at field creation time and used consistently for both indexing and querying.
The Builder pattern assembles the index schema step by step. You declare fields with their tokenizers, then call build() to get a validated SearchIndex. This prevents half-configured indices from leaking into production.
The Decorator pattern enriches query results after the core search. A BoostDecorator multiplies scores. A FilterDecorator excludes documents. Decorators compose: boost wraps filter, and the pipeline processes results through each layer without modifying the search visitor.
Requirements
Functional
- Index documents with multiple fields (title, body, tags) using per-field tokenization
- Build inverted indices mapping terms to document IDs with term frequency tracking
- Search across all fields and return documents ranked by TF-IDF score
- Support different tokenization strategies per field (whitespace, lowercase, ngram)
- Collect index statistics (unique terms, document counts) without modifying index classes
- Post-process search results with composable decorators (boosting, filtering)
Non-Functional
- Adding a new index operation (export, validation) should not modify any index class
- Changing tokenization for a field should not require changes to search or scoring logic
- Index schema must be validated at build time to prevent runtime inconsistencies
- Result decoration must compose without modifying the core search pipeline
Design Decisions
Couldn't we use a flat lookup table instead of Composite?
A search index is not a flat lookup table. It is a tree: the root index contains field-level sub-indices, each with its own inverted index and tokenizer. The Composite pattern lets you treat a single FieldIndex and the entire SearchIndex uniformly through the IndexNode interface.
This matters when you search. Calling accept(visitor) on the root automatically fans out to every field. You do not write loops over fields in your search logic. The composite handles traversal. When you add a new field, zero search code changes.
What if we put search methods directly on IndexNode?
If you add a search() method directly to IndexNode, it works. But then you need statistics collection. Edit FieldIndex and SearchIndex. Then you want index validation. Edit both again. Every new operation touches every node class.
With Visitor, SearchVisitor scores documents, StatsVisitor counts terms, and a future ExportVisitor could serialize the index. Each is its own class. The index hierarchy stays frozen.
Does one tokenizer work for all fields?
Tokenization is not one-size-fits-all. English titles split on whitespace. Body text benefits from ngram tokenization for partial matches. Tags need exact matching. Hardcoding any of these means you cannot adapt to new content types or languages.
Strategy lets you inject the tokenizer at field creation time. The same FieldIndex class works with any tokenizer. Swapping from whitespace to ngram means passing a different strategy object, not rewriting the index.
Why not just construct the index directly?
An index without fields is useless. An index where some documents have a "title" field and others do not is a source of silent bugs. The Builder pattern forces you to declare the schema upfront: which fields exist, which tokenizer each uses. Calling build() validates the schema and returns a ready-to-use SearchIndex. No half-initialized objects escape into the system.
Couldn't the search visitor handle boosting and filtering?
Boosting, filtering, deduplication, and personalization are all post-processing concerns. Putting them inside the search visitor creates a monolithic searcher that knows about business rules. Decorator keeps each concern in its own class. You compose them: BoostDecorator(2.0, FilterDecorator(excluded)). Each layer wraps the previous one. Adding a new post-processing step means writing one new decorator, not editing the search pipeline.
Interview Follow-ups
- "How would you implement BM25 instead of TF-IDF?" Replace the scoring formula in
SearchVisitor. BM25 adds document length normalization and a saturation function on term frequency. The visitor interface stays the same. Only the math insidevisit_field_indexchanges. You could even make scoring itself a Strategy if you want to switch between TF-IDF and BM25 at runtime. - "How would you shard the index across multiple machines?" Each shard is a
SearchIndexholding a subset of documents (by hash of doc ID). A coordinator node fans out queries to all shards, each returns local top-K results, and the coordinator merges them. The Composite pattern extends naturally: aDistributedIndexis a composite whose children are remote shards instead of localFieldIndexobjects. - "How would you handle real-time indexing while serving queries?" Use a two-phase approach: an in-memory buffer index for recent writes and a larger immutable on-disk index for historical data. Queries search both and merge results. Periodically, the buffer is flushed and merged into the main index. This is exactly how Lucene segments work.
- "How would you add faceted search?" A faceted search counts how many results fall into each category. Add a
FacetVisitorthat traverses the index tree and groups matched documents by a designated facet field value. The Visitor pattern means this is one new class. No changes to the index structure or the search visitor.
Code Implementation
1 from __future__ import annotations
2 from abc import ABC, abstractmethod
3 from dataclasses import dataclass, field
4 from collections import defaultdict
5 import math
6
7
8 # ── Tokenizer Strategy ────────────────────────────────────────────
9
10 class TokenizerStrategy(ABC):
11 @abstractmethod
12 def tokenize(self, text: str) -> list[str]:
13 ...
14
15
16 class WhitespaceTokenizer(TokenizerStrategy):
17 def tokenize(self, text: str) -> list[str]:
18 return text.split()
19
20
21 class LowercaseTokenizer(TokenizerStrategy):
22 def tokenize(self, text: str) -> list[str]:
23 return text.lower().split()
24
25
26 class NgramTokenizer(TokenizerStrategy):
27 def __init__(self, n: int = 3):
28 self.n = n
29
30 def tokenize(self, text: str) -> list[str]:
31 text = text.lower()
32 tokens = []
33 for word in text.split():
34 for i in range(len(word) - self.n + 1):
35 tokens.append(word[i:i + self.n])
36 return tokens if tokens else [text.lower()]
37
38
39 # ── Composite: IndexNode hierarchy ────────────────────────────────
40
41 class IndexNode(ABC):
42 @abstractmethod
43 def accept(self, visitor: "IndexVisitor"):
44 ...
45
46 @abstractmethod
47 def add_document(self, doc_id: str, text: str) -> None:
48 ...
49
50
51 class FieldIndex(IndexNode):
52 def __init__(self, name: str, tokenizer: TokenizerStrategy):
53 self.name = name
54 self.tokenizer = tokenizer
55 self.inverted_index: dict[str, set[str]] = defaultdict(set)
56 self.tf_scores: dict[str, dict[str, float]] = defaultdict(dict)
57 self.doc_count = 0
58
59 def accept(self, visitor: "IndexVisitor"):
60 return visitor.visit_field_index(self)
61
62 def add_document(self, doc_id: str, text: str) -> None:
63 tokens = self.tokenizer.tokenize(text)
64 self.doc_count += 1
65 term_freq: dict[str, int] = defaultdict(int)
66 for token in tokens:
67 term_freq[token] += 1
68 total = len(tokens) if tokens else 1
69 for term, count in term_freq.items():
70 self.inverted_index[term].add(doc_id)
71 self.tf_scores[term][doc_id] = count / total
72
73 def search(self, term: str) -> set[str]:
74 tokens = self.tokenizer.tokenize(term)
75 if not tokens:
76 return set()
77 return self.inverted_index.get(tokens[0], set())
78
79 def get_tf(self, term: str, doc_id: str) -> float:
80 return self.tf_scores.get(term, {}).get(doc_id, 0.0)
81
82 def get_idf(self, term: str) -> float:
83 df = len(self.inverted_index.get(term, set()))
84 if df == 0 or self.doc_count == 0:
85 return 0.0
86 return math.log(self.doc_count / df)
87
88
89 class SearchIndex(IndexNode):
90 def __init__(self):
91 self.fields: dict[str, FieldIndex] = {}
92
93 def accept(self, visitor: "IndexVisitor"):
94 return visitor.visit_search_index(self)
95
96 def add_field(self, field_index: FieldIndex) -> None:
97 self.fields[field_index.name] = field_index
98
99 def get_field(self, name: str) -> FieldIndex:
100 return self.fields[name]
101
102 def add_document(self, doc_id: str, text: str) -> None:
103 for fi in self.fields.values():
104 fi.add_document(doc_id, text)
105
106 def add_document_fields(self, doc_id: str, fields: dict[str, str]) -> None:
107 for field_name, text in fields.items():
108 if field_name in self.fields:
109 self.fields[field_name].add_document(doc_id, text)
110
111
112 # ── Visitor ───────────────────────────────────────────────────────
113
114 class IndexVisitor(ABC):
115 @abstractmethod
116 def visit_field_index(self, field_index: FieldIndex):
117 ...
118
119 @abstractmethod
120 def visit_search_index(self, index: SearchIndex):
121 ...
122
123
124 class SearchVisitor(IndexVisitor):
125 def __init__(self, query: str):
126 self.query = query
127 self.results: dict[str, float] = defaultdict(float)
128
129 def visit_field_index(self, field_index: FieldIndex) -> None:
130 tokens = field_index.tokenizer.tokenize(self.query)
131 for token in tokens:
132 idf = field_index.get_idf(token)
133 for doc_id in field_index.inverted_index.get(token, set()):
134 tf = field_index.get_tf(token, doc_id)
135 self.results[doc_id] += tf * idf
136
137 def visit_search_index(self, index: SearchIndex) -> None:
138 for fi in index.fields.values():
139 fi.accept(self)
140
141 def get_results(self) -> list[tuple[str, float]]:
142 return sorted(self.results.items(), key=lambda x: -x[1])
143
144
145 class StatsVisitor(IndexVisitor):
146 def __init__(self):
147 self.total_terms = 0
148 self.total_docs = 0
149 self.field_stats: list[dict] = []
150
151 def visit_field_index(self, field_index: FieldIndex) -> None:
152 terms = len(field_index.inverted_index)
153 self.total_terms += terms
154 self.total_docs = max(self.total_docs, field_index.doc_count)
155 self.field_stats.append({
156 "field": field_index.name,
157 "unique_terms": terms,
158 "documents": field_index.doc_count,
159 })
160
161 def visit_search_index(self, index: SearchIndex) -> None:
162 for fi in index.fields.values():
163 fi.accept(self)
164
165
166 # ── Builder ───────────────────────────────────────────────────────
167
168 class IndexBuilder:
169 def __init__(self):
170 self._fields: dict[str, FieldIndex] = {}
171
172 def add_field(self, name: str, tokenizer: TokenizerStrategy) -> "IndexBuilder":
173 self._fields[name] = FieldIndex(name, tokenizer)
174 return self
175
176 def build(self) -> SearchIndex:
177 if not self._fields:
178 raise ValueError("Index must have at least one field")
179 index = SearchIndex()
180 for fi in self._fields.values():
181 index.add_field(fi)
182 return index
183
184
185 # ── Query Decorator ───────────────────────────────────────────────
186
187 class QueryDecorator(ABC):
188 def __init__(self, wrapped: "QueryDecorator | None" = None):
189 self.wrapped = wrapped
190
191 @abstractmethod
192 def execute(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
193 ...
194
195 def _delegate(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
196 if self.wrapped:
197 return self.wrapped.execute(results)
198 return results
199
200
201 class BoostDecorator(QueryDecorator):
202 def __init__(self, multiplier: float, wrapped: QueryDecorator | None = None):
203 super().__init__(wrapped)
204 self.multiplier = multiplier
205
206 def execute(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
207 results = self._delegate(results)
208 return [(doc_id, score * self.multiplier) for doc_id, score in results]
209
210
211 class FilterDecorator(QueryDecorator):
212 def __init__(self, excluded: set[str], wrapped: QueryDecorator | None = None):
213 super().__init__(wrapped)
214 self.excluded = excluded
215
216 def execute(self, results: list[tuple[str, float]]) -> list[tuple[str, float]]:
217 results = self._delegate(results)
218 return [(doc_id, score) for doc_id, score in results if doc_id not in self.excluded]
219
220
221 # ── Demo ──────────────────────────────────────────────────────────
222
223 if __name__ == "__main__":
224 # Build index with different tokenizers per field
225 index = (
226 IndexBuilder()
227 .add_field("title", LowercaseTokenizer())
228 .add_field("body", LowercaseTokenizer())
229 .add_field("tags", WhitespaceTokenizer())
230 .build()
231 )
232
233 # Add documents with per-field content
234 docs = [
235 {"id": "doc1", "title": "Introduction to Search Engines",
236 "body": "Search engines use inverted indices to map terms to documents efficiently",
237 "tags": "search indexing fundamentals"},
238 {"id": "doc2", "title": "Building Inverted Indices",
239 "body": "An inverted index stores a mapping from content to its location in documents",
240 "tags": "indexing data-structures"},
241 {"id": "doc3", "title": "TF-IDF Scoring Explained",
242 "body": "Term frequency inverse document frequency measures how important a term is to a document",
243 "tags": "scoring ranking search"},
244 {"id": "doc4", "title": "Distributed Search Architecture",
245 "body": "Sharding distributes the search index across multiple nodes for horizontal scaling",
246 "tags": "distributed architecture scaling"},
247 ]
248
249 for doc in docs:
250 index.add_document_fields(doc["id"], {
251 "title": doc["title"],
252 "body": doc["body"],
253 "tags": doc["tags"],
254 })
255
256 # Search with TF-IDF scoring via Visitor
257 print("=== Search: 'search' ===")
258 visitor = SearchVisitor("search")
259 index.accept(visitor)
260 for doc_id, score in visitor.get_results():
261 print(f" {doc_id}: {score:.4f}")
262
263 print("\n=== Search: 'inverted index' ===")
264 visitor = SearchVisitor("inverted index")
265 index.accept(visitor)
266 for doc_id, score in visitor.get_results():
267 print(f" {doc_id}: {score:.4f}")
268
269 # Stats via Visitor
270 print("\n=== Index Statistics ===")
271 stats = StatsVisitor()
272 index.accept(stats)
273 print(f" Total unique terms: {stats.total_terms}")
274 print(f" Total documents: {stats.total_docs}")
275 for fs in stats.field_stats:
276 print(f" Field '{fs['field']}': {fs['unique_terms']} terms, {fs['documents']} docs")
277
278 # Query Decorators: boost then filter
279 print("\n=== Decorated Query: 'search' (boost 2x, filter out doc4) ===")
280 visitor = SearchVisitor("search")
281 index.accept(visitor)
282 raw_results = visitor.get_results()
283 pipeline = BoostDecorator(2.0, FilterDecorator({"doc4"}))
284 decorated = pipeline.execute(raw_results)
285 for doc_id, score in decorated:
286 print(f" {doc_id}: {score:.4f}")
287
288 # Ngram tokenizer demo
289 print("\n=== Ngram Tokenizer Demo ===")
290 ngram_index = IndexBuilder().add_field("content", NgramTokenizer(3)).build()
291 ngram_index.add_document_fields("d1", {"content": "elasticsearch"})
292 ngram_index.add_document_fields("d2", {"content": "elastic band"})
293 visitor = SearchVisitor("ela")
294 ngram_index.accept(visitor)
295 print(" Search 'ela':")
296 for doc_id, score in visitor.get_results():
297 print(f" {doc_id}: {score:.4f}")Common Mistakes
- ✗Using a single flat index for all fields: you can't weight title matches higher than body matches
- ✗Putting search logic inside index classes: adding TF-IDF scoring requires modifying FieldIndex
- ✗Hardcoding tokenization: English whitespace doesn't work for CJK languages
- ✗Not building the index schema upfront: leads to inconsistent field types across documents
Key Points
- ✓Composite: a SearchIndex contains multiple FieldIndex objects (title, body, tags). Querying the root traverses all relevant sub-indices.
- ✓Visitor: search, scoring, statistics collection are visitors that traverse the index tree without modifying index classes.
- ✓Strategy: tokenization is per-field. Title might use whitespace tokenizer, body uses ngram, tags use exact match.
- ✓Builder: index schema is assembled step-by-step. Builder validates required fields exist and analyzers are set.