Regex Matcher
Parse pattern into an AST, compile to NFA, simulate against input. Supports literals, '.', '*', '+', '?', character classes, and grouping — without catastrophic backtracking.
Key Abstractions
AST node — Literal, AnyChar, CharClass, Star, Plus, Optional, Concat, Alternate, StartAnchor, EndAnchor
Builds the AST. Handles precedence, grouping, and desugars bounded {n,m} into concat + Optional.
Thompson's construction — AST → NFA graph, including position-conditional edges for anchors
State machine with epsilon and epsilon-pos transitions. Simulation runs all reachable states in lockstep.
Walks the input; epsilon closure is position-aware so ^ and $ fire only at boundaries.
Class Diagram
The Key Insight
The textbook approach — recursive backtracking over the pattern — is the wrong answer. It works on a*b but on (a|aa)*b against aaaaaaaaaaaaaaa it takes exponential time. Real regex engines avoid this with Thompson's construction: compile the pattern once to an NFA, then simulate the NFA by advancing all active states in parallel.
Separate the three phases cleanly. The lexer/parser builds an AST from the pattern string. The NFA builder walks the AST, producing a fragment per node and wiring them up with epsilon transitions. The matcher walks the input, using epsilon-closure to find all states reachable without consuming a character. Each phase is self-contained and individually testable.
The beauty of Thompson's construction is that quantifiers become graph shapes. a* is a cycle through 'a' with a bypass edge. a+ is 'a' with a back edge. a|b is a diamond. No combinatorial explosion — each operator adds a constant number of states and edges. The resulting NFA has size proportional to the pattern, and simulation visits each state at most once per input character.
Requirements
Functional
- Literals:
abcmatches the stringabc - Anchored full-match semantics (pattern must match the entire input)
- Metacharacters:
.,*,+,? - Alternation:
a|b - Grouping:
(abc)+ - Character classes:
[a-z],[^0-9], with ranges - Escape:
\.matches a literal dot
Non-Functional
- No exponential blowup on pathological patterns
- Parse once, match many
- Each phase (parse, build, match) independently testable
- Clear error messages on malformed patterns
Design Decisions
Why Thompson's NFA over backtracking?
Backtracking is conceptually simple but has degenerate cases that run exponentially. Thompson's construction is linear — O(mn) where m is pattern size and n is input size — and has no performance cliffs. The trade-off is that features like backreferences (\1) don't fit in a pure NFA; real engines either limit them or fall back to backtracking for patterns that use them.
Why split parser and NFA builder?
The parser's job is "understand the pattern grammar." The builder's job is "turn that understanding into a runtime artifact." A new language feature (e.g., {2,5} bounded quantifier) adds a node type and a builder case — without touching the matcher. Tight coupling would spread the change across five files.
Why epsilon closure?
Without epsilon transitions, implementing a|b requires branching logic in the matcher. With epsilons, branching is expressed in the graph itself — a single state with two epsilon outs. The matcher becomes a simple loop: "advance all current states by one char, compute the epsilon closure." That simplicity is the whole point.
Why sets of states, not lists?
An NFA can reach the same state multiple ways in one step. Using sets means one copy of each active state, no duplicate work, no double-accept. The closure computation uses a set-check to avoid revisiting.
Why parse recursive-descent?
Regex grammar is small enough that recursive descent fits. Parser combinators or Pratt parsing would be overkill. The precedence climbing here (alternation < concat < quantifier) matches how humans read regex.
Interview Follow-ups
- "How would you add character-class shorthand like
\d,\w,\s?" Extend the lexer to recognize\das a class-token, and wire it to aCharClasswith the digit set. All downstream phases stay the same. - "What about anchors
^and$?" AddStartAnchorandEndAnchorAST nodes. The matcher currently assumes full-match; anchors become no-ops, and a partial-match API would consult them. - "How do backreferences work?" They break pure NFA — you need a PDA (pushdown automaton) or backtracking. Most engines split: NFA path for backref-free patterns, backtrack path otherwise.
- "How would you compile an NFA to a DFA for speed?" Subset construction — each DFA state is a set of NFA states, transitions computed by closing over inputs. Uses more memory but avoids per-match closure computation.
- "How do you handle Unicode?" Characters are code points, not bytes. Classes need range operations over code points. Java's character class behavior is good reference; Python's
reflags forre.UNICODE.
Code Implementation
1 from __future__ import annotations
2 from abc import ABC
3 from dataclasses import dataclass, field
4 from typing import Union
5
6
7 # ------- AST nodes -------
8
9 class Node(ABC): pass
10
11 @dataclass
12 class Literal(Node):
13 ch: str
14
15 @dataclass
16 class AnyChar(Node):
17 pass
18
19 @dataclass
20 class CharClass(Node):
21 chars: frozenset[str]
22 negated: bool = False
23
24 def matches(self, ch: str) -> bool:
25 return (ch in self.chars) != self.negated
26
27 @dataclass
28 class Star(Node):
29 child: Node
30
31 @dataclass
32 class Plus(Node):
33 child: Node
34
35 @dataclass
36 class Optional(Node):
37 child: Node
38
39 @dataclass
40 class Concat(Node):
41 left: Node
42 right: Node
43
44 @dataclass
45 class Alternate(Node):
46 left: Node
47 right: Node
48
49 class StartAnchor(Node):
50 """Zero-width match: only fires at position 0."""
51
52 class EndAnchor(Node):
53 """Zero-width match: only fires when cursor is at len(input)."""
54
55
56 # ------- Parser (recursive descent) -------
57
58 class Parser:
59 """Precedence: | (lowest) < concat < * + ? (highest). Group with ()."""
60
61 def __init__(self, pattern: str):
62 self._s = pattern
63 self._i = 0
64
65 def parse(self) -> Node:
66 node = self._parse_alt()
67 if self._i != len(self._s):
68 raise ValueError(f"trailing characters at position {self._i}")
69 return node
70
71 def _parse_alt(self) -> Node:
72 left = self._parse_concat()
73 while self._peek() == "|":
74 self._consume("|")
75 right = self._parse_concat()
76 left = Alternate(left, right)
77 return left
78
79 def _parse_concat(self) -> Node:
80 atoms = []
81 while self._peek() is not None and self._peek() not in ("|", ")"):
82 atoms.append(self._parse_quantified())
83 if not atoms:
84 return Literal("") # empty pattern matches empty string
85 node = atoms[0]
86 for a in atoms[1:]:
87 node = Concat(node, a)
88 return node
89
90 def _parse_quantified(self) -> Node:
91 atom = self._parse_atom()
92 while self._peek() in ("*", "+", "?", "{"):
93 op = self._peek()
94 if op == "{":
95 n, m = self._parse_bounds()
96 atom = self._desugar_bounded(atom, n, m)
97 else:
98 self._consume_any()
99 if op == "*": atom = Star(atom)
100 elif op == "+": atom = Plus(atom)
101 elif op == "?": atom = Optional(atom)
102 return atom
103
104 def _parse_bounds(self) -> tuple[int, int | None]:
105 self._consume("{")
106 n_str = ""
107 while self._peek() and self._peek().isdigit():
108 n_str += self._consume_any()
109 if not n_str:
110 raise ValueError(f"empty lower bound at {self._i}")
111 n = int(n_str)
112 if self._peek() == "}":
113 self._consume("}")
114 return n, n # {n}
115 if self._peek() != ",":
116 raise ValueError(f"expected ',' or '}}' in bounds at {self._i}")
117 self._consume(",")
118 m_str = ""
119 while self._peek() and self._peek().isdigit():
120 m_str += self._consume_any()
121 self._consume("}")
122 if not m_str:
123 return n, None # {n,} — unbounded
124 m = int(m_str)
125 if m < n:
126 raise ValueError(f"upper bound {m} < lower {n}")
127 return n, m
128
129 @staticmethod
130 def _desugar_bounded(child: Node, n: int, m: int | None) -> Node:
131 # X{n,m} = X...X (n times) followed by Optional(X)*(m-n).
132 # X{n,} = X...X (n times) followed by Star(X).
133 atoms: list[Node] = [child] * n
134 if m is None:
135 atoms.append(Star(child))
136 else:
137 atoms.extend([Optional(child)] * (m - n))
138 if not atoms:
139 return Literal("") # {0,0}
140 node = atoms[0]
141 for a in atoms[1:]:
142 node = Concat(node, a)
143 return node
144
145 def _parse_atom(self) -> Node:
146 ch = self._peek()
147 if ch is None:
148 raise ValueError("unexpected end of pattern")
149 if ch == "(":
150 self._consume("(")
151 inner = self._parse_alt()
152 self._consume(")")
153 return inner
154 if ch == "[":
155 return self._parse_class()
156 if ch == ".":
157 self._consume(".")
158 return AnyChar()
159 if ch == "^":
160 self._consume("^")
161 return StartAnchor()
162 if ch == "$":
163 self._consume("$")
164 return EndAnchor()
165 if ch == "\\":
166 self._consume("\\")
167 escaped = self._consume_any()
168 return Literal(escaped)
169 if ch in "*+?)|{}":
170 raise ValueError(f"unexpected '{ch}' at {self._i}")
171 self._consume_any()
172 return Literal(ch)
173
174 def _parse_class(self) -> CharClass:
175 self._consume("[")
176 negated = False
177 if self._peek() == "^":
178 self._consume("^")
179 negated = True
180 chars: set[str] = set()
181 while self._peek() and self._peek() != "]":
182 a = self._consume_any()
183 if self._peek() == "-" and self._i + 1 < len(self._s) and self._s[self._i + 1] != "]":
184 self._consume("-")
185 b = self._consume_any()
186 for code in range(ord(a), ord(b) + 1):
187 chars.add(chr(code))
188 else:
189 chars.add(a)
190 self._consume("]")
191 return CharClass(frozenset(chars), negated)
192
193 def _peek(self) -> str | None:
194 return self._s[self._i] if self._i < len(self._s) else None
195
196 def _consume(self, expected: str) -> None:
197 if self._peek() != expected:
198 raise ValueError(f"expected '{expected}' at {self._i}")
199 self._i += 1
200
201 def _consume_any(self) -> str:
202 ch = self._s[self._i]
203 self._i += 1
204 return ch
205
206
207 # ------- NFA -------
208
209 class State:
210 __slots__ = ("id", "transitions", "epsilon", "epsilon_pos")
211 _counter = 0
212
213 def __init__(self):
214 State._counter += 1
215 self.id = State._counter
216 # transitions: list of (matcher_callable, next_state). Matcher takes a char, returns bool.
217 self.transitions: list[tuple[object, "State"]] = []
218 self.epsilon: list["State"] = []
219 # Position-conditional epsilons: (predicate(index, total), next_state). Used for anchors.
220 self.epsilon_pos: list[tuple[object, "State"]] = []
221
222
223 @dataclass
224 class NFA:
225 start: State
226 accept: State
227
228
229 class NFABuilder:
230 """Thompson's construction. Every AST node becomes a small NFA fragment."""
231
232 def build(self, node: Node) -> NFA:
233 if isinstance(node, Literal):
234 if node.ch == "":
235 s = State()
236 return NFA(s, s) # empty pattern matches empty string
237 start, accept = State(), State()
238 start.transitions.append((lambda c, lit=node.ch: c == lit, accept))
239 return NFA(start, accept)
240
241 if isinstance(node, AnyChar):
242 start, accept = State(), State()
243 start.transitions.append((lambda c: c != "\n", accept))
244 return NFA(start, accept)
245
246 if isinstance(node, CharClass):
247 start, accept = State(), State()
248 start.transitions.append((lambda c, cls=node: cls.matches(c), accept))
249 return NFA(start, accept)
250
251 if isinstance(node, Concat):
252 left = self.build(node.left)
253 right = self.build(node.right)
254 left.accept.epsilon.append(right.start)
255 return NFA(left.start, right.accept)
256
257 if isinstance(node, Alternate):
258 start, accept = State(), State()
259 left = self.build(node.left)
260 right = self.build(node.right)
261 start.epsilon.extend([left.start, right.start])
262 left.accept.epsilon.append(accept)
263 right.accept.epsilon.append(accept)
264 return NFA(start, accept)
265
266 if isinstance(node, Star):
267 start, accept = State(), State()
268 inner = self.build(node.child)
269 start.epsilon.extend([inner.start, accept])
270 inner.accept.epsilon.extend([inner.start, accept])
271 return NFA(start, accept)
272
273 if isinstance(node, Plus):
274 inner = self.build(node.child)
275 inner.accept.epsilon.append(inner.start)
276 accept = State()
277 inner.accept.epsilon.append(accept)
278 return NFA(inner.start, accept)
279
280 if isinstance(node, Optional):
281 inner = self.build(node.child)
282 start, accept = State(), State()
283 start.epsilon.extend([inner.start, accept])
284 inner.accept.epsilon.append(accept)
285 return NFA(start, accept)
286
287 if isinstance(node, StartAnchor):
288 start, accept = State(), State()
289 start.epsilon_pos.append((lambda i, n: i == 0, accept))
290 return NFA(start, accept)
291
292 if isinstance(node, EndAnchor):
293 start, accept = State(), State()
294 start.epsilon_pos.append((lambda i, n: i == n, accept))
295 return NFA(start, accept)
296
297 raise ValueError(f"unknown node {node}")
298
299
300 # ------- Matcher (NFA simulation) -------
301
302 class Matcher:
303 def __init__(self, pattern: str):
304 ast = Parser(pattern).parse()
305 self._nfa = NFABuilder().build(ast)
306
307 def match(self, text: str) -> bool:
308 n = len(text)
309 current = self._epsilon_closure({self._nfa.start}, 0, n)
310 for i, ch in enumerate(text):
311 if not current:
312 return False
313 next_states: set[State] = set()
314 for state in current:
315 for matcher_fn, target in state.transitions:
316 if matcher_fn(ch): # type: ignore[misc]
317 next_states.add(target)
318 current = self._epsilon_closure(next_states, i + 1, n)
319 return self._nfa.accept in current
320
321 @staticmethod
322 def _epsilon_closure(states: set[State], index: int, total: int) -> set[State]:
323 stack = list(states)
324 closure = set(states)
325 while stack:
326 s = stack.pop()
327 for nxt in s.epsilon:
328 if nxt not in closure:
329 closure.add(nxt)
330 stack.append(nxt)
331 for pred, nxt in s.epsilon_pos:
332 if nxt not in closure and pred(index, total): # type: ignore[misc]
333 closure.add(nxt)
334 stack.append(nxt)
335 return closure
336
337
338 if __name__ == "__main__":
339 cases = [
340 ("a", "a", True),
341 ("a", "b", False),
342 ("a*", "", True),
343 ("a*", "aaaa", True),
344 ("a+b", "aaab", True),
345 ("a+b", "b", False),
346 ("ab?c", "ac", True),
347 ("ab?c", "abc", True),
348 ("ab?c", "abbc", False),
349 ("(ab|cd)+", "ababab", True),
350 ("(ab|cd)+", "abcd", True),
351 ("(ab|cd)+", "abx", False),
352 ("[a-c]+", "abcabc", True),
353 ("[^0-9]+", "hello", True),
354 ("[^0-9]+", "he1lo", False),
355 (".*@.*", "a@b", True),
356 (".*@.*", "no-at-sign", False),
357 ("\\.", ".", True),
358 ("\\.", "a", False),
359 # Hard case for backtracking engines — runs in linear time here.
360 ("(a|aa)*b", "a" * 30 + "b", True),
361 # Anchors — start and end.
362 ("^abc$", "abc", True),
363 ("^abc$", "xabc", False),
364 ("^abc$", "abcd", False),
365 ("^a.*z$", "a--z", True),
366 # Bounded quantifier {n,m}, {n}, {n,}.
367 ("^a{2,4}b$", "ab", False),
368 ("^a{2,4}b$", "aab", True),
369 ("^a{2,4}b$", "aaab", True),
370 ("^a{2,4}b$", "aaaab", True),
371 ("^a{2,4}b$", "aaaaab", False),
372 ("^a{3}$", "aaa", True),
373 ("^a{3}$", "aa", False),
374 ("^a{2,}$", "aaaaa", True),
375 ("^a{2,}$", "a", False),
376 ]
377 for pattern, text, expected in cases:
378 actual = Matcher(pattern).match(text)
379 status = "PASS" if actual == expected else "FAIL"
380 print(f"{status} /{pattern}/ on {text!r:<32} -> {actual} (expected {expected})")
381 assert actual == expected, f"/{pattern}/ on {text!r}"
382 print("All assertions passed.")Common Mistakes
- ✗Implementing regex with recursion and backtracking. Works on simple patterns, explodes on `(a|a)*b` against `aaaaaa`.
- ✗Mixing lexer, parser, and matcher. Every feature addition touches three places at once.
- ✗Modeling '*' as 'zero or more' without the NFA framing. The implementation becomes a maze of nested loops.
- ✗Forgetting anchors. `a` without `^`/`$` context must report whether it's substring-match or full-match.
Key Points
- ✓Thompson's NFA is O(n*m) where n is pattern length and m is input length. Backtracking regex is exponential.
- ✓Parse the pattern once, match many inputs. Separating compile from match is the key performance decision.
- ✓State machine over the input — advance all active states per input char. No stack, no recursion.
- ✓Epsilon transitions handle quantifiers cleanly: a* is 'go forward or loop back' as free state moves.
- ✓Anchors are zero-width. They're position-conditional epsilons — fire only when the cursor is at start (^) or end ($).
- ✓{n,m} is desugared in the parser into n copies of the child plus (m−n) Optional copies — no new NFA shape required.