FoundationalProblem 19 of 20

FoundationalAdvanced

Regex Matcher

Interpreter PatternComposite PatternVisitor Pattern

Parse pattern into an AST, compile to NFA, simulate against input. Supports literals, '.', '*', '+', '?', character classes, and grouping — without catastrophic backtracking.

Key Abstractions

Node

AST node — Literal, AnyChar, CharClass, Star, Plus, Optional, Concat, Alternate, StartAnchor, EndAnchor

Parser

Builds the AST. Handles precedence, grouping, and desugars bounded {n,m} into concat + Optional.

NFABuilder

Thompson's construction — AST → NFA graph, including position-conditional edges for anchors

NFA

State machine with epsilon and epsilon-pos transitions. Simulation runs all reachable states in lockstep.

Matcher

Walks the input; epsilon closure is position-aware so ^ and $ fire only at boundaries.

Class Diagram

The Key Insight

The textbook approach — recursive backtracking over the pattern — is the wrong answer. It works on a*b but on (a|aa)*b against aaaaaaaaaaaaaaa it takes exponential time. Real regex engines avoid this with Thompson's construction: compile the pattern once to an NFA, then simulate the NFA by advancing all active states in parallel.

Separate the three phases cleanly. The lexer/parser builds an AST from the pattern string. The NFA builder walks the AST, producing a fragment per node and wiring them up with epsilon transitions. The matcher walks the input, using epsilon-closure to find all states reachable without consuming a character. Each phase is self-contained and individually testable.

The beauty of Thompson's construction is that quantifiers become graph shapes. a* is a cycle through 'a' with a bypass edge. a+ is 'a' with a back edge. a|b is a diamond. No combinatorial explosion — each operator adds a constant number of states and edges. The resulting NFA has size proportional to the pattern, and simulation visits each state at most once per input character.

Requirements

Functional

Literals: abc matches the string abc
Anchored full-match semantics (pattern must match the entire input)
Metacharacters: ., *, +, ?
Alternation: a|b
Grouping: (abc)+
Character classes: [a-z], [^0-9], with ranges
Escape: \. matches a literal dot

Non-Functional

No exponential blowup on pathological patterns
Parse once, match many
Each phase (parse, build, match) independently testable
Clear error messages on malformed patterns

Design Decisions

Why Thompson's NFA over backtracking?

Backtracking is conceptually simple but has degenerate cases that run exponentially. Thompson's construction is linear — O(mn) where m is pattern size and n is input size — and has no performance cliffs. The trade-off is that features like backreferences (\1) don't fit in a pure NFA; real engines either limit them or fall back to backtracking for patterns that use them.

Why split parser and NFA builder?

The parser's job is "understand the pattern grammar." The builder's job is "turn that understanding into a runtime artifact." A new language feature (e.g., {2,5} bounded quantifier) adds a node type and a builder case — without touching the matcher. Tight coupling would spread the change across five files.

Why epsilon closure?

Without epsilon transitions, implementing a|b requires branching logic in the matcher. With epsilons, branching is expressed in the graph itself — a single state with two epsilon outs. The matcher becomes a simple loop: "advance all current states by one char, compute the epsilon closure." That simplicity is the whole point.

Why sets of states, not lists?

An NFA can reach the same state multiple ways in one step. Using sets means one copy of each active state, no duplicate work, no double-accept. The closure computation uses a set-check to avoid revisiting.

Why parse recursive-descent?

Regex grammar is small enough that recursive descent fits. Parser combinators or Pratt parsing would be overkill. The precedence climbing here (alternation < concat < quantifier) matches how humans read regex.

Interview Follow-ups

"How would you add character-class shorthand like \d, \w, \s?" Extend the lexer to recognize \d as a class-token, and wire it to a CharClass with the digit set. All downstream phases stay the same.
"What about anchors ^ and $?" Add StartAnchor and EndAnchor AST nodes. The matcher currently assumes full-match; anchors become no-ops, and a partial-match API would consult them.
"How do backreferences work?" They break pure NFA — you need a PDA (pushdown automaton) or backtracking. Most engines split: NFA path for backref-free patterns, backtrack path otherwise.
"How would you compile an NFA to a DFA for speed?" Subset construction — each DFA state is a set of NFA states, transitions computed by closing over inputs. Uses more memory but avoids per-match closure computation.
"How do you handle Unicode?" Characters are code points, not bytes. Classes need range operations over code points. Java's character class behavior is good reference; Python's re flags for re.UNICODE.

Code Implementation

  1  from __future__ import annotations
  2  from abc import ABC
  3  from dataclasses import dataclass, field
  4  from typing import Union
  5  
  6  
  7  # ------- AST nodes -------
  8  
  9  class Node(ABC): pass
 10  
 11  @dataclass
 12  class Literal(Node):
 13      ch: str
 14  
 15  @dataclass
 16  class AnyChar(Node):
 17      pass
 18  
 19  @dataclass
 20  class CharClass(Node):
 21      chars: frozenset[str]
 22      negated: bool = False
 23  
 24      def matches(self, ch: str) -> bool:
 25          return (ch in self.chars) != self.negated
 26  
 27  @dataclass
 28  class Star(Node):
 29      child: Node
 30  
 31  @dataclass
 32  class Plus(Node):
 33      child: Node
 34  
 35  @dataclass
 36  class Optional(Node):
 37      child: Node
 38  
 39  @dataclass
 40  class Concat(Node):
 41      left: Node
 42      right: Node
 43  
 44  @dataclass
 45  class Alternate(Node):
 46      left: Node
 47      right: Node
 48  
 49  class StartAnchor(Node):
 50      """Zero-width match: only fires at position 0."""
 51  
 52  class EndAnchor(Node):
 53      """Zero-width match: only fires when cursor is at len(input)."""
 54  
 55  
 56  # ------- Parser (recursive descent) -------
 57  
 58  class Parser:
 59      """Precedence: | (lowest) < concat < * + ? (highest). Group with ()."""
 60  
 61      def __init__(self, pattern: str):
 62          self._s = pattern
 63          self._i = 0
 64  
 65      def parse(self) -> Node:
 66          node = self._parse_alt()
 67          if self._i != len(self._s):
 68              raise ValueError(f"trailing characters at position {self._i}")
 69          return node
 70  
 71      def _parse_alt(self) -> Node:
 72          left = self._parse_concat()
 73          while self._peek() == "|":
 74              self._consume("|")
 75              right = self._parse_concat()
 76              left = Alternate(left, right)
 77          return left
 78  
 79      def _parse_concat(self) -> Node:
 80          atoms = []
 81          while self._peek() is not None and self._peek() not in ("|", ")"):
 82              atoms.append(self._parse_quantified())
 83          if not atoms:
 84              return Literal("")      # empty pattern matches empty string
 85          node = atoms[0]
 86          for a in atoms[1:]:
 87              node = Concat(node, a)
 88          return node
 89  
 90      def _parse_quantified(self) -> Node:
 91          atom = self._parse_atom()
 92          while self._peek() in ("*", "+", "?", "{"):
 93              op = self._peek()
 94              if op == "{":
 95                  n, m = self._parse_bounds()
 96                  atom = self._desugar_bounded(atom, n, m)
 97              else:
 98                  self._consume_any()
 99                  if op == "*": atom = Star(atom)
100                  elif op == "+": atom = Plus(atom)
101                  elif op == "?": atom = Optional(atom)
102          return atom
103  
104      def _parse_bounds(self) -> tuple[int, int | None]:
105          self._consume("{")
106          n_str = ""
107          while self._peek() and self._peek().isdigit():
108              n_str += self._consume_any()
109          if not n_str:
110              raise ValueError(f"empty lower bound at {self._i}")
111          n = int(n_str)
112          if self._peek() == "}":
113              self._consume("}")
114              return n, n                  # {n}
115          if self._peek() != ",":
116              raise ValueError(f"expected ',' or '}}' in bounds at {self._i}")
117          self._consume(",")
118          m_str = ""
119          while self._peek() and self._peek().isdigit():
120              m_str += self._consume_any()
121          self._consume("}")
122          if not m_str:
123              return n, None               # {n,} — unbounded
124          m = int(m_str)
125          if m < n:
126              raise ValueError(f"upper bound {m} < lower {n}")
127          return n, m
128  
129      @staticmethod
130      def _desugar_bounded(child: Node, n: int, m: int | None) -> Node:
131          # X{n,m} = X...X (n times) followed by Optional(X)*(m-n).
132          # X{n,}  = X...X (n times) followed by Star(X).
133          atoms: list[Node] = [child] * n
134          if m is None:
135              atoms.append(Star(child))
136          else:
137              atoms.extend([Optional(child)] * (m - n))
138          if not atoms:
139              return Literal("")           # {0,0}
140          node = atoms[0]
141          for a in atoms[1:]:
142              node = Concat(node, a)
143          return node
144  
145      def _parse_atom(self) -> Node:
146          ch = self._peek()
147          if ch is None:
148              raise ValueError("unexpected end of pattern")
149          if ch == "(":
150              self._consume("(")
151              inner = self._parse_alt()
152              self._consume(")")
153              return inner
154          if ch == "[":
155              return self._parse_class()
156          if ch == ".":
157              self._consume(".")
158              return AnyChar()
159          if ch == "^":
160              self._consume("^")
161              return StartAnchor()
162          if ch == "$":
163              self._consume("$")
164              return EndAnchor()
165          if ch == "\\":
166              self._consume("\\")
167              escaped = self._consume_any()
168              return Literal(escaped)
169          if ch in "*+?)|{}":
170              raise ValueError(f"unexpected '{ch}' at {self._i}")
171          self._consume_any()
172          return Literal(ch)
173  
174      def _parse_class(self) -> CharClass:
175          self._consume("[")
176          negated = False
177          if self._peek() == "^":
178              self._consume("^")
179              negated = True
180          chars: set[str] = set()
181          while self._peek() and self._peek() != "]":
182              a = self._consume_any()
183              if self._peek() == "-" and self._i + 1 < len(self._s) and self._s[self._i + 1] != "]":
184                  self._consume("-")
185                  b = self._consume_any()
186                  for code in range(ord(a), ord(b) + 1):
187                      chars.add(chr(code))
188              else:
189                  chars.add(a)
190          self._consume("]")
191          return CharClass(frozenset(chars), negated)
192  
193      def _peek(self) -> str | None:
194          return self._s[self._i] if self._i < len(self._s) else None
195  
196      def _consume(self, expected: str) -> None:
197          if self._peek() != expected:
198              raise ValueError(f"expected '{expected}' at {self._i}")
199          self._i += 1
200  
201      def _consume_any(self) -> str:
202          ch = self._s[self._i]
203          self._i += 1
204          return ch
205  
206  
207  # ------- NFA -------
208  
209  class State:
210      __slots__ = ("id", "transitions", "epsilon", "epsilon_pos")
211      _counter = 0
212  
213      def __init__(self):
214          State._counter += 1
215          self.id = State._counter
216          # transitions: list of (matcher_callable, next_state). Matcher takes a char, returns bool.
217          self.transitions: list[tuple[object, "State"]] = []
218          self.epsilon: list["State"] = []
219          # Position-conditional epsilons: (predicate(index, total), next_state). Used for anchors.
220          self.epsilon_pos: list[tuple[object, "State"]] = []
221  
222  
223  @dataclass
224  class NFA:
225      start: State
226      accept: State
227  
228  
229  class NFABuilder:
230      """Thompson's construction. Every AST node becomes a small NFA fragment."""
231  
232      def build(self, node: Node) -> NFA:
233          if isinstance(node, Literal):
234              if node.ch == "":
235                  s = State()
236                  return NFA(s, s)   # empty pattern matches empty string
237              start, accept = State(), State()
238              start.transitions.append((lambda c, lit=node.ch: c == lit, accept))
239              return NFA(start, accept)
240  
241          if isinstance(node, AnyChar):
242              start, accept = State(), State()
243              start.transitions.append((lambda c: c != "\n", accept))
244              return NFA(start, accept)
245  
246          if isinstance(node, CharClass):
247              start, accept = State(), State()
248              start.transitions.append((lambda c, cls=node: cls.matches(c), accept))
249              return NFA(start, accept)
250  
251          if isinstance(node, Concat):
252              left = self.build(node.left)
253              right = self.build(node.right)
254              left.accept.epsilon.append(right.start)
255              return NFA(left.start, right.accept)
256  
257          if isinstance(node, Alternate):
258              start, accept = State(), State()
259              left = self.build(node.left)
260              right = self.build(node.right)
261              start.epsilon.extend([left.start, right.start])
262              left.accept.epsilon.append(accept)
263              right.accept.epsilon.append(accept)
264              return NFA(start, accept)
265  
266          if isinstance(node, Star):
267              start, accept = State(), State()
268              inner = self.build(node.child)
269              start.epsilon.extend([inner.start, accept])
270              inner.accept.epsilon.extend([inner.start, accept])
271              return NFA(start, accept)
272  
273          if isinstance(node, Plus):
274              inner = self.build(node.child)
275              inner.accept.epsilon.append(inner.start)
276              accept = State()
277              inner.accept.epsilon.append(accept)
278              return NFA(inner.start, accept)
279  
280          if isinstance(node, Optional):
281              inner = self.build(node.child)
282              start, accept = State(), State()
283              start.epsilon.extend([inner.start, accept])
284              inner.accept.epsilon.append(accept)
285              return NFA(start, accept)
286  
287          if isinstance(node, StartAnchor):
288              start, accept = State(), State()
289              start.epsilon_pos.append((lambda i, n: i == 0, accept))
290              return NFA(start, accept)
291  
292          if isinstance(node, EndAnchor):
293              start, accept = State(), State()
294              start.epsilon_pos.append((lambda i, n: i == n, accept))
295              return NFA(start, accept)
296  
297          raise ValueError(f"unknown node {node}")
298  
299  
300  # ------- Matcher (NFA simulation) -------
301  
302  class Matcher:
303      def __init__(self, pattern: str):
304          ast = Parser(pattern).parse()
305          self._nfa = NFABuilder().build(ast)
306  
307      def match(self, text: str) -> bool:
308          n = len(text)
309          current = self._epsilon_closure({self._nfa.start}, 0, n)
310          for i, ch in enumerate(text):
311              if not current:
312                  return False
313              next_states: set[State] = set()
314              for state in current:
315                  for matcher_fn, target in state.transitions:
316                      if matcher_fn(ch):   # type: ignore[misc]
317                          next_states.add(target)
318              current = self._epsilon_closure(next_states, i + 1, n)
319          return self._nfa.accept in current
320  
321      @staticmethod
322      def _epsilon_closure(states: set[State], index: int, total: int) -> set[State]:
323          stack = list(states)
324          closure = set(states)
325          while stack:
326              s = stack.pop()
327              for nxt in s.epsilon:
328                  if nxt not in closure:
329                      closure.add(nxt)
330                      stack.append(nxt)
331              for pred, nxt in s.epsilon_pos:
332                  if nxt not in closure and pred(index, total):   # type: ignore[misc]
333                      closure.add(nxt)
334                      stack.append(nxt)
335          return closure
336  
337  
338  if __name__ == "__main__":
339      cases = [
340          ("a", "a", True),
341          ("a", "b", False),
342          ("a*", "", True),
343          ("a*", "aaaa", True),
344          ("a+b", "aaab", True),
345          ("a+b", "b", False),
346          ("ab?c", "ac", True),
347          ("ab?c", "abc", True),
348          ("ab?c", "abbc", False),
349          ("(ab|cd)+", "ababab", True),
350          ("(ab|cd)+", "abcd", True),
351          ("(ab|cd)+", "abx", False),
352          ("[a-c]+", "abcabc", True),
353          ("[^0-9]+", "hello", True),
354          ("[^0-9]+", "he1lo", False),
355          (".*@.*", "a@b", True),
356          (".*@.*", "no-at-sign", False),
357          ("\\.", ".", True),
358          ("\\.", "a", False),
359          # Hard case for backtracking engines — runs in linear time here.
360          ("(a|aa)*b", "a" * 30 + "b", True),
361          # Anchors — start and end.
362          ("^abc$", "abc", True),
363          ("^abc$", "xabc", False),
364          ("^abc$", "abcd", False),
365          ("^a.*z$", "a--z", True),
366          # Bounded quantifier {n,m}, {n}, {n,}.
367          ("^a{2,4}b$", "ab", False),
368          ("^a{2,4}b$", "aab", True),
369          ("^a{2,4}b$", "aaab", True),
370          ("^a{2,4}b$", "aaaab", True),
371          ("^a{2,4}b$", "aaaaab", False),
372          ("^a{3}$", "aaa", True),
373          ("^a{3}$", "aa", False),
374          ("^a{2,}$", "aaaaa", True),
375          ("^a{2,}$", "a", False),
376      ]
377      for pattern, text, expected in cases:
378          actual = Matcher(pattern).match(text)
379          status = "PASS" if actual == expected else "FAIL"
380          print(f"{status}  /{pattern}/ on {text!r:<32} -> {actual}  (expected {expected})")
381          assert actual == expected, f"/{pattern}/ on {text!r}"
382      print("All assertions passed.")

Common Mistakes

✗Implementing regex with recursion and backtracking. Works on simple patterns, explodes on `(a|a)*b` against `aaaaaa`.
✗Mixing lexer, parser, and matcher. Every feature addition touches three places at once.
✗Modeling '*' as 'zero or more' without the NFA framing. The implementation becomes a maze of nested loops.
✗Forgetting anchors. `a` without `^`/`$` context must report whether it's substring-match or full-match.

Key Points

✓Thompson's NFA is O(n*m) where n is pattern length and m is input length. Backtracking regex is exponential.
✓Parse the pattern once, match many inputs. Separating compile from match is the key performance decision.
✓State machine over the input — advance all active states per input char. No stack, no recursion.
✓Epsilon transitions handle quantifiers cleanly: a* is 'go forward or loop back' as free state moves.
✓Anchors are zero-width. They're position-conditional epsilons — fire only when the cursor is at start (^) or end ($).
✓{n,m} is desugared in the parser into n copies of the child plus (m−n) Optional copies — no new NFA shape required.

Related Designs