Regex Matcher
Parse pattern into an AST, compile to NFA, simulate against input. Supports literals, '.', '*', '+', '?', character classes, and grouping — without catastrophic backtracking.
Key Abstractions
AST node — Literal, AnyChar, CharClass, Star, Plus, Optional, Concat, Alternate, StartAnchor, EndAnchor
Builds the AST. Handles precedence, grouping, and desugars bounded {n,m} into concat + Optional.
Thompson's construction — AST → NFA graph, including position-conditional edges for anchors
State machine with epsilon and epsilon-pos transitions. Simulation runs all reachable states in lockstep.
Walks the input; epsilon closure is position-aware so ^ and $ fire only at boundaries.
Class Diagram
The Key Insight
The textbook approach — recursive backtracking over the pattern — is the wrong answer. It works on a*b but on (a|aa)*b against aaaaaaaaaaaaaaa it takes exponential time. Real regex engines avoid this with Thompson's construction: compile the pattern once to an NFA, then simulate the NFA by advancing all active states in parallel.
Separate the three phases cleanly. The lexer/parser builds an AST from the pattern string. The NFA builder walks the AST, producing a fragment per node and wiring them up with epsilon transitions. The matcher walks the input, using epsilon-closure to find all states reachable without consuming a character. Each phase is self-contained and individually testable.
The beauty of Thompson's construction is that quantifiers become graph shapes. a* is a cycle through 'a' with a bypass edge. a+ is 'a' with a back edge. a|b is a diamond. No combinatorial explosion — each operator adds a constant number of states and edges. The resulting NFA has size proportional to the pattern, and simulation visits each state at most once per input character.
Requirements
Functional
- Literals:
abcmatches the stringabc - Anchored full-match semantics (pattern must match the entire input)
- Metacharacters:
.,*,+,? - Alternation:
a|b - Grouping:
(abc)+ - Character classes:
[a-z],[^0-9], with ranges - Escape:
\.matches a literal dot
Non-Functional
- No exponential blowup on pathological patterns
- Parse once, match many
- Each phase (parse, build, match) independently testable
- Clear error messages on malformed patterns
Design Decisions
Why Thompson's NFA over backtracking?
Backtracking is conceptually simple but has degenerate cases that run exponentially. Thompson's construction is linear — O(mn) where m is pattern size and n is input size — and has no performance cliffs. The trade-off is that features like backreferences (\1) don't fit in a pure NFA; real engines either limit them or fall back to backtracking for patterns that use them.
Why split parser and NFA builder?
The parser's job is "understand the pattern grammar." The builder's job is "turn that understanding into a runtime artifact." A new language feature (e.g., {2,5} bounded quantifier) adds a node type and a builder case — without touching the matcher. Tight coupling would spread the change across five files.
Why epsilon closure?
Without epsilon transitions, implementing a|b requires branching logic in the matcher. With epsilons, branching is expressed in the graph itself — a single state with two epsilon outs. The matcher becomes a simple loop: "advance all current states by one char, compute the epsilon closure." That simplicity is the whole point.
Why sets of states, not lists?
An NFA can reach the same state multiple ways in one step. Using sets means one copy of each active state, no duplicate work, no double-accept. The closure computation uses a set-check to avoid revisiting.
Why parse recursive-descent?
Regex grammar is small enough that recursive descent fits. Parser combinators or Pratt parsing would be overkill. The precedence climbing here (alternation < concat < quantifier) matches how humans read regex.
Interview Follow-ups
- "How would you add character-class shorthand like
\d,\w,\s?" Extend the lexer to recognize\das a class-token, and wire it to aCharClasswith the digit set. All downstream phases stay the same. - "What about anchors
^and$?" AddStartAnchorandEndAnchorAST nodes. The matcher currently assumes full-match; anchors become no-ops, and a partial-match API would consult them. - "How do backreferences work?" They break pure NFA — you need a PDA (pushdown automaton) or backtracking. Most engines split: NFA path for backref-free patterns, backtrack path otherwise.
- "How would you compile an NFA to a DFA for speed?" Subset construction — each DFA state is a set of NFA states, transitions computed by closing over inputs. Uses more memory but avoids per-match closure computation.
- "How do you handle Unicode?" Characters are code points, not bytes. Classes need range operations over code points. Java's character class behavior is good reference; Python's
reflags forre.UNICODE.
Code Implementation
from __future__ import annotations
from abc import ABC
from dataclasses import dataclass, field
from typing import Union
# ------- AST nodes -------
class Node(ABC): pass
@dataclass
class Literal(Node):
ch: str
@dataclass
class AnyChar(Node):
pass
@dataclass
class CharClass(Node):
chars: frozenset[str]
negated: bool = False
def matches(self, ch: str) -> bool:
return (ch in self.chars) != self.negated
@dataclass
class Star(Node):
child: Node
@dataclass
class Plus(Node):
child: Node
@dataclass
class Optional(Node):
child: Node
@dataclass
class Concat(Node):
left: Node
right: Node
@dataclass
class Alternate(Node):
left: Node
right: Node
class StartAnchor(Node):
"""Zero-width match: only fires at position 0."""
class EndAnchor(Node):
"""Zero-width match: only fires when cursor is at len(input)."""
# ------- Parser (recursive descent) -------
class Parser:
"""Precedence: | (lowest) < concat < * + ? (highest). Group with ()."""
def __init__(self, pattern: str):
self._s = pattern
self._i = 0
def parse(self) -> Node:
node = self._parse_alt()
if self._i != len(self._s):
raise ValueError(f"trailing characters at position {self._i}")
return node
def _parse_alt(self) -> Node:
left = self._parse_concat()
while self._peek() == "|":
self._consume("|")
right = self._parse_concat()
left = Alternate(left, right)
return left
def _parse_concat(self) -> Node:
atoms = []
while self._peek() is not None and self._peek() not in ("|", ")"):
atoms.append(self._parse_quantified())
if not atoms:
return Literal("") # empty pattern matches empty string
node = atoms[0]
for a in atoms[1:]:
node = Concat(node, a)
return node
def _parse_quantified(self) -> Node:
atom = self._parse_atom()
while self._peek() in ("*", "+", "?", "{"):
op = self._peek()
if op == "{":
n, m = self._parse_bounds()
atom = self._desugar_bounded(atom, n, m)
else:
self._consume_any()
if op == "*": atom = Star(atom)
elif op == "+": atom = Plus(atom)
elif op == "?": atom = Optional(atom)
return atom
def _parse_bounds(self) -> tuple[int, int | None]:
self._consume("{")
n_str = ""
while self._peek() and self._peek().isdigit():
n_str += self._consume_any()
if not n_str:
raise ValueError(f"empty lower bound at {self._i}")
n = int(n_str)
if self._peek() == "}":
self._consume("}")
return n, n # {n}
if self._peek() != ",":
raise ValueError(f"expected ',' or '}}' in bounds at {self._i}")
self._consume(",")
m_str = ""
while self._peek() and self._peek().isdigit():
m_str += self._consume_any()
self._consume("}")
if not m_str:
return n, None # {n,} — unbounded
m = int(m_str)
if m < n:
raise ValueError(f"upper bound {m} < lower {n}")
return n, m
@staticmethod
def _desugar_bounded(child: Node, n: int, m: int | None) -> Node:
# X{n,m} = X...X (n times) followed by Optional(X)*(m-n).
# X{n,} = X...X (n times) followed by Star(X).
atoms: list[Node] = [child] * n
if m is None:
atoms.append(Star(child))
else:
atoms.extend([Optional(child)] * (m - n))
if not atoms:
return Literal("") # {0,0}
node = atoms[0]
for a in atoms[1:]:
node = Concat(node, a)
return node
def _parse_atom(self) -> Node:
ch = self._peek()
if ch is None:
raise ValueError("unexpected end of pattern")
if ch == "(":
self._consume("(")
inner = self._parse_alt()
self._consume(")")
return inner
if ch == "[":
return self._parse_class()
if ch == ".":
self._consume(".")
return AnyChar()
if ch == "^":
self._consume("^")
return StartAnchor()
if ch == "$":
self._consume("$")
return EndAnchor()
if ch == "\\":
self._consume("\\")
escaped = self._consume_any()
return Literal(escaped)
if ch in "*+?)|{}":
raise ValueError(f"unexpected '{ch}' at {self._i}")
self._consume_any()
return Literal(ch)
def _parse_class(self) -> CharClass:
self._consume("[")
negated = False
if self._peek() == "^":
self._consume("^")
negated = True
chars: set[str] = set()
while self._peek() and self._peek() != "]":
a = self._consume_any()
if self._peek() == "-" and self._i + 1 < len(self._s) and self._s[self._i + 1] != "]":
self._consume("-")
b = self._consume_any()
for code in range(ord(a), ord(b) + 1):
chars.add(chr(code))
else:
chars.add(a)
self._consume("]")
return CharClass(frozenset(chars), negated)
def _peek(self) -> str | None:
return self._s[self._i] if self._i < len(self._s) else None
def _consume(self, expected: str) -> None:
if self._peek() != expected:
raise ValueError(f"expected '{expected}' at {self._i}")
self._i += 1
def _consume_any(self) -> str:
ch = self._s[self._i]
self._i += 1
return ch
# ------- NFA -------
class State:
__slots__ = ("id", "transitions", "epsilon", "epsilon_pos")
_counter = 0
def __init__(self):
State._counter += 1
self.id = State._counter
# transitions: list of (matcher_callable, next_state). Matcher takes a char, returns bool.
self.transitions: list[tuple[object, "State"]] = []
self.epsilon: list["State"] = []
# Position-conditional epsilons: (predicate(index, total), next_state). Used for anchors.
self.epsilon_pos: list[tuple[object, "State"]] = []
@dataclass
class NFA:
start: State
accept: State
class NFABuilder:
"""Thompson's construction. Every AST node becomes a small NFA fragment."""
def build(self, node: Node) -> NFA:
if isinstance(node, Literal):
if node.ch == "":
s = State()
return NFA(s, s) # empty pattern matches empty string
start, accept = State(), State()
start.transitions.append((lambda c, lit=node.ch: c == lit, accept))
return NFA(start, accept)
if isinstance(node, AnyChar):
start, accept = State(), State()
start.transitions.append((lambda c: c != "\n", accept))
return NFA(start, accept)
if isinstance(node, CharClass):
start, accept = State(), State()
start.transitions.append((lambda c, cls=node: cls.matches(c), accept))
return NFA(start, accept)
if isinstance(node, Concat):
left = self.build(node.left)
right = self.build(node.right)
left.accept.epsilon.append(right.start)
return NFA(left.start, right.accept)
if isinstance(node, Alternate):
start, accept = State(), State()
left = self.build(node.left)
right = self.build(node.right)
start.epsilon.extend([left.start, right.start])
left.accept.epsilon.append(accept)
right.accept.epsilon.append(accept)
return NFA(start, accept)
if isinstance(node, Star):
start, accept = State(), State()
inner = self.build(node.child)
start.epsilon.extend([inner.start, accept])
inner.accept.epsilon.extend([inner.start, accept])
return NFA(start, accept)
if isinstance(node, Plus):
inner = self.build(node.child)
inner.accept.epsilon.append(inner.start)
accept = State()
inner.accept.epsilon.append(accept)
return NFA(inner.start, accept)
if isinstance(node, Optional):
inner = self.build(node.child)
start, accept = State(), State()
start.epsilon.extend([inner.start, accept])
inner.accept.epsilon.append(accept)
return NFA(start, accept)
if isinstance(node, StartAnchor):
start, accept = State(), State()
start.epsilon_pos.append((lambda i, n: i == 0, accept))
return NFA(start, accept)
if isinstance(node, EndAnchor):
start, accept = State(), State()
start.epsilon_pos.append((lambda i, n: i == n, accept))
return NFA(start, accept)
raise ValueError(f"unknown node {node}")
# ------- Matcher (NFA simulation) -------
class Matcher:
def __init__(self, pattern: str):
ast = Parser(pattern).parse()
self._nfa = NFABuilder().build(ast)
def match(self, text: str) -> bool:
n = len(text)
current = self._epsilon_closure({self._nfa.start}, 0, n)
for i, ch in enumerate(text):
if not current:
return False
next_states: set[State] = set()
for state in current:
for matcher_fn, target in state.transitions:
if matcher_fn(ch): # type: ignore[misc]
next_states.add(target)
current = self._epsilon_closure(next_states, i + 1, n)
return self._nfa.accept in current
@staticmethod
def _epsilon_closure(states: set[State], index: int, total: int) -> set[State]:
stack = list(states)
closure = set(states)
while stack:
s = stack.pop()
for nxt in s.epsilon:
if nxt not in closure:
closure.add(nxt)
stack.append(nxt)
for pred, nxt in s.epsilon_pos:
if nxt not in closure and pred(index, total): # type: ignore[misc]
closure.add(nxt)
stack.append(nxt)
return closure
if __name__ == "__main__":
cases = [
("a", "a", True),
("a", "b", False),
("a*", "", True),
("a*", "aaaa", True),
("a+b", "aaab", True),
("a+b", "b", False),
("ab?c", "ac", True),
("ab?c", "abc", True),
("ab?c", "abbc", False),
("(ab|cd)+", "ababab", True),
("(ab|cd)+", "abcd", True),
("(ab|cd)+", "abx", False),
("[a-c]+", "abcabc", True),
("[^0-9]+", "hello", True),
("[^0-9]+", "he1lo", False),
(".*@.*", "a@b", True),
(".*@.*", "no-at-sign", False),
("\\.", ".", True),
("\\.", "a", False),
# Hard case for backtracking engines — runs in linear time here.
("(a|aa)*b", "a" * 30 + "b", True),
# Anchors — start and end.
("^abc$", "abc", True),
("^abc$", "xabc", False),
("^abc$", "abcd", False),
("^a.*z$", "a--z", True),
# Bounded quantifier {n,m}, {n}, {n,}.
("^a{2,4}b$", "ab", False),
("^a{2,4}b$", "aab", True),
("^a{2,4}b$", "aaab", True),
("^a{2,4}b$", "aaaab", True),
("^a{2,4}b$", "aaaaab", False),
("^a{3}$", "aaa", True),
("^a{3}$", "aa", False),
("^a{2,}$", "aaaaa", True),
("^a{2,}$", "a", False),
]
for pattern, text, expected in cases:
actual = Matcher(pattern).match(text)
status = "PASS" if actual == expected else "FAIL"
print(f"{status} /{pattern}/ on {text!r:<32} -> {actual} (expected {expected})")
assert actual == expected, f"/{pattern}/ on {text!r}"
print("All assertions passed.")Common Mistakes
- ✗Implementing regex with recursion and backtracking. Works on simple patterns, explodes on `(a|a)*b` against `aaaaaa`.
- ✗Mixing lexer, parser, and matcher. Every feature addition touches three places at once.
- ✗Modeling '*' as 'zero or more' without the NFA framing. The implementation becomes a maze of nested loops.
- ✗Forgetting anchors. `a` without `^`/`$` context must report whether it's substring-match or full-match.
Key Points
- ✓Thompson's NFA is O(n*m) where n is pattern length and m is input length. Backtracking regex is exponential.
- ✓Parse the pattern once, match many inputs. Separating compile from match is the key performance decision.
- ✓State machine over the input — advance all active states per input char. No stack, no recursion.
- ✓Epsilon transitions handle quantifiers cleanly: a* is 'go forward or loop back' as free state moves.
- ✓Anchors are zero-width. They're position-conditional epsilons — fire only when the cursor is at start (^) or end ($).
- ✓{n,m} is desugared in the parser into n copies of the child plus (m−n) Optional copies — no new NFA shape required.