String Matching (KMP)
The Intuition
The brute-force substring search is O(n * m): at each starting position in the text, compare up to m characters of the pattern. The advanced algorithms below all reach O(n + m) by avoiding redundant comparisons after a partial match fails.
KMP (Knuth-Morris-Pratt). The key observation: when a mismatch occurs after matching k characters, some suffix of those k matched characters may also be a prefix of the pattern. The text pointer never moves backward; only the pattern pointer rewinds, and only to a position determined by the precomputed failure function (or LPS array).
The failure function lps[i] is defined as the length of the longest proper prefix of pattern[0..i] that is also a suffix of pattern[0..i]. It is computed in O(m) time by matching the pattern against itself. During the actual search, when a mismatch happens at pattern index j, set j = lps[j-1] (instead of resetting to 0) and re-compare without moving the text pointer.
The total search runs in O(n + m). The text pointer monotonically advances, and the pattern pointer can only fall back along the failure-function chain, so total work is linear even for adversarial inputs like searching for "aaaaab" in "aaaaaaaaaa".
Rabin-Karp. Hash the first m characters of the text and compare to the hash of the pattern. If hashes match, verify by direct comparison (to handle hash collisions). Slide the window one character at a time, updating the hash in O(1) using a rolling hash function. Average-case O(n + m), worst-case O(n * m) if every window collides. Rabin-Karp's main advantage is multi-pattern search: hash all patterns once into a set, then check each window's hash against the set in O(1).
Z-algorithm. Computes the Z-array, where Z[i] is the length of the longest substring starting at i that matches a prefix of the input. To search for pattern in text, run Z-algorithm on pattern + '#' + text and look for any Z[i] >= m. Comparable to KMP in performance, often easier to implement, and useful for problems involving prefix-suffix relationships within a single string.
Substring search, pattern detection in strings, repeated pattern identification, plagiarism detection, or any problem requiring efficient text search without backtracking.
Pattern has complex regex features (use regex engine), need approximate/fuzzy matching (use edit distance), or searching for multiple patterns simultaneously (use Aho-Corasick).
- Empty pattern or empty text should be handled gracefully
- Pattern longer than the text returns no matches
- Pattern and text are identical (single full match)
- All characters in text and pattern are the same (e.g., "aaaa" in "aaaaaa")
- Single-character pattern or single-character text
- Pattern appears at the very beginning or very end of the text
- Overlapping matches (e.g., "aa" in "aaaa" yields indices 0, 1, 2)
Key Points
- •KMP builds a failure function (partial match table) to skip redundant comparisons
- •Rabin-Karp uses rolling hash for average-case O(n + m) multi-pattern matching
- •KMP guarantees worst-case O(n + m) with no backtracking on the text
- •The failure function represents the longest proper prefix that is also a suffix
- •Rabin-Karp can degrade to O(n * m) with many hash collisions
Code Template
def build_kmp_table(pattern):
"""Build the KMP partial match (failure) table."""
m = len(pattern)
table = [0] * m
length = 0
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
table[i] = length
i += 1
elif length > 0:
length = table[length - 1]
else:
table[i] = 0
i += 1
return table
def kmp_search(text, pattern):
"""Find all occurrences of pattern in text using KMP."""
n, m = len(text), len(pattern)
if m == 0:
return [0]
table = build_kmp_table(pattern)
results = []
j = 0 # pointer in pattern
for i in range(n):
while j > 0 and text[i] != pattern[j]:
j = table[j - 1]
if text[i] == pattern[j]:
j += 1
if j == m:
results.append(i - m + 1)
j = table[j - 1]
return results
def rabin_karp(text, pattern, base=256, mod=10**9 + 7):
"""Find all occurrences of pattern in text using Rabin-Karp."""
n, m = len(text), len(pattern)
if m > n:
return []
# Compute hash of pattern and first window
p_hash = 0
t_hash = 0
power = 1
for i in range(m - 1):
power = (power * base) % mod
for i in range(m):
p_hash = (p_hash * base + ord(pattern[i])) % mod
t_hash = (t_hash * base + ord(text[i])) % mod
results = []
for i in range(n - m + 1):
if p_hash == t_hash and text[i:i + m] == pattern:
results.append(i)
if i < n - m:
t_hash = (t_hash - ord(text[i]) * power) % mod
t_hash = (t_hash * base + ord(text[i + m])) % mod
return resultsCommon Mistakes
- Off-by-one errors when building the KMP failure table
- Using a weak hash function in Rabin-Karp causing excessive collisions
- Forgetting to handle the case where the pattern is longer than the text
- Not resetting the failure pointer correctly when a mismatch occurs