Z-Algorithm
The Intuition
Most string-search problems boil down to one question: where does the pattern P appear inside the text T? The simple answer is to slide P along T character by character, but that takes time proportional to the product of the two lengths. The Z-Algorithm replaces the sliding with a one-time table lookup that runs in linear time.
The table is the Z-array of a string s. For each position i in s, the entry Z[i] is the length of the longest piece starting at i that exactly matches the start of s. If s starts with "abc" and position 7 also starts with "abc...", then Z[7] is at least 3.
To search for a pattern P in a text T, glue them together with a separator that cannot appear in either: build the string P + sep + T and compute the Z-array of the result. Wherever Z[i] equals the length of P, you have found a copy of P inside T at the corresponding position.
The reason the build is linear is that the algorithm keeps a window of "stuff we have already matched" between two pointers l and r. When the loop reaches position i and i is still inside that window, it can copy the answer from earlier (because the window matches the prefix). When it leaves the window, it falls back to character-by-character comparison, but only on characters it has not seen before. Each character of s contributes at most one comparison total. The total work across the whole loop is O(n).
The pattern transfers wherever you would think about pattern matching. Counting how often a substring appears, finding the longest piece that is both a prefix and a suffix, splitting a string into the smallest building block, building the shortest palindrome that can be made by adding characters in front. They all become "build the Z-array of some cleverly constructed string and read off the answer".
KMP is the more famous algorithm in the same space and runs in the same linear time. Pick whichever feels simpler to remember. Z is often easier because the table has a direct, intuitive meaning.
Searching for one or many occurrences of a pattern in a text. Counting matches. Finding the longest prefix that is also a suffix. Periodicity questions. Shortest palindrome by prefix.
- The text is enormous and lives in storage (use a streaming algorithm or a suffix structure)
- You need fuzzy matching (this is for exact matches only)
- The "pattern" changes constantly (build a suffix automaton instead)
Variations:
- Counting matches: Sum over Z[i] equals length of P, no need to store positions.
- Borders: A "border" is a piece that is both a prefix and a suffix. The Z-array can be turned into the border array with one extra pass.
- Period detection: A string of length n has a period p (a repeating unit) if and only if
n - pshows up as a Z value at the right place.
- empty pattern (define behavior up front; the code treats it as "matches everywhere")
- pattern longer than text (no matches)
- pattern equals text (one match at position 0)
- separator character that could appear in either string (use a value that cannot, like null byte or a code point outside the alphabet)
- characters past the end of the string (always check bounds inside the extension loop)
Key Points
- •Build a "Z-array" once. Z[i] is the length of the longest piece starting at position i that matches the start of the string.
- •With that array, finding any pattern in any text becomes very fast.
- •To search for pattern P in text T, build the combined string P + special_char + T and run the Z algorithm on it. Wherever Z[i] equals the length of P, that is a match.
- •The algorithm is "linear" because it reuses what it learned from earlier positions, never re-comparing characters that are already known to match.
- •Compared to KMP, the math here is easier to follow. Both run in linear time.
Code Template
1 def z_array(s):
2 """
3 Build the Z-array of s.
4 Z[i] = length of the longest substring starting at i that matches a prefix of s.
5 Z[0] is left at 0 by convention.
6 """
7 n = len(s)
8 z = [0] * n
9 l, r = 0, 0 # current matched window [l..r-1] of s
10
11 for i in range(1, n):
12 if i < r:
13 # We are inside the previously matched window. Reuse what we know.
14 # The character at i corresponds to s[i - l] in the prefix.
15 z[i] = min(r - i, z[i - l])
16
17 # Try to extend the match character by character.
18 while i + z[i] < n and s[z[i]] == s[i + z[i]]:
19 z[i] += 1
20
21 # Update the window if this match extends past r.
22 if i + z[i] > r:
23 l, r = i, i + z[i]
24 return z
25
26 def find_all_occurrences(text, pattern):
27 """Return every starting index where pattern appears in text."""
28 if not pattern:
29 return list(range(len(text) + 1))
30 sep = "\x00" # a character that cannot appear in either string
31 combined = pattern + sep + text
32 z = z_array(combined)
33 m = len(pattern)
34 hits = []
35 # In the combined string, the text starts at index m + 1.
36 for i in range(m + 1, len(combined)):
37 if z[i] == m:
38 hits.append(i - m - 1)
39 return hitsCommon Mistakes
- Using a separator character that can also appear in the pattern (pick something that cannot, like a special marker)
- Off-by-one in the window boundaries l and r (the window is closed-open or closed-closed depending on the variant; pick one and stay consistent)
- Setting Z[0] to anything other than 0 (the convention is to leave it 0 because comparing the string to itself is trivial)
- Forgetting that Z[i] for i past the window must be re-built character by character