String Encoding (Length-Prefix Framing)
The Intuition
Encode and Decode Strings (LC 271). Given a list of strings (each may contain any character including punctuation, digits, or even the delimiter you might pick), design a single function encode that turns the list into one string, and another function decode that recovers the original list. The encoded form must allow exact recovery for any input.
The naive idea is to join the strings with a separator like a comma. That fails the moment one of the input strings contains a comma. Try escaping commas with a backslash, and you have to handle escaped backslashes too. The complexity grows fast and there is always one more edge case.
Length-prefix framing sidesteps the entire escape problem. Before each string, write its length in base 10, followed by a fixed delimiter character (# is conventional). Then write the string verbatim. The decoder reads digits until it sees the delimiter, parses the length, then reads exactly that many bytes for the value. The string itself can contain anything, including the delimiter character or all-digit content, because the decoder never scans inside the value to look for boundaries.
Why this works: the length tells the decoder how many bytes to consume. The delimiter is only meaningful in the prefix, never in the body. Once the decoder reads length bytes, it knows the next character is either the start of the next length prefix or the end of input.
This pattern is everywhere in production systems. Redis's RESP protocol uses $<length>\r\n<bulk>. HTTP/2 frames begin with a 24-bit length field. Protocol Buffers prefix every field with a wire-type tag and a length when applicable. The reason is the same in every case: length-prefix framing is the cheapest unambiguous way to delimit variable-length values in a stream.
For interviews, reaching for length-prefix framing immediately on this problem is the right move. Mentioning to the interviewer that the same pattern shows up in real protocols signals you understand why the technique exists, not just that it works.
- Need to serialize a list of variable-length strings or byte arrays
- Input characters are unrestricted (could contain any byte including separators)
- Designing a wire protocol for streams or files
- Encoding or decoding tree or graph structures into a single string
- All strings are known to be a fixed length (just concatenate)
- Input is restricted to a known alphabet that excludes a safe separator (e.g. all lowercase letters; use
,) - You can use a structured format like JSON or Protocol Buffers and the language has a library
Variations:
- Fixed-width length prefix: Use a 4-byte big-endian integer for the length. Slightly faster to parse, no delimiter needed, but caps string length at 2^32 bytes.
- Two-pass encode: First pass computes total size, second pass writes into a pre-sized buffer. Useful in low-level languages to avoid resizing.
- Tree serialization: Encode tree structure with length-prefixed nodes plus parent-child markers. Comes up in Serialize and Deserialize Binary Tree (LC 297).
- Versioned framing: Add a one-byte version prefix before the length. Lets you change the encoding later without breaking old decoders.
- Empty input list (encode returns empty string, decode returns empty list)
- Empty strings inside the list (encoded as
0#) - Very long strings (length prefix grows but the algorithm still works)
- Unicode characters (use byte length, not character count, if the language distinguishes)
Key Points
- •Prefix each string with its byte length and a delimiter
- •Decoder reads digits until the delimiter, then reads exactly that many bytes
- •Works for any input including empty strings and strings containing the delimiter
- •Same idea is used in Redis RESP, HTTP/2 frames, and protocol buffers
- •Avoid join-with-separator approaches because no separator is safe for arbitrary input
Code Template
1 # Encode and Decode Strings (LC 271). Length-prefix framing.
2 def encode(strs):
3 """Encode list of strings into a single string."""
4 parts = []
5 for s in strs:
6 parts.append(f"{len(s)}#{s}")
7 return "".join(parts)
8
9 def decode(s):
10 """Decode the single string back into a list."""
11 result = []
12 i = 0
13 while i < len(s):
14 # Read digits up to the '#' delimiter.
15 j = s.index('#', i)
16 length = int(s[i:j])
17 # The value is the next `length` characters after '#'.
18 start = j + 1
19 end = start + length
20 result.append(s[start:end])
21 i = end
22 return resultCommon Mistakes
- Picking a "safe" separator like comma or pipe and assuming inputs will not contain it
- Using base64 or escaping when length-prefix is simpler and faster
- Off-by-one errors in the substring index when reading the value back
- Forgetting that the length itself is variable-width (parse digits until delimiter)