String Encoding (Length-Prefix Framing)

The Intuition

Encode and Decode Strings (LC 271). Given a list of strings (each may contain any character including punctuation, digits, or even the delimiter you might pick), design a single function encode that turns the list into one string, and another function decode that recovers the original list. The encoded form must allow exact recovery for any input.

The naive idea is to join the strings with a separator like a comma. That fails the moment one of the input strings contains a comma. Try escaping commas with a backslash, and you have to handle escaped backslashes too. The complexity grows fast and there is always one more edge case.

Length-prefix framing sidesteps the entire escape problem. Before each string, write its length in base 10, followed by a fixed delimiter character (# is conventional). Then write the string verbatim. The decoder reads digits until it sees the delimiter, parses the length, then reads exactly that many bytes for the value. The string itself can contain anything, including the delimiter character or all-digit content, because the decoder never scans inside the value to look for boundaries.

Why this works: the length tells the decoder how many bytes to consume. The delimiter is only meaningful in the prefix, never in the body. Once the decoder reads length bytes, it knows the next character is either the start of the next length prefix or the end of input.

This pattern is everywhere in production systems. Redis's RESP protocol uses $<length>\r\n<bulk>. HTTP/2 frames begin with a 24-bit length field. Protocol Buffers prefix every field with a wire-type tag and a length when applicable. The reason is the same in every case: length-prefix framing is the cheapest unambiguous way to delimit variable-length values in a stream.

For interviews, reaching for length-prefix framing immediately on this problem is the right move. Mentioning to the interviewer that the same pattern shows up in real protocols signals you understand why the technique exists, not just that it works.

When to use

Need to serialize a list of variable-length strings or byte arrays
Input characters are unrestricted (could contain any byte including separators)
Designing a wire protocol for streams or files
Encoding or decoding tree or graph structures into a single string

When NOT to use

All strings are known to be a fixed length (just concatenate)
Input is restricted to a known alphabet that excludes a safe separator (e.g. all lowercase letters; use ,)
You can use a structured format like JSON or Protocol Buffers and the language has a library

Pattern Recognition

encode a list of strings into one stringdesign a serialization protocoltransmit variable-length recordsdecode without knowing in advance how many records there are

Variations:

Fixed-width length prefix: Use a 4-byte big-endian integer for the length. Slightly faster to parse, no delimiter needed, but caps string length at 2^32 bytes.
Two-pass encode: First pass computes total size, second pass writes into a pre-sized buffer. Useful in low-level languages to avoid resizing.
Tree serialization: Encode tree structure with length-prefixed nodes plus parent-child markers. Comes up in Serialize and Deserialize Binary Tree (LC 297).
Versioned framing: Add a one-byte version prefix before the length. Lets you change the encoding later without breaking old decoders.

Edge Cases

Empty input list (encode returns empty string, decode returns empty list)
Empty strings inside the list (encoded as 0#)
Very long strings (length prefix grows but the algorithm still works)
Unicode characters (use byte length, not character count, if the language distinguishes)

Practice Problems

MediumEncode and Decode Strings

HardSerialize and Deserialize Binary Tree

HardSerialize and Deserialize N-ary Tree

MediumEncode N-ary Tree to Binary Tree

Key Points

•Prefix each string with its byte length and a delimiter
•Decoder reads digits until the delimiter, then reads exactly that many bytes
•Works for any input including empty strings and strings containing the delimiter
•Same idea is used in Redis RESP, HTTP/2 frames, and protocol buffers
•Avoid join-with-separator approaches because no separator is safe for arbitrary input

Code Template

 1  # Encode and Decode Strings (LC 271). Length-prefix framing.
 2  def encode(strs):
 3      """Encode list of strings into a single string."""
 4      parts = []
 5      for s in strs:
 6          parts.append(f"{len(s)}#{s}")
 7      return "".join(parts)
 8  
 9  def decode(s):
10      """Decode the single string back into a list."""
11      result = []
12      i = 0
13      while i < len(s):
14          # Read digits up to the '#' delimiter.
15          j = s.index('#', i)
16          length = int(s[i:j])
17          # The value is the next `length` characters after '#'.
18          start = j + 1
19          end = start + length
20          result.append(s[start:end])
21          i = end
22      return result

Common Mistakes

Picking a "safe" separator like comma or pipe and assuming inputs will not contain it
Using base64 or escaping when length-prefix is simpler and faster
Off-by-one errors in the substring index when reading the value back
Forgetting that the length itself is variable-width (parse digits until delimiter)

Related Patterns

CrackingWalnuts

Tries & String AlgorithmsTemplate 4 of 4

Trie & StringsBasicO(n) time, O(n) space