Real-World SystemsProblem 5 of 16

Real-World SystemsIntermediate

File Compression System

Strategy PatternDecorator PatternComposite PatternBuilder PatternTemplate Method

Swappable compression strategies, Decorator-layered stream processing (compress, encrypt, checksum), Composite archive tree, and Builder for step-by-step archive construction.

Key Abstractions

CompressionStrategy

Interface defining compress/decompress on raw bytes. Strategies are swappable at runtime.

RLECompression / HuffmanCompression

Concrete strategies. RLE for repetitive data, Huffman for general text with frequency-based encoding.

StreamDecorator

Wraps a data stream to add processing layers. EncryptionDecorator and ChecksumDecorator compose transparently.

ArchiveEntry

Composite node. Files are leaves holding compressed data; directories contain children for recursive traversal.

ArchiveBuilder

Fluent builder for archive construction. addFile, addDirectory, setCompression, setEncryption, build with validation.

Compressor

Template Method skeleton: read chunk, compress, write output, update checksum. Subclasses override hooks.

Class Diagram

How It Works

A file compression system layers multiple design patterns to handle three distinct concerns: how to compress data, how to structure an archive, and how to process data through a pipeline.

Different files benefit from different compression algorithms. Repetitive data like log files compresses well with Run-Length Encoding. General text does better with Huffman coding, which assigns shorter bit sequences to more frequent bytes. By extracting the compression algorithm behind an interface, the system can pick the right strategy per file without changing any other code.

Stream processing is where Decorator shines. After compressing data, you might want to add a checksum for integrity verification, then encode it as Base64 for safe text transport. Each of these is a decorator that wraps the data, and they compose in any order. The critical point is ordering: compress first, then encrypt, then checksum. Encrypting before compressing destroys the statistical patterns that compression exploits.

The archive itself is a tree. Directories contain files and other directories, which is the textbook Composite pattern. When you ask a directory for its total size, it recursively sums up all children. No special-casing needed. Files return their compressed size directly, directories delegate downward.

The Builder pattern handles archive construction because creating an archive involves multiple steps with constraints. You need to set a compression strategy, optionally configure encryption, add files and directories, and validate that the configuration is consistent. A constructor with that many parameters would be unreadable. The builder's fluent API makes the intent clear and catches invalid configurations at build time.

Requirements

Functional

Support multiple compression algorithms swappable at runtime (RLE, Huffman)
Layer processing steps on compressed data: checksum, encoding, encryption
Build archives with nested directories and files
Track compression ratios per file and across the archive
Validate builder constraints before producing the archive
Round-trip fidelity: decompress(compress(data)) must equal the original data

Non-Functional

Stream-friendly design: avoid loading entire large files into memory at once
Composable decorators: adding a new processing layer should not modify existing ones
Extensible strategies: adding a new compression algorithm requires only a new class
Clear error messages when validation fails

Design Decisions

What's wrong with an if/else chain for picking the algorithm?

An if/else chain means every caller that compresses data needs to know about every algorithm. Adding LZ77 means finding every if algorithm == "rle" block and adding another branch. Strategy isolates each algorithm in its own class. New algorithms are new classes. Existing code stays untouched. The strategy is also runtime-configurable, so you can pick the best algorithm based on file content analysis.

Does the order of decorators matter?

Yes, and it's critical. Compressing before encrypting is essential because encrypted data looks random and won't compress. Similarly, checksumming should happen after encryption so the checksum covers the final form of the data. Decorator makes this ordering explicit in the pipeline construction without baking it into the class hierarchy.

Why not just use a constructor for building the archive?

An archive needs a compression strategy, optional encryption key, optional decorators, and a tree of files and directories. That's too many parameters for a constructor, especially since most are optional. Builder lets callers set only what they need, validates constraints at build time (encryption key length, strategy compatibility), and produces a fully configured archive in one final step.

How does Template Method help with the compression pipeline?

The compression pipeline always follows the same steps: pre-process, compress, post-process. What varies is what happens at each hook. A checksum compressor adds hash computation after compression. A logging compressor prints sizes before and after. Template Method defines the skeleton once, and subclasses override only the hooks they care about.

Interview Follow-ups

"How would you handle streaming for large files?" Process data in fixed-size chunks rather than loading the entire file. Each chunk is compressed independently with a chunk header storing the compressed and original sizes. The Decorator pipeline processes each chunk as it flows through, keeping memory usage bounded regardless of file size.
"How would you add parallel compression?" Divide the file into blocks and compress each block on a separate thread. Use a thread pool or fork-join framework. The challenge is reassembly: blocks must be written in order, so use a sequence number per block and a merging step that waits for blocks in order. The Strategy interface stays unchanged since each thread gets its own strategy instance.
"How would you support incremental archives?" Track file modification timestamps. On a second archive pass, only compress files that changed since the last archive. Store a manifest mapping file paths to their last-known hash and timestamp. The Builder can accept a previous manifest and skip unchanged files.
"How would you add deduplication?" Hash each file (or each chunk) before compression. Maintain a content-addressable store keyed by hash. If a hash already exists, store a reference instead of the data. This is how systems like rsync and modern backup tools minimize storage. The Composite tree stores references for duplicate entries, pointing to the same underlying compressed data.

Code Implementation

  1  from __future__ import annotations
  2  from abc import ABC, abstractmethod
  3  from collections import Counter
  4  import heapq
  5  import struct
  6  import hashlib
  7  import base64
  8  
  9  
 10  # ── Strategy Pattern: Compression algorithms ──────────────────────────
 11  
 12  class CompressionStrategy(ABC):
 13      """Interface for pluggable compression algorithms."""
 14  
 15      @abstractmethod
 16      def compress(self, data: bytes) -> bytes: ...
 17  
 18      @abstractmethod
 19      def decompress(self, data: bytes) -> bytes: ...
 20  
 21  
 22  class RLECompression(CompressionStrategy):
 23      """Run-Length Encoding. Best for data with long runs of repeated bytes."""
 24  
 25      def compress(self, data: bytes) -> bytes:
 26          if not data:
 27              return b""
 28          result = bytearray()
 29          i = 0
 30          while i < len(data):
 31              byte = data[i]
 32              count = 1
 33              while i + count < len(data) and data[i + count] == byte and count < 255:
 34                  count += 1
 35              result.append(count)
 36              result.append(byte)
 37              i += count
 38          return bytes(result)
 39  
 40      def decompress(self, data: bytes) -> bytes:
 41          result = bytearray()
 42          for i in range(0, len(data), 2):
 43              count = data[i]
 44              byte = data[i + 1]
 45              result.extend([byte] * count)
 46          return bytes(result)
 47  
 48  
 49  class HuffmanCompression(CompressionStrategy):
 50      """Huffman coding. Builds a frequency-based prefix tree for encoding."""
 51  
 52      def compress(self, data: bytes) -> bytes:
 53          if not data:
 54              return b""
 55          freq = Counter(data)
 56          # Build Huffman tree using a heap
 57          heap: list[list] = [[count, [byte, ""]] for byte, count in freq.items()]
 58          heapq.heapify(heap)
 59          if len(heap) == 1:
 60              heap[0][1][1] = "0"
 61          else:
 62              while len(heap) > 1:
 63                  lo = heapq.heappop(heap)
 64                  hi = heapq.heappop(heap)
 65                  for pair in lo[1:]:
 66                      pair[1] = "0" + pair[1]
 67                  for pair in hi[1:]:
 68                      pair[1] = "1" + pair[1]
 69                  heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
 70          codes = {byte: code for byte, code in heap[0][1:]}
 71          # Encode: store table size, table entries, then bit string as bytes
 72          header = bytearray()
 73          header.append(len(codes))
 74          for byte_val, code in codes.items():
 75              header.append(byte_val)
 76              header.append(len(code))
 77              # Pack code bits into the header as a string for simplicity
 78              header.extend(code.encode("ascii"))
 79          bit_string = "".join(codes[b] for b in data)
 80          # Pad bit string to multiple of 8
 81          padding = (8 - len(bit_string) % 8) % 8
 82          bit_string += "0" * padding
 83          encoded = bytearray()
 84          for i in range(0, len(bit_string), 8):
 85              encoded.append(int(bit_string[i:i + 8], 2))
 86          # Store: padding (1 byte) + header_len (2 bytes) + header + encoded
 87          header_bytes = bytes(header)
 88          result = struct.pack(">BH", padding, len(header_bytes)) + header_bytes + bytes(encoded)
 89          return result
 90  
 91      def decompress(self, data: bytes) -> bytes:
 92          if not data:
 93              return b""
 94          padding = data[0]
 95          header_len = struct.unpack(">H", data[1:3])[0]
 96          header = data[3:3 + header_len]
 97          encoded = data[3 + header_len:]
 98          # Rebuild code table
 99          idx = 0
100          num_codes = header[idx]; idx += 1
101          codes: dict[str, int] = {}
102          for _ in range(num_codes):
103              byte_val = header[idx]; idx += 1
104              code_len = header[idx]; idx += 1
105              code = header[idx:idx + code_len].decode("ascii"); idx += code_len
106              codes[code] = byte_val
107          # Decode bit string
108          bit_string = "".join(f"{b:08b}" for b in encoded)
109          if padding:
110              bit_string = bit_string[:-padding]
111          result = bytearray()
112          current = ""
113          for bit in bit_string:
114              current += bit
115              if current in codes:
116                  result.append(codes[current])
117                  current = ""
118          return bytes(result)
119  
120  
121  # ── Decorator Pattern: Stream processing layers ───────────────────────
122  
123  class StreamDecorator(ABC):
124      """Base decorator for layered data processing."""
125  
126      @abstractmethod
127      def process(self, data: bytes) -> bytes: ...
128  
129      @abstractmethod
130      def unprocess(self, data: bytes) -> bytes: ...
131  
132  
133  class ChecksumDecorator(StreamDecorator):
134      """Appends a SHA-256 checksum to the data for integrity verification."""
135  
136      HASH_SIZE = 32
137  
138      def process(self, data: bytes) -> bytes:
139          checksum = hashlib.sha256(data).digest()
140          return data + checksum
141  
142      def unprocess(self, data: bytes) -> bytes:
143          payload = data[:-self.HASH_SIZE]
144          stored_checksum = data[-self.HASH_SIZE:]
145          computed = hashlib.sha256(payload).digest()
146          if computed != stored_checksum:
147              raise ValueError("Checksum verification failed : data is corrupted")
148          return payload
149  
150      def verify(self, data: bytes) -> bool:
151          try:
152              self.unprocess(data)
153              return True
154          except ValueError:
155              return False
156  
157  
158  class Base64Decorator(StreamDecorator):
159      """Encodes data as Base64 for safe text transport."""
160  
161      def process(self, data: bytes) -> bytes:
162          return base64.b64encode(data)
163  
164      def unprocess(self, data: bytes) -> bytes:
165          return base64.b64decode(data)
166  
167  
168  # ── Composite Pattern: Archive tree structure ─────────────────────────
169  
170  class ArchiveEntry(ABC):
171      """Abstract node in the archive tree."""
172  
173      def __init__(self, name: str):
174          self._name = name
175  
176      @property
177      def name(self) -> str:
178          return self._name
179  
180      @abstractmethod
181      def is_directory(self) -> bool: ...
182  
183      @abstractmethod
184      def total_size(self) -> int: ...
185  
186      @abstractmethod
187      def display(self, indent: int = 0) -> str: ...
188  
189  
190  class FileEntry(ArchiveEntry):
191      """Leaf node holding original and compressed data."""
192  
193      def __init__(self, name: str, data: bytes, compressed: bytes):
194          super().__init__(name)
195          self._data = data
196          self._compressed = compressed
197  
198      def is_directory(self) -> bool:
199          return False
200  
201      def total_size(self) -> int:
202          return len(self._compressed)
203  
204      @property
205      def original_size(self) -> int:
206          return len(self._data)
207  
208      @property
209      def compressed_data(self) -> bytes:
210          return self._compressed
211  
212      def display(self, indent: int = 0) -> str:
213          ratio = (1 - len(self._compressed) / len(self._data)) * 100 if self._data else 0
214          return f"{'  ' * indent}FILE {self._name} ({len(self._data)}B -> {len(self._compressed)}B, {ratio:.1f}% saved)"
215  
216  
217  class DirectoryEntry(ArchiveEntry):
218      """Composite node containing files and sub-directories."""
219  
220      def __init__(self, name: str):
221          super().__init__(name)
222          self._children: list[ArchiveEntry] = []
223  
224      def is_directory(self) -> bool:
225          return True
226  
227      def add_child(self, entry: ArchiveEntry) -> None:
228          self._children.append(entry)
229  
230      @property
231      def children(self) -> list[ArchiveEntry]:
232          return list(self._children)
233  
234      def total_size(self) -> int:
235          return sum(child.total_size() for child in self._children)
236  
237      def display(self, indent: int = 0) -> str:
238          lines = [f"{'  ' * indent}DIR  {self._name}/ ({self.total_size()}B compressed)"]
239          for child in self._children:
240              lines.append(child.display(indent + 1))
241          return "\n".join(lines)
242  
243  
244  # ── Builder Pattern: Archive construction ─────────────────────────────
245  
246  class ArchiveBuilder:
247      """Fluent builder for constructing archives with validation."""
248  
249      def __init__(self):
250          self._root = DirectoryEntry("archive")
251          self._strategy: CompressionStrategy = RLECompression()
252          self._encryption_key: str | None = None
253          self._decorators: list[StreamDecorator] = []
254  
255      def set_compression(self, strategy: CompressionStrategy) -> "ArchiveBuilder":
256          self._strategy = strategy
257          return self
258  
259      def set_encryption(self, key: str) -> "ArchiveBuilder":
260          if len(key) < 4:
261              raise ValueError("Encryption key must be at least 4 characters")
262          self._encryption_key = key
263          return self
264  
265      def add_decorator(self, decorator: StreamDecorator) -> "ArchiveBuilder":
266          self._decorators.append(decorator)
267          return self
268  
269      def add_file(self, name: str, data: bytes, directory: DirectoryEntry | None = None) -> "ArchiveBuilder":
270          compressed = self._strategy.compress(data)
271          # Apply decorators in order (compress first, then layer on top)
272          for dec in self._decorators:
273              compressed = dec.process(compressed)
274          entry = FileEntry(name, data, compressed)
275          target = directory if directory else self._root
276          target.add_child(entry)
277          return self
278  
279      def add_directory(self, name: str, parent: DirectoryEntry | None = None) -> DirectoryEntry:
280          directory = DirectoryEntry(name)
281          target = parent if parent else self._root
282          target.add_child(directory)
283          return directory
284  
285      def build(self) -> DirectoryEntry:
286          if self._encryption_key and not any(isinstance(d, Base64Decorator) for d in self._decorators):
287              print("  [Builder Warning] Encryption key set but no encoding decorator added")
288          return self._root
289  
290  
291  # ── Template Method: Compressor skeleton ──────────────────────────────
292  
293  class Compressor(ABC):
294      """Template Method for file processing. Defines the skeleton:
295      read -> pre-process -> compress -> post-process -> return."""
296  
297      def __init__(self, strategy: CompressionStrategy):
298          self._strategy = strategy
299  
300      def process_file(self, data: bytes) -> bytes:
301          """Template method : fixed sequence, customizable hooks."""
302          data = self.before_compress(data)
303          compressed = self._strategy.compress(data)
304          compressed = self.after_compress(compressed)
305          return compressed
306  
307      def before_compress(self, data: bytes) -> bytes:
308          """Hook: override to pre-process data before compression."""
309          return data
310  
311      def after_compress(self, data: bytes) -> bytes:
312          """Hook: override to post-process data after compression."""
313          return data
314  
315  
316  class ChecksumCompressor(Compressor):
317      """Adds checksum computation after compression."""
318  
319      def __init__(self, strategy: CompressionStrategy):
320          super().__init__(strategy)
321          self.last_checksum: str = ""
322  
323      def after_compress(self, data: bytes) -> bytes:
324          self.last_checksum = hashlib.sha256(data).hexdigest()[:16]
325          return data
326  
327  
328  class LoggingCompressor(Compressor):
329      """Logs each processing step for debugging."""
330  
331      def before_compress(self, data: bytes) -> bytes:
332          print(f"    [LoggingCompressor] Input: {len(data)} bytes")
333          return data
334  
335      def after_compress(self, data: bytes) -> bytes:
336          print(f"    [LoggingCompressor] Output: {len(data)} bytes")
337          return data
338  
339  
340  # ── Demo ──────────────────────────────────────────────────────────────
341  
342  if __name__ == "__main__":
343      print("=" * 60)
344      print("FILE COMPRESSION SYSTEM : DESIGN PATTERN DEMO")
345      print("=" * 60)
346  
347      # 1. Strategy Pattern: Compare compression algorithms
348      print("\n--- Strategy Pattern: Swappable Compression ---")
349      repetitive = b"AAAAAABBBBCCCCCCCCDDEE" * 5
350      text_data = b"the quick brown fox jumps over the lazy dog near the river" * 3
351  
352      rle = RLECompression()
353      huffman = HuffmanCompression()
354  
355      for label, data in [("Repetitive", repetitive), ("Text", text_data)]:
356          rle_out = rle.compress(data)
357          huf_out = huffman.compress(data)
358          print(f"\n  {label} data ({len(data)}B):")
359          print(f"    RLE:     {len(rle_out)}B ({(1 - len(rle_out)/len(data))*100:.1f}% saved)")
360          print(f"    Huffman: {len(huf_out)}B ({(1 - len(huf_out)/len(data))*100:.1f}% saved)")
361          # Verify round-trip
362          assert rle.decompress(rle_out) == data, "RLE round-trip failed"
363          assert huffman.decompress(huf_out) == data, "Huffman round-trip failed"
364      print("  Round-trip verification: PASSED")
365  
366      # 2. Decorator Pattern: Layered stream processing
367      print("\n--- Decorator Pattern: Layered Processing ---")
368      raw = b"Sensitive financial report data 2026"
369      print(f"  Raw data: {len(raw)}B")
370  
371      checksum_dec = ChecksumDecorator()
372      base64_dec = Base64Decorator()
373  
374      # Layer: compress -> checksum -> base64
375      step1 = rle.compress(raw)
376      print(f"  After RLE compression: {len(step1)}B")
377      step2 = checksum_dec.process(step1)
378      print(f"  After checksum layer:  {len(step2)}B (+32B SHA-256)")
379      step3 = base64_dec.process(step2)
380      print(f"  After Base64 layer:    {len(step3)}B (text-safe)")
381  
382      # Reverse the pipeline
383      r1 = base64_dec.unprocess(step3)
384      r2 = checksum_dec.unprocess(r1)
385      r3 = rle.decompress(r2)
386      assert r3 == raw, "Decorator pipeline round-trip failed"
387      print("  Pipeline round-trip: PASSED")
388      print(f"  Checksum valid: {checksum_dec.verify(r1)}")
389  
390      # 3. Composite Pattern: Archive tree
391      print("\n--- Composite Pattern: Archive Tree ---")
392      builder = (
393          ArchiveBuilder()
394          .set_compression(HuffmanCompression())
395          .add_decorator(ChecksumDecorator())
396      )
397  
398      src_dir = builder.add_directory("src")
399      builder.add_file("main.py", b"import sys\nprint('hello world')\n" * 4, src_dir)
400      builder.add_file("utils.py", b"def helper(): return True\n" * 6, src_dir)
401  
402      docs_dir = builder.add_directory("docs")
403      builder.add_file("README.md", b"# Project\nThis is the readme.\n" * 5, docs_dir)
404  
405      builder.add_file("config.yaml", b"key: value\nport: 8080\n" * 3)
406  
407      archive = builder.build()
408      print(f"\n{archive.display()}")
409      print(f"\n  Total archive size: {archive.total_size()}B")
410  
411      # 4. Builder validation
412      print("\n--- Builder Pattern: Validation ---")
413      try:
414          ArchiveBuilder().set_encryption("ab")
415      except ValueError as e:
416          print(f"  Caught: {e}")
417  
418      valid_builder = ArchiveBuilder().set_compression(RLECompression()).set_encryption("secure-key-2026")
419      valid_builder.add_file("data.bin", b"\x00" * 100 + b"\xFF" * 100)
420      result = valid_builder.build()
421      print(f"  Valid build: {result.total_size()}B compressed")
422  
423      # 5. Template Method: Processing hooks
424      print("\n--- Template Method: Compressor Hooks ---")
425      sample = b"AAAAABBBBBCCCCC" * 10
426  
427      print("  ChecksumCompressor:")
428      cc = ChecksumCompressor(RLECompression())
429      compressed = cc.process_file(sample)
430      print(f"    {len(sample)}B -> {len(compressed)}B, checksum: {cc.last_checksum}")
431  
432      print("\n  LoggingCompressor:")
433      lc = LoggingCompressor(HuffmanCompression())
434      compressed = lc.process_file(sample)
435  
436      print("\n" + "=" * 60)
437      print("All operations completed successfully.")

Common Mistakes

✗Hardcoding compression algorithm: you can't optimize per file type
✗Applying encryption before compression: encrypted data doesn't compress well
✗Not using Composite for directories: leads to flat file lists with path string parsing
✗Loading entire file into memory: you need streaming for large archives

Key Points

✓Strategy makes compression algorithm swappable: choose RLE for repetitive data, Huffman for general text
✓Decorator for stream processing layers that compose: encryption wraps compression wraps raw data
✓Composite for archive structure: directories contain files and sub-directories, recursive traversal
✓Builder validates constraints: encryption requires a key, compression level must be valid

Related Designs