File Compression System
Swappable compression strategies, Decorator-layered stream processing (compress, encrypt, checksum), Composite archive tree, and Builder for step-by-step archive construction.
Key Abstractions
Interface defining compress/decompress on raw bytes. Strategies are swappable at runtime.
Concrete strategies. RLE for repetitive data, Huffman for general text with frequency-based encoding.
Wraps a data stream to add processing layers. EncryptionDecorator and ChecksumDecorator compose transparently.
Composite node. Files are leaves holding compressed data; directories contain children for recursive traversal.
Fluent builder for archive construction. addFile, addDirectory, setCompression, setEncryption, build with validation.
Template Method skeleton: read chunk, compress, write output, update checksum. Subclasses override hooks.
Class Diagram
How It Works
A file compression system layers multiple design patterns to handle three distinct concerns: how to compress data, how to structure an archive, and how to process data through a pipeline.
Different files benefit from different compression algorithms. Repetitive data like log files compresses well with Run-Length Encoding. General text does better with Huffman coding, which assigns shorter bit sequences to more frequent bytes. By extracting the compression algorithm behind an interface, the system can pick the right strategy per file without changing any other code.
Stream processing is where Decorator shines. After compressing data, you might want to add a checksum for integrity verification, then encode it as Base64 for safe text transport. Each of these is a decorator that wraps the data, and they compose in any order. The critical point is ordering: compress first, then encrypt, then checksum. Encrypting before compressing destroys the statistical patterns that compression exploits.
The archive itself is a tree. Directories contain files and other directories, which is the textbook Composite pattern. When you ask a directory for its total size, it recursively sums up all children. No special-casing needed. Files return their compressed size directly, directories delegate downward.
The Builder pattern handles archive construction because creating an archive involves multiple steps with constraints. You need to set a compression strategy, optionally configure encryption, add files and directories, and validate that the configuration is consistent. A constructor with that many parameters would be unreadable. The builder's fluent API makes the intent clear and catches invalid configurations at build time.
Requirements
Functional
- Support multiple compression algorithms swappable at runtime (RLE, Huffman)
- Layer processing steps on compressed data: checksum, encoding, encryption
- Build archives with nested directories and files
- Track compression ratios per file and across the archive
- Validate builder constraints before producing the archive
- Round-trip fidelity: decompress(compress(data)) must equal the original data
Non-Functional
- Stream-friendly design: avoid loading entire large files into memory at once
- Composable decorators: adding a new processing layer should not modify existing ones
- Extensible strategies: adding a new compression algorithm requires only a new class
- Clear error messages when validation fails
Design Decisions
What's wrong with an if/else chain for picking the algorithm?
An if/else chain means every caller that compresses data needs to know about every algorithm. Adding LZ77 means finding every if algorithm == "rle" block and adding another branch. Strategy isolates each algorithm in its own class. New algorithms are new classes. Existing code stays untouched. The strategy is also runtime-configurable, so you can pick the best algorithm based on file content analysis.
Does the order of decorators matter?
Yes, and it's critical. Compressing before encrypting is essential because encrypted data looks random and won't compress. Similarly, checksumming should happen after encryption so the checksum covers the final form of the data. Decorator makes this ordering explicit in the pipeline construction without baking it into the class hierarchy.
Why not just use a constructor for building the archive?
An archive needs a compression strategy, optional encryption key, optional decorators, and a tree of files and directories. That's too many parameters for a constructor, especially since most are optional. Builder lets callers set only what they need, validates constraints at build time (encryption key length, strategy compatibility), and produces a fully configured archive in one final step.
How does Template Method help with the compression pipeline?
The compression pipeline always follows the same steps: pre-process, compress, post-process. What varies is what happens at each hook. A checksum compressor adds hash computation after compression. A logging compressor prints sizes before and after. Template Method defines the skeleton once, and subclasses override only the hooks they care about.
Interview Follow-ups
- "How would you handle streaming for large files?" Process data in fixed-size chunks rather than loading the entire file. Each chunk is compressed independently with a chunk header storing the compressed and original sizes. The Decorator pipeline processes each chunk as it flows through, keeping memory usage bounded regardless of file size.
- "How would you add parallel compression?" Divide the file into blocks and compress each block on a separate thread. Use a thread pool or fork-join framework. The challenge is reassembly: blocks must be written in order, so use a sequence number per block and a merging step that waits for blocks in order. The Strategy interface stays unchanged since each thread gets its own strategy instance.
- "How would you support incremental archives?" Track file modification timestamps. On a second archive pass, only compress files that changed since the last archive. Store a manifest mapping file paths to their last-known hash and timestamp. The Builder can accept a previous manifest and skip unchanged files.
- "How would you add deduplication?" Hash each file (or each chunk) before compression. Maintain a content-addressable store keyed by hash. If a hash already exists, store a reference instead of the data. This is how systems like rsync and modern backup tools minimize storage. The Composite tree stores references for duplicate entries, pointing to the same underlying compressed data.
Code Implementation
1 from __future__ import annotations
2 from abc import ABC, abstractmethod
3 from collections import Counter
4 import heapq
5 import struct
6 import hashlib
7 import base64
8
9
10 # ── Strategy Pattern: Compression algorithms ──────────────────────────
11
12 class CompressionStrategy(ABC):
13 """Interface for pluggable compression algorithms."""
14
15 @abstractmethod
16 def compress(self, data: bytes) -> bytes: ...
17
18 @abstractmethod
19 def decompress(self, data: bytes) -> bytes: ...
20
21
22 class RLECompression(CompressionStrategy):
23 """Run-Length Encoding. Best for data with long runs of repeated bytes."""
24
25 def compress(self, data: bytes) -> bytes:
26 if not data:
27 return b""
28 result = bytearray()
29 i = 0
30 while i < len(data):
31 byte = data[i]
32 count = 1
33 while i + count < len(data) and data[i + count] == byte and count < 255:
34 count += 1
35 result.append(count)
36 result.append(byte)
37 i += count
38 return bytes(result)
39
40 def decompress(self, data: bytes) -> bytes:
41 result = bytearray()
42 for i in range(0, len(data), 2):
43 count = data[i]
44 byte = data[i + 1]
45 result.extend([byte] * count)
46 return bytes(result)
47
48
49 class HuffmanCompression(CompressionStrategy):
50 """Huffman coding. Builds a frequency-based prefix tree for encoding."""
51
52 def compress(self, data: bytes) -> bytes:
53 if not data:
54 return b""
55 freq = Counter(data)
56 # Build Huffman tree using a heap
57 heap: list[list] = [[count, [byte, ""]] for byte, count in freq.items()]
58 heapq.heapify(heap)
59 if len(heap) == 1:
60 heap[0][1][1] = "0"
61 else:
62 while len(heap) > 1:
63 lo = heapq.heappop(heap)
64 hi = heapq.heappop(heap)
65 for pair in lo[1:]:
66 pair[1] = "0" + pair[1]
67 for pair in hi[1:]:
68 pair[1] = "1" + pair[1]
69 heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
70 codes = {byte: code for byte, code in heap[0][1:]}
71 # Encode: store table size, table entries, then bit string as bytes
72 header = bytearray()
73 header.append(len(codes))
74 for byte_val, code in codes.items():
75 header.append(byte_val)
76 header.append(len(code))
77 # Pack code bits into the header as a string for simplicity
78 header.extend(code.encode("ascii"))
79 bit_string = "".join(codes[b] for b in data)
80 # Pad bit string to multiple of 8
81 padding = (8 - len(bit_string) % 8) % 8
82 bit_string += "0" * padding
83 encoded = bytearray()
84 for i in range(0, len(bit_string), 8):
85 encoded.append(int(bit_string[i:i + 8], 2))
86 # Store: padding (1 byte) + header_len (2 bytes) + header + encoded
87 header_bytes = bytes(header)
88 result = struct.pack(">BH", padding, len(header_bytes)) + header_bytes + bytes(encoded)
89 return result
90
91 def decompress(self, data: bytes) -> bytes:
92 if not data:
93 return b""
94 padding = data[0]
95 header_len = struct.unpack(">H", data[1:3])[0]
96 header = data[3:3 + header_len]
97 encoded = data[3 + header_len:]
98 # Rebuild code table
99 idx = 0
100 num_codes = header[idx]; idx += 1
101 codes: dict[str, int] = {}
102 for _ in range(num_codes):
103 byte_val = header[idx]; idx += 1
104 code_len = header[idx]; idx += 1
105 code = header[idx:idx + code_len].decode("ascii"); idx += code_len
106 codes[code] = byte_val
107 # Decode bit string
108 bit_string = "".join(f"{b:08b}" for b in encoded)
109 if padding:
110 bit_string = bit_string[:-padding]
111 result = bytearray()
112 current = ""
113 for bit in bit_string:
114 current += bit
115 if current in codes:
116 result.append(codes[current])
117 current = ""
118 return bytes(result)
119
120
121 # ── Decorator Pattern: Stream processing layers ───────────────────────
122
123 class StreamDecorator(ABC):
124 """Base decorator for layered data processing."""
125
126 @abstractmethod
127 def process(self, data: bytes) -> bytes: ...
128
129 @abstractmethod
130 def unprocess(self, data: bytes) -> bytes: ...
131
132
133 class ChecksumDecorator(StreamDecorator):
134 """Appends a SHA-256 checksum to the data for integrity verification."""
135
136 HASH_SIZE = 32
137
138 def process(self, data: bytes) -> bytes:
139 checksum = hashlib.sha256(data).digest()
140 return data + checksum
141
142 def unprocess(self, data: bytes) -> bytes:
143 payload = data[:-self.HASH_SIZE]
144 stored_checksum = data[-self.HASH_SIZE:]
145 computed = hashlib.sha256(payload).digest()
146 if computed != stored_checksum:
147 raise ValueError("Checksum verification failed : data is corrupted")
148 return payload
149
150 def verify(self, data: bytes) -> bool:
151 try:
152 self.unprocess(data)
153 return True
154 except ValueError:
155 return False
156
157
158 class Base64Decorator(StreamDecorator):
159 """Encodes data as Base64 for safe text transport."""
160
161 def process(self, data: bytes) -> bytes:
162 return base64.b64encode(data)
163
164 def unprocess(self, data: bytes) -> bytes:
165 return base64.b64decode(data)
166
167
168 # ── Composite Pattern: Archive tree structure ─────────────────────────
169
170 class ArchiveEntry(ABC):
171 """Abstract node in the archive tree."""
172
173 def __init__(self, name: str):
174 self._name = name
175
176 @property
177 def name(self) -> str:
178 return self._name
179
180 @abstractmethod
181 def is_directory(self) -> bool: ...
182
183 @abstractmethod
184 def total_size(self) -> int: ...
185
186 @abstractmethod
187 def display(self, indent: int = 0) -> str: ...
188
189
190 class FileEntry(ArchiveEntry):
191 """Leaf node holding original and compressed data."""
192
193 def __init__(self, name: str, data: bytes, compressed: bytes):
194 super().__init__(name)
195 self._data = data
196 self._compressed = compressed
197
198 def is_directory(self) -> bool:
199 return False
200
201 def total_size(self) -> int:
202 return len(self._compressed)
203
204 @property
205 def original_size(self) -> int:
206 return len(self._data)
207
208 @property
209 def compressed_data(self) -> bytes:
210 return self._compressed
211
212 def display(self, indent: int = 0) -> str:
213 ratio = (1 - len(self._compressed) / len(self._data)) * 100 if self._data else 0
214 return f"{' ' * indent}FILE {self._name} ({len(self._data)}B -> {len(self._compressed)}B, {ratio:.1f}% saved)"
215
216
217 class DirectoryEntry(ArchiveEntry):
218 """Composite node containing files and sub-directories."""
219
220 def __init__(self, name: str):
221 super().__init__(name)
222 self._children: list[ArchiveEntry] = []
223
224 def is_directory(self) -> bool:
225 return True
226
227 def add_child(self, entry: ArchiveEntry) -> None:
228 self._children.append(entry)
229
230 @property
231 def children(self) -> list[ArchiveEntry]:
232 return list(self._children)
233
234 def total_size(self) -> int:
235 return sum(child.total_size() for child in self._children)
236
237 def display(self, indent: int = 0) -> str:
238 lines = [f"{' ' * indent}DIR {self._name}/ ({self.total_size()}B compressed)"]
239 for child in self._children:
240 lines.append(child.display(indent + 1))
241 return "\n".join(lines)
242
243
244 # ── Builder Pattern: Archive construction ─────────────────────────────
245
246 class ArchiveBuilder:
247 """Fluent builder for constructing archives with validation."""
248
249 def __init__(self):
250 self._root = DirectoryEntry("archive")
251 self._strategy: CompressionStrategy = RLECompression()
252 self._encryption_key: str | None = None
253 self._decorators: list[StreamDecorator] = []
254
255 def set_compression(self, strategy: CompressionStrategy) -> "ArchiveBuilder":
256 self._strategy = strategy
257 return self
258
259 def set_encryption(self, key: str) -> "ArchiveBuilder":
260 if len(key) < 4:
261 raise ValueError("Encryption key must be at least 4 characters")
262 self._encryption_key = key
263 return self
264
265 def add_decorator(self, decorator: StreamDecorator) -> "ArchiveBuilder":
266 self._decorators.append(decorator)
267 return self
268
269 def add_file(self, name: str, data: bytes, directory: DirectoryEntry | None = None) -> "ArchiveBuilder":
270 compressed = self._strategy.compress(data)
271 # Apply decorators in order (compress first, then layer on top)
272 for dec in self._decorators:
273 compressed = dec.process(compressed)
274 entry = FileEntry(name, data, compressed)
275 target = directory if directory else self._root
276 target.add_child(entry)
277 return self
278
279 def add_directory(self, name: str, parent: DirectoryEntry | None = None) -> DirectoryEntry:
280 directory = DirectoryEntry(name)
281 target = parent if parent else self._root
282 target.add_child(directory)
283 return directory
284
285 def build(self) -> DirectoryEntry:
286 if self._encryption_key and not any(isinstance(d, Base64Decorator) for d in self._decorators):
287 print(" [Builder Warning] Encryption key set but no encoding decorator added")
288 return self._root
289
290
291 # ── Template Method: Compressor skeleton ──────────────────────────────
292
293 class Compressor(ABC):
294 """Template Method for file processing. Defines the skeleton:
295 read -> pre-process -> compress -> post-process -> return."""
296
297 def __init__(self, strategy: CompressionStrategy):
298 self._strategy = strategy
299
300 def process_file(self, data: bytes) -> bytes:
301 """Template method : fixed sequence, customizable hooks."""
302 data = self.before_compress(data)
303 compressed = self._strategy.compress(data)
304 compressed = self.after_compress(compressed)
305 return compressed
306
307 def before_compress(self, data: bytes) -> bytes:
308 """Hook: override to pre-process data before compression."""
309 return data
310
311 def after_compress(self, data: bytes) -> bytes:
312 """Hook: override to post-process data after compression."""
313 return data
314
315
316 class ChecksumCompressor(Compressor):
317 """Adds checksum computation after compression."""
318
319 def __init__(self, strategy: CompressionStrategy):
320 super().__init__(strategy)
321 self.last_checksum: str = ""
322
323 def after_compress(self, data: bytes) -> bytes:
324 self.last_checksum = hashlib.sha256(data).hexdigest()[:16]
325 return data
326
327
328 class LoggingCompressor(Compressor):
329 """Logs each processing step for debugging."""
330
331 def before_compress(self, data: bytes) -> bytes:
332 print(f" [LoggingCompressor] Input: {len(data)} bytes")
333 return data
334
335 def after_compress(self, data: bytes) -> bytes:
336 print(f" [LoggingCompressor] Output: {len(data)} bytes")
337 return data
338
339
340 # ── Demo ──────────────────────────────────────────────────────────────
341
342 if __name__ == "__main__":
343 print("=" * 60)
344 print("FILE COMPRESSION SYSTEM : DESIGN PATTERN DEMO")
345 print("=" * 60)
346
347 # 1. Strategy Pattern: Compare compression algorithms
348 print("\n--- Strategy Pattern: Swappable Compression ---")
349 repetitive = b"AAAAAABBBBCCCCCCCCDDEE" * 5
350 text_data = b"the quick brown fox jumps over the lazy dog near the river" * 3
351
352 rle = RLECompression()
353 huffman = HuffmanCompression()
354
355 for label, data in [("Repetitive", repetitive), ("Text", text_data)]:
356 rle_out = rle.compress(data)
357 huf_out = huffman.compress(data)
358 print(f"\n {label} data ({len(data)}B):")
359 print(f" RLE: {len(rle_out)}B ({(1 - len(rle_out)/len(data))*100:.1f}% saved)")
360 print(f" Huffman: {len(huf_out)}B ({(1 - len(huf_out)/len(data))*100:.1f}% saved)")
361 # Verify round-trip
362 assert rle.decompress(rle_out) == data, "RLE round-trip failed"
363 assert huffman.decompress(huf_out) == data, "Huffman round-trip failed"
364 print(" Round-trip verification: PASSED")
365
366 # 2. Decorator Pattern: Layered stream processing
367 print("\n--- Decorator Pattern: Layered Processing ---")
368 raw = b"Sensitive financial report data 2026"
369 print(f" Raw data: {len(raw)}B")
370
371 checksum_dec = ChecksumDecorator()
372 base64_dec = Base64Decorator()
373
374 # Layer: compress -> checksum -> base64
375 step1 = rle.compress(raw)
376 print(f" After RLE compression: {len(step1)}B")
377 step2 = checksum_dec.process(step1)
378 print(f" After checksum layer: {len(step2)}B (+32B SHA-256)")
379 step3 = base64_dec.process(step2)
380 print(f" After Base64 layer: {len(step3)}B (text-safe)")
381
382 # Reverse the pipeline
383 r1 = base64_dec.unprocess(step3)
384 r2 = checksum_dec.unprocess(r1)
385 r3 = rle.decompress(r2)
386 assert r3 == raw, "Decorator pipeline round-trip failed"
387 print(" Pipeline round-trip: PASSED")
388 print(f" Checksum valid: {checksum_dec.verify(r1)}")
389
390 # 3. Composite Pattern: Archive tree
391 print("\n--- Composite Pattern: Archive Tree ---")
392 builder = (
393 ArchiveBuilder()
394 .set_compression(HuffmanCompression())
395 .add_decorator(ChecksumDecorator())
396 )
397
398 src_dir = builder.add_directory("src")
399 builder.add_file("main.py", b"import sys\nprint('hello world')\n" * 4, src_dir)
400 builder.add_file("utils.py", b"def helper(): return True\n" * 6, src_dir)
401
402 docs_dir = builder.add_directory("docs")
403 builder.add_file("README.md", b"# Project\nThis is the readme.\n" * 5, docs_dir)
404
405 builder.add_file("config.yaml", b"key: value\nport: 8080\n" * 3)
406
407 archive = builder.build()
408 print(f"\n{archive.display()}")
409 print(f"\n Total archive size: {archive.total_size()}B")
410
411 # 4. Builder validation
412 print("\n--- Builder Pattern: Validation ---")
413 try:
414 ArchiveBuilder().set_encryption("ab")
415 except ValueError as e:
416 print(f" Caught: {e}")
417
418 valid_builder = ArchiveBuilder().set_compression(RLECompression()).set_encryption("secure-key-2026")
419 valid_builder.add_file("data.bin", b"\x00" * 100 + b"\xFF" * 100)
420 result = valid_builder.build()
421 print(f" Valid build: {result.total_size()}B compressed")
422
423 # 5. Template Method: Processing hooks
424 print("\n--- Template Method: Compressor Hooks ---")
425 sample = b"AAAAABBBBBCCCCC" * 10
426
427 print(" ChecksumCompressor:")
428 cc = ChecksumCompressor(RLECompression())
429 compressed = cc.process_file(sample)
430 print(f" {len(sample)}B -> {len(compressed)}B, checksum: {cc.last_checksum}")
431
432 print("\n LoggingCompressor:")
433 lc = LoggingCompressor(HuffmanCompression())
434 compressed = lc.process_file(sample)
435
436 print("\n" + "=" * 60)
437 print("All operations completed successfully.")Common Mistakes
- ✗Hardcoding compression algorithm: you can't optimize per file type
- ✗Applying encryption before compression: encrypted data doesn't compress well
- ✗Not using Composite for directories: leads to flat file lists with path string parsing
- ✗Loading entire file into memory: you need streaming for large archives
Key Points
- ✓Strategy makes compression algorithm swappable: choose RLE for repetitive data, Huffman for general text
- ✓Decorator for stream processing layers that compose: encryption wraps compression wraps raw data
- ✓Composite for archive structure: directories contain files and sub-directories, recursive traversal
- ✓Builder validates constraints: encryption requires a key, compression level must be valid