File Compression System
Swappable compression strategies, Decorator-layered stream processing (compress, encrypt, checksum), Composite archive tree, and Builder for step-by-step archive construction.
Key Abstractions
Interface defining compress/decompress on raw bytes. Strategies are swappable at runtime.
Concrete strategies. RLE for repetitive data, Huffman for general text with frequency-based encoding.
Wraps a data stream to add processing layers. EncryptionDecorator and ChecksumDecorator compose transparently.
Composite node. Files are leaves holding compressed data; directories contain children for recursive traversal.
Fluent builder for archive construction. addFile, addDirectory, setCompression, setEncryption, build with validation.
Template Method skeleton: read chunk, compress, write output, update checksum. Subclasses override hooks.
Class Diagram
How It Works
A file compression system layers multiple design patterns to handle three distinct concerns: how to compress data, how to structure an archive, and how to process data through a pipeline.
Different files benefit from different compression algorithms. Repetitive data like log files compresses well with Run-Length Encoding. General text does better with Huffman coding, which assigns shorter bit sequences to more frequent bytes. By extracting the compression algorithm behind an interface, the system can pick the right strategy per file without changing any other code.
Stream processing is where Decorator shines. After compressing data, you might want to add a checksum for integrity verification, then encode it as Base64 for safe text transport. Each of these is a decorator that wraps the data, and they compose in any order. The critical point is ordering: compress first, then encrypt, then checksum. Encrypting before compressing destroys the statistical patterns that compression exploits.
The archive itself is a tree. Directories contain files and other directories, which is the textbook Composite pattern. When you ask a directory for its total size, it recursively sums up all children. No special-casing needed. Files return their compressed size directly, directories delegate downward.
The Builder pattern handles archive construction because creating an archive involves multiple steps with constraints. You need to set a compression strategy, optionally configure encryption, add files and directories, and validate that the configuration is consistent. A constructor with that many parameters would be unreadable. The builder's fluent API makes the intent clear and catches invalid configurations at build time.
Requirements
Functional
- Support multiple compression algorithms swappable at runtime (RLE, Huffman)
- Layer processing steps on compressed data: checksum, encoding, encryption
- Build archives with nested directories and files
- Track compression ratios per file and across the archive
- Validate builder constraints before producing the archive
- Round-trip fidelity: decompress(compress(data)) must equal the original data
Non-Functional
- Stream-friendly design: avoid loading entire large files into memory at once
- Composable decorators: adding a new processing layer should not modify existing ones
- Extensible strategies: adding a new compression algorithm requires only a new class
- Clear error messages when validation fails
Design Decisions
What's wrong with an if/else chain for picking the algorithm?
An if/else chain means every caller that compresses data needs to know about every algorithm. Adding LZ77 means finding every if algorithm == "rle" block and adding another branch. Strategy isolates each algorithm in its own class. New algorithms are new classes. Existing code stays untouched. The strategy is also runtime-configurable, so you can pick the best algorithm based on file content analysis.
Does the order of decorators matter?
Yes, and it's critical. Compressing before encrypting is essential because encrypted data looks random and won't compress. Similarly, checksumming should happen after encryption so the checksum covers the final form of the data. Decorator makes this ordering explicit in the pipeline construction without baking it into the class hierarchy.
Why not just use a constructor for building the archive?
An archive needs a compression strategy, optional encryption key, optional decorators, and a tree of files and directories. That's too many parameters for a constructor, especially since most are optional. Builder lets callers set only what they need, validates constraints at build time (encryption key length, strategy compatibility), and produces a fully configured archive in one final step.
How does Template Method help with the compression pipeline?
The compression pipeline always follows the same steps: pre-process, compress, post-process. What varies is what happens at each hook. A checksum compressor adds hash computation after compression. A logging compressor prints sizes before and after. Template Method defines the skeleton once, and subclasses override only the hooks they care about.
Interview Follow-ups
- "How would you handle streaming for large files?" Process data in fixed-size chunks rather than loading the entire file. Each chunk is compressed independently with a chunk header storing the compressed and original sizes. The Decorator pipeline processes each chunk as it flows through, keeping memory usage bounded regardless of file size.
- "How would you add parallel compression?" Divide the file into blocks and compress each block on a separate thread. Use a thread pool or fork-join framework. The challenge is reassembly: blocks must be written in order, so use a sequence number per block and a merging step that waits for blocks in order. The Strategy interface stays unchanged since each thread gets its own strategy instance.
- "How would you support incremental archives?" Track file modification timestamps. On a second archive pass, only compress files that changed since the last archive. Store a manifest mapping file paths to their last-known hash and timestamp. The Builder can accept a previous manifest and skip unchanged files.
- "How would you add deduplication?" Hash each file (or each chunk) before compression. Maintain a content-addressable store keyed by hash. If a hash already exists, store a reference instead of the data. This is how systems like rsync and modern backup tools minimize storage. The Composite tree stores references for duplicate entries, pointing to the same underlying compressed data.
Code Implementation
from __future__ import annotations
from abc import ABC, abstractmethod
from collections import Counter
import heapq
import struct
import hashlib
import base64
# ── Strategy Pattern: Compression algorithms ──────────────────────────
class CompressionStrategy(ABC):
"""Interface for pluggable compression algorithms."""
@abstractmethod
def compress(self, data: bytes) -> bytes: ...
@abstractmethod
def decompress(self, data: bytes) -> bytes: ...
class RLECompression(CompressionStrategy):
"""Run-Length Encoding. Best for data with long runs of repeated bytes."""
def compress(self, data: bytes) -> bytes:
if not data:
return b""
result = bytearray()
i = 0
while i < len(data):
byte = data[i]
count = 1
while i + count < len(data) and data[i + count] == byte and count < 255:
count += 1
result.append(count)
result.append(byte)
i += count
return bytes(result)
def decompress(self, data: bytes) -> bytes:
result = bytearray()
for i in range(0, len(data), 2):
count = data[i]
byte = data[i + 1]
result.extend([byte] * count)
return bytes(result)
class HuffmanCompression(CompressionStrategy):
"""Huffman coding. Builds a frequency-based prefix tree for encoding."""
def compress(self, data: bytes) -> bytes:
if not data:
return b""
freq = Counter(data)
# Build Huffman tree using a heap
heap: list[list] = [[count, [byte, ""]] for byte, count in freq.items()]
heapq.heapify(heap)
if len(heap) == 1:
heap[0][1][1] = "0"
else:
while len(heap) > 1:
lo = heapq.heappop(heap)
hi = heapq.heappop(heap)
for pair in lo[1:]:
pair[1] = "0" + pair[1]
for pair in hi[1:]:
pair[1] = "1" + pair[1]
heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
codes = {byte: code for byte, code in heap[0][1:]}
# Encode: store table size, table entries, then bit string as bytes
header = bytearray()
header.append(len(codes))
for byte_val, code in codes.items():
header.append(byte_val)
header.append(len(code))
# Pack code bits into the header as a string for simplicity
header.extend(code.encode("ascii"))
bit_string = "".join(codes[b] for b in data)
# Pad bit string to multiple of 8
padding = (8 - len(bit_string) % 8) % 8
bit_string += "0" * padding
encoded = bytearray()
for i in range(0, len(bit_string), 8):
encoded.append(int(bit_string[i:i + 8], 2))
# Store: padding (1 byte) + header_len (2 bytes) + header + encoded
header_bytes = bytes(header)
result = struct.pack(">BH", padding, len(header_bytes)) + header_bytes + bytes(encoded)
return result
def decompress(self, data: bytes) -> bytes:
if not data:
return b""
padding = data[0]
header_len = struct.unpack(">H", data[1:3])[0]
header = data[3:3 + header_len]
encoded = data[3 + header_len:]
# Rebuild code table
idx = 0
num_codes = header[idx]; idx += 1
codes: dict[str, int] = {}
for _ in range(num_codes):
byte_val = header[idx]; idx += 1
code_len = header[idx]; idx += 1
code = header[idx:idx + code_len].decode("ascii"); idx += code_len
codes[code] = byte_val
# Decode bit string
bit_string = "".join(f"{b:08b}" for b in encoded)
if padding:
bit_string = bit_string[:-padding]
result = bytearray()
current = ""
for bit in bit_string:
current += bit
if current in codes:
result.append(codes[current])
current = ""
return bytes(result)
# ── Decorator Pattern: Stream processing layers ───────────────────────
class StreamDecorator(ABC):
"""Base decorator for layered data processing."""
@abstractmethod
def process(self, data: bytes) -> bytes: ...
@abstractmethod
def unprocess(self, data: bytes) -> bytes: ...
class ChecksumDecorator(StreamDecorator):
"""Appends a SHA-256 checksum to the data for integrity verification."""
HASH_SIZE = 32
def process(self, data: bytes) -> bytes:
checksum = hashlib.sha256(data).digest()
return data + checksum
def unprocess(self, data: bytes) -> bytes:
payload = data[:-self.HASH_SIZE]
stored_checksum = data[-self.HASH_SIZE:]
computed = hashlib.sha256(payload).digest()
if computed != stored_checksum:
raise ValueError("Checksum verification failed : data is corrupted")
return payload
def verify(self, data: bytes) -> bool:
try:
self.unprocess(data)
return True
except ValueError:
return False
class Base64Decorator(StreamDecorator):
"""Encodes data as Base64 for safe text transport."""
def process(self, data: bytes) -> bytes:
return base64.b64encode(data)
def unprocess(self, data: bytes) -> bytes:
return base64.b64decode(data)
# ── Composite Pattern: Archive tree structure ─────────────────────────
class ArchiveEntry(ABC):
"""Abstract node in the archive tree."""
def __init__(self, name: str):
self._name = name
@property
def name(self) -> str:
return self._name
@abstractmethod
def is_directory(self) -> bool: ...
@abstractmethod
def total_size(self) -> int: ...
@abstractmethod
def display(self, indent: int = 0) -> str: ...
class FileEntry(ArchiveEntry):
"""Leaf node holding original and compressed data."""
def __init__(self, name: str, data: bytes, compressed: bytes):
super().__init__(name)
self._data = data
self._compressed = compressed
def is_directory(self) -> bool:
return False
def total_size(self) -> int:
return len(self._compressed)
@property
def original_size(self) -> int:
return len(self._data)
@property
def compressed_data(self) -> bytes:
return self._compressed
def display(self, indent: int = 0) -> str:
ratio = (1 - len(self._compressed) / len(self._data)) * 100 if self._data else 0
return f"{' ' * indent}FILE {self._name} ({len(self._data)}B -> {len(self._compressed)}B, {ratio:.1f}% saved)"
class DirectoryEntry(ArchiveEntry):
"""Composite node containing files and sub-directories."""
def __init__(self, name: str):
super().__init__(name)
self._children: list[ArchiveEntry] = []
def is_directory(self) -> bool:
return True
def add_child(self, entry: ArchiveEntry) -> None:
self._children.append(entry)
@property
def children(self) -> list[ArchiveEntry]:
return list(self._children)
def total_size(self) -> int:
return sum(child.total_size() for child in self._children)
def display(self, indent: int = 0) -> str:
lines = [f"{' ' * indent}DIR {self._name}/ ({self.total_size()}B compressed)"]
for child in self._children:
lines.append(child.display(indent + 1))
return "\n".join(lines)
# ── Builder Pattern: Archive construction ─────────────────────────────
class ArchiveBuilder:
"""Fluent builder for constructing archives with validation."""
def __init__(self):
self._root = DirectoryEntry("archive")
self._strategy: CompressionStrategy = RLECompression()
self._encryption_key: str | None = None
self._decorators: list[StreamDecorator] = []
def set_compression(self, strategy: CompressionStrategy) -> "ArchiveBuilder":
self._strategy = strategy
return self
def set_encryption(self, key: str) -> "ArchiveBuilder":
if len(key) < 4:
raise ValueError("Encryption key must be at least 4 characters")
self._encryption_key = key
return self
def add_decorator(self, decorator: StreamDecorator) -> "ArchiveBuilder":
self._decorators.append(decorator)
return self
def add_file(self, name: str, data: bytes, directory: DirectoryEntry | None = None) -> "ArchiveBuilder":
compressed = self._strategy.compress(data)
# Apply decorators in order (compress first, then layer on top)
for dec in self._decorators:
compressed = dec.process(compressed)
entry = FileEntry(name, data, compressed)
target = directory if directory else self._root
target.add_child(entry)
return self
def add_directory(self, name: str, parent: DirectoryEntry | None = None) -> DirectoryEntry:
directory = DirectoryEntry(name)
target = parent if parent else self._root
target.add_child(directory)
return directory
def build(self) -> DirectoryEntry:
if self._encryption_key and not any(isinstance(d, Base64Decorator) for d in self._decorators):
print(" [Builder Warning] Encryption key set but no encoding decorator added")
return self._root
# ── Template Method: Compressor skeleton ──────────────────────────────
class Compressor(ABC):
"""Template Method for file processing. Defines the skeleton:
read -> pre-process -> compress -> post-process -> return."""
def __init__(self, strategy: CompressionStrategy):
self._strategy = strategy
def process_file(self, data: bytes) -> bytes:
"""Template method : fixed sequence, customizable hooks."""
data = self.before_compress(data)
compressed = self._strategy.compress(data)
compressed = self.after_compress(compressed)
return compressed
def before_compress(self, data: bytes) -> bytes:
"""Hook: override to pre-process data before compression."""
return data
def after_compress(self, data: bytes) -> bytes:
"""Hook: override to post-process data after compression."""
return data
class ChecksumCompressor(Compressor):
"""Adds checksum computation after compression."""
def __init__(self, strategy: CompressionStrategy):
super().__init__(strategy)
self.last_checksum: str = ""
def after_compress(self, data: bytes) -> bytes:
self.last_checksum = hashlib.sha256(data).hexdigest()[:16]
return data
class LoggingCompressor(Compressor):
"""Logs each processing step for debugging."""
def before_compress(self, data: bytes) -> bytes:
print(f" [LoggingCompressor] Input: {len(data)} bytes")
return data
def after_compress(self, data: bytes) -> bytes:
print(f" [LoggingCompressor] Output: {len(data)} bytes")
return data
# ── Demo ──────────────────────────────────────────────────────────────
if __name__ == "__main__":
print("=" * 60)
print("FILE COMPRESSION SYSTEM : DESIGN PATTERN DEMO")
print("=" * 60)
# 1. Strategy Pattern: Compare compression algorithms
print("\n--- Strategy Pattern: Swappable Compression ---")
repetitive = b"AAAAAABBBBCCCCCCCCDDEE" * 5
text_data = b"the quick brown fox jumps over the lazy dog near the river" * 3
rle = RLECompression()
huffman = HuffmanCompression()
for label, data in [("Repetitive", repetitive), ("Text", text_data)]:
rle_out = rle.compress(data)
huf_out = huffman.compress(data)
print(f"\n {label} data ({len(data)}B):")
print(f" RLE: {len(rle_out)}B ({(1 - len(rle_out)/len(data))*100:.1f}% saved)")
print(f" Huffman: {len(huf_out)}B ({(1 - len(huf_out)/len(data))*100:.1f}% saved)")
# Verify round-trip
assert rle.decompress(rle_out) == data, "RLE round-trip failed"
assert huffman.decompress(huf_out) == data, "Huffman round-trip failed"
print(" Round-trip verification: PASSED")
# 2. Decorator Pattern: Layered stream processing
print("\n--- Decorator Pattern: Layered Processing ---")
raw = b"Sensitive financial report data 2026"
print(f" Raw data: {len(raw)}B")
checksum_dec = ChecksumDecorator()
base64_dec = Base64Decorator()
# Layer: compress -> checksum -> base64
step1 = rle.compress(raw)
print(f" After RLE compression: {len(step1)}B")
step2 = checksum_dec.process(step1)
print(f" After checksum layer: {len(step2)}B (+32B SHA-256)")
step3 = base64_dec.process(step2)
print(f" After Base64 layer: {len(step3)}B (text-safe)")
# Reverse the pipeline
r1 = base64_dec.unprocess(step3)
r2 = checksum_dec.unprocess(r1)
r3 = rle.decompress(r2)
assert r3 == raw, "Decorator pipeline round-trip failed"
print(" Pipeline round-trip: PASSED")
print(f" Checksum valid: {checksum_dec.verify(r1)}")
# 3. Composite Pattern: Archive tree
print("\n--- Composite Pattern: Archive Tree ---")
builder = (
ArchiveBuilder()
.set_compression(HuffmanCompression())
.add_decorator(ChecksumDecorator())
)
src_dir = builder.add_directory("src")
builder.add_file("main.py", b"import sys\nprint('hello world')\n" * 4, src_dir)
builder.add_file("utils.py", b"def helper(): return True\n" * 6, src_dir)
docs_dir = builder.add_directory("docs")
builder.add_file("README.md", b"# Project\nThis is the readme.\n" * 5, docs_dir)
builder.add_file("config.yaml", b"key: value\nport: 8080\n" * 3)
archive = builder.build()
print(f"\n{archive.display()}")
print(f"\n Total archive size: {archive.total_size()}B")
# 4. Builder validation
print("\n--- Builder Pattern: Validation ---")
try:
ArchiveBuilder().set_encryption("ab")
except ValueError as e:
print(f" Caught: {e}")
valid_builder = ArchiveBuilder().set_compression(RLECompression()).set_encryption("secure-key-2026")
valid_builder.add_file("data.bin", b"\x00" * 100 + b"\xFF" * 100)
result = valid_builder.build()
print(f" Valid build: {result.total_size()}B compressed")
# 5. Template Method: Processing hooks
print("\n--- Template Method: Compressor Hooks ---")
sample = b"AAAAABBBBBCCCCC" * 10
print(" ChecksumCompressor:")
cc = ChecksumCompressor(RLECompression())
compressed = cc.process_file(sample)
print(f" {len(sample)}B -> {len(compressed)}B, checksum: {cc.last_checksum}")
print("\n LoggingCompressor:")
lc = LoggingCompressor(HuffmanCompression())
compressed = lc.process_file(sample)
print("\n" + "=" * 60)
print("All operations completed successfully.")Common Mistakes
- ✗Hardcoding compression algorithm: you can't optimize per file type
- ✗Applying encryption before compression: encrypted data doesn't compress well
- ✗Not using Composite for directories: leads to flat file lists with path string parsing
- ✗Loading entire file into memory: you need streaming for large archives
Key Points
- ✓Strategy makes compression algorithm swappable: choose RLE for repetitive data, Huffman for general text
- ✓Decorator for stream processing layers that compose: encryption wraps compression wraps raw data
- ✓Composite for archive structure: directories contain files and sub-directories, recursive traversal
- ✓Builder validates constraints: encryption requires a key, compression level must be valid