Pastebin
Short ID, optional expiry, optional password. The scaling trick is generating unique IDs without a central counter — base62 of a random source does the job.
Key Abstractions
Immutable record: id, content, created, expires, visibility, password hash
Strategy for unique short IDs — base62 random, Snowflake, or counter
Repository abstraction — in-memory, SQL, or blob store behind the same interface
Salt + hash generator used to store OTP-style secrets without exposing the plaintext
Facade. Create, fetch, and delete pastes.
Class Diagram
The Key Insight
Pastebin looks like URL shortener — same "store blob, get ID" shape. The difference is that pastes are expiring, password-protected, and often large. So the design leans into three layers that URL shortener can skip: an ID generator that's unguessable, an access gate that's information-hiding, and a storage interface that supports size-aware backends.
The ID generator is the interesting call. An auto-increment integer is catastrophic — anyone can iterate IDs and scrape every public paste. Base62 of 7 characters gives 62^7 ≈ 3.5 trillion possible IDs. At one billion pastes, the chance of collision on a random new ID is about 1 in 3,500. A retry loop handles the rest.
The access gate is security design. "Paste not found" is the only error anyone outside should see — not "paste expired," not "wrong password," not "rate limited". Different error messages leak existence of private pastes to attackers probing IDs.
Requirements
Functional
- Create a paste with content, optional TTL, optional password, optional burn-on-read
- Fetch a paste by ID, verifying password if set
- Automatic deletion on read for burn-on-read pastes
- Background sweep of expired pastes
- Enforce content size limits
Non-Functional
- Short, unguessable IDs
- Passwords stored hashed, never plaintext
- Size bound per paste (prevent pathological inputs)
- Storage backend swappable (memory for tests, durable store for prod)
- Thread-safe creation and fetch
Design Decisions
Why random base62 instead of a counter?
A counter exposes total paste count and allows enumeration. Random generates unguessable IDs at the cost of collision checks. At 7 chars and a billion pastes, collision rate is ~0.03% — negligible with retry.
Why lazy expiry plus a sweep?
Per-paste timer threads scale badly (a million pastes = a million idle timers). A single background sweep runs every N minutes to reclaim storage. Lazy expiry on read catches anything between sweeps — no leak window matters because the paste can never be returned.
Why a unified "not found" error?
An attacker trying to find private pastes benefits from any differentiation: "expired" tells them an ID existed, "wrong password" confirms the paste is real. Returning the same error for all failure modes kills the oracle.
Why Storage as an interface?
Tests need a fast, deterministic store. Prod needs durability, replication, size efficiency. Interface-driven storage lets the service stay identical across both — the only thing that changes is the constructor argument.
Why bound content size?
Without a cap, someone uploads a 10GB paste and exhausts the server. The limit is a business decision (Pastebin.com caps at 512KB for free tier) enforced at the boundary — far cheaper than post-hoc cleanup.
Interview Follow-ups
- "How would you scale writes to millions per day?" Partition storage by first char of the ID. Each shard handles 1/62 of writes. ID generator stays random; partition is derived.
- "How do you handle syntax highlighting for different languages?"
languagefield on the paste. Storage is unchanged; rendering is a UI concern. - "What about full-text search?" Separate index (Elasticsearch) indexed on public pastes only. Private and password-protected pastes stay out of the index.
- "How would you prevent abuse?" Rate-limit by IP on create (token bucket), content-scan for phishing/malware signatures, require captcha above N pastes/hour.
- "How do you support edits?" Pastes are immutable by design (that's why URLs are linkable). An "edit" is actually a new paste that links back to the original via a parent_id column.
Code Implementation
1 from __future__ import annotations
2 from abc import ABC, abstractmethod
3 from dataclasses import dataclass, field
4 from datetime import datetime, timedelta
5 from enum import Enum
6 from threading import RLock
7 import hashlib
8 import secrets
9
10
11 BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
12
13
14 class Visibility(Enum):
15 PUBLIC = "public"
16 UNLISTED = "unlisted"
17 PRIVATE = "private"
18
19
20 @dataclass
21 class Paste:
22 id: str
23 content: str
24 created_at: datetime
25 expires_at: datetime | None
26 visibility: Visibility
27 password_hash: str | None = None
28 burn_on_read: bool = False
29 views: int = 0
30 view_limit: int | None = None # None = unlimited
31
32
33 class IdGenerator(ABC):
34 @abstractmethod
35 def generate(self) -> str: ...
36
37
38 class Base62RandomGenerator(IdGenerator):
39 """7 chars of base62 = ~3.5 trillion IDs. Collisions retry."""
40
41 def __init__(self, length: int = 7):
42 if length < 4:
43 raise ValueError("length too short — collisions will be common")
44 self._length = length
45
46 def generate(self) -> str:
47 # secrets.choice: cryptographically secure, unguessable.
48 return "".join(secrets.choice(BASE62) for _ in range(self._length))
49
50
51 class Storage(ABC):
52 @abstractmethod
53 def save(self, paste: Paste) -> None: ...
54
55 @abstractmethod
56 def find(self, paste_id: str) -> Paste | None: ...
57
58 @abstractmethod
59 def delete(self, paste_id: str) -> None: ...
60
61 @abstractmethod
62 def all_expired(self, now: datetime) -> list[Paste]: ...
63
64
65 class InMemoryStorage(Storage):
66 def __init__(self):
67 self._map: dict[str, Paste] = {}
68 self._lock = RLock()
69
70 def save(self, paste: Paste) -> None:
71 with self._lock:
72 if paste.id in self._map:
73 raise ValueError("ID collision")
74 self._map[paste.id] = paste
75
76 def find(self, paste_id: str) -> Paste | None:
77 with self._lock:
78 return self._map.get(paste_id)
79
80 def delete(self, paste_id: str) -> None:
81 with self._lock:
82 self._map.pop(paste_id, None)
83
84 def all_expired(self, now: datetime) -> list[Paste]:
85 with self._lock:
86 return [p for p in self._map.values() if p.expires_at and p.expires_at <= now]
87
88
89 class PasswordHasher:
90 """Pluggable hasher. SHA-256 + salt here for brevity — production should use argon2/bcrypt."""
91
92 def hash(self, password: str) -> str:
93 salt = secrets.token_hex(16)
94 digest = hashlib.sha256((salt + password).encode()).hexdigest()
95 return f"{salt}${digest}"
96
97 def verify(self, password: str, stored: str) -> bool:
98 try:
99 salt, digest = stored.split("$", 1)
100 except ValueError:
101 return False
102 expected = hashlib.sha256((salt + password).encode()).hexdigest()
103 # Constant-time compare — defeats timing attacks.
104 return secrets.compare_digest(expected, digest)
105
106
107 @dataclass
108 class CreateRequest:
109 content: str
110 ttl: timedelta | None = None
111 password: str | None = None
112 visibility: Visibility = Visibility.UNLISTED
113 burn_on_read: bool = False
114 view_limit: int | None = None
115
116
117 class PastebinService:
118 MAX_CONTENT = 512 * 1024 # 512 KB per paste
119 MAX_ID_RETRIES = 5
120
121 def __init__(self, storage: Storage, id_gen: IdGenerator | None = None):
122 self._storage = storage
123 self._id_gen = id_gen or Base62RandomGenerator()
124 self._hasher = PasswordHasher()
125 self._lock = RLock()
126
127 def create(self, req: CreateRequest) -> str:
128 if not req.content:
129 raise ValueError("content is required")
130 if len(req.content) > self.MAX_CONTENT:
131 raise ValueError(f"content exceeds {self.MAX_CONTENT} bytes")
132
133 pw_hash = self._hasher.hash(req.password) if req.password else None
134 now = datetime.utcnow()
135 expires = now + req.ttl if req.ttl else None
136
137 if req.view_limit is not None and req.view_limit < 1:
138 raise ValueError("view_limit must be positive")
139
140 with self._lock:
141 for attempt in range(self.MAX_ID_RETRIES):
142 paste_id = self._id_gen.generate()
143 if self._storage.find(paste_id) is None:
144 paste = Paste(
145 id=paste_id,
146 content=req.content,
147 created_at=now,
148 expires_at=expires,
149 visibility=req.visibility,
150 password_hash=pw_hash,
151 burn_on_read=req.burn_on_read,
152 view_limit=req.view_limit,
153 )
154 self._storage.save(paste)
155 return paste_id
156 raise RuntimeError("Failed to generate unique ID after retries")
157
158 def fetch(self, paste_id: str, password: str | None = None) -> str:
159 paste = self._storage.find(paste_id)
160 # Unified "not found" for missing, expired, and wrong-password.
161 # This prevents oracle attacks that probe for existence.
162 now = datetime.utcnow()
163 if paste is None:
164 raise LookupError("Paste not found")
165 if paste.expires_at and paste.expires_at <= now:
166 self._storage.delete(paste_id)
167 raise LookupError("Paste not found")
168 if paste.password_hash:
169 if password is None or not self._hasher.verify(password, paste.password_hash):
170 raise LookupError("Paste not found")
171
172 # Successful read — bump counter, apply burn / view_limit.
173 paste.views += 1
174 if paste.burn_on_read:
175 self._storage.delete(paste_id)
176 elif paste.view_limit is not None and paste.views >= paste.view_limit:
177 self._storage.delete(paste_id)
178 return paste.content
179
180 def view_count(self, paste_id: str) -> int:
181 paste = self._storage.find(paste_id)
182 return paste.views if paste else 0
183
184 def delete(self, paste_id: str) -> None:
185 self._storage.delete(paste_id)
186
187 def sweep_expired(self) -> int:
188 """Background hygiene. Safe to run periodically; lazy expiry also protects reads."""
189 now = datetime.utcnow()
190 count = 0
191 for paste in self._storage.all_expired(now):
192 self._storage.delete(paste.id)
193 count += 1
194 return count
195
196
197 if __name__ == "__main__":
198 service = PastebinService(InMemoryStorage())
199
200 # Simple paste.
201 id1 = service.create(CreateRequest(content="print('hello world')"))
202 print(f"Created: {id1}")
203 print(f"Fetch: {service.fetch(id1)!r}")
204
205 # Password-protected paste.
206 id2 = service.create(CreateRequest(content="api_key=super-secret", password="hunter2"))
207 try:
208 service.fetch(id2)
209 except LookupError:
210 print("Correctly rejected fetch without password")
211 print(f"With password: {service.fetch(id2, 'hunter2')!r}")
212
213 # Burn on read.
214 id3 = service.create(CreateRequest(content="one-time token", burn_on_read=True))
215 print(f"Burn read 1: {service.fetch(id3)!r}")
216 try:
217 service.fetch(id3)
218 except LookupError:
219 print("Burn paste correctly gone after one read")
220
221 # Expired paste is swept.
222 id4 = service.create(CreateRequest(content="short-lived", ttl=timedelta(milliseconds=1)))
223 import time
224 time.sleep(0.01)
225 count = service.sweep_expired()
226 print(f"Swept {count} expired paste(s)")
227
228 # View count + view limit. Third fetch hits the limit and auto-deletes.
229 id5 = service.create(CreateRequest(content="counted", view_limit=3))
230 service.fetch(id5); service.fetch(id5)
231 print(f"Views after 2 fetches: {service.view_count(id5)}")
232 service.fetch(id5) # third fetch — limit hit, paste auto-deleted
233 try:
234 service.fetch(id5)
235 except LookupError:
236 print("Fourth fetch correctly rejected — view_limit reached.")
237
238 print("All operations passed.")Common Mistakes
- ✗Auto-increment integer IDs. They're guessable — anyone can scrape every paste by iterating.
- ✗Storing password in plaintext or with a weak hash. Compromise leaks everyone's secrets.
- ✗Checking expiry only on the sweep thread. A just-expired paste can leak between ticks.
- ✗Forgetting to return the same 'not found' response for missing, expired, and wrong-password pastes. Oracle attack.
Key Points
- ✓Random base62 of 7 chars gives 62^7 = 3.5 trillion IDs. Collisions are vanishingly rare.
- ✓Lazy expiry: delete on read, sweep periodically. No per-paste timer threads.
- ✓Password is hashed (bcrypt/argon2), never stored in plaintext. Verify on read.
- ✓Storage interface means the same service works in-memory for tests and against Postgres in prod.