URL Shortener
Base62 encode a counter and store the mapping. Reads dominate writes 100:1, so optimize your lookup path. The strategy pattern lets you swap encoding schemes without rewiring anything.
Key Abstractions
Orchestrator that coordinates code generation, storage, and analytics
Strategy interface for generating short codes: base62, hash-based, or random
Persistence interface for storing and retrieving URL mappings
Tracks click counts, timestamps, and referrer data per short URL
Class Diagram
The Key Insight
A URL shortener is really two separate problems glued together. Problem one: generate a unique short code for a given URL. Problem two: look it up fast when someone clicks it. That second problem dominates your design because reads outnumber writes by a factor of 100x or more.
For code generation, base62 encoding a monotonic counter is the simplest approach that actually works. You get short, readable codes with zero collision risk. No need for retry loops or uniqueness checks. Hash-based approaches seem clever until you realize truncated hashes collide way more often than people expect, and now you need collision resolution logic cluttering up your write path.
The Strategy pattern on the code generator is not just academic tidiness. It is genuinely useful here. You might start with base62 counters, then switch to random codes for vanity URLs, or hash-based codes for idempotent shortening. The shortener itself stays identical.
Requirements
Functional
shorten(longUrl)accepts a valid URL and returns a shortened URLresolve(shortCode)redirects a short code to the original URL- Same long URL submitted twice returns the same short URL (deduplication)
- Basic click analytics: total clicks per short URL
Non-Functional
- Resolve must be fast since reads vastly outnumber writes
- Short codes should be compact, ideally 7 characters or fewer
- Thread-safe for concurrent access
- Code generation strategy should be swappable without modifying core logic
Design Decisions
Why base62 encoding over MD5 hashing?
Base62 on a counter gives you guaranteed uniqueness. Every increment produces a different code. MD5 hashing requires you to truncate the output to keep it short, which raises collision probability significantly. With truncated hashes, you need a check-and-retry loop on every write. Base62 avoids all of that complexity.
Why maintain a reverse mapping (long to short)?
Without it, submitting the same URL twice creates two different short codes pointing to the same destination. That wastes storage and confuses users who expect a stable mapping. The reverse lookup adds some memory overhead, but it keeps the system predictable and idempotent.
Why separate URLStore behind an interface?
Today it is an in-memory HashMap. Tomorrow it could be Redis, DynamoDB, or Postgres. The shortener logic does not care. This separation also makes testing trivial since you can inject a mock store that simulates failures or latency.
Why track analytics separately from the store?
Analytics and URL resolution have different access patterns and scaling needs. Bundling them together means your redirect latency pays for analytics overhead. Keeping them separate lets you make the redirect path as lean as possible and process analytics asynchronously if needed.
Interview Follow-ups
- "How would you handle custom aliases?" Add a
customShorten(alias, longUrl)method that checks if the alias is taken, then stores it directly. Skip the code generator entirely for custom requests. - "What happens when the counter overflows?" With a 7-character base62 code, you have 3.5 trillion unique values. If you somehow exhaust that, increase the code length by one character and you get 218 trillion more.
- "How do you prevent abuse?" Add a Rate Limiter per IP or API key. Validate URLs against a blocklist. Set expiration policies so abandoned short URLs get cleaned up.
- "How would you scale this for billions of URLs?" Partition the counter range across multiple servers. Each server owns a non-overlapping range, so they generate codes independently without coordination.
Code Implementation
from __future__ import annotations
from abc import ABC, abstractmethod
from threading import Lock
from datetime import datetime
import hashlib
import string
class CodeGenerator(ABC):
"""Strategy interface for short code generation."""
@abstractmethod
def generate(self, url: str) -> str: ...
class Base62Generator(CodeGenerator):
"""Counter-based base62 encoding. No collisions possible."""
CHARS = string.ascii_letters + string.digits # a-zA-Z0-9
def __init__(self, start: int = 100000):
self._counter = start
self._lock = Lock()
def generate(self, url: str) -> str:
with self._lock:
self._counter += 1
return self._encode(self._counter)
def _encode(self, num: int) -> str:
if num == 0:
return self.CHARS[0]
result = []
while num > 0:
result.append(self.CHARS[num % 62])
num //= 62
return "".join(reversed(result))
class HashGenerator(CodeGenerator):
"""MD5-based generation. Needs collision handling."""
def __init__(self, length: int = 7):
self._length = length
def generate(self, url: str) -> str:
digest = hashlib.md5(url.encode()).hexdigest()
num = int(digest[:12], 16)
chars = string.ascii_letters + string.digits
result = []
for _ in range(self._length):
result.append(chars[num % 62])
num //= 62
return "".join(result)
class ClickStats:
def __init__(self, short_code: str):
self.short_code = short_code
self.total_clicks = 0
self.created_at = datetime.now()
def __repr__(self) -> str:
return f"ClickStats({self.short_code}, clicks={self.total_clicks})"
class Analytics:
"""Tracks click data per short code."""
def __init__(self):
self._data: dict[str, ClickStats] = {}
def init_tracking(self, short_code: str) -> None:
self._data[short_code] = ClickStats(short_code)
def record_click(self, short_code: str) -> None:
if short_code in self._data:
self._data[short_code].total_clicks += 1
def get_stats(self, short_code: str) -> ClickStats | None:
return self._data.get(short_code)
class URLStore(ABC):
"""Persistence interface for URL mappings."""
@abstractmethod
def save(self, short_code: str, long_url: str) -> None: ...
@abstractmethod
def find(self, short_code: str) -> str | None: ...
@abstractmethod
def exists_by_long_url(self, long_url: str) -> str | None: ...
class InMemoryURLStore(URLStore):
def __init__(self):
self._short_to_long: dict[str, str] = {}
self._long_to_short: dict[str, str] = {}
def save(self, short_code: str, long_url: str) -> None:
self._short_to_long[short_code] = long_url
self._long_to_short[long_url] = short_code
def find(self, short_code: str) -> str | None:
return self._short_to_long.get(short_code)
def exists_by_long_url(self, long_url: str) -> str | None:
return self._long_to_short.get(long_url)
class URLShortener:
"""Orchestrates URL shortening with pluggable code generation."""
def __init__(
self,
base_url: str = "https://short.ly/",
generator: CodeGenerator | None = None,
store: URLStore | None = None,
):
self._base_url = base_url
self._generator = generator or Base62Generator()
self._store = store or InMemoryURLStore()
self._analytics = Analytics()
def shorten(self, long_url: str) -> str:
if not long_url or not long_url.startswith(("http://", "https://")):
raise ValueError("Invalid URL")
existing = self._store.exists_by_long_url(long_url)
if existing:
return self._base_url + existing
code = self._generator.generate(long_url)
self._store.save(code, long_url)
self._analytics.init_tracking(code)
return self._base_url + code
def resolve(self, short_code: str) -> str | None:
long_url = self._store.find(short_code)
if long_url:
self._analytics.record_click(short_code)
return long_url
def get_stats(self, short_code: str) -> ClickStats | None:
return self._analytics.get_stats(short_code)
if __name__ == "__main__":
shortener = URLShortener()
# Shorten some URLs
s1 = shortener.shorten("https://example.com/very/long/path/to/resource")
s2 = shortener.shorten("https://another.com/page")
print(f"Shortened URL 1: {s1}")
print(f"Shortened URL 2: {s2}")
# Same URL returns same short link
s1_again = shortener.shorten("https://example.com/very/long/path/to/resource")
print(f"Same URL again: {s1_again}")
assert s1 == s1_again, "Same URL should return same short link"
# Resolve short codes
code1 = s1.replace("https://short.ly/", "")
resolved = shortener.resolve(code1)
print(f"Resolved: {resolved}")
assert resolved == "https://example.com/very/long/path/to/resource"
# Click tracking
shortener.resolve(code1)
shortener.resolve(code1)
stats = shortener.get_stats(code1)
print(f"Stats: {stats}")
assert stats.total_clicks == 3
# Invalid URL handling
try:
shortener.shorten("not-a-url")
except ValueError as e:
print(f"Caught expected error: {e}")
print("All assertions passed.")Common Mistakes
- ✗Using a full MD5/SHA hash and truncating it. Truncation dramatically increases collision probability.
- ✗Not handling the same long URL submitted twice. Decide upfront: same short code or different ones.
- ✗Skipping input validation. Malformed URLs, empty strings, and excessively long inputs all need guardrails.
- ✗Ignoring the read path. If you only optimize writes, your redirect latency will suffer under real traffic.
Key Points
- ✓Base62 encoding gives you [a-zA-Z0-9] characters. A 7-character code supports 62^7 = 3.5 trillion unique URLs.
- ✓Counter-based generation guarantees uniqueness without collision checks. Hash-based needs collision handling.
- ✓Read-to-write ratio is roughly 100:1. Cache the short-to-long mapping aggressively.
- ✓Strategy pattern on CodeGenerator means you can swap base62 for MD5-based hashing without touching the shortener.