Design an Ad Click Aggregator
Difficulty: Medium. Category: Data Processing.
Design an ad click aggregation system that processes billions of click events daily. The system must count clicks per ad in real-time, deduplicate clicks, detect fraud, and provide accurate billing data to advertisers.
Key Topics
- Stream Processing
- Lambda Architecture
- Click Deduplication
- Time Windows
- Fraud Detection
- Exactly-Once Semantics
Hints for Design an Ad Click Aggregator
- Ingest click events via Kafka (partitioned by ad_id for ordering). Use a stream processor (Flink for exactly-once guarantees, Spark Streaming for simpler setup) to aggregate clicks in time windows: 1-minute (real-time dashboard), 1-hour (billing reports), 1-day (daily summaries).
- Deduplicate clicks: same user clicking the same ad within 30 seconds = one click. Use a composite key (user_id, ad_id) in a Redis SET with TTL. On each click event, SADD to check membership. If exists, skip. Bloom filter alternative if memory is constrained (accept ~1% over-count).
- Lambda architecture: real-time stream path (Flink aggregation for live dashboards, serves queries immediately) + batch path (hourly Spark job for accurate billing reconciliation). The batch path corrects any stream processing errors. Billing is always based on the batch path.
- Fraud detection pipeline (runs in parallel): flag suspicious patterns — same user clicking >5 times/minute (click spam), many clicks from one IP subnet (click farm), abnormal CTR spikes for an ad (competitor sabotage), clicks from known bot user agents. ML model trained on labeled fraud data improves over time.
- Store aggregated results in an OLAP database (ClickHouse for speed) or pre-aggregated tables. Schema: (ad_id, campaign_id, advertiser_id, time_bucket, click_count, unique_users, impressions). Advertisers query dashboards with rollups by campaign, ad group, creative, geo, device.
- Scale math: 10B ad impressions/day, 1% CTR = 100M clicks/day = ~1,160 clicks/sec. Peak during US business hours: 5-10x = ~10K clicks/sec. Each click event is ~200 bytes. Kafka throughput needed: ~2MB/sec sustained, ~20MB/sec peak. Manageable with a small Kafka cluster.
- Bottleneck: the deduplication layer. At 10K clicks/sec, Redis must handle 10K SADD operations/sec with TTL. A single Redis instance handles this easily. But if you also need to dedup across longer windows (daily unique clicks per ad), the SET sizes grow — millions of entries per popular ad.
- What breaks: late-arriving events. A click event from a mobile user on a poor connection arrives 5 minutes after it happened. The 1-minute window has already closed. Solution: use event-time processing (Flink watermarks) instead of processing-time. Allow late events up to a configurable threshold (e.g., 10 minutes). Re-aggregate affected windows.
- Don't use: database writes per click event (too slow, too expensive at scale). Don't skip the batch reconciliation path — stream processing can have subtle bugs that only batch processing catches. Don't aggregate by processing time — event time is what matters for accurate billing. Don't trust client-side click timestamps (clock skew, manipulation).
Design AI Agent Platform for Restaurant Ops
Difficulty: Hard. Category: AI & Agents.
Design a multi-tenant AI agent platform that monitors restaurant operations across thousands of restaurants. The system ingests real-time data from POS systems, delivery platforms (DoorDash, Uber Eats, Grubhub), payment processors, and inventory systems. It detects anomalies (revenue leaks, unusual refund patterns, commission errors), investigates them using LLM-powered agents with tool access, and triggers automated actions like dispute filing and reorder requests. Handle 50M+ events/day with sub-60-second detection latency.
Key Topics
- Multi-Agent Orchestration
- Real-Time Anomaly Detection
- Multi-Tenant Data Isolation
- LLM Tool Use & MCP
- Event-Driven Data Pipeline
- Cost-Optimized LLM Inference
Hints for Design AI Agent Platform for Restaurant Ops
- The core problem is four-way reconciliation: POS system (what the restaurant sold), delivery platforms (DoorDash, Uber Eats, Grubhub — what they say was ordered), payment processor (what money actually moved), and inventory system (what was consumed). Discrepancies between these four sources represent revenue leaks — missed payouts, phantom charges, incorrect commission rates. The platform detects these automatically.
- Data pipeline: connectors pull from 5+ third-party APIs (each with different schemas, rate limits, auth) → Kafka for ingestion → Flink for real-time normalization and anomaly detection → ClickHouse for analytics storage → PostgreSQL for transactional state. The normalization layer is critical: a 'Medium Pizza' in the POS, 'MED PIZ' on DoorDash, and 'Pizza (M)' in inventory are the same item. Fuzzy matching plus tenant-specific mapping tables.
- Anomaly detection runs on streaming data, not batch. Use statistical baselines per restaurant per metric (avg order value, refund rate, delivery time). When a metric exceeds 2-3 standard deviations from the rolling baseline, trigger an investigation. Don't use fixed thresholds — a $50 average ticket is normal for a steakhouse but anomalous for a taco shop.
- An AI agent is an LLM with a reasoning loop and tool access. The agent receives an anomaly alert, queries ClickHouse for historical data, pulls recent orders from the POS API, checks delivery platform records, and correlates signals. It doesn't hallucinate answers — it investigates using tools and cites specific data points. Each investigation is a sequence of tool calls, not a single prompt.
- Multi-agent collaboration for complex investigations: a refund agent spots unusual refund patterns, an inventory agent checks stock levels, a delivery agent checks platform-side disputes, and a pricing agent verifies commission rates. An orchestrator agent decomposes the investigation and delegates to specialists. Each agent has its own system prompt, tool set, and context window.
- Multi-tenant isolation is non-negotiable. Restaurant A's data must never leak into Restaurant B's investigation. Enforce tenant_id filtering at every layer: Kafka topic partitioning, ClickHouse row-level filtering, agent context construction, and tool call parameter injection. The agent never sees raw tenant IDs — they're injected server-side.
- Cost math: each investigation uses ~2K-5K input tokens and ~500-1K output tokens. At 10 investigations/day per restaurant across 1,000 restaurants, that's 10K investigations/day. Use a small model (Haiku/GPT-4o-mini) for triage and simple lookups, escalate to a larger model only for complex multi-agent investigations. Model routing cuts LLM costs by 50-70%.
- Scale: 50M+ events/day across 1,000 restaurants. Each restaurant generates 500-2,000 orders/day, each order has 5-10 associated events (created, modified, payment, delivery status updates). Detection latency target: <60 seconds from event to anomaly alert. Investigation completion target: <3 minutes. The pipeline must handle bursty traffic — Friday dinner rush is 5-10x the Tuesday lunch baseline.
- Don't use: a single monolithic agent for all investigation types — it will exceed context limits and lose focus. Don't poll third-party APIs on a fixed schedule — use webhooks where available and poll with exponential backoff where not. Don't store raw LLM conversation logs without PII redaction — restaurant financial data is sensitive. Don't skip agent guardrails — cap iterations (3-5), tool calls (10-15), and tokens (50K) per investigation to prevent runaway costs.
Design an AI Software Engineer
Difficulty: Hard. Category: AI & Agents.
Design a complete AI Software Engineer system that operates at three levels: inline autocomplete in 300ms (serving 100M completions/day), a codebase-aware agent that refactors across 12 files in 45 seconds, and a fully autonomous builder that takes a natural language spec, asks clarifying questions, designs the architecture, scaffolds the project, builds each module with error recovery, responds to feedback over hours, and deploys to production. Cover the local context engine (tree-sitter AST, dependency graphs, LSP, git), multi-model routing, agent tool system with sandbox execution, long-running checkpointing and crash recovery, persistent project memory, multi-agent orchestration, and the economics of serving 1M developers at $20/month.
Key Topics
- Three Levels (Autocomplete, Agent, Autonomous)
- Local Context Engine (AST, LSP, Git)
- Multi-Model Routing & Inference
- Agent Loop (Think, Act, Observe)
- Long-Running Memory & Checkpointing
- Multi-Agent Orchestration
Hints for Design an AI Software Engineer
- This system operates at three levels. Level 1 is autocomplete: predict next lines in 300ms. Level 2 is a codebase agent: search, read, edit, and test across files in 45 seconds. Level 3 is an autonomous software engineer: build entire apps from a spec over hours, with crash recovery and deployment. Each level subsumes the previous. As you go from L1 to L3, intelligence shifts from the model to the system around it. At L1 the model does half the work. At L3 it does maybe 10%.
- The local context engine is where the real magic happens, before the LLM ever sees a token. It includes: file system indexing with FS watchers for live updates, tree-sitter AST parsing (incremental, sub-millisecond re-parse on every keystroke, works on broken code), a dependency graph built from import analysis (who imports this file? who calls this function?), git integration (uncommitted diff tells the model what you're working on right now), and LSP diagnostics (type errors, lint warnings are free high-quality context most assistants ignore). Getting the right 2,000 tokens out of 500,000 lines is the real competitive moat.
- Multi-model routing is essential for economics. Inline completions use a 7B INT4 quantized model at $0.001 per request. Multi-line completions use a 34B INT8 model at $0.005. Agent tasks use a 70B+ FP16 model at $0.05. If everything is down, fall back to LSP completions. A loading spinner means developers disable the feature. Always have a response.
- The agent loop is: think, act, observe, repeat. The agent calls tools (search_files, read_file, edit_file, run_command) and the system executes them in a sandbox. After editing, it runs tests and type checks. If tests fail, it reads the error, fixes the code, and retries. The '3 strikes' rule prevents infinite loops: same error 3 times means stop and ask the developer.
- Level 3 requires systems you never need at L1 or L2. A task scheduler decomposes 'build me a SaaS app' into a DAG of 200 subtasks. Checkpointing saves state every 10 steps so the agent can recover from crashes. Persistent memory (stored in a project file like CLAUDE.md) remembers architecture decisions across sessions. Without memory, the agent forgets its own decisions and contradicts itself after 100 steps.
- Multi-agent orchestration solves the context window problem for large L3 tasks. A planner agent decomposes the spec into tasks. Backend and frontend agents work in parallel on different parts of the codebase. A reviewer agent checks all diffs before committing. File-level locks prevent two agents from editing the same file simultaneously.
- Post-processing is the unsung hero. Every completion passes 5 gates: syntax validation via tree-sitter, bracket balancing, import validation against the actual project dependency tree (this single check eliminates 30% of user complaints), style matching (indentation, naming convention), and deduplication. If your assistant doesn't validate imports, it's a toy.
- Track persistence rate, not just acceptance rate. Developers sometimes Tab-accept then immediately delete. The real metric is: did they keep the suggestion after 30 seconds? GitHub reports 88% of Copilot-generated code stays in the final version. Use persistence rate as the north star for model improvement, not raw acceptance.
Design a Chat System with E2EE
Difficulty: Medium. Category: Security.
Design a real-time chat application (like Signal or WhatsApp) that supports 1:1 and group messaging with end-to-end encryption. The server should never be able to read message contents. Consider key management, offline delivery, and multi-device support.
Key Topics
- End-to-End Encryption
- WebSocket
- Message Queue
- Key Exchange
- Signal Protocol
Hints for Design a Chat System with E2EE
- Use the Signal Protocol (Double Ratchet + X3DH key agreement) for E2EE. It provides forward secrecy (past messages stay safe if keys are compromised later) and break-in recovery (future messages become safe again after a compromise).
- WebSockets maintain persistent connections for real-time delivery. Design a connection gateway service that maps userId → WebSocket connection. Use consistent hashing to route messages to the correct gateway server.
- For offline users, store encrypted messages server-side (the server only sees ciphertext, never plaintext). Deliver on reconnect. Set a retention limit (e.g., 30 days) to bound storage growth.
- Group chats: Sender Keys (one key per sender, shared with all members) = O(1) encrypt cost but weaker forward secrecy. Pairwise encryption (encrypt message once per member) = O(n) cost but stronger security. WhatsApp uses Sender Keys for efficiency.
- Scale math: 1B users, 100B messages/day = ~1.15M messages/sec. Each message is ~1KB encrypted = ~1.15GB/sec write throughput. Use Cassandra (write-optimized, partition by chatId) for message storage.
- Bottleneck: the connection gateway. Each server handles ~500K concurrent WebSocket connections (kernel tuning needed). For 500M concurrent users, you need ~1,000 gateway servers. A separate presence service tracks which gateway hosts each user.
- What breaks: key distribution at scale. When a user adds a new device, it needs key exchange with all contacts. Use a key server to store public prekeys, but clients MUST verify key fingerprints out-of-band (QR code scan) to prevent man-in-the-middle attacks.
- Don't use: symmetric encryption alone (no forward secrecy), HTTP polling for real-time chat (too much latency and server overhead), or any scheme where the server can read messages even temporarily. Don't store encryption keys on the server.
- Multi-device: each device has its own key pair. Messages are encrypted separately for each of the recipient's devices. Use a device registry per user. When a device is removed, rotate the Sender Key for all groups that user is in.
- Message ordering: use Lamport timestamps or vector clocks to order messages correctly across devices. Client-side timestamps are unreliable (clock skew). The server assigns a monotonic sequence number per chat for display ordering.
Design a Distributed Cache
Difficulty: Medium. Category: Infrastructure.
Design a distributed caching system like Memcached or Redis Cluster. It should support fast key-value lookups across multiple machines, handle node failures gracefully, and scale horizontally. Consider eviction policies, data partitioning, and replication strategies.
Key Topics
- Consistent Hashing
- Eviction Policies
- Replication
- Cache Invalidation
- LRU/LFU
Hints for Design a Distributed Cache
- Use consistent hashing with virtual nodes (150-200 per physical node) to distribute keys evenly. Without virtual nodes, adding/removing a server causes massive key redistribution and a thundering herd of cache misses.
- Cache invalidation is the hardest problem in CS. Patterns ranked by complexity: TTL-based (simplest, tolerate staleness), cache-aside (app manages both cache and DB), write-through (write to cache + DB atomically), write-behind (write to cache, async DB sync — risky if cache node crashes before DB write).
- For high availability, replicate each partition to 2-3 nodes. Async replication: lower write latency but stale reads possible. Sync replication: guarantees consistency but adds ~1-2ms per write. Choose based on your consistency requirements.
- Hot key problem: a single viral key (celebrity profile, flash sale item) can overload one cache node. Solutions: replicate hot keys to multiple shards, add an L1 local cache on each app server with 5-10s TTL, or use request coalescing (only one request fetches from origin, others wait on a promise).
- Scale math: a single Redis node handles ~100K ops/sec and stores ~25GB. For 1M ops/sec you need ~10 shards. For 500GB of cached data you need ~20 shards. Consistent hashing lets you add shards without full redistribution.
- Eviction policies: LRU is the safe default (evict least recently used). LFU is better when frequency matters (keeps consistently popular items longer). Random eviction is surprisingly effective and avoids the overhead of tracking access metadata.
- What breaks at scale: thundering herd — a popular key expires and thousands of concurrent requests simultaneously miss cache and slam the database. Fix with cache locking (mutex on miss — only one request fetches from DB), staggered TTLs (add random jitter ±30s), or background refresh (never expire, refresh before TTL).
- Don't use: a distributed cache for data that changes every second (just query the DB). Don't cache objects with complex invalidation dependencies (cache A depends on cache B depends on cache C → guaranteed stale data cascades). Don't use Memcached if you need persistence or complex data structures — use Redis.
- Monitoring: track cache hit ratio (should be >95%), p99 latency, eviction rate, and memory usage. If hit ratio drops below 90%, your cache is too small or your access pattern isn't cache-friendly.
Design Dropbox
Difficulty: Hard. Category: Storage & Sync.
Design a cloud file storage and sync service like Dropbox. Users can upload files from one device and access them from any other device. Support file versioning, sharing, and efficient sync.
Key Topics
- File Chunking
- Deduplication
- Sync Protocol
- Conflict Resolution
- Metadata Service
- Block Storage
Hints for Design Dropbox
- Split files into fixed-size chunks (4MB). Only upload changed chunks, not the entire file. Editing 1 byte in a 1GB file should upload 4MB, not 1GB. Use content-defined chunking (Rabin fingerprinting) for even better deduplication across similar files.
- Content-based hashing (SHA-256 per chunk) enables deduplication. If two users upload the same file, store chunks once, point both file records to the same chunk IDs. Saves 30-50% storage in practice. Also saves bandwidth — client computes hash first, asks server 'do you have this chunk?', only uploads if no.
- Metadata service is the brain: maps files → list of chunk IDs → storage locations. Use PostgreSQL with ACID guarantees. Partition by userId for horizontal scaling. This is the most critical service — if metadata is lost or corrupted, files are unrecoverable.
- Conflict resolution: last-writer-wins is dangerous for important files. Dropbox creates a 'conflicted copy' when two devices edit the same file before syncing. Always preserve both versions and notify the user. Data loss is worse than a conflict notification.
- Real-time sync via WebSocket or long polling. When user edits a file on laptop, a notification pushes to all other devices within seconds. The device then fetches only the changed chunks from the server.
- Scale math: 500M users, average 2GB stored = 1EB total. With dedup, actual storage ~500PB. At 4MB chunks = ~125 billion chunks. Store chunks in object storage (S3/GCS). Store chunk metadata in Cassandra (chunk_hash → storage_location).
- Bottleneck: the metadata service. Every file operation hits it. Cache aggressively in Redis (user's file tree). Read from replicas for listing operations. All writes go through the primary for consistency.
- What breaks: sync loops — Device A syncs to B, B syncs back to A, repeat forever. Fix with vector clocks or logical timestamps. Each device tracks the last seen version; only sync if the remote version is strictly newer.
- Don't use: direct S3 uploads for every small edit — too expensive and slow. Use a staging/upload service that receives chunks, deduplicates, then writes to cold storage in batches. Don't store file contents in a relational database — use object storage. Don't skip chunking — it's the foundation of efficient sync.
Design FB Post Search
Difficulty: Medium. Category: Search & Discovery.
Design the search functionality for a social platform like Facebook. Users can search for posts, people, pages, and groups. Results should be ranked by relevance, recency, and social connection. Support typeahead suggestions.
Key Topics
- Inverted Index
- Elasticsearch
- Tokenization
- Search Ranking
- Typeahead
- Index Partitioning
Hints for Design FB Post Search
- Build an inverted index mapping words → post IDs. Use Elasticsearch (built on Lucene) as the search engine. It handles tokenization, inverted indexing, and relevance scoring out of the box. Don't build your own search engine from scratch.
- Index updates: tokenize posts on write (lowercase, remove stop words, stem words, extract entities). Don't update the search index synchronously — publish a post-created event to Kafka, and a separate indexing service consumes and updates Elasticsearch asynchronously. Indexing latency of 1-5 seconds is acceptable.
- Ranking model: relevance score (BM25/TF-IDF from Elasticsearch) + recency (newer posts rank higher) + social signals (posts from friends rank above strangers) + engagement (likes, comments, shares weighted by recency) + content type (posts with photos rank higher). Use Elasticsearch's function_score query for custom ranking.
- Partition the search index by time range. Recent posts (last 30 days) are searched 10x more than older posts. Keep recent posts in a hot index (more shards, more RAM), and older posts in a cold index (fewer resources). This reduces cost by 60-70%.
- Typeahead: combine a prefix trie for instant suggestions with Elasticsearch's completion suggester. Return mixed results — posts, people, pages, groups — ranked by relevance and user relationship. Debounce keystrokes on the client (200ms) to reduce backend queries.
- Scale math: 3B users, 1B posts/day. Search index size: trillions of documents over 10+ years. Even just the recent 30-day index is ~30B documents. Elasticsearch cluster needs 100s of nodes. Query rate: ~100K searches/sec.
- Bottleneck: indexing throughput. 1B new posts/day = ~11,500 index writes/sec. Each post needs tokenization, entity extraction, and Elasticsearch indexing. Use a pipeline: Kafka → NLP enrichment service → Elasticsearch bulk indexer (batch 1000 docs per bulk request).
- What breaks: search relevance for ambiguous queries. 'Apple' — the company, the fruit, or your friend named Apple? Social context is critical. Boost results from friends, pages you follow, and groups you're in. Use the user's profile and activity history to disambiguate.
- Don't use: SQL LIKE queries (full table scan, impossible at scale). Don't index every field (index only searchable text, store metadata separately). Don't return all matches (paginate, show top 20 results). Don't skip access control — users should only see posts they have permission to view (filter by privacy settings in the query).
Design E-Commerce Flash Sales
Difficulty: Hard. Category: E-Commerce.
Design an e-commerce flash sale platform handling 10M concurrent users with atomic inventory management via Valkey Lua scripts, a coupon system supporting limited pools (500K codes) with LPOP + SET NX one-per-user enforcement, four coupon types with stacking rules, virtual queue traffic shaping, checkout saga orchestration via Temporal with compensating transactions, and hot-key mitigation through 16-slot inventory sharding.
Key Topics
- Virtual Queue (Traffic Shaping)
- Atomic Inventory (Valkey Lua)
- Coupon Pool (LPOP + SET NX)
- Coupon Types & Stacking
- Checkout Saga (Temporal)
- Hot-Key Mitigation
Hints for Design E-Commerce Flash Sales
- The #1 invariant: inventory can never go negative. 10M users hitting 'Buy Now' on 100K items simultaneously. A naive SELECT-then-UPDATE creates a race window where 5x the inventory gets sold. The fix: a Valkey Lua script that atomically checks and decrements in a single operation. DECRBY returns the new value; if it drops below zero, INCRBY rolls it back and the request is rejected. Zero network round-trips between check and decrement.
- Coupon pool management with two-layer defense. Layer 1 (Valkey, fast path): LPOP pulls the next code from a pre-populated list, SET NX on coupon_claim:{campaign}:{user} enforces one-per-user atomically. Layer 2 (PostgreSQL, durable path): UNIQUE constraint on (campaign_id, user_id) catches any edge case that slips past Valkey (e.g., Valkey restart between claim and DB write). Both layers must agree before the coupon is confirmed.
- Virtual queue using a Valkey sorted set. At T-0, 10M users call ZADD queue:{sale_id} {timestamp} {user_id}. An admission controller reads the queue in batches (ZRANGEBYSCORE), issues short-lived access tokens, and advances the cursor. Adaptive batch sizing: start at 100 users/batch, measure backend latency, increase batch size if p99 stays under threshold. WebSocket pushes queue position updates every 5 seconds.
- Coupon types and stacking rules engine. Four types: percent_off (capped at max_discount), fixed_amount, bogo (buy-one-get-one), free_shipping. Stacking rules: at most one percent_off + one fixed_amount per cart. free_shipping stacks with everything. Apply in priority order: percent_off first (on original price), then fixed_amount, then free_shipping. Final price must never drop below a minimum (e.g., $0.01). Validate the full stack atomically before checkout.
- Checkout saga orchestrated by Temporal. Five steps: (1) reserve inventory (Lua DECRBY), (2) apply coupon (SET NX + DB insert), (3) process payment, (4) confirm order (PostgreSQL INSERT), (5) emit events (Kafka). If payment fails at step 3, compensating transactions fire in reverse: return coupon to pool (RPUSH code, DEL claim key, SREM from claimed set), release inventory (INCRBY). Temporal guarantees exactly-once execution of the full saga.
- Hot-key mitigation for doorbuster SKUs. A single popular item concentrates all traffic on one Valkey shard. Solution: shard inventory across 16 virtual slots -- inv:{sale_id}:{sku_id}:slot:{hash(user_id) mod 16}. Each slot holds total_inventory / 16 units. A wrapper Lua script tries the user's assigned slot first; if empty, scans adjacent slots. This spreads 150K ops/sec across multiple shards instead of hammering one.
- Reservation timeout (FR-17): if checkout doesn't complete within 10 minutes, a background job reclaims the reserved inventory. It queries orders WHERE status = 'inventory_reserved' AND created_at < now() - interval '10 minutes', releases inventory via INCRBY, and marks the order as 'expired'. Without this, abandoned checkouts permanently leak inventory.
- Bot prevention with 7-layer defense: (1) CDN rate limiting per IP, (2) geo-blocking of known proxy ranges, (3) WAF rules for automated patterns, (4) CAPTCHA at queue entry, (5) proof-of-work challenge (client computes a hash), (6) device fingerprinting to detect multi-account abuse, (7) behavioral analysis scoring (0-100 scale) that flags suspicious velocity, identical cart patterns, and headless browser signatures.
- Scale math: 10M concurrent users -> 500K CDN req/sec at T-0 (static sale page). Queue admits users in adaptive batches. Backend sustains 833 checkouts/sec (50K orders/min / 60). Each checkout: 3-4 Valkey ops (inventory DECR + coupon SET NX + claim tracking) + 1 PostgreSQL INSERT + 1 payment API call. Valkey cluster: 3+3 nodes handling 25K ops/sec. PostgreSQL: db.r6g.2xlarge at 833 writes/sec, 5K IOPS.
Design Google Docs
Difficulty: Hard. Category: Collaboration.
Design a real-time collaborative document editor like Google Docs. Multiple users can edit the same document simultaneously, see each other's cursors, and all changes are synced in real-time without conflicts.
Key Topics
- Operational Transform
- CRDT
- Real-Time Sync
- Cursor Presence
- Version History
- Conflict Resolution
Hints for Design Google Docs
- The core challenge is concurrent editing without conflicts. Two main approaches: Operational Transform (OT) — transforms operations against each other to converge. CRDTs (Conflict-free Replicated Data Types) — data structures that merge automatically. Google Docs uses OT. Figma uses CRDTs.
- OT transforms operations: if User A inserts 'X' at position 5 and User B deletes character at position 3, OT adjusts A's insert to position 4 (since B's delete shifted everything). Each operation has a revision number. The server is the single source of truth for operation ordering.
- Use WebSocket for real-time sync. The server acts as a central relay and transformer — clients send operations, server transforms against concurrent ops, broadcasts the result. Latency budget: operations should appear on other users' screens within 100-300ms.
- Cursor presence: show each collaborator's cursor position and selection in real-time. Send presence updates (cursor position, user info, color) on a separate lightweight channel. Throttle to ~5 updates/sec per user to avoid flooding.
- Store the document as a sequence of operations (event sourcing). Periodically snapshot the full document state for fast loading (every 100-500 operations). Loading a document = fetch latest snapshot + replay operations since snapshot.
- Scale math: a popular document might have 50 concurrent editors producing 10 ops/sec each = 500 ops/sec per document. Across millions of documents, total throughput is high but per-document load is manageable. The bottleneck is the operation log per document.
- Bottleneck: the OT server for a single popular document. All operations must be serialized through one server to maintain consistent ordering. Solutions: assign one server per document (partition by docId), use in-memory operation buffer with periodic persistence.
- What breaks: network partitions — a user goes offline, types extensively, then reconnects with 500 buffered operations. All must be transformed against concurrent operations from other users. Solution: client-side OT that transforms local operations against incoming server operations, then sends the transformed batch on reconnect.
- Don't use: last-writer-wins (destroys concurrent edits). Don't use locking (blocks other users). Don't send full document state on every edit (bandwidth explosion). Don't use REST for real-time sync — WebSocket is essential. Don't implement OT from scratch in production — use a proven library like ShareDB or Yjs (CRDT).
Design Instagram
Difficulty: Hard. Category: Social & Media.
Design a photo and video sharing social network like Instagram. Users can upload photos/videos, follow others, browse a personalized feed, view stories, and explore trending content. Handle billions of photos and millions of concurrent users.
Key Topics
- Photo Upload Pipeline
- CDN
- News Feed
- Fan-Out
- Stories
- Explore/Recommendation
Hints for Design Instagram
- Photo upload pipeline: client uploads to a media service → store original in object storage (S3) → async worker generates multiple resolutions (150px thumbnail, 640px feed, 1080px full) → serve via CDN. Process upload + resizing in <30 seconds for good UX.
- Feed is a hybrid fan-out system. For regular users (<10K followers): fan-out-on-write — pre-compute feeds in Redis on post. For celebrities (>100K followers): fan-out-on-read — merge their posts into feeds at query time. This is the Facebook/Instagram architecture.
- Stories are ephemeral (24h TTL). Store separately in a fast-access store with automatic expiry (Redis with TTL or Cassandra with TTL). Pre-generate story trays (list of stories from followed users) per user. A user opening the app should see story trays load in <200ms.
- Explore/Discover uses a recommendation engine: collaborative filtering (users who liked X also liked Y) + content-based (image embeddings via CNN, hashtags, location) + engagement signals (like rate, save rate, share rate in first hour). Ranked by a lightweight ML model.
- Handle high write throughput for social actions: likes, comments, and follower counts don't need real-time accuracy. Process asynchronously via Kafka. Show approximate counts (eventual consistency). A like counter that's 2 seconds behind is fine.
- Scale math: 2B MAU, 500M DAU, 100M photos uploaded/day. Average photo = 2MB × 4 resolutions = 8MB storage. 100M × 8MB = 800TB/day new storage. After a year: ~300PB. CDN serves >95% of image reads. Without CDN, the system is impossible.
- Bottleneck: the feed generation service during peak hours. When 500M users check their feed 5-10x/day, that's 2.5-5B feed reads/day. Each feed read fetches from Redis (pre-computed) + merges celebrity posts (real-time). Redis latency must be <5ms.
- What breaks: image processing backlog. If the transcoding workers can't keep up, uploaded photos don't appear for minutes. Fix: auto-scale transcoding workers based on queue depth. Show a 'processing' placeholder immediately, swap to the real image when ready.
- Don't use: a relational database for image storage (use S3/GCS). Don't serve images directly from your servers (use CDN). Don't fan-out-on-write for celebrities — a Cristiano Ronaldo post would trigger 600M+ cache writes, taking minutes and overwhelming the fan-out service.
Design a Job Scheduler
Difficulty: Medium. Category: Infrastructure.
Design a distributed job scheduler that can execute millions of scheduled tasks reliably. Support cron schedules, one-time jobs, job dependencies (DAGs), retries with backoff, and priority-based execution.
Key Topics
- Cron Scheduling
- Task Queue
- Exactly-Once Execution
- Worker Pool
- Failure Recovery
- Priority Queue
Hints for Design a Job Scheduler
- Separate the scheduler (decides WHEN to run) from the executor (decides HOW to run). The scheduler polls for jobs where next_run <= now and enqueues them into a task queue (Kafka, SQS, or Redis). Workers pull from the queue and execute. This decoupling allows independent scaling.
- For cron-like schedules, store the next execution time in the database. A scheduler process polls every second: SELECT jobs WHERE next_run <= NOW() AND status = 'scheduled' LIMIT 1000. After enqueuing, update next_run to the next cron occurrence. Use batch operations to handle thousands of jobs/sec.
- Exactly-once execution is the hardest requirement. Use distributed locks (Redis SETNX with TTL, or database row-level locks). When a worker claims a job, it acquires a lock. Only the lock holder can execute. TTL prevents deadlocks if the worker crashes mid-execution.
- Handle worker failures: workers send heartbeats every 10-30 seconds. If a worker stops heartbeating, the scheduler marks its in-progress jobs as 'failed' and makes them eligible for retry. Use exponential backoff (1m, 5m, 30m, 2h) with a max retry count (e.g., 5 attempts).
- Job dependencies (DAGs): Job B runs only after Job A succeeds. Model as a directed acyclic graph. Use topological ordering. The scheduler tracks execution state per DAG run. When Job A completes, check if all of Job B's dependencies are satisfied before enqueuing B.
- Scale math: 10M scheduled jobs, 1M execute per hour = ~278 executions/sec. Each execution takes 1-60 seconds. At 30 seconds average, you need ~8,300 worker slots to keep up. Auto-scale workers based on queue depth — if queue grows, add workers.
- Bottleneck: the scheduler polling loop. If 10M jobs need next_run checks, the database query becomes slow. Solution: partition jobs by next_run time bucket (minute-level buckets). Only query the current bucket. Index on (status, next_run). Use a separate 'delayed job' queue for far-future jobs.
- What breaks: clock skew across scheduler nodes. If two scheduler instances have different clocks, jobs might be enqueued twice or missed entirely. Solution: use a single leader scheduler (elected via ZooKeeper/etcd), or use database-level SELECT FOR UPDATE SKIP LOCKED to ensure each job is picked by exactly one scheduler.
- Don't use: in-memory job storage (lost on restart). Don't use sleep-based scheduling (inaccurate and wasteful). Don't allow unbounded retries (a permanently failing job retrying forever wastes resources). Don't skip dead-letter queues — failed jobs after max retries should be moved to a DLQ for manual inspection.
Design LeetCode
Difficulty: Hard. Category: Platform.
Design an online coding platform like LeetCode. Users can browse problems, write code in multiple languages, submit solutions, and see if they pass all test cases. Support contests with real-time leaderboards.
Key Topics
- Code Execution Sandbox
- Container Isolation
- Test Case Runner
- Queue-Based Processing
- Leaderboard
- Anti-Cheat
Hints for Design LeetCode
- The hardest part is safe code execution. Run user code in sandboxed containers (Docker with gVisor or Firecracker) with strict resource limits: CPU (2 cores max), memory (256MB max), time (10 seconds), no network access, no filesystem writes outside /tmp. A malicious fork bomb or infinite loop must be killed cleanly.
- Use a job queue (Kafka or SQS) between the submission API and execution workers. Flow: user submits → API validates and enqueues → worker pulls job → runs code in sandbox → reports result → API notifies user via WebSocket. Decoupling prevents a burst of submissions from crashing the execution tier.
- Pre-compile test cases and expected outputs. Run user code against them sequentially, stopping at first failure for efficiency. Show which test case failed, the expected vs actual output, and execution time. Hidden test cases prevent hardcoding answers.
- Support multiple languages with language-specific Docker images (Python 3.11, Java 17, C++20, Go, Rust, etc.) pre-installed. Warm up containers — pulling and starting a cold Docker image takes 5-10 seconds. Keep a pool of warm containers per language ready to accept submissions.
- Contest leaderboard: use Redis ZSET keyed by (problems_solved DESC, total_time ASC). Update atomically on each accepted submission. Show real-time rank to users via WebSocket push. For large contests (10K+ users), paginate the leaderboard.
- Scale math: 1M submissions/day = ~12 submissions/sec average. During a contest: 50K users × 5 problems × 3 attempts = 750K submissions in 2 hours = ~100 submissions/sec peak. Each execution takes 2-10 seconds. You need 200-1000 execution workers during contests.
- Bottleneck: execution workers during contests. All 50K users submit simultaneously. Solution: auto-scale workers based on queue depth, prioritize contest submissions over practice submissions, and set a maximum queue wait time (show 'queue position' to the user).
- What breaks: users can try to escape the sandbox (container breakout), consume excessive resources (memory bombs, fork bombs), or read test case files. Use gVisor or Firecracker for kernel-level isolation. Mount test cases as read-only. Kill any process exceeding resource limits instantly.
- Don't use: exec() or eval() directly on the server to run user code — it's a security nightmare. Don't give containers network access (users could make outbound API calls to fetch answers). Don't trust execution time on shared machines for competitive ranking — normalize by CPU benchmark or use dedicated instances for contests.
Design a Local Delivery Service
Difficulty: Hard. Category: Real-Time Systems.
Design a local delivery service like DoorDash or Instacart. Users can order from local restaurants/stores, and nearby couriers pick up and deliver the orders. Support real-time tracking, ETA estimation, and driver matching.
Key Topics
- Geospatial Indexing
- ETA Estimation
- Order Matching
- Real-Time Tracking
- Surge Pricing
- Dispatch Optimization
Hints for Design a Local Delivery Service
- Use geohashing or quadtrees to efficiently find nearby drivers/couriers and restaurants. Geohash cells provide O(1) lookup for 'find all drivers in this area'. Store driver locations in Redis, partitioned by geohash prefix.
- Separate into microservices: order service (CRUD, status), dispatch service (matching), tracking service (live location), payment service (charges), and restaurant service (menus, prep time). The dispatch service is the brain.
- Driver location updates stream at high frequency (every 3-5 seconds). Store in Redis (in-memory, fast writes). Don't persist every update to a database — batch-write to a time-series store every 30s for historical analytics.
- The dispatch/matching algorithm optimizes for: minimize delivery time (not just distance — road routing matters), driver proximity to restaurant, order batching (two orders at same restaurant → one driver), and driver earnings fairness (don't starve idle drivers).
- ETA estimation needs: real-time traffic data, historical delivery times for this restaurant, restaurant prep time (varies by order complexity and current queue), and driver travel time. ML models trained on historical data outperform simple distance/speed calculations by 30-40%.
- Scale math: 10M orders/day peak, 1M concurrent drivers sending location every 4s = 250K location writes/sec. During dinner rush (6-9pm), order volume spikes 5-10x. The dispatch service must match an order in <30 seconds.
- Bottleneck: the dispatch service during peak hours. Thousands of new orders per second, each needs a geospatial query + ETA calculation + driver selection. Pre-compute a driver availability heat map (refreshed every 5s) for fast initial matching.
- What breaks: restaurant prep time estimates are wrong — driver arrives but order isn't ready, or order is ready but no driver is nearby. Fix: real-time prep time updates from the restaurant POS system. If a driver is assigned too early, they waste time waiting. If too late, food gets cold.
- Don't use: a relational DB for live driver locations (too slow). Don't match purely by distance (a driver 1km away across a river is worse than one 2km away on the same road). Don't use a single dispatch algorithm for all markets — dense urban areas need different optimization than suburban areas.
Design a Metrics Monitoring System
Difficulty: Medium. Category: Infrastructure.
Design a metrics monitoring and alerting system like Datadog or Prometheus. Collect metrics from thousands of servers, store time-series data, display real-time dashboards, and fire alerts when thresholds are breached.
Key Topics
- Time-Series Database
- Data Collection Agent
- Alerting Engine
- Dashboard
- Downsampling
- Push vs Pull Model
Hints for Design a Metrics Monitoring System
- Lightweight agents on each server push metrics (CPU, memory, disk, network, request latency, error rate, custom app metrics) to a collector service. Use UDP for fire-and-forget high-throughput collection (like StatsD). TCP for guaranteed delivery of critical metrics.
- Store metrics in a time-series database (Prometheus for pull-based, InfluxDB or TimescaleDB for push-based). Schema: (metric_name, tags/labels, timestamp, value). Tags enable filtering: host=web-01, service=api, region=us-east. Choose a TSDB that supports tag-based queries efficiently.
- Downsample old data aggressively: 1-second resolution for the last 2 hours, 1-minute for last 7 days, 5-minute for last 30 days, 1-hour for last year. This reduces storage by 100-1000x. A metric producing 1 data point/sec = 2.6M points/month raw, but only ~8,700 after downsampling to 5-min.
- Alerting engine evaluates rules against incoming data: 'error_rate{service=api} > 5% for 3 minutes'. Use a sliding window evaluation. Support severity levels (warning, critical, page). Route alerts to Slack (warning), PagerDuty (critical). Implement alert grouping and dedup — don't send 500 alerts when 500 servers have the same issue.
- Pull vs Push: Prometheus uses pull (scrapes HTTP endpoints on targets every 15-30s). Datadog uses push (agents send data). Pull is simpler for service discovery and prevents overloading the collector. Push handles ephemeral containers (Lambda, Kubernetes pods) better since they may die before being scraped.
- Scale math: 10K servers, each producing 200 metrics at 10-second intervals = 200K data points/sec ingestion. Each data point = ~50 bytes = ~10MB/sec. After downsampling, long-term storage is manageable. Query load: 1,000 dashboard refreshes/min, each querying 10-50 metrics.
- Bottleneck: query performance for complex dashboards. A dashboard with 20 panels, each querying a metric across 500 hosts for the last 24 hours = 20 × 500 × 8,640 data points = 86M data points to scan. Solution: pre-aggregate common queries, use rollup tables, and cache dashboard results for 10-30 seconds.
- What breaks: alert fatigue. Too many alerts = engineers ignore them all. Solution: proper alert tuning (avoid noisy thresholds), alert correlation (group related alerts into incidents), auto-remediation for known issues (restart service on OOM), and alert-on-alert-absence (detect if monitoring itself fails).
- Don't use: a relational database for time-series storage (write throughput too low, queries too slow for time-range scans). Don't skip downsampling (storage costs explode). Don't collect every possible metric (focus on the Four Golden Signals: latency, traffic, errors, saturation). Don't alert on a single data point — always use windows to avoid flapping.
Design a Multi-Tenant SaaS Platform
Difficulty: Hard. Category: Platform.
Design a multi-tenant SaaS platform where multiple organizations share the same infrastructure but have isolated data and customizable configurations. Consider tenant isolation strategies, authentication/authorization, billing, and how to handle tenants of vastly different sizes.
Key Topics
- Multi-Tenancy
- Data Isolation
- RBAC
- Horizontal Scaling
- Tenant-Aware Routing
Hints for Design a Multi-Tenant SaaS Platform
- Three isolation models (pick based on compliance needs and cost): shared database with row-level tenant_id column (cheapest, weakest isolation), schema-per-tenant (moderate cost, good isolation), database-per-tenant (most expensive, strongest isolation — required for healthcare/finance). Most B2B SaaS starts with shared DB, offers dedicated DB for enterprise tier.
- Tenant-aware middleware: extract tenant context from subdomain (acme.app.com), JWT claim, or API key. Inject tenant_id into every database query automatically. Critical security rule: NEVER serve data without a tenant_id filter. A missing WHERE tenant_id = X clause exposes all tenants' data.
- Noisy neighbor problem: one tenant's heavy query slows down everyone. Solutions: per-tenant rate limiting (100 API calls/min), per-tenant resource quotas (max 10% of DB connections), separate connection pools per tier (enterprise tenants get dedicated DB resources), and tenant-based sharding for large customers.
- RBAC (Role-Based Access Control): hierarchical permission model — Organization → Workspace → Team → Role → Permission. Store in a dedicated auth service (or use something like Oso/Casbin). Common roles: Owner, Admin, Member, Viewer. Support custom roles for enterprise clients.
- Tenant onboarding/offboarding: automate everything. Provisioning a new tenant should take <30 seconds (create org record, seed default data, generate API keys). Offboarding: soft-delete with 30-day recovery window, then hard-delete all data (GDPR compliance).
- Scale math: 10K tenants, top 1% generate 50% of traffic (power law). Average tenant: 100 users, 1K API calls/day. Top tenant: 50K users, 5M API calls/day. Total: ~60M API calls/day = ~700 TPS average, ~3,500 TPS peak. The top 100 tenants need special attention.
- Bottleneck: shared database at scale. When all tenants share one PostgreSQL instance, the top tenant's complex queries block everyone. Solution: shard by tenant_id. Small tenants share shards (multi-tenant shard). Large tenants get dedicated shards. Use a routing layer to direct queries to the right shard.
- What breaks: schema migrations. In a shared DB, migrating the schema affects ALL tenants simultaneously. A failed migration takes everyone down. Solution: rolling migrations (add column → backfill → deploy code → remove old column). For schema-per-tenant: automate migration across all schemas with rollback capability.
- Don't use: a single-tenant architecture that you 'multi-tenantify' later (extremely painful retrofit). Don't skip tenant_id in every table (you'll have data leaks). Don't give all tenants the same resource limits (the biggest tenant will degrade experience for the smallest). Don't hard-code tenant configuration — use a feature flag service for per-tenant customization.
Design a News Aggregator
Difficulty: Medium. Category: Data Processing.
Design a news aggregator like Google News. The system crawls 100K RSS sources, ingests 5M articles per day, deduplicates near-identical stories using MinHash + LSH, ranks articles using exponential decay scoring, detects breaking news via stream processing, and serves personalized feeds to 50M daily users at 10K requests/sec.
Key Topics
- Adaptive Polling (Valkey Sorted Set)
- MinHash + LSH Deduplication
- Exponential Decay Ranking
- Breaking News Detection (Flink)
- Personalized Feed Assembly
- Fan-out Architecture
Hints for Design a News Aggregator
- Adaptive polling with a Valkey sorted set priority queue. Each source gets a next_poll_time score. Workers pop the lowest-score source (ZPOPMIN), fetch its RSS feed, and re-insert with a new score based on how many new articles were found. Active sources (CNN, BBC) poll every 5 minutes; dormant blogs stretch to 6 hours. This eliminates 80% of wasted polls compared to fixed-interval crawling.
- Near-duplicate detection using MinHash + Locality-Sensitive Hashing (LSH). Each article is shingled (5-character n-grams), then 128 MinHash functions produce a 512-byte signature. The signature is split into 32 bands of 4 rows each. Articles sharing at least one identical band are candidate duplicates. A final Jaccard similarity check (threshold > 0.4) confirms the match. This runs in O(1) per lookup against Valkey-stored LSH buckets.
- Ranking uses an exponential decay formula with a 6-hour half-life: score = (0.30 * freshness) + (0.25 * source_authority) + (0.20 * engagement) + (0.15 * topic_relevance) + (0.10 * geographic_relevance). Freshness decays exponentially -- an article loses 50% of its freshness score every 6 hours. A batch re-ranking job runs every 15 minutes to recompute scores across the active article set.
- Breaking news detection via Apache Flink. Flink consumes the classified article stream and maintains 5-minute tumbling windows per category. When the article count in a window exceeds 5x the 7-day rolling average for that category AND the absolute count exceeds 10, the system flags it as breaking news. Notifications push to 8M subscribers with rate limits (1 push per topic per hour, 5 per user per day).
- Personalized feed assembly blends global and personal signals: final_score = 0.7 * global_score + 0.3 * interest_boost. The interest_boost comes from a user interest vector stored in Valkey (30-day TTL), built from ClickHouse aggregation of read events. The 70/30 split is deliberate -- aggressive personalization creates filter bubbles. Pre-compute feeds for the top 2% most active users; assemble on-demand for the rest.
- Topic classification uses fastText (not BERT) for speed -- classifying 17 articles/sec with < 1ms latency per article. fastText handles the volume at 1/100th the cost of transformer models. Named entity recognition and sentiment analysis run as separate Kafka consumers downstream, enriching articles without blocking the main ingestion pipeline.
- Fan-out architecture with three independent paths branching from Kafka: (1) Write path: Crawler -> Kafka -> Dedup -> Classifier -> Ranker -> PostgreSQL/Valkey/Elasticsearch. (2) Detection path: Kafka -> Flink -> trending topics + notifications. (3) Read path: Client -> Feed Service -> Valkey (cache hit 95%) or PostgreSQL (cache miss fallback). The read path never directly interacts with the write path.
- Scale math: 100K sources at 83 polls/sec sustained. 5M articles/day ingested, dedup reduces to 2M unique. Each article ~1.2KB = 1.8GB/day storage. Valkey: 15GB across 6 shards (feeds, interests, LSH buckets). Feed serving: 10K requests/sec peak, p99 < 200ms from Valkey cache. Kafka: 12 partitions, 3-day retention, ~47GB total storage.
- What to avoid: fixed-interval polling (80% wasted fetches), exact-match dedup (only catches identical copies, misses paraphrased articles -- recall drops to 10%), BERT embeddings for dedup (GPU cost is prohibitive at 5M articles/day), recency-only ranking (promotes low-quality clickbait from fast publishers).
Design a News Feed / Timeline
Difficulty: Medium. Category: Social & Media.
Design a news feed system like Facebook's Timeline or Twitter's Home Feed. Users should see a personalized feed of posts from people they follow, ranked by relevance. Consider fan-out strategies, caching, ranking algorithms, and handling users who follow millions of accounts.
Key Topics
- Fan-out
- Push vs Pull
- Ranking Algorithm
- Cache Layer
- News Feed Generation
Hints for Design a News Feed / Timeline
- Fan-out-on-write (push model): when a user posts, write the post ID to every follower's feed cache. Fast reads (O(1) feed fetch) but extremely expensive writes for users with millions of followers. Good for users with <10K followers.
- Fan-out-on-read (pull model): compute the feed at read time by fetching recent posts from all followed users, then merge-sort. Cheap writes but slow reads for users following many accounts. Better for celebrity accounts.
- Hybrid approach (what Facebook/Twitter actually use): push for regular users, pull for celebrities (>100K followers). At read time, merge the pre-computed feed with real-time pull results from celebrity accounts. This solves the 'Lady Gaga problem' — one post shouldn't trigger 50M cache writes.
- Cache the pre-computed feed in Redis as a sorted set (score = timestamp or ML rank). Limit to the most recent 500-1000 items. Older items fall off the cache and are fetched from the database on deep scroll.
- Ranking model: engagement signals (likes, comments, shares weighted by recency), relationship strength (close friends > acquaintances), content type (video > photo > text), diversity (don't show 10 posts from the same person), and negative signals (hide, report, unfollow).
- Scale math: 500M DAU, each checks feed 10x/day = 5B feed reads/day = ~58K reads/sec. With push model, each post fans out to average 200 followers = 200 cache writes per post. At 10M posts/day = 2B cache writes/day.
- Bottleneck: the fan-out service for popular users. A celebrity posting triggers millions of async writes. Use Kafka to buffer fan-out tasks, process with a pool of workers. Acceptable for follower feeds to update with 1-5 second delay.
- What breaks: feed consistency — user posts but doesn't see it in their own feed (fan-out hasn't completed). Fix: always inject the user's own recent posts at read time, regardless of fan-out status (read-your-writes consistency).
- Don't use: polling the database for every feed request — it kills DB performance at scale. Don't use a single monolithic feed table — partition by userId. Don't rank purely chronologically — engagement-based ranking increases time-on-site by 2-3x.
Design a Notification System
Difficulty: Hard. Category: Infrastructure.
Design a notification platform that can send 100 million notifications per second across web push (WebSocket/SSE), Android (FCM), and iOS (APNs). Support real-time delivery, scheduled notifications, broadcast to millions of users, and handle failures gracefully with at-least-once delivery semantics.
Key Topics
- WebSocket vs SSE
- Fan-Out
- Message Queue (Kafka)
- Connection Registry
- Scheduled Notifications
- Multi-Channel Delivery
Hints for Design a Notification System
- Use WebSocket for real-time web push with SSE as fallback. WebSocket enables bidirectional communication for inline ACKs -- each delivery confirmation is a tiny frame on the existing connection rather than a separate HTTP round-trip. At 100M notifications/sec, this saves 30M extra HTTP requests/sec for acknowledgments alone.
- Maintain a centralized connection registry in Redis. When a user connects via WebSocket, register the mapping user_id → {server_id, connection_id} in a Redis hash. When a notification fires from any server, look up the user's connection in Redis and relay via Redis Pub/Sub to the exact gateway pod. O(1) lookup, O(1) delivery, zero broadcast waste.
- Use Kafka as the central nervous system. Separate topics per delivery channel (web, FCM, APNs) enable independent scaling. With RF=3 and acks=all, no notification is lost once accepted. Kafka decouples ingestion from delivery -- the system returns 202 Accepted after Kafka persistence, not after actual delivery.
- Fan-out for broadcasts to millions of users requires staged processing. Don't fan out from a single worker — it will OOM. Instead: scan subscriber set → produce micro-batches of 10K users to Kafka → 500+ consumers pick up batches in parallel. Total broadcast to 10M users completes in < 5 seconds.
- Scheduled notifications use Redis sorted sets (fire time as score) partitioned by minute-level buckets, backed by ScyllaDB for durability. A leader-elected scheduler service ticks every second, fires pending notifications into Kafka. A consistency sweep every 5 minutes catches anything Redis missed.
- Scale math: 100M notifications/sec, 500M registered users, 100M concurrent WebSocket connections. Each WS connection needs ~20KB memory = 2TB total. 200 gateway servers at 500K connections each. 500+ Kafka brokers. Every layer must scale independently.
- Bottleneck: WebSocket node selection. An event can originate on any server, but the user's connection lives on one specific gateway pod. Four approaches: sticky hashing (fragile), broadcast to all pods (catastrophically wasteful), centralized registry in Redis (correct — sub-ms lookup), gossip protocol (doesn't scale). The registry approach handles multi-device seamlessly.
- What breaks: stale connection registry entries when pods crash without clean deregistration. Three recovery mechanisms: heartbeat PING/PONG (detects in 30-40s), pod health reaper (detects in 10-30s), Redis key TTL safety net (24h worst case). Pending notifications stored in ScyllaDB, delivered on reconnect.
- Don't use: HTTP polling for real-time notifications (too much latency and server overhead). Don't use a single database for all state (partition by concern: ScyllaDB for device tokens and status, Redis for connection registry and rate limiting, ClickHouse for analytics). Don't skip idempotency keys -- network retries will cause duplicate deliveries without them.
Design Object Storage
Difficulty: Hard. Category: Storage & Sync.
Design an object storage system like Amazon S3 or Google Cloud Storage. The system must store exabytes of data with 11 nines of durability using erasure coding, support a flat namespace with bucket-based organization, handle multipart uploads for large files, and provide strong read-after-write consistency. Consider lifecycle policies, versioning, and cross-region replication.
Key Topics
- Erasure Coding (10+4)
- Flat Namespace + CRUSH Placement
- Control Plane vs Data Plane
- Strong Consistency (Raft)
- Multipart Upload
- CDN vs Presigned URLs
- Lifecycle Policies
- Immutability
Hints for Design Object Storage
- Object storage is not a filesystem. No directories, no inodes, no POSIX semantics. An object is a key, a value, and headers. The key 'photos/2024/beach.jpg' is an opaque string, not a directory path. This simplicity is deliberate: every constraint removed from the API is a constraint removed from the storage engine. Flat namespace with sorted LSM-tree keys gives near-constant-time lookups and efficient prefix-based listing.
- Immutability is the core design decision. PUT always creates a new version with entirely new shards. The old shards sit until GC cleans them up. This removes distributed locking, makes caching safe (ETags never go stale), and eliminates conflict resolution from replication. If two clients upload the same key, they create independent shards on independent nodes. Raft orders the metadata commits. No data-level conflict.
- Durability target: 11 nines (99.999999999%). Achieved through Reed-Solomon erasure coding (10+4): split a 10MB object into 10 data shards (1MB each) + 4 parity shards = 14 shards, 1.4x overhead. Any 10 of 14 shards can reconstruct the original. This tolerates 4 simultaneous failures while costing $28M/month per EB vs $60M for 3x replication. But EC alone does not guarantee 11 nines. Fast repair, multi-AZ isolation, and failure independence are also required.
- Placement uses CRUSH (Controlled Replication Under Scalable Hashing). Objects hash to placement groups (PGs), and each PG maps to 14 storage nodes via CRUSH using the cluster topology (AZ, rack, node). This is deterministic: any node can compute placement locally with no central lookup. When a node is added or removed, only the affected PGs move data. With 100K PGs across 19,500 nodes, adding one node moves roughly 5 PGs worth of data.
- Separate control plane from data plane. Control plane: Metadata Service (sharded LSM-tree + Raft, ~1.4M partitions), Placement Service (CRUSH map), IAM Engine, Lifecycle/GC. Data plane: Data Router fleet (stateless, handles EC encode/decode) and Storage Nodes (36 disks each, shards stored in 256MB extent files). The Data Router is a logical role. In presigned URL flows, the storage node takes on this role directly.
- Three request paths. CDN path (reads): CDN-signed URL → CDN edge validates signature → cache hit: serve / miss: fetch from storage origin via OAC. CDN never sees internal routing. API path (writes): Client → WAF → API Router → IAM → Metadata (PG lookup) → Data Router (EC encode) → 14 shard writes (quorum 11/14) → Raft commit. Presigned path (direct uploads): Client → Storage Node (validates sig, CRUSH, EC encode, distribute shards, Raft commit). Zero bytes through API fleet.
- Metadata is the hardest problem. ~700 bytes per object × 100T objects = ~70 PB raw. With LSM overhead: ~140 PB across ~1.4M Raft partitions at 100GB each. Metadata stores the PG ID, not node locations. Node placement is derived at read/write time by CRUSH + cluster map. The shard_map in metadata is a cached optimization.
- Multipart upload: each part is independently erasure-coded and durable on upload (not raw bytes waiting). Parts upload in parallel, out of order, from different machines. CompleteMultipartUpload validates ETags, links part shard locations into final object metadata (no byte copy), computes composite ETag, and commits atomically via Raft. Auto-abort cleans up incomplete uploads after 7 days.
- Storage presigned URLs go direct to storage nodes, never through CDN. CDN-signed URLs (separate signing system) go through CDN with caching. These are two completely different systems. For private content with high read traffic, use CDN-signed URLs + Origin Access Control (bucket is private, only CDN can fetch from storage). For uploads and low-traffic downloads, use storage presigned URLs.
- GC is always async. Dead shards get a 24-hour grace period before deletion (protects in-flight reads, clock skew, race conditions). Extent compaction at >30% dead ratio. GC budget: 20% of cluster I/O. Weekly orphan detection via metadata export reconciliation.
- Lifecycle: per-partition scanner runs hourly. Matching objects get re-encoded with target EC (e.g., 16+4 for Glacier at 1.25x overhead) and moved to cheaper storage. Storage classes: Standard ($23/TB/mo) → IA ($12.50) → Glacier Instant ($4) → Glacier Flexible ($3.60) → Deep Archive ($0.99, tape-backed, 12-48h retrieval).
Design an Online Auction
Difficulty: Hard. Category: E-Commerce.
Design an online auction platform like eBay auctions. Users can list items for auction, place bids in real-time, and the highest bidder wins when the timer expires. Handle bid sniping, concurrent bids, and real-time updates.
Key Topics
- Real-Time Bidding
- Concurrency Control
- Timer Service
- Bid Validation
- Notification System
- Fraud Detection
Hints for Design an Online Auction
- The core challenge is concurrent bids. Use optimistic concurrency control: each bid includes the current highest price as a condition — 'place bid of $105 IF current_highest = $100'. If another bid was placed in between, reject and ask the user to retry with the updated price.
- Use WebSocket or SSE to push bid updates to all watchers in real-time. Don't rely on polling — auction final minutes have rapid-fire bids. Each auction is a pub/sub channel. When a new bid is accepted, broadcast to all subscribers within 100ms.
- Auction end time must be precise. Use a distributed timer/scheduler service. Critical: implement anti-sniping — if a bid arrives in the last 30 seconds, extend the auction by 2 minutes. This prevents last-second bids that give no one time to respond.
- CQRS pattern: separate bid processing (strong consistency, single-threaded per auction) from bid display (eventually consistent, cached). Bid writes are serialized per auction_id through a single processor. Bid reads can be served from cache with 100ms staleness.
- Support multiple auction types: English (ascending bids, most common), Dutch (price starts high, decreases until someone bids), sealed-bid (all bids submitted blind, highest wins). Each has different validation logic and winner determination.
- Scale math: 10M active auctions, average 50 watchers each = 500M WebSocket connections. During auction close: bids arrive in bursts of 10-100/sec for popular items. The bid processor for a single hot auction must handle serialized writes at this rate.
- Bottleneck: the bid processing service. Each auction must process bids serially (to prevent overselling/double-bidding). Solution: partition by auction_id — each auction is handled by exactly one processor (consistent hashing). Processors can handle thousands of auctions each.
- What breaks: the timer service. If the auction close event fires late, bids might be accepted after the auction should have ended. Use a high-precision timer (not cron) with redundancy — two independent timers, use the first one that fires. Clock sync via NTP across servers.
- Don't use: database-level locking for bid processing at scale (too slow). Don't trust client timestamps for bid ordering (use server-side monotonic timestamps). Don't allow bids below the current highest + minimum increment — validate server-side. Don't send push notifications for every bid — batch notifications for watchers.
Design a Payment System
Difficulty: Hard. Category: FinTech.
Design a payment processing system like Stripe or Square. It should handle credit card payments, refunds, and payouts reliably. The system must guarantee no double charges, handle partial failures gracefully, and maintain an accurate financial ledger at all times.
Key Topics
- Idempotency
- Double-Entry Ledger
- Saga Pattern
- Exactly-Once Processing
- Reconciliation
Hints for Design a Payment System
- Idempotency is non-negotiable. Every payment request must include a client-generated idempotency key. If a retry arrives with the same key, return the original result. Store idempotency keys with TTL (24-48h) in Redis or the database. Without this, network retries WILL cause double charges.
- Use a double-entry ledger: every transaction creates two entries — a debit from one account and a credit to another. The sum of all entries must always be zero. This makes auditing trivial and catches bugs immediately (if the books don't balance, something is wrong).
- Payment flow: authorize (verify card, hold funds) → capture (actually charge) → settle (move money between banks). Use the Saga pattern with compensating transactions — if capture fails after authorization, issue a void. Never leave money in an ambiguous state.
- Build a reconciliation pipeline that runs daily: compare your internal ledger with external payment provider (Stripe, bank) records. Flag discrepancies automatically. In practice, ~0.1-1% of transactions will have mismatches that need investigation.
- Scale math: 1M transactions/day = ~12 TPS average. But Black Friday peaks at 50-100x normal = ~1,200 TPS burst. Use Kafka as a buffer between the API and the payment processor to smooth out spikes. The payment service must handle bursts gracefully.
- Bottleneck: the ledger database. Every transaction = at least 2 writes (debit + credit) with ACID guarantees. Use PostgreSQL with row-level locking. Partition the ledger by account_id for horizontal scaling. Avoid cross-partition transactions — they're 10x slower.
- What breaks: network timeouts to payment providers. You send a charge request but get no response — did it succeed or fail? NEVER assume. Implement a 'pending' state, then use the provider's query API or webhooks to reconcile. Timeout ≠ failure.
- Don't use: eventually consistent databases for the financial ledger (money MUST be strongly consistent). Don't process payments synchronously in the API request — use async processing with status polling or webhooks. Don't skip audit logging — regulators require immutable transaction records.
- Fraud detection: run async checks in parallel — velocity checks (too many transactions too fast), device fingerprinting, amount anomalies, geolocation mismatches. If fraud is detected post-authorization, reverse before settlement. ML models improve detection over time.
Design a Price Tracking Service
Difficulty: Medium. Category: Data Processing.
Design a price tracking service like CamelCamelCamel or Honey. The system monitors product prices across multiple retailers, stores price history, and alerts users when prices drop below their target.
Key Topics
- Web Scraping
- Scheduler
- Price History
- Alert System
- Time-Series Data
- Rate Limiting
Hints for Design a Price Tracking Service
- Crawl product pages on a schedule using a job scheduler. Respect per-retailer rate limits (politeness). Amazon might allow 1 req/sec; a small shop might only tolerate 1 req/min. Store rate limit configs per domain. Use a priority queue to schedule high-demand products more frequently.
- Store price history as time-series data. TimescaleDB (PostgreSQL extension) or InfluxDB are ideal. Schema: (product_id, retailer, timestamp, price, currency, availability). Partition by product_id and time range. Keep raw data for last 90 days, downsample to daily averages for historical.
- Alert system: users set price targets ('notify me when < $X'). When a new price is scraped, check against active alerts for that product. Use Redis for fast lookup: key = product_id, value = sorted set of (target_price, user_id). If new price < any target, trigger notification via push/email.
- Handle anti-scraping defenses: rotate proxy IPs (pool of 1000+ residential proxies), randomize request intervals, use headless browsers (Playwright/Puppeteer) for JS-rendered pages, respect robots.txt. Some retailers expose APIs or affiliate feeds — prefer these when available.
- Price history charts: pre-aggregate data into daily/weekly/monthly buckets for fast chart rendering. Store pre-aggregated data separately (materialized views). A user viewing a 1-year price chart shouldn't query 365 raw data points — use the daily aggregate (365 points) or weekly aggregate (52 points).
- Scale math: tracking 100M products across 50 retailers. If each product is checked once every 6 hours, that's ~4,600 scrapes/sec sustained. Each scrape returns ~1KB of data = 4.6MB/sec ingestion. Price change events (subset of scrapes) trigger alert checks.
- Bottleneck: the scraping infrastructure. 4,600 reqs/sec distributed across 50 retailers means some retailers see hundreds of req/sec. You'll get IP-banned quickly. Solution: use multiple scraping strategies per retailer (API, RSS, browser scrape) and maintain a healthy proxy rotation pool.
- What breaks: product page structure changes. When Amazon redesigns their product page, all scrapers break simultaneously. Fix: separate scraping logic into per-retailer parsers with schema validation. Alert on parser failures (if extraction fails for >10% of a retailer's products, page the on-call engineer).
- Don't use: a single IP for all scraping (instant ban). Don't scrape every product at the same frequency — high-demand products (with active alerts) get checked every hour, dormant products every 24 hours. Don't block on synchronous scraping — use async HTTP clients and a worker pool.
Design a RAG Platform
Difficulty: Hard. Category: AI & Agents.
Design an enterprise RAG (Retrieval-Augmented Generation) platform that serves 10M queries per day across 500+ engineering teams. The platform ingests 2M+ internal documents from Confluence, GitHub, Slack, Google Docs, and S3, lets engineers ask natural language questions, and returns accurate, cited answers grounded in internal knowledge. Handle hybrid retrieval, multi-strategy chunking, model routing for cost optimization, hallucination mitigation with citation enforcement, and continuous evaluation loops.
Key Topics
- Multi-Strategy Chunking
- Hybrid Retrieval (BM25 + Vector)
- Cross-Encoder Re-Ranking
- Model Routing
- Hallucination Mitigation
- Evaluation Pipeline
Hints for Design a RAG Platform
- A RAG platform has six layers: ingestion (connectors pull docs from Confluence, GitHub, Slack, etc.), chunking & embedding (split docs into chunks, convert to vectors), storage (vector DB + keyword index + metadata store + cache), retrieval (hybrid search + re-ranking + ACL filtering), generation (model routing + prompt assembly + streaming), and evaluation (user feedback + LLM-as-judge). The LLM is the last 20% of the work. Retrieval quality determines answer quality.
- Chunking is the single most impactful decision. No single strategy works for all document types. Use recursive chunking for well-structured docs (split by headings), semantic chunking for long-form prose (split where topic changes using embedding similarity), AST-aware chunking for code (split by function/class boundaries), and thread-level chunking for Slack (keep full thread as one unit). Classify document type first, then route to the right chunker.
- Hybrid retrieval combines BM25 keyword search and dense vector search in parallel. Each has blind spots: vector search misses exact identifiers like 'ErrorCode 4032' (no meaningful semantic embedding), BM25 misses intent like 'how to handle failures gracefully' (keywords don't match 'Retry and Circuit Breaker Patterns'). Merge with Reciprocal Rank Fusion (RRF), then re-rank top-20 with a cross-encoder that reads query-document pairs jointly.
- Model routing is not optional. 60-80% of enterprise queries are simple factual lookups a small model handles in 200ms. Only 5-10% need frontier-model reasoning. Route based on query complexity: simple → Haiku/GPT-4o-mini, moderate → Sonnet/GPT-4o, complex → Opus/GPT-4. This cuts LLM costs by 50-70%. Without routing, you either overpay on every query or underserve complex ones.
- Hallucination mitigation has three tiers based on retrieval confidence. High confidence (>0.6): generate grounded answer with citations. Medium (0.3-0.6): generate but hedge — 'Based on what I found... I'd recommend verifying with [team].' Low (<0.3): abstain entirely — 'I couldn't find reliable information. Here are the closest matches.' Most RAG systems only have answer-or-error. The hedged middle tier prevents the worst hallucinations.
- Access control is non-negotiable. Pre-filter vector search by the user's permission groups using payload indexes. Post-filtering is simpler but risks loading inaccessible document IDs into memory. One leaked internal doc through the knowledge assistant and the platform is dead. Sync permissions from source systems (Confluence spaces, GitHub repos) to the vector store within minutes.
- Evaluation requires both offline and online loops. Offline: golden dataset of 500+ curated Q&A pairs, run on every pipeline change, block deploy on regression. Online: user feedback (thumbs up/down), LLM-as-judge on 5-10% of production traffic scoring correctness/completeness/citations. Feed low-rated answers back into chunking quality audits and prompt tuning. Without evaluation you are flying blind.
- Agentic RAG handles the 30-40% of queries too complex for single-shot retrieval. The agent decomposes multi-hop questions into sub-queries, searches iteratively, evaluates whether context is sufficient, and refines. 'How does payments handle retries, and what changed after Q3 migration?' requires two separate searches across different doc spaces. Cap agent loops: max 3 iterations, 15 tool calls, 50K tokens, 15s wall-clock. Without guardrails, agents burn tokens endlessly.
- Scale math: 10M queries/day = ~115 QPS average, ~500 QPS peak. 2M documents → 10M chunks. Storage: 18-57 GB in Qdrant (depending on quantization), 54 GB in Elasticsearch, 16 GB in PostgreSQL. Per-query latency budget: embedding 5-10ms, vector search 10-50ms, BM25 5-20ms, re-ranking 80-150ms, LLM generation 400-1500ms. Total P50 ~1.0-1.5s, P99 ~3-5s. Semantic cache hits 5-25% of queries, saving the full LLM inference cost.
- Don't use: a single embedding model for both ingestion and queries without versioning — model upgrades silently break retrieval. Don't skip re-ranking — bi-encoder retrieval alone misses 10-15% of relevant results that cross-encoders catch. Don't embed full documents as single vectors — retrieval recall drops 20-40%. Don't use fixed chunking for everything — Slack threads need thread-level, code needs AST-aware, prose needs semantic boundaries.
Design a Rate Limiter
Difficulty: Easy. Category: Infrastructure.
Design a rate limiting service that protects APIs from abuse. It should support different rate limit rules (per user, per IP, per endpoint), work across multiple servers, and handle high throughput with minimal latency overhead.
Key Topics
- Token Bucket
- Sliding Window
- Redis
- Distributed Systems
- API Gateway
Hints for Design a Rate Limiter
- Token Bucket is the most common algorithm — simple, memory-efficient, allows bursts. Each user gets a bucket that refills at a fixed rate. Sliding Window Log is more precise but uses more memory (stores every request timestamp).
- For distributed rate limiting across multiple servers, use Redis with atomic Lua scripts (INCR + EXPIRE in one call). Without atomicity, race conditions cause over-counting or under-counting.
- Placement matters: API gateway level (NGINX, Kong) handles it before your app code runs — lowest latency. Application middleware gives more flexibility (per-endpoint rules). A dedicated rate limit service (like Envoy) is best for microservices.
- Return proper HTTP headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and 429 Too Many Requests status code. Clients need these to implement backoff correctly.
- Scale math: if you have 10M users and track per-user counters in Redis, that's ~10M keys × ~100 bytes = ~1GB RAM. Fits in a single Redis instance. For per-endpoint rules, multiply by number of endpoints.
- What breaks: a single Redis instance is a SPOF. Use Redis Cluster or replicate across nodes. If Redis goes down, fail-open (allow all traffic) is usually safer than fail-closed (block all traffic) in production.
- Don't use: a local in-memory counter on each app server — requests from the same user hit different servers, so limits aren't enforced globally. Sticky sessions are a hack, not a solution.
- Advanced: sliding window counter algorithm combines fixed window counters with weighted overlap — nearly as accurate as sliding window log, but uses only 2 counters per window instead of storing every timestamp.
Design Robinhood
Difficulty: Hard. Category: FinTech.
Design a stock trading platform like Robinhood. Users can view real-time stock quotes, place buy/sell orders, manage their portfolio, and see account balances. The system must handle market-hours traffic spikes and maintain financial accuracy.
Key Topics
- Order Matching
- Market Data Feed
- Portfolio Service
- Idempotency
- Regulatory Compliance
- Real-Time Quotes
Hints for Design Robinhood
- Separate the order service (place/cancel orders) from the execution service (route to exchanges like NYSE, NASDAQ). Use an event-driven architecture with Kafka between them. The order service validates and persists; the execution service handles exchange communication.
- Market data streams at very high throughput — thousands of price updates per second for thousands of tickers. Use WebSocket to push real-time quotes to clients. Aggregate and sample: send updates every 100ms (not every tick) for most stocks. Real-time tick data for stocks the user has open.
- Order idempotency is critical. A network retry must NOT place a duplicate order. Use unique client-generated order IDs. Before placing an order, check if an order with that ID already exists. Store order IDs with the result for 24h.
- Portfolio balance and positions must be strongly consistent — users must see accurate buying power before placing orders. Use ACID transactions in PostgreSQL for balance changes. Double-entry bookkeeping: every trade debits cash and credits stock (or vice versa).
- Regulatory requirements: store ALL order events immutably for compliance (SEC, FINRA). Use an append-only log (Kafka with infinite retention or S3). Support full order audit trails — when was it placed, modified, filled, canceled, and by whom.
- Scale math: 10M active traders, peak market open generates ~50K orders/sec across all users. Market data: ~50K price updates/sec across all tickers. Portfolio queries: ~100K reads/sec (users refreshing). The market data fan-out to 10M WebSocket connections is the hardest scaling problem.
- Bottleneck: market data fan-out. If 10M users are watching 10 stocks each, and each stock updates 10x/sec, that's 1B WebSocket messages/sec. Solution: topic-based pub/sub — group users by their watchlist, push updates per-ticker channel. Only send updates for stocks the user is watching.
- What breaks: market open thundering herd. At 9:30 AM ET, millions of users simultaneously open the app, place orders, and request quotes. Pre-warm caches, pre-establish WebSocket connections, and use a queue to buffer order bursts. Fail gracefully — show stale prices rather than errors.
- Don't use: eventually consistent databases for account balances (regulatory violation). Don't allow margin trading without real-time balance checks (Robinhood's 2020 'infinite money' glitch). Don't skip circuit breakers — if the exchange API is slow, reject new orders rather than queue indefinitely.
Design a Search Autocomplete
Difficulty: Medium. Category: Search & Discovery.
Design a search autocomplete system like Google's search suggestions. As users type, show the top 5-10 relevant suggestions in real-time. The system should handle billions of queries per day, support personalization, and update suggestions based on trending topics.
Key Topics
- Trie Data Structure
- Prefix Matching
- Elasticsearch
- Caching
- Ranking
Hints for Design a Search Autocomplete
- A Trie (prefix tree) is the classic data structure. Each node stores a character; paths from root represent prefixes. Store the top K suggestions at each trie node to avoid traversing the entire subtree on every query. This makes lookups O(length of prefix) instead of O(all matching words).
- Pre-compute top K suggestions for each prefix and cache them in Redis (key = prefix, value = sorted list). Update asynchronously every 15-30 minutes from a batch pipeline that processes search logs. Don't rebuild on every search.
- For personalization, layer user-specific suggestions on top of global results. Keep a small per-user cache (last 50 searches). Blend formula: 70% global trending + 30% personal history. Personal results always rank first.
- Latency is everything: suggestions must appear within 100-200ms. Use CDN edge caching for the top 10K most common prefixes — this handles ~80% of requests without hitting your backend. Only cache misses go to the trie service.
- Scale math: 5B searches/day, average 4 characters typed per query = 20B prefix lookups/day = ~230K lookups/sec. Redis can handle this on a few nodes, but CDN caching reduces backend load by 80%+.
- Bottleneck: updating the trie with fresh data. Don't rebuild on every search query. Collect search logs in Kafka → aggregate with Flink/MapReduce hourly → rebuild the trie offline → swap atomically (blue-green deploy of the trie). Users see updated suggestions within 1 hour.
- What breaks: long-tail prefixes that aren't in the pre-computed cache. For prefixes beyond 3-4 characters, fall back to Elasticsearch prefix + fuzzy queries. This handles typos and rare queries that the trie can't cover.
- Don't use: database LIKE queries for autocomplete — full table scans at scale. Don't trigger a backend call on every keystroke — debounce on client (200-300ms delay) to reduce requests by ~60%. Don't return 50 results — top 5-10 is the sweet spot.
- Content safety: filter offensive, NSFW, and dangerous suggestions. Maintain a blocklist. Suppress suggestions related to ongoing tragedies or sensitive events. Run an async content filter on all suggestion candidates before caching.
Design Shazam
Difficulty: Hard. Category: Search & Discovery.
Design an audio recognition system like Shazam. A user holds their phone up to a speaker for 5-10 seconds, the system identifies the song from a catalog of 50M+ tracks, even in noisy environments like bars and concerts. The system must handle 100M+ recognition requests per day with sub-3-second identification latency. Cover audio fingerprinting, spectrogram analysis, noise-robust matching, and large-scale index lookup.
Key Topics
- Audio Fingerprinting
- Spectrogram Analysis
- Locality-Sensitive Hashing
- Noise-Robust Matching
- Large-Scale Index Lookup
- Real-Time Audio Processing
Hints for Design Shazam
- The core algorithm is audio fingerprinting, not ML classification. Convert the audio clip into a spectrogram (time-frequency representation), identify prominent peaks (local maxima in energy), and create fingerprints from pairs of nearby peaks. Each fingerprint encodes: frequency of two peaks + time difference between them. This is extremely compact and noise-resistant. The original paper by Avery Wang (2003) is the foundation.
- Spectrogram generation: apply Short-Time Fourier Transform (STFT) to the audio signal with overlapping windows (e.g., 1024-sample window, 512-sample hop). This gives a time-frequency grid. Each cell represents the energy at a particular frequency during a particular time window. The human-audible range (20Hz-20kHz) is divided into frequency bands.
- Peak picking: find local maxima in the spectrogram — points that have higher energy than their neighbors in both time and frequency dimensions. These peaks are robust to noise because background noise raises the energy floor uniformly but doesn't change where the peaks are relative to each other. Typically extract 10-30 peaks per second of audio.
- Fingerprint construction: pair nearby peaks (within a target zone — e.g., peaks within 1-5 seconds of each other). Each pair creates a hash: hash(freq1, freq2, time_delta). This combinatorial pairing produces hundreds of fingerprints per second of audio. Store each fingerprint with a pointer to (song_id, time_offset). The time offset is critical for alignment verification.
- Matching: the user records 5-10 seconds of audio. Generate fingerprints from the sample. Look up each fingerprint hash in the database. Multiple songs will match individual fingerprints (collisions). The key insight: correct matches will have consistent time offsets. If fingerprint A matches song X at offset 42s and fingerprint B matches song X at offset 43s, the time difference matches. Plot (song_id, time_offset_difference) and look for clusters. A cluster of 5+ aligned matches is a confident identification.
- Storage and indexing: a catalog of 50M songs, each producing ~100K fingerprints = 5 trillion fingerprint entries. Store as a hash table: fingerprint_hash → list of (song_id, time_offset). Use a distributed key-value store (DynamoDB, Cassandra, or sharded Redis). Each lookup is O(1) by hash. The total index is large (tens of TB) but the access pattern is simple: point lookups by hash, no range scans.
- Noise robustness: the algorithm works in noisy environments (bars, concerts, cars) because fingerprints are based on spectral peaks, not absolute energy levels. Background noise raises the floor but peaks still protrude. The combinatorial pairing adds redundancy — even if 50% of peaks are missed due to noise, enough fingerprint pairs survive for matching. Time-offset alignment further filters false positives.
- Scale: 100M+ recognition requests/day. Each request generates ~200 fingerprints from a 5-second clip. Each fingerprint requires one hash lookup. That's 20B hash lookups/day, or ~230K lookups/sec average, ~1M/sec at peak. The hash table must be partitioned across many nodes. Latency target: <3 seconds from audio capture to song identification. Audio upload ~500ms, fingerprint generation ~100ms, matching ~500ms, result return ~100ms.
- Don't use: raw audio comparison (waveform matching) — it doesn't work with ambient noise, different recordings, or live performances. Don't use ML-based audio classification as the primary path — it requires massive training data per song and doesn't generalize to new catalog additions. Don't store fingerprints in a relational database with LIKE queries — the scale demands a hash-based index. Don't skip the time-offset alignment step — without it, false positive rates are unacceptably high.
Design Strava
Difficulty: Medium. Category: Real-Time Systems.
Design a fitness tracking platform like Strava. Users can record GPS-based activities (running, cycling), view their routes on a map, compete on segments/leaderboards, and follow other athletes.
Key Topics
- GPS Tracking
- Activity Recording
- Segment Matching
- Leaderboard
- Social Feed
- Map Rendering
Hints for Design Strava
- Record GPS coordinates at regular intervals (every 1-5 seconds) on the client device. Batch upload the complete activity when finished — don't stream in real-time to save battery and bandwidth. A 1-hour run at 1Hz = 3,600 GPS points × 20 bytes each = ~72KB per activity.
- Store activities as polylines (ordered list of lat/lng/timestamp/elevation points). Calculate derived metrics server-side: total distance (Haversine formula between consecutive points), pace, elevation gain/loss, calories burned. Don't trust client calculations — they vary by device.
- Segments are predefined route sections (e.g., a popular hill climb). Match activities to segments using geospatial corridor matching: check if the activity polyline passes through the segment's start, end, and stays within a tolerance corridor (50m). This is a computational geometry problem.
- Leaderboard per segment: use Redis ZSET (score = time in seconds, member = userId). When an activity matches a segment, compute the elapsed time for that section and update the leaderboard. Support filters: all-time, this year, this month, age groups, weight classes.
- Social feed is a fan-out-on-write system. When a user completes an activity, push it to followers' feeds in Redis. Include activity summary (distance, time, map thumbnail). Detailed activity data is fetched on-demand when the user taps on it.
- Scale math: 100M users, 10M activities uploaded/day. Average activity = 3,600 GPS points = 72KB raw data + derived metrics. That's ~720GB/day of new activity data. Segment matching: 10M activities × 10 nearby segments each = 100M segment checks/day.
- Bottleneck: segment matching at upload time. Checking an activity against all segments globally is impossible (millions of segments). Solution: use geospatial indexing to find only segments near the activity's bounding box. Typically 5-20 candidate segments per activity.
- What breaks: GPS accuracy. Urban canyons (tall buildings), tunnels, and poor satellite coverage cause GPS drift — an activity might show you running through buildings. Fix: apply Kalman filtering to smooth GPS tracks. Flag impossible speeds (>40 km/h for a run) as GPS errors.
- Don't use: real-time streaming of GPS during recording (drains battery, wastes bandwidth). Don't compute leaderboard rankings on every query (pre-compute and cache). Don't store raw GPS in a relational DB (use a specialized format like GeoJSON or Protocol Buffers for compact storage).
Design a Tagging System
Difficulty: Medium. Category: Platform.
Design a multi-tenant tagging system where tenants independently manage tags and associate them with content. The system must prevent duplicate tags globally within a tenant, serve product discovery in under 25ms, and track tag usage analytics across millions of interactions. Consider consistency trade-offs: tag operations need strong consistency, discovery can tolerate caching, and analytics can be eventual.
Key Topics
- Multi-Tenancy
- Global Deduplication
- Caching Strategy
- Event-Driven Analytics
- Consistency Tiers
Hints for Design a Tagging System
- Tag deduplication is the core constraint. Within a tenant, tag names must be unique (case-insensitive). Use a unique index on (tenant_id, normalized_tag_name) in PostgreSQL. Normalize at write time (lowercase, trim whitespace). Without this, 'JavaScript', 'javascript', and ' JavaScript ' become three different tags.
- Three consistency tiers, because not everything needs the same guarantees. Tag CRUD operations need strong consistency (read-after-write, served from primary DB). Product discovery (which products have this tag) can tolerate short staleness (cached, refreshed every few seconds). Tag usage analytics are eventual (event-driven pipeline, seconds to minutes of delay is fine).
- Tag-to-content association is a many-to-many relationship. The junction table (content_tags) has a composite primary key: (tenant_id, content_id, tag_id). Sharding by tenant_id keeps all of a tenant's data co-located. Queries like 'all content with tag X' are fast because they hit a single shard.
- Popular tags dashboard is an analytics problem, not a transactional one. Click events flow through Kafka into a counting pipeline (Flink or similar). Pre-aggregated counts stored in a fast-read store (Redis or materialized view). The dashboard reads from the aggregate, not from the transactional DB.
- Caching strategy for discovery: tag-to-content mappings cached in Redis with a short TTL (5-30 seconds). Cache key: tenant_id:tag_id:content_list. Invalidation: on tag association/dissociation, publish an event that clears the relevant cache entries. Stale reads during the TTL window are acceptable for discovery but not for tag CRUD.
- Scale math: 10K tenants, average 500 tags per tenant = 5M tags total. Average 50K content items per tenant = 500M content items. Average 3 tags per content item = 1.5B association rows. At 100 bytes per row, the junction table is about 150GB. Fits on a single sharded PostgreSQL cluster.
- Multi-region: tag writes go to the primary region (strong consistency). Reads replicated to edge regions with a replication lag of 1-2 seconds. For discovery queries that tolerate staleness, read from the nearest replica. For tag creation, always route to primary.
- Common mistake: storing tags as a comma-separated string on the content row (e.g., 'javascript,react,frontend'). This breaks querying ('find all content tagged react'), prevents deduplication, and makes rename/delete operations require scanning every content row.
Design Ticketmaster
Difficulty: Hard. Category: E-Commerce.
Design an event ticketing platform like Ticketmaster. Users can browse events, select seats, and purchase tickets. The system must handle extreme traffic spikes when popular events go on sale without overselling any seats.
Key Topics
- Inventory Management
- Seat Reservation
- Distributed Locking
- Queue Management
- High Concurrency
- Payment Integration
Hints for Design Ticketmaster
- The core challenge: tens of thousands of users trying to book the same seats simultaneously. The system must prevent overselling while remaining fast. This is fundamentally a distributed concurrency problem.
- Use temporary seat reservations with TTL (e.g., 10 minutes). When a user selects seats, hold them with a timer. If payment doesn't complete within the TTL, release the seats back to inventory. This prevents seats being locked forever by abandoned checkouts.
- Virtual waiting room for high-demand events: when a Taylor Swift concert goes on sale, queue users randomly before letting them in. Admit in batches of 1,000-5,000. This converts a thundering herd into controlled traffic. Communicate queue position and estimated wait time.
- Seat locking: optimistic locking (compare-and-swap — 'book seat 42 IF current_status = available') is faster but can fail under extreme contention. Pessimistic locking (SELECT FOR UPDATE) guarantees success but serializes access. For hot events, use a combination: optimistic for general admission, pessimistic for specific seat selection.
- Separate read path from write path (CQRS). Read path (browse events, view seat maps) is served from cache — eventually consistent, heavily cached in CDN. Write path (book tickets) goes to the primary DB with strong consistency. 99% of traffic is reads.
- Scale math: a major concert has 80K seats. On-sale moment: 500K-2M concurrent users. ~80K successful transactions in 5-10 minutes. That's ~250 TPS for booking. The real bottleneck isn't TPS — it's contention on individual seats.
- Bottleneck: the seat inventory database. All booking writes contend for rows in the same table. Solutions: partition inventory by section/zone, use Redis for real-time seat availability checks (fast reads), and only hit the DB for the actual booking transaction.
- What breaks: the payment service is slow (3-5 seconds per charge) while seats are held. If the payment provider has an outage, seats get locked and expire, causing a cascade of retries. Fix: separate reservation from payment — confirm the booking immediately, charge asynchronously. If payment fails, cancel the booking with a compensation flow.
- Don't use: a single relational DB for everything — it will collapse under on-sale load. Don't allow unlimited concurrent seat selections — cap the number of users actively selecting seats. Don't show real-time exact seat availability to everyone (too expensive) — show approximate availability, validate at booking time.
Design Tinder
Difficulty: Medium. Category: Social & Media.
Design a location-based dating app like Tinder. Users create profiles, set preferences, and swipe through nearby candidates. When two users swipe right on each other, it's a match and they can chat.
Key Topics
- Geospatial Queries
- Recommendation Engine
- Swipe Queue
- Match Detection
- Real-Time Messaging
- Profile Ranking
Hints for Design Tinder
- Use geohashing to find users within a configurable radius (1-100 miles). Pre-compute a swipe queue of nearby candidates that match the user's age, gender, and distance preferences. Store the queue in Redis — when a user opens the app, the deck loads instantly.
- Recommendation engine factors: distance (closer = higher score), age preference match, mutual interests, historical swipe patterns (ELO-like scoring — if many people swipe right on you, you're shown to more people), profile completeness, and activity recency (active users first).
- Match detection: when User A swipes right on User B, check if B already swiped right on A. Store 'likes' as a set per user (Redis SET or a simple DB table). On right-swipe: SISMEMBER check. If mutual → create match and notify both users instantly via push notification.
- Pre-generate swipe stacks asynchronously (batch job every 30-60 min). When a user exhausts their current stack, replenish in the background. The key UX requirement: the card deck should NEVER be empty in an active area. Load next batch before current batch is exhausted.
- After a match, enable real-time chat via WebSocket. Store messages in a separate chat service (Cassandra, partitioned by match_id). Chat is a separate microservice from the matching/swiping service.
- Scale math: 75M MAU, 2B swipes/day = ~23K swipes/sec. Each swipe = a read (fetch next card) + a write (record the swipe). Match check = 1 Redis SISMEMBER. Profile images served via CDN — each profile has 6 photos × 3 sizes = 18 images.
- Bottleneck: the recommendation engine for dense cities. In Manhattan, there might be 500K eligible profiles within 10 miles. You can't score all 500K for every user. Solution: pre-filter by hard constraints (age, gender, distance), then score the top 1,000 candidates, cache the top 100 as the swipe stack.
- What breaks: stale swipe queues. User A's queue shows User B, but B already deleted their account or changed preferences. Fix: validate profiles in the swipe queue lazily — if a swiped profile is invalid, skip and fetch the next. Don't block the user experience for validation.
- Don't use: SQL LIKE queries for profile search. Don't show the same profile twice (track seen profiles per user in a bloom filter — memory efficient). Don't expose exact location (show 'X miles away', round to nearest mile). Don't let inactive accounts pollute the swipe queue — filter by last-active timestamp.
Design Uber
Difficulty: Hard. Category: Real-Time Systems.
Design a ride-sharing platform like Uber. Riders request a ride, the system matches them with nearby drivers, tracks the trip in real-time, calculates fares with surge pricing, and processes payments. Handle millions of concurrent trips.
Key Topics
- Geospatial Matching
- Supply-Demand Balancing
- ETA Service
- Trip Service
- Surge Pricing
- Driver Location Tracking
Hints for Design Uber
- Use a geospatial index (Google S2 cells, geohash, or quadtree) to match riders with nearby available drivers. S2 cells handle Earth's curvature accurately and support variable-resolution indexing. Geohash is simpler but has edge-case issues at cell boundaries.
- Driver location updates arrive every 3-5 seconds. Store in an in-memory grid (Redis) partitioned by geographic cells. Don't write every update to a persistent database — buffer and batch-write every 30 seconds for analytics. Live matching only needs the latest location.
- The matching algorithm combines: proximity (straight-line distance as initial filter), ETA (actual road distance via routing engine — much more accurate), driver rating, and vehicle type. Don't match the closest driver — match the fastest-arriving one.
- Surge pricing: monitor supply (available drivers) vs demand (ride requests) per geographic zone in real-time. When demand/supply ratio exceeds threshold, apply a multiplier (1.5x, 2x, etc.). Use Kafka for real-time supply/demand event streams and Flink for windowed aggregation.
- Trip lifecycle state machine: REQUEST → MATCHED → DRIVER_EN_ROUTE → ARRIVED → TRIP_STARTED → TRIP_COMPLETED → PAYMENT. Each state transition is an event logged immutably. This enables replaying trips for disputes and analytics.
- Scale math: 20M rides/day globally, 5M concurrent drivers sending location updates every 4 seconds = 1.25M location updates/sec. Peak hours in NYC alone: ~50K concurrent trips. The matching service must respond in <3 seconds for acceptable UX.
- Bottleneck: the matching service during peak demand. Thousands of ride requests per second in a single city, each needing a geospatial query + ETA calculation. Solution: pre-compute a driver availability grid (refreshed every 5s), use it for fast initial matching, then refine with precise ETA.
- What breaks: ETA accuracy degrades with traffic. An ETA estimated at request time becomes stale by the time the driver is en route. Fix: continuously update ETA using real-time traffic data and driver's live location. Push updated ETAs to the rider every 10-15 seconds.
- Don't use: a relational database for real-time location storage — too slow for millions of writes/sec. Don't compute road-distance ETA for every driver in the city — filter by radius first (geospatial index), then compute ETA only for the nearest 10-20 candidates. Don't use polling for trip tracking — use WebSocket/SSE push.
Design a URL Shortener
Difficulty: Easy. Category: Foundational.
Design a service like bit.ly that takes long URLs and creates short, unique aliases. When users visit the short URL, they are redirected to the original. Consider scale (billions of URLs), latency (sub-100ms redirects), and analytics (click tracking).
Key Topics
- Hashing
- Base62 Encoding
- Database Design
- Caching
- Read-Heavy System
Hints for Design a URL Shortener
- Use base62 encoding (a-z, A-Z, 0-9) to generate short codes — 7 characters gives ~3.5 trillion unique URLs. Avoid base64 as +, / cause URL issues.
- Collision handling: check-and-retry works at low scale. At high scale, use a pre-generated unique ID service (Snowflake, Twitter's ID generator, or a counter with base62 conversion) to guarantee uniqueness without retries.
- Read-to-write ratio is extreme (~100:1 to 1000:1). Put Redis in front as a read-through cache for hot URLs. A single Redis instance handles ~100K reads/sec. LRU eviction works since popular URLs follow a power-law distribution.
- Database choice: a simple key-value store (DynamoDB, Cassandra) works well since you only do point lookups by short code. Avoid relational DBs unless you need analytics joins. Partition by short code hash for horizontal scaling.
- Scale math: 100M new URLs/day = ~1,200 writes/sec. 10B redirects/day = ~115K reads/sec. A single DB won't handle reads — cache layer is essential, not optional.
- Bottleneck: the ID generation service becomes a single point of failure. Solutions: run multiple ID generators with different ID ranges, or use hash-based approaches (MD5 of long URL, take first 7 chars of base62).
- For analytics (click counts, geo, referrer), don't update the main table on every redirect. Write click events to Kafka, aggregate with Flink/Spark into an analytics table. This keeps the redirect path fast (<10ms).
- What breaks at scale: cache stampede when a viral link's cache expires — millions of requests hit the DB simultaneously. Fix with cache locking (only one request fetches from DB, others wait) or probabilistic early expiration.
Design a Web Crawler
Difficulty: Medium. Category: Data Processing.
Design a web crawler that can crawl billions of web pages. The system must discover new URLs, download pages efficiently, handle politeness constraints, avoid traps, and store content for indexing.
Key Topics
- URL Frontier
- Politeness Policy
- Deduplication
- DNS Resolution
- Content Parsing
- Distributed Architecture
Hints for Design a Web Crawler
- Use a URL frontier (priority queue) to manage which URLs to crawl next. Prioritize by: domain importance (PageRank), freshness requirements (news sites > static pages), and crawl depth (homepage > deep links). The frontier is the brain of the crawler.
- Enforce politeness: never crawl a single domain faster than its robots.txt allows. Use per-domain rate limiting (typically 1 request every 1-10 seconds). Each crawler worker is assigned specific domains via consistent hashing — this naturally enforces per-domain limits and simplifies dedup.
- Deduplicate URLs using a Bloom filter (fast, ~1% false positive rate, minimal memory). For 10B URLs, a Bloom filter uses ~1.2GB RAM. Also deduplicate content: compute SimHash of page content to detect mirror sites and duplicate content hosted on different URLs.
- Distribute work across hundreds of crawl workers. Partition by domain via consistent hashing: one worker per domain cluster. This simplifies politeness enforcement (one worker controls the rate for its domains) and enables local DNS caching per worker.
- Handle crawler traps: infinite URL spaces (calendar pages with every date, shopping sites with every filter combination). Mitigations: max URL depth limit (e.g., 15 levels), max URLs per domain per crawl cycle, URL normalization (remove tracking params, sort query params, lowercase), and detect repetitive URL patterns.
- Scale math: to crawl 10B pages in 1 month, you need ~3,850 pages/sec sustained. Each page is ~100KB = ~385MB/sec download bandwidth. DNS lookups: ~3,850/sec (cache aggressively — local DNS resolver per worker). Storage: 10B × 100KB = ~1PB raw HTML.
- Bottleneck: DNS resolution. At 3,850 pages/sec, each page needs a DNS lookup. Public DNS resolvers will rate-limit you. Solution: run local DNS resolvers on each crawler worker, cache DNS results aggressively (TTL of hours, not seconds), and pre-resolve DNS for queued URLs in batch.
- What breaks: a single slow domain can bottleneck a worker. If a worker is assigned to a domain that responds slowly (2-3 seconds per page), it wastes time waiting. Solution: async HTTP clients (make hundreds of concurrent requests), and redistribute unresponsive domains to a 'slow lane' worker pool.
- Don't use: a single-threaded synchronous crawler (way too slow). Don't ignore robots.txt (ethical and legal issues, plus you'll get IP-banned). Don't store raw HTML in a relational database (use object storage like S3 with an index in the DB). Don't recrawl unchanged pages — use HTTP ETag/If-Modified-Since headers to skip unchanged content.
Design Yelp
Difficulty: Medium. Category: Search & Discovery.
Design a local business discovery and review platform like Yelp. Users can search for nearby businesses, read and write reviews, upload photos, and filter by category, rating, and distance.
Key Topics
- Geospatial Search
- Quadtree / Geohash
- Review System
- Search Ranking
- Photo Storage
- Business Profiles
Hints for Design Yelp
- Use a quadtree or geohash index to efficiently answer 'find restaurants near me within 5km' queries. A quadtree recursively divides 2D space into four quadrants. Geohash converts lat/lng to a string prefix for range queries. Both give O(log n) spatial lookups.
- Separate the search/discovery service from the review/rating service. Search needs fast reads with geospatial indexes (Elasticsearch with geo queries). Reviews need write consistency and anti-spam validation (PostgreSQL). Different scaling requirements.
- Aggregate ratings asynchronously. Don't recompute the average on every new review — it locks the business row. Use a running weighted average: new_avg = old_avg + (new_rating - old_avg) / (total_reviews + 1). Or batch-recompute with a Kafka consumer on new-review events.
- Search ranking combines: distance (closer = higher), rating (weighted by review count — a 4.5 with 500 reviews beats a 5.0 with 2 reviews), relevance to search query (text match on name, cuisine, keywords), and sponsored/promoted status. Use Elasticsearch for scoring.
- Store photos in object storage (S3) with CDN delivery. Generate multiple sizes on upload: 150px thumbnail (list view), 600px medium (business page), 1200px full (lightbox). Average business has 20-50 photos. Total: billions of photos.
- Scale math: 200M businesses, 50M reviews/year, 500M search queries/day = ~5,800 QPS. Geospatial queries are compute-heavy — each checks distance for thousands of nearby businesses. Cache popular queries (city + category combinations) in Redis.
- Bottleneck: geospatial index for dense urban areas. Manhattan has 50K+ restaurants. A query for 'pizza near Times Square' must filter, score, and rank thousands of candidates in <200ms. Solution: pre-computed geohash-based indexes with materialized search results for popular areas.
- What breaks: fake reviews. Spam accounts posting 5-star reviews for payment. Detection: flag accounts with unusual patterns (only 5-star reviews, all for new businesses, posted within minutes). Use ML models trained on known fake reviews. Rate-limit reviews per user per day.
- Don't use: database LIKE queries for text search (use Elasticsearch). Don't compute distances in the application layer for every business in the DB (use spatial indexes). Don't show raw average ratings — use Bayesian average to prevent manipulation by businesses with few reviews.
Design YouTube Top K
Difficulty: Hard. Category: Data Processing.
Design a system that computes the Top K most-viewed videos on YouTube in real-time. The system processes billions of view events per day and must return the current top K videos for various time windows (last hour, day, week).
Key Topics
- Min-Heap
- Apache Kafka
- Apache Flink
- Time Windows
- Exactly-Once Semantics
- Stateful Stream Processing
Hints for Design YouTube Top K
- Kafka as source of truth — partition by video_id, RF=3, acks=all, 128 partitions. Everything downstream is derived state.
- Flink + RocksDB — stateful stream processing for 50M+ keys without GC pressure. Incremental checkpoints to S3 every 60s.
- Min-heap of size K — O(log K) per update, not O(N log N) sort. Runs inside Flink, selects top K from exact per-video counts.
- ZADD, not ZINCRBY — Flink computes absolute counts, writes to Redis with ZADD (idempotent). Redis is a display board, never a counter.
- Time windows — 10-min sliding, daily/monthly tumbling with continuous triggers, lifetime counter (no window). Each emits every 30s.
- Exactly-once chain — idempotent Kafka producer → checkpointed offsets → 2PC Postgres sink. Redis writes are at-least-once but idempotent via ZADD.
- Watermarks + late events — BoundedOutOfOrderness(30s), allowedLateness(5min), sideOutputLateData for monitoring. Use server_timestamp, not client clock.
- 3-tier API fallback — Redis (5ms timeout) → Postgres → stale cache. Never hard-fail. CDN caches for 10s, absorbs 99%+ of reads.
- Failure recovery — Flink crashes: restore checkpoint from S3 (~30s), replay Kafka, re-emit to idempotent sinks. Redis crashes: API falls back to Postgres, Flink repopulates Redis in ≤30s. Zero data loss in every scenario.
Design YouTube
Difficulty: Hard. Category: Social & Media.
Design a video sharing and streaming platform like YouTube. Users upload videos, the system transcodes and stores them, and viewers can stream in adaptive quality. Support recommendations, comments, subscriptions, and billions of daily views.
Key Topics
- Video Upload Pipeline
- Transcoding
- Adaptive Bitrate Streaming
- CDN
- Recommendation Engine
- View Count Service
Hints for Design YouTube
- Video upload pipeline: client uploads to a staging service → store raw video in object storage (S3) → enqueue in Kafka for processing → transcode to multiple resolutions (360p, 720p, 1080p, 4K) and codecs (H.264, VP9, AV1). Each resolution × codec = a separate file. A 10-minute video might produce 12+ variants.
- Use adaptive bitrate streaming (HLS or DASH). Break each transcoded video into small segments (2-10 seconds). Client requests a manifest file listing all quality levels, then dynamically switches between them based on network bandwidth. This is why YouTube quality changes mid-video.
- Serve videos via CDN with edge caching. Top 10% of videos get ~90% of views (power law). Cache popular videos at edge PoPs worldwide. Long-tail content is served from regional origin servers. CDN cost is the #1 expense for a video platform.
- Recommendation engine: watch history (what you've watched), collaborative filtering (users similar to you watched X), content similarity (video embeddings from titles, thumbnails, transcripts), and engagement signals (click-through rate, watch time, completion rate). Watch time is the most important signal.
- View counts don't need real-time accuracy. A 5-minute delay is fine. Ingest view events via Kafka → aggregate with Flink in 1-minute tumbling windows → write to a counter service. Show approximate counts. Exact counts are reconciled in a daily batch job.
- Scale math: 500 hours of video uploaded per minute. Average video: 10 min × 12 variants × ~100MB each = ~1.2GB per video. That's ~3.6TB of new transcoded content per hour. CDN serves 1B+ hours of video daily. Bandwidth: ~1 petabit/sec aggregate.
- Bottleneck: transcoding. It's CPU-intensive (1 hour of video takes 3-5 hours of CPU time for all variants). Use a large pool of GPU-accelerated transcoding workers. Auto-scale based on queue depth. Prioritize short videos and popular creators.
- What breaks: viral videos cause CDN cache misses at edge PoPs that haven't cached the video yet. All requests fall through to the origin, overwhelming it. Fix: pre-warm CDN edges when a video starts trending (monitor view velocity, proactively push to edges).
- Don't use: a single server to serve video files — impossible at scale. Don't transcode synchronously during upload — it takes too long. Don't store videos in a database — use object storage (S3/GCS). Don't skip adaptive bitrate — fixed quality means buffering on slow connections and wasted bandwidth on fast ones.