Design an Ad Click Aggregator
Difficulty: Medium. Category: Data Processing.
Design an ad click aggregation system that processes billions of click events daily. The system must count clicks per ad in real-time, deduplicate clicks, detect fraud, and provide accurate billing data to advertisers.
Key Topics
- Stream Processing
- Lambda Architecture
- Click Deduplication
- Time Windows
- Fraud Detection
- Exactly-Once Semantics
Hints for Design an Ad Click Aggregator
- Ingest click events via Kafka (partitioned by ad_id for ordering). Use a stream processor (Flink for exactly-once guarantees, Spark Streaming for simpler setup) to aggregate clicks in time windows: 1-minute (real-time dashboard), 1-hour (billing reports), 1-day (daily summaries).
- Deduplicate clicks: same user clicking the same ad within 30 seconds = one click. Use a composite key (user_id, ad_id) in a Redis SET with TTL. On each click event, SADD to check membership. If exists, skip. Bloom filter alternative if memory is constrained (accept ~1% over-count).
- Lambda architecture: real-time stream path (Flink aggregation for live dashboards, serves queries immediately) + batch path (hourly Spark job for accurate billing reconciliation). The batch path corrects any stream processing errors. Billing is always based on the batch path.
- Fraud detection pipeline (runs in parallel): flag suspicious patterns — same user clicking >5 times/minute (click spam), many clicks from one IP subnet (click farm), abnormal CTR spikes for an ad (competitor sabotage), clicks from known bot user agents. ML model trained on labeled fraud data improves over time.
- Store aggregated results in an OLAP database (ClickHouse for speed) or pre-aggregated tables. Schema: (ad_id, campaign_id, advertiser_id, time_bucket, click_count, unique_users, impressions). Advertisers query dashboards with rollups by campaign, ad group, creative, geo, device.
- Scale math: 10B ad impressions/day, 1% CTR = 100M clicks/day = ~1,160 clicks/sec. Peak during US business hours: 5-10x = ~10K clicks/sec. Each click event is ~200 bytes. Kafka throughput needed: ~2MB/sec sustained, ~20MB/sec peak. Manageable with a small Kafka cluster.
- Bottleneck: the deduplication layer. At 10K clicks/sec, Redis must handle 10K SADD operations/sec with TTL. A single Redis instance handles this easily. But if you also need to dedup across longer windows (daily unique clicks per ad), the SET sizes grow — millions of entries per popular ad.
- What breaks: late-arriving events. A click event from a mobile user on a poor connection arrives 5 minutes after it happened. The 1-minute window has already closed. Solution: use event-time processing (Flink watermarks) instead of processing-time. Allow late events up to a configurable threshold (e.g., 10 minutes). Re-aggregate affected windows.
- Don't use: database writes per click event (too slow, too expensive at scale). Don't skip the batch reconciliation path — stream processing can have subtle bugs that only batch processing catches. Don't aggregate by processing time — event time is what matters for accurate billing. Don't trust client-side click timestamps (clock skew, manipulation).
Design a Chat System with E2EE
Difficulty: Medium. Category: Security & Messaging.
Design a real-time chat application (like Signal or WhatsApp) that supports 1:1 and group messaging with end-to-end encryption. The server should never be able to read message contents. Consider key management, offline delivery, and multi-device support.
Key Topics
- End-to-End Encryption
- WebSocket
- Message Queue
- Key Exchange
- Signal Protocol
Hints for Design a Chat System with E2EE
- Use the Signal Protocol (Double Ratchet + X3DH key agreement) for E2EE. It provides forward secrecy (past messages stay safe if keys are compromised later) and break-in recovery (future messages become safe again after a compromise).
- WebSockets maintain persistent connections for real-time delivery. Design a connection gateway service that maps userId → WebSocket connection. Use consistent hashing to route messages to the correct gateway server.
- For offline users, store encrypted messages server-side (the server only sees ciphertext, never plaintext). Deliver on reconnect. Set a retention limit (e.g., 30 days) to bound storage growth.
- Group chats: Sender Keys (one key per sender, shared with all members) = O(1) encrypt cost but weaker forward secrecy. Pairwise encryption (encrypt message once per member) = O(n) cost but stronger security. WhatsApp uses Sender Keys for efficiency.
- Scale math: 1B users, 100B messages/day = ~1.15M messages/sec. Each message is ~1KB encrypted = ~1.15GB/sec write throughput. Use Cassandra (write-optimized, partition by chatId) for message storage.
- Bottleneck: the connection gateway. Each server handles ~500K concurrent WebSocket connections (kernel tuning needed). For 500M concurrent users, you need ~1,000 gateway servers. A separate presence service tracks which gateway hosts each user.
- What breaks: key distribution at scale. When a user adds a new device, it needs key exchange with all contacts. Use a key server to store public prekeys, but clients MUST verify key fingerprints out-of-band (QR code scan) to prevent man-in-the-middle attacks.
- Don't use: symmetric encryption alone (no forward secrecy), HTTP polling for real-time chat (too much latency and server overhead), or any scheme where the server can read messages even temporarily. Don't store encryption keys on the server.
- Multi-device: each device has its own key pair. Messages are encrypted separately for each of the recipient's devices. Use a device registry per user. When a device is removed, rotate the Sender Key for all groups that user is in.
- Message ordering: use Lamport timestamps or vector clocks to order messages correctly across devices. Client-side timestamps are unreliable (clock skew). The server assigns a monotonic sequence number per chat for display ordering.
Design a Distributed Cache
Difficulty: Medium. Category: Infrastructure.
Design a distributed caching system like Memcached or Redis Cluster. It should support fast key-value lookups across multiple machines, handle node failures gracefully, and scale horizontally. Consider eviction policies, data partitioning, and replication strategies.
Key Topics
- Consistent Hashing
- Eviction Policies
- Replication
- Cache Invalidation
- LRU/LFU
Hints for Design a Distributed Cache
- Use consistent hashing with virtual nodes (150-200 per physical node) to distribute keys evenly. Without virtual nodes, adding/removing a server causes massive key redistribution and a thundering herd of cache misses.
- Cache invalidation is the hardest problem in CS. Patterns ranked by complexity: TTL-based (simplest, tolerate staleness), cache-aside (app manages both cache and DB), write-through (write to cache + DB atomically), write-behind (write to cache, async DB sync — risky if cache node crashes before DB write).
- For high availability, replicate each partition to 2-3 nodes. Async replication: lower write latency but stale reads possible. Sync replication: guarantees consistency but adds ~1-2ms per write. Choose based on your consistency requirements.
- Hot key problem: a single viral key (celebrity profile, flash sale item) can overload one cache node. Solutions: replicate hot keys to multiple shards, add an L1 local cache on each app server with 5-10s TTL, or use request coalescing (only one request fetches from origin, others wait on a promise).
- Scale math: a single Redis node handles ~100K ops/sec and stores ~25GB. For 1M ops/sec you need ~10 shards. For 500GB of cached data you need ~20 shards. Consistent hashing lets you add shards without full redistribution.
- Eviction policies: LRU is the safe default (evict least recently used). LFU is better when frequency matters (keeps consistently popular items longer). Random eviction is surprisingly effective and avoids the overhead of tracking access metadata.
- What breaks at scale: thundering herd — a popular key expires and thousands of concurrent requests simultaneously miss cache and slam the database. Fix with cache locking (mutex on miss — only one request fetches from DB), staggered TTLs (add random jitter ±30s), or background refresh (never expire, refresh before TTL).
- Don't use: a distributed cache for data that changes every second (just query the DB). Don't cache objects with complex invalidation dependencies (cache A depends on cache B depends on cache C → guaranteed stale data cascades). Don't use Memcached if you need persistence or complex data structures — use Redis.
- Monitoring: track cache hit ratio (should be >95%), p99 latency, eviction rate, and memory usage. If hit ratio drops below 90%, your cache is too small or your access pattern isn't cache-friendly.
Design Dropbox
Difficulty: Hard. Category: Storage & Sync.
Design a cloud file storage and sync service like Dropbox. Users can upload files from one device and access them from any other device. Support file versioning, sharing, and efficient sync.
Key Topics
- File Chunking
- Deduplication
- Sync Protocol
- Conflict Resolution
- Metadata Service
- Block Storage
Hints for Design Dropbox
- Split files into fixed-size chunks (4MB). Only upload changed chunks, not the entire file. Editing 1 byte in a 1GB file should upload 4MB, not 1GB. Use content-defined chunking (Rabin fingerprinting) for even better deduplication across similar files.
- Content-based hashing (SHA-256 per chunk) enables deduplication. If two users upload the same file, store chunks once, point both file records to the same chunk IDs. Saves 30-50% storage in practice. Also saves bandwidth — client computes hash first, asks server 'do you have this chunk?', only uploads if no.
- Metadata service is the brain: maps files → list of chunk IDs → storage locations. Use PostgreSQL with ACID guarantees. Partition by userId for horizontal scaling. This is the most critical service — if metadata is lost or corrupted, files are unrecoverable.
- Conflict resolution: last-writer-wins is dangerous for important files. Dropbox creates a 'conflicted copy' when two devices edit the same file before syncing. Always preserve both versions and notify the user. Data loss is worse than a conflict notification.
- Real-time sync via WebSocket or long polling. When user edits a file on laptop, a notification pushes to all other devices within seconds. The device then fetches only the changed chunks from the server.
- Scale math: 500M users, average 2GB stored = 1EB total. With dedup, actual storage ~500PB. At 4MB chunks = ~125 billion chunks. Store chunks in object storage (S3/GCS). Store chunk metadata in Cassandra (chunk_hash → storage_location).
- Bottleneck: the metadata service. Every file operation hits it. Cache aggressively in Redis (user's file tree). Read from replicas for listing operations. All writes go through the primary for consistency.
- What breaks: sync loops — Device A syncs to B, B syncs back to A, repeat forever. Fix with vector clocks or logical timestamps. Each device tracks the last seen version; only sync if the remote version is strictly newer.
- Don't use: direct S3 uploads for every small edit — too expensive and slow. Use a staging/upload service that receives chunks, deduplicates, then writes to cold storage in batches. Don't store file contents in a relational database — use object storage. Don't skip chunking — it's the foundation of efficient sync.
Design FB Post Search
Difficulty: Medium. Category: Search & Discovery.
Design the search functionality for a social platform like Facebook. Users can search for posts, people, pages, and groups. Results should be ranked by relevance, recency, and social connection. Support typeahead suggestions.
Key Topics
- Inverted Index
- Elasticsearch
- Tokenization
- Search Ranking
- Typeahead
- Index Partitioning
Hints for Design FB Post Search
- Build an inverted index mapping words → post IDs. Use Elasticsearch (built on Lucene) as the search engine. It handles tokenization, inverted indexing, and relevance scoring out of the box. Don't build your own search engine from scratch.
- Index updates: tokenize posts on write (lowercase, remove stop words, stem words, extract entities). Don't update the search index synchronously — publish a post-created event to Kafka, and a separate indexing service consumes and updates Elasticsearch asynchronously. Indexing latency of 1-5 seconds is acceptable.
- Ranking model: relevance score (BM25/TF-IDF from Elasticsearch) + recency (newer posts rank higher) + social signals (posts from friends rank above strangers) + engagement (likes, comments, shares weighted by recency) + content type (posts with photos rank higher). Use Elasticsearch's function_score query for custom ranking.
- Partition the search index by time range. Recent posts (last 30 days) are searched 10x more than older posts. Keep recent posts in a hot index (more shards, more RAM), and older posts in a cold index (fewer resources). This reduces cost by 60-70%.
- Typeahead: combine a prefix trie for instant suggestions with Elasticsearch's completion suggester. Return mixed results — posts, people, pages, groups — ranked by relevance and user relationship. Debounce keystrokes on the client (200ms) to reduce backend queries.
- Scale math: 3B users, 1B posts/day. Search index size: trillions of documents over 10+ years. Even just the recent 30-day index is ~30B documents. Elasticsearch cluster needs 100s of nodes. Query rate: ~100K searches/sec.
- Bottleneck: indexing throughput. 1B new posts/day = ~11,500 index writes/sec. Each post needs tokenization, entity extraction, and Elasticsearch indexing. Use a pipeline: Kafka → NLP enrichment service → Elasticsearch bulk indexer (batch 1000 docs per bulk request).
- What breaks: search relevance for ambiguous queries. 'Apple' — the company, the fruit, or your friend named Apple? Social context is critical. Boost results from friends, pages you follow, and groups you're in. Use the user's profile and activity history to disambiguate.
- Don't use: SQL LIKE queries (full table scan, impossible at scale). Don't index every field (index only searchable text, store metadata separately). Don't return all matches (paginate, show top 20 results). Don't skip access control — users should only see posts they have permission to view (filter by privacy settings in the query).
Design Google Docs
Difficulty: Hard. Category: Collaboration.
Design a real-time collaborative document editor like Google Docs. Multiple users can edit the same document simultaneously, see each other's cursors, and all changes are synced in real-time without conflicts.
Key Topics
- Operational Transform
- CRDT
- Real-Time Sync
- Cursor Presence
- Version History
- Conflict Resolution
Hints for Design Google Docs
- The core challenge is concurrent editing without conflicts. Two main approaches: Operational Transform (OT) — transforms operations against each other to converge. CRDTs (Conflict-free Replicated Data Types) — data structures that merge automatically. Google Docs uses OT. Figma uses CRDTs.
- OT transforms operations: if User A inserts 'X' at position 5 and User B deletes character at position 3, OT adjusts A's insert to position 4 (since B's delete shifted everything). Each operation has a revision number. The server is the single source of truth for operation ordering.
- Use WebSocket for real-time sync. The server acts as a central relay and transformer — clients send operations, server transforms against concurrent ops, broadcasts the result. Latency budget: operations should appear on other users' screens within 100-300ms.
- Cursor presence: show each collaborator's cursor position and selection in real-time. Send presence updates (cursor position, user info, color) on a separate lightweight channel. Throttle to ~5 updates/sec per user to avoid flooding.
- Store the document as a sequence of operations (event sourcing). Periodically snapshot the full document state for fast loading (every 100-500 operations). Loading a document = fetch latest snapshot + replay operations since snapshot.
- Scale math: a popular document might have 50 concurrent editors producing 10 ops/sec each = 500 ops/sec per document. Across millions of documents, total throughput is high but per-document load is manageable. The bottleneck is the operation log per document.
- Bottleneck: the OT server for a single popular document. All operations must be serialized through one server to maintain consistent ordering. Solutions: assign one server per document (partition by docId), use in-memory operation buffer with periodic persistence.
- What breaks: network partitions — a user goes offline, types extensively, then reconnects with 500 buffered operations. All must be transformed against concurrent operations from other users. Solution: client-side OT that transforms local operations against incoming server operations, then sends the transformed batch on reconnect.
- Don't use: last-writer-wins (destroys concurrent edits). Don't use locking (blocks other users). Don't send full document state on every edit (bandwidth explosion). Don't use REST for real-time sync — WebSocket is essential. Don't implement OT from scratch in production — use a proven library like ShareDB or Yjs (CRDT).
Design Instagram
Difficulty: Hard. Category: Social & Media.
Design a photo and video sharing social network like Instagram. Users can upload photos/videos, follow others, browse a personalized feed, view stories, and explore trending content. Handle billions of photos and millions of concurrent users.
Key Topics
- Photo Upload Pipeline
- CDN
- News Feed
- Fan-Out
- Stories
- Explore/Recommendation
Hints for Design Instagram
- Photo upload pipeline: client uploads to a media service → store original in object storage (S3) → async worker generates multiple resolutions (150px thumbnail, 640px feed, 1080px full) → serve via CDN. Process upload + resizing in <30 seconds for good UX.
- Feed is a hybrid fan-out system. For regular users (<10K followers): fan-out-on-write — pre-compute feeds in Redis on post. For celebrities (>100K followers): fan-out-on-read — merge their posts into feeds at query time. This is the Facebook/Instagram architecture.
- Stories are ephemeral (24h TTL). Store separately in a fast-access store with automatic expiry (Redis with TTL or Cassandra with TTL). Pre-generate story trays (list of stories from followed users) per user. A user opening the app should see story trays load in <200ms.
- Explore/Discover uses a recommendation engine: collaborative filtering (users who liked X also liked Y) + content-based (image embeddings via CNN, hashtags, location) + engagement signals (like rate, save rate, share rate in first hour). Ranked by a lightweight ML model.
- Handle high write throughput for social actions: likes, comments, and follower counts don't need real-time accuracy. Process asynchronously via Kafka. Show approximate counts (eventual consistency). A like counter that's 2 seconds behind is fine.
- Scale math: 2B MAU, 500M DAU, 100M photos uploaded/day. Average photo = 2MB × 4 resolutions = 8MB storage. 100M × 8MB = 800TB/day new storage. After a year: ~300PB. CDN serves >95% of image reads. Without CDN, the system is impossible.
- Bottleneck: the feed generation service during peak hours. When 500M users check their feed 5-10x/day, that's 2.5-5B feed reads/day. Each feed read fetches from Redis (pre-computed) + merges celebrity posts (real-time). Redis latency must be <5ms.
- What breaks: image processing backlog. If the transcoding workers can't keep up, uploaded photos don't appear for minutes. Fix: auto-scale transcoding workers based on queue depth. Show a 'processing' placeholder immediately, swap to the real image when ready.
- Don't use: a relational database for image storage (use S3/GCS). Don't serve images directly from your servers (use CDN). Don't fan-out-on-write for celebrities — a Cristiano Ronaldo post would trigger 600M+ cache writes, taking minutes and overwhelming the fan-out service.
Design a Job Scheduler
Difficulty: Medium. Category: Infrastructure.
Design a distributed job scheduler that can execute millions of scheduled tasks reliably. Support cron schedules, one-time jobs, job dependencies (DAGs), retries with backoff, and priority-based execution.
Key Topics
- Cron Scheduling
- Task Queue
- Exactly-Once Execution
- Worker Pool
- Failure Recovery
- Priority Queue
Hints for Design a Job Scheduler
- Separate the scheduler (decides WHEN to run) from the executor (decides HOW to run). The scheduler polls for jobs where next_run <= now and enqueues them into a task queue (Kafka, SQS, or Redis). Workers pull from the queue and execute. This decoupling allows independent scaling.
- For cron-like schedules, store the next execution time in the database. A scheduler process polls every second: SELECT jobs WHERE next_run <= NOW() AND status = 'scheduled' LIMIT 1000. After enqueuing, update next_run to the next cron occurrence. Use batch operations to handle thousands of jobs/sec.
- Exactly-once execution is the hardest requirement. Use distributed locks (Redis SETNX with TTL, or database row-level locks). When a worker claims a job, it acquires a lock. Only the lock holder can execute. TTL prevents deadlocks if the worker crashes mid-execution.
- Handle worker failures: workers send heartbeats every 10-30 seconds. If a worker stops heartbeating, the scheduler marks its in-progress jobs as 'failed' and makes them eligible for retry. Use exponential backoff (1m, 5m, 30m, 2h) with a max retry count (e.g., 5 attempts).
- Job dependencies (DAGs): Job B runs only after Job A succeeds. Model as a directed acyclic graph. Use topological ordering. The scheduler tracks execution state per DAG run. When Job A completes, check if all of Job B's dependencies are satisfied before enqueuing B.
- Scale math: 10M scheduled jobs, 1M execute per hour = ~278 executions/sec. Each execution takes 1-60 seconds. At 30 seconds average, you need ~8,300 worker slots to keep up. Auto-scale workers based on queue depth — if queue grows, add workers.
- Bottleneck: the scheduler polling loop. If 10M jobs need next_run checks, the database query becomes slow. Solution: partition jobs by next_run time bucket (minute-level buckets). Only query the current bucket. Index on (status, next_run). Use a separate 'delayed job' queue for far-future jobs.
- What breaks: clock skew across scheduler nodes. If two scheduler instances have different clocks, jobs might be enqueued twice or missed entirely. Solution: use a single leader scheduler (elected via ZooKeeper/etcd), or use database-level SELECT FOR UPDATE SKIP LOCKED to ensure each job is picked by exactly one scheduler.
- Don't use: in-memory job storage (lost on restart). Don't use sleep-based scheduling (inaccurate and wasteful). Don't allow unbounded retries (a permanently failing job retrying forever wastes resources). Don't skip dead-letter queues — failed jobs after max retries should be moved to a DLQ for manual inspection.
Design LeetCode
Difficulty: Hard. Category: Platform.
Design an online coding platform like LeetCode. Users can browse problems, write code in multiple languages, submit solutions, and see if they pass all test cases. Support contests with real-time leaderboards.
Key Topics
- Code Execution Sandbox
- Container Isolation
- Test Case Runner
- Queue-Based Processing
- Leaderboard
- Anti-Cheat
Hints for Design LeetCode
- The hardest part is safe code execution. Run user code in sandboxed containers (Docker with gVisor or Firecracker) with strict resource limits: CPU (2 cores max), memory (256MB max), time (10 seconds), no network access, no filesystem writes outside /tmp. A malicious fork bomb or infinite loop must be killed cleanly.
- Use a job queue (Kafka or SQS) between the submission API and execution workers. Flow: user submits → API validates and enqueues → worker pulls job → runs code in sandbox → reports result → API notifies user via WebSocket. Decoupling prevents a burst of submissions from crashing the execution tier.
- Pre-compile test cases and expected outputs. Run user code against them sequentially, stopping at first failure for efficiency. Show which test case failed, the expected vs actual output, and execution time. Hidden test cases prevent hardcoding answers.
- Support multiple languages with language-specific Docker images (Python 3.11, Java 17, C++20, Go, Rust, etc.) pre-installed. Warm up containers — pulling and starting a cold Docker image takes 5-10 seconds. Keep a pool of warm containers per language ready to accept submissions.
- Contest leaderboard: use Redis ZSET keyed by (problems_solved DESC, total_time ASC). Update atomically on each accepted submission. Show real-time rank to users via WebSocket push. For large contests (10K+ users), paginate the leaderboard.
- Scale math: 1M submissions/day = ~12 submissions/sec average. During a contest: 50K users × 5 problems × 3 attempts = 750K submissions in 2 hours = ~100 submissions/sec peak. Each execution takes 2-10 seconds. You need 200-1000 execution workers during contests.
- Bottleneck: execution workers during contests. All 50K users submit simultaneously. Solution: auto-scale workers based on queue depth, prioritize contest submissions over practice submissions, and set a maximum queue wait time (show 'queue position' to the user).
- What breaks: users can try to escape the sandbox (container breakout), consume excessive resources (memory bombs, fork bombs), or read test case files. Use gVisor or Firecracker for kernel-level isolation. Mount test cases as read-only. Kill any process exceeding resource limits instantly.
- Don't use: exec() or eval() directly on the server to run user code — it's a security nightmare. Don't give containers network access (users could make outbound API calls to fetch answers). Don't trust execution time on shared machines for competitive ranking — normalize by CPU benchmark or use dedicated instances for contests.
Design a Local Delivery Service
Difficulty: Hard. Category: Real-Time Systems.
Design a local delivery service like DoorDash or Instacart. Users can order from local restaurants/stores, and nearby couriers pick up and deliver the orders. Support real-time tracking, ETA estimation, and driver matching.
Key Topics
- Geospatial Indexing
- ETA Estimation
- Order Matching
- Real-Time Tracking
- Surge Pricing
- Dispatch Optimization
Hints for Design a Local Delivery Service
- Use geohashing or quadtrees to efficiently find nearby drivers/couriers and restaurants. Geohash cells provide O(1) lookup for 'find all drivers in this area'. Store driver locations in Redis, partitioned by geohash prefix.
- Separate into microservices: order service (CRUD, status), dispatch service (matching), tracking service (live location), payment service (charges), and restaurant service (menus, prep time). The dispatch service is the brain.
- Driver location updates stream at high frequency (every 3-5 seconds). Store in Redis (in-memory, fast writes). Don't persist every update to a database — batch-write to a time-series store every 30s for historical analytics.
- The dispatch/matching algorithm optimizes for: minimize delivery time (not just distance — road routing matters), driver proximity to restaurant, order batching (two orders at same restaurant → one driver), and driver earnings fairness (don't starve idle drivers).
- ETA estimation needs: real-time traffic data, historical delivery times for this restaurant, restaurant prep time (varies by order complexity and current queue), and driver travel time. ML models trained on historical data outperform simple distance/speed calculations by 30-40%.
- Scale math: 10M orders/day peak, 1M concurrent drivers sending location every 4s = 250K location writes/sec. During dinner rush (6-9pm), order volume spikes 5-10x. The dispatch service must match an order in <30 seconds.
- Bottleneck: the dispatch service during peak hours. Thousands of new orders per second, each needs a geospatial query + ETA calculation + driver selection. Pre-compute a driver availability heat map (refreshed every 5s) for fast initial matching.
- What breaks: restaurant prep time estimates are wrong — driver arrives but order isn't ready, or order is ready but no driver is nearby. Fix: real-time prep time updates from the restaurant POS system. If a driver is assigned too early, they waste time waiting. If too late, food gets cold.
- Don't use: a relational DB for live driver locations (too slow). Don't match purely by distance (a driver 1km away across a river is worse than one 2km away on the same road). Don't use a single dispatch algorithm for all markets — dense urban areas need different optimization than suburban areas.
Design a Metrics Monitoring System
Difficulty: Medium. Category: Infrastructure.
Design a metrics monitoring and alerting system like Datadog or Prometheus. Collect metrics from thousands of servers, store time-series data, display real-time dashboards, and fire alerts when thresholds are breached.
Key Topics
- Time-Series Database
- Data Collection Agent
- Alerting Engine
- Dashboard
- Downsampling
- Push vs Pull Model
Hints for Design a Metrics Monitoring System
- Lightweight agents on each server push metrics (CPU, memory, disk, network, request latency, error rate, custom app metrics) to a collector service. Use UDP for fire-and-forget high-throughput collection (like StatsD). TCP for guaranteed delivery of critical metrics.
- Store metrics in a time-series database (Prometheus for pull-based, InfluxDB or TimescaleDB for push-based). Schema: (metric_name, tags/labels, timestamp, value). Tags enable filtering: host=web-01, service=api, region=us-east. Choose a TSDB that supports tag-based queries efficiently.
- Downsample old data aggressively: 1-second resolution for the last 2 hours, 1-minute for last 7 days, 5-minute for last 30 days, 1-hour for last year. This reduces storage by 100-1000x. A metric producing 1 data point/sec = 2.6M points/month raw, but only ~8,700 after downsampling to 5-min.
- Alerting engine evaluates rules against incoming data: 'error_rate{service=api} > 5% for 3 minutes'. Use a sliding window evaluation. Support severity levels (warning, critical, page). Route alerts to Slack (warning), PagerDuty (critical). Implement alert grouping and dedup — don't send 500 alerts when 500 servers have the same issue.
- Pull vs Push: Prometheus uses pull (scrapes HTTP endpoints on targets every 15-30s). Datadog uses push (agents send data). Pull is simpler for service discovery and prevents overloading the collector. Push handles ephemeral containers (Lambda, Kubernetes pods) better since they may die before being scraped.
- Scale math: 10K servers, each producing 200 metrics at 10-second intervals = 200K data points/sec ingestion. Each data point = ~50 bytes = ~10MB/sec. After downsampling, long-term storage is manageable. Query load: 1,000 dashboard refreshes/min, each querying 10-50 metrics.
- Bottleneck: query performance for complex dashboards. A dashboard with 20 panels, each querying a metric across 500 hosts for the last 24 hours = 20 × 500 × 8,640 data points = 86M data points to scan. Solution: pre-aggregate common queries, use rollup tables, and cache dashboard results for 10-30 seconds.
- What breaks: alert fatigue. Too many alerts = engineers ignore them all. Solution: proper alert tuning (avoid noisy thresholds), alert correlation (group related alerts into incidents), auto-remediation for known issues (restart service on OOM), and alert-on-alert-absence (detect if monitoring itself fails).
- Don't use: a relational database for time-series storage (write throughput too low, queries too slow for time-range scans). Don't skip downsampling (storage costs explode). Don't collect every possible metric (focus on the Four Golden Signals: latency, traffic, errors, saturation). Don't alert on a single data point — always use windows to avoid flapping.
Design a Multi-Tenant SaaS Platform
Difficulty: Hard. Category: Architecture.
Design a multi-tenant SaaS platform where multiple organizations share the same infrastructure but have isolated data and customizable configurations. Consider tenant isolation strategies, authentication/authorization, billing, and how to handle tenants of vastly different sizes.
Key Topics
- Multi-Tenancy
- Data Isolation
- RBAC
- Horizontal Scaling
- Tenant-Aware Routing
Hints for Design a Multi-Tenant SaaS Platform
- Three isolation models (pick based on compliance needs and cost): shared database with row-level tenant_id column (cheapest, weakest isolation), schema-per-tenant (moderate cost, good isolation), database-per-tenant (most expensive, strongest isolation — required for healthcare/finance). Most B2B SaaS starts with shared DB, offers dedicated DB for enterprise tier.
- Tenant-aware middleware: extract tenant context from subdomain (acme.app.com), JWT claim, or API key. Inject tenant_id into every database query automatically. Critical security rule: NEVER serve data without a tenant_id filter. A missing WHERE tenant_id = X clause exposes all tenants' data.
- Noisy neighbor problem: one tenant's heavy query slows down everyone. Solutions: per-tenant rate limiting (100 API calls/min), per-tenant resource quotas (max 10% of DB connections), separate connection pools per tier (enterprise tenants get dedicated DB resources), and tenant-based sharding for large customers.
- RBAC (Role-Based Access Control): hierarchical permission model — Organization → Workspace → Team → Role → Permission. Store in a dedicated auth service (or use something like Oso/Casbin). Common roles: Owner, Admin, Member, Viewer. Support custom roles for enterprise clients.
- Tenant onboarding/offboarding: automate everything. Provisioning a new tenant should take <30 seconds (create org record, seed default data, generate API keys). Offboarding: soft-delete with 30-day recovery window, then hard-delete all data (GDPR compliance).
- Scale math: 10K tenants, top 1% generate 50% of traffic (power law). Average tenant: 100 users, 1K API calls/day. Top tenant: 50K users, 5M API calls/day. Total: ~60M API calls/day = ~700 TPS average, ~3,500 TPS peak. The top 100 tenants need special attention.
- Bottleneck: shared database at scale. When all tenants share one PostgreSQL instance, the top tenant's complex queries block everyone. Solution: shard by tenant_id. Small tenants share shards (multi-tenant shard). Large tenants get dedicated shards. Use a routing layer to direct queries to the right shard.
- What breaks: schema migrations. In a shared DB, migrating the schema affects ALL tenants simultaneously. A failed migration takes everyone down. Solution: rolling migrations (add column → backfill → deploy code → remove old column). For schema-per-tenant: automate migration across all schemas with rollback capability.
- Don't use: a single-tenant architecture that you 'multi-tenantify' later (extremely painful retrofit). Don't skip tenant_id in every table (you'll have data leaks). Don't give all tenants the same resource limits (the biggest tenant will degrade experience for the smallest). Don't hard-code tenant configuration — use a feature flag service for per-tenant customization.
Design a News Aggregator
Difficulty: Medium. Category: Data Processing.
Design a news aggregator like Google News. The system crawls and aggregates news from thousands of sources, deduplicates similar stories, and presents a personalized feed to users.
Key Topics
- Web Crawling
- RSS/Atom Feeds
- Deduplication
- Ranking Algorithm
- Personalization
- Content Categorization
Hints for Design a News Aggregator
- Crawl RSS feeds and news APIs on a schedule. Use a URL frontier (priority queue) to manage crawl priorities. High-authority sources (NYT, BBC) are crawled every 5 minutes; smaller blogs every few hours. Respect robots.txt and crawl-delay directives.
- Deduplication is critical — many outlets publish the same AP/Reuters story. Use content fingerprinting with MinHash or SimHash to detect near-duplicate articles (80%+ similarity). Group duplicates into story clusters, show the best source (highest authority + most detail).
- Ranking model: freshness (newer = higher, with rapid decay — a 6-hour-old article loses 50% of its freshness score), source authority (established outlets > unknown blogs), user engagement (clicks, read-time, shares), topic relevance to user interests, and geographic relevance.
- Personalization: collaborative filtering (users with similar reading patterns) + content-based (articles matching user's topic interests inferred from reading history). Maintain a user interest vector (topic weights) updated on every article read. Blend: 60% personalized + 40% trending/editorial picks.
- Content categorization using NLP: topic classification (politics, sports, tech, etc.) with a fine-tuned transformer model. Named entity recognition for people, companies, locations. Sentiment analysis. These enrichments power filtering, trending detection, and personalization.
- Scale math: 100K news sources, average 50 articles/day each = 5M new articles/day. Dedup reduces this to ~2M unique stories. Each article = ~5KB text + metadata. Storage: ~10GB/day raw text. The crawler fetches ~60 articles/sec sustained.
- Bottleneck: the deduplication pipeline. Every new article must be compared against recent articles to detect duplicates. Computing SimHash for 5M articles and comparing pairwise is expensive. Solution: use locality-sensitive hashing (LSH) to find candidate duplicates in O(1), then verify with detailed comparison.
- What breaks: breaking news events. A major event generates 10,000 articles in an hour from every outlet. The dedup system and crawler are overwhelmed. Solution: detect event spikes (surge in articles with similar entities/keywords), switch to high-frequency crawling for related sources, and fast-track dedup for the event cluster.
- Don't use: a simple reverse-chronological feed (boring, no personalization). Don't show duplicate stories (frustrating UX — users see the same headline 10 times). Don't crawl every source at the same frequency (waste of resources). Don't rely solely on RSS — many modern news sites don't have RSS feeds, so you also need web scraping.
Design a News Feed / Timeline
Difficulty: Medium. Category: Social Media.
Design a news feed system like Facebook's Timeline or Twitter's Home Feed. Users should see a personalized feed of posts from people they follow, ranked by relevance. Consider fan-out strategies, caching, ranking algorithms, and handling users who follow millions of accounts.
Key Topics
- Fan-out
- Push vs Pull
- Ranking Algorithm
- Cache Layer
- News Feed Generation
Hints for Design a News Feed / Timeline
- Fan-out-on-write (push model): when a user posts, write the post ID to every follower's feed cache. Fast reads (O(1) feed fetch) but extremely expensive writes for users with millions of followers. Good for users with <10K followers.
- Fan-out-on-read (pull model): compute the feed at read time by fetching recent posts from all followed users, then merge-sort. Cheap writes but slow reads for users following many accounts. Better for celebrity accounts.
- Hybrid approach (what Facebook/Twitter actually use): push for regular users, pull for celebrities (>100K followers). At read time, merge the pre-computed feed with real-time pull results from celebrity accounts. This solves the 'Lady Gaga problem' — one post shouldn't trigger 50M cache writes.
- Cache the pre-computed feed in Redis as a sorted set (score = timestamp or ML rank). Limit to the most recent 500-1000 items. Older items fall off the cache and are fetched from the database on deep scroll.
- Ranking model: engagement signals (likes, comments, shares weighted by recency), relationship strength (close friends > acquaintances), content type (video > photo > text), diversity (don't show 10 posts from the same person), and negative signals (hide, report, unfollow).
- Scale math: 500M DAU, each checks feed 10x/day = 5B feed reads/day = ~58K reads/sec. With push model, each post fans out to average 200 followers = 200 cache writes per post. At 10M posts/day = 2B cache writes/day.
- Bottleneck: the fan-out service for popular users. A celebrity posting triggers millions of async writes. Use Kafka to buffer fan-out tasks, process with a pool of workers. Acceptable for follower feeds to update with 1-5 second delay.
- What breaks: feed consistency — user posts but doesn't see it in their own feed (fan-out hasn't completed). Fix: always inject the user's own recent posts at read time, regardless of fan-out status (read-your-writes consistency).
- Don't use: polling the database for every feed request — it kills DB performance at scale. Don't use a single monolithic feed table — partition by userId. Don't rank purely chronologically — engagement-based ranking increases time-on-site by 2-3x.
Design an Online Auction
Difficulty: Hard. Category: E-Commerce.
Design an online auction platform like eBay auctions. Users can list items for auction, place bids in real-time, and the highest bidder wins when the timer expires. Handle bid sniping, concurrent bids, and real-time updates.
Key Topics
- Real-Time Bidding
- Concurrency Control
- Timer Service
- Bid Validation
- Notification System
- Fraud Detection
Hints for Design an Online Auction
- The core challenge is concurrent bids. Use optimistic concurrency control: each bid includes the current highest price as a condition — 'place bid of $105 IF current_highest = $100'. If another bid was placed in between, reject and ask the user to retry with the updated price.
- Use WebSocket or SSE to push bid updates to all watchers in real-time. Don't rely on polling — auction final minutes have rapid-fire bids. Each auction is a pub/sub channel. When a new bid is accepted, broadcast to all subscribers within 100ms.
- Auction end time must be precise. Use a distributed timer/scheduler service. Critical: implement anti-sniping — if a bid arrives in the last 30 seconds, extend the auction by 2 minutes. This prevents last-second bids that give no one time to respond.
- CQRS pattern: separate bid processing (strong consistency, single-threaded per auction) from bid display (eventually consistent, cached). Bid writes are serialized per auction_id through a single processor. Bid reads can be served from cache with 100ms staleness.
- Support multiple auction types: English (ascending bids, most common), Dutch (price starts high, decreases until someone bids), sealed-bid (all bids submitted blind, highest wins). Each has different validation logic and winner determination.
- Scale math: 10M active auctions, average 50 watchers each = 500M WebSocket connections. During auction close: bids arrive in bursts of 10-100/sec for popular items. The bid processor for a single hot auction must handle serialized writes at this rate.
- Bottleneck: the bid processing service. Each auction must process bids serially (to prevent overselling/double-bidding). Solution: partition by auction_id — each auction is handled by exactly one processor (consistent hashing). Processors can handle thousands of auctions each.
- What breaks: the timer service. If the auction close event fires late, bids might be accepted after the auction should have ended. Use a high-precision timer (not cron) with redundancy — two independent timers, use the first one that fires. Clock sync via NTP across servers.
- Don't use: database-level locking for bid processing at scale (too slow). Don't trust client timestamps for bid ordering (use server-side monotonic timestamps). Don't allow bids below the current highest + minimum increment — validate server-side. Don't send push notifications for every bid — batch notifications for watchers.
Design a Payment System
Difficulty: Hard. Category: Fintech.
Design a payment processing system like Stripe or Square. It should handle credit card payments, refunds, and payouts reliably. The system must guarantee no double charges, handle partial failures gracefully, and maintain an accurate financial ledger at all times.
Key Topics
- Idempotency
- Double-Entry Ledger
- Saga Pattern
- Exactly-Once Processing
- Reconciliation
Hints for Design a Payment System
- Idempotency is non-negotiable. Every payment request must include a client-generated idempotency key. If a retry arrives with the same key, return the original result. Store idempotency keys with TTL (24-48h) in Redis or the database. Without this, network retries WILL cause double charges.
- Use a double-entry ledger: every transaction creates two entries — a debit from one account and a credit to another. The sum of all entries must always be zero. This makes auditing trivial and catches bugs immediately (if the books don't balance, something is wrong).
- Payment flow: authorize (verify card, hold funds) → capture (actually charge) → settle (move money between banks). Use the Saga pattern with compensating transactions — if capture fails after authorization, issue a void. Never leave money in an ambiguous state.
- Build a reconciliation pipeline that runs daily: compare your internal ledger with external payment provider (Stripe, bank) records. Flag discrepancies automatically. In practice, ~0.1-1% of transactions will have mismatches that need investigation.
- Scale math: 1M transactions/day = ~12 TPS average. But Black Friday peaks at 50-100x normal = ~1,200 TPS burst. Use Kafka as a buffer between the API and the payment processor to smooth out spikes. The payment service must handle bursts gracefully.
- Bottleneck: the ledger database. Every transaction = at least 2 writes (debit + credit) with ACID guarantees. Use PostgreSQL with row-level locking. Partition the ledger by account_id for horizontal scaling. Avoid cross-partition transactions — they're 10x slower.
- What breaks: network timeouts to payment providers. You send a charge request but get no response — did it succeed or fail? NEVER assume. Implement a 'pending' state, then use the provider's query API or webhooks to reconcile. Timeout ≠ failure.
- Don't use: eventually consistent databases for the financial ledger (money MUST be strongly consistent). Don't process payments synchronously in the API request — use async processing with status polling or webhooks. Don't skip audit logging — regulators require immutable transaction records.
- Fraud detection: run async checks in parallel — velocity checks (too many transactions too fast), device fingerprinting, amount anomalies, geolocation mismatches. If fraud is detected post-authorization, reverse before settlement. ML models improve detection over time.
Design a Price Tracking Service
Difficulty: Medium. Category: Data Processing.
Design a price tracking service like CamelCamelCamel or Honey. The system monitors product prices across multiple retailers, stores price history, and alerts users when prices drop below their target.
Key Topics
- Web Scraping
- Scheduler
- Price History
- Alert System
- Time-Series Data
- Rate Limiting
Hints for Design a Price Tracking Service
- Crawl product pages on a schedule using a job scheduler. Respect per-retailer rate limits (politeness). Amazon might allow 1 req/sec; a small shop might only tolerate 1 req/min. Store rate limit configs per domain. Use a priority queue to schedule high-demand products more frequently.
- Store price history as time-series data. TimescaleDB (PostgreSQL extension) or InfluxDB are ideal. Schema: (product_id, retailer, timestamp, price, currency, availability). Partition by product_id and time range. Keep raw data for last 90 days, downsample to daily averages for historical.
- Alert system: users set price targets ('notify me when < $X'). When a new price is scraped, check against active alerts for that product. Use Redis for fast lookup: key = product_id, value = sorted set of (target_price, user_id). If new price < any target, trigger notification via push/email.
- Handle anti-scraping defenses: rotate proxy IPs (pool of 1000+ residential proxies), randomize request intervals, use headless browsers (Playwright/Puppeteer) for JS-rendered pages, respect robots.txt. Some retailers expose APIs or affiliate feeds — prefer these when available.
- Price history charts: pre-aggregate data into daily/weekly/monthly buckets for fast chart rendering. Store pre-aggregated data separately (materialized views). A user viewing a 1-year price chart shouldn't query 365 raw data points — use the daily aggregate (365 points) or weekly aggregate (52 points).
- Scale math: tracking 100M products across 50 retailers. If each product is checked once every 6 hours, that's ~4,600 scrapes/sec sustained. Each scrape returns ~1KB of data = 4.6MB/sec ingestion. Price change events (subset of scrapes) trigger alert checks.
- Bottleneck: the scraping infrastructure. 4,600 reqs/sec distributed across 50 retailers means some retailers see hundreds of req/sec. You'll get IP-banned quickly. Solution: use multiple scraping strategies per retailer (API, RSS, browser scrape) and maintain a healthy proxy rotation pool.
- What breaks: product page structure changes. When Amazon redesigns their product page, all scrapers break simultaneously. Fix: separate scraping logic into per-retailer parsers with schema validation. Alert on parser failures (if extraction fails for >10% of a retailer's products, page the on-call engineer).
- Don't use: a single IP for all scraping (instant ban). Don't scrape every product at the same frequency — high-demand products (with active alerts) get checked every hour, dormant products every 24 hours. Don't block on synchronous scraping — use async HTTP clients and a worker pool.
Design a Rate Limiter
Difficulty: Easy. Category: Infrastructure.
Design a rate limiting service that protects APIs from abuse. It should support different rate limit rules (per user, per IP, per endpoint), work across multiple servers, and handle high throughput with minimal latency overhead.
Key Topics
- Token Bucket
- Sliding Window
- Redis
- Distributed Systems
- API Gateway
Hints for Design a Rate Limiter
- Token Bucket is the most common algorithm — simple, memory-efficient, allows bursts. Each user gets a bucket that refills at a fixed rate. Sliding Window Log is more precise but uses more memory (stores every request timestamp).
- For distributed rate limiting across multiple servers, use Redis with atomic Lua scripts (INCR + EXPIRE in one call). Without atomicity, race conditions cause over-counting or under-counting.
- Placement matters: API gateway level (NGINX, Kong) handles it before your app code runs — lowest latency. Application middleware gives more flexibility (per-endpoint rules). A dedicated rate limit service (like Envoy) is best for microservices.
- Return proper HTTP headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and 429 Too Many Requests status code. Clients need these to implement backoff correctly.
- Scale math: if you have 10M users and track per-user counters in Redis, that's ~10M keys × ~100 bytes = ~1GB RAM. Fits in a single Redis instance. For per-endpoint rules, multiply by number of endpoints.
- What breaks: a single Redis instance is a SPOF. Use Redis Cluster or replicate across nodes. If Redis goes down, fail-open (allow all traffic) is usually safer than fail-closed (block all traffic) in production.
- Don't use: a local in-memory counter on each app server — requests from the same user hit different servers, so limits aren't enforced globally. Sticky sessions are a hack, not a solution.
- Advanced: sliding window counter algorithm combines fixed window counters with weighted overlap — nearly as accurate as sliding window log, but uses only 2 counters per window instead of storing every timestamp.
Design Robinhood
Difficulty: Hard. Category: FinTech.
Design a stock trading platform like Robinhood. Users can view real-time stock quotes, place buy/sell orders, manage their portfolio, and see account balances. The system must handle market-hours traffic spikes and maintain financial accuracy.
Key Topics
- Order Matching
- Market Data Feed
- Portfolio Service
- Idempotency
- Regulatory Compliance
- Real-Time Quotes
Hints for Design Robinhood
- Separate the order service (place/cancel orders) from the execution service (route to exchanges like NYSE, NASDAQ). Use an event-driven architecture with Kafka between them. The order service validates and persists; the execution service handles exchange communication.
- Market data streams at very high throughput — thousands of price updates per second for thousands of tickers. Use WebSocket to push real-time quotes to clients. Aggregate and sample: send updates every 100ms (not every tick) for most stocks. Real-time tick data for stocks the user has open.
- Order idempotency is critical. A network retry must NOT place a duplicate order. Use unique client-generated order IDs. Before placing an order, check if an order with that ID already exists. Store order IDs with the result for 24h.
- Portfolio balance and positions must be strongly consistent — users must see accurate buying power before placing orders. Use ACID transactions in PostgreSQL for balance changes. Double-entry bookkeeping: every trade debits cash and credits stock (or vice versa).
- Regulatory requirements: store ALL order events immutably for compliance (SEC, FINRA). Use an append-only log (Kafka with infinite retention or S3). Support full order audit trails — when was it placed, modified, filled, canceled, and by whom.
- Scale math: 10M active traders, peak market open generates ~50K orders/sec across all users. Market data: ~50K price updates/sec across all tickers. Portfolio queries: ~100K reads/sec (users refreshing). The market data fan-out to 10M WebSocket connections is the hardest scaling problem.
- Bottleneck: market data fan-out. If 10M users are watching 10 stocks each, and each stock updates 10x/sec, that's 1B WebSocket messages/sec. Solution: topic-based pub/sub — group users by their watchlist, push updates per-ticker channel. Only send updates for stocks the user is watching.
- What breaks: market open thundering herd. At 9:30 AM ET, millions of users simultaneously open the app, place orders, and request quotes. Pre-warm caches, pre-establish WebSocket connections, and use a queue to buffer order bursts. Fail gracefully — show stale prices rather than errors.
- Don't use: eventually consistent databases for account balances (regulatory violation). Don't allow margin trading without real-time balance checks (Robinhood's 2020 'infinite money' glitch). Don't skip circuit breakers — if the exchange API is slow, reject new orders rather than queue indefinitely.
Design a Search Autocomplete
Difficulty: Medium. Category: Search.
Design a search autocomplete system like Google's search suggestions. As users type, show the top 5-10 relevant suggestions in real-time. The system should handle billions of queries per day, support personalization, and update suggestions based on trending topics.
Key Topics
- Trie Data Structure
- Prefix Matching
- Elasticsearch
- Caching
- Ranking
Hints for Design a Search Autocomplete
- A Trie (prefix tree) is the classic data structure. Each node stores a character; paths from root represent prefixes. Store the top K suggestions at each trie node to avoid traversing the entire subtree on every query. This makes lookups O(length of prefix) instead of O(all matching words).
- Pre-compute top K suggestions for each prefix and cache them in Redis (key = prefix, value = sorted list). Update asynchronously every 15-30 minutes from a batch pipeline that processes search logs. Don't rebuild on every search.
- For personalization, layer user-specific suggestions on top of global results. Keep a small per-user cache (last 50 searches). Blend formula: 70% global trending + 30% personal history. Personal results always rank first.
- Latency is everything: suggestions must appear within 100-200ms. Use CDN edge caching for the top 10K most common prefixes — this handles ~80% of requests without hitting your backend. Only cache misses go to the trie service.
- Scale math: 5B searches/day, average 4 characters typed per query = 20B prefix lookups/day = ~230K lookups/sec. Redis can handle this on a few nodes, but CDN caching reduces backend load by 80%+.
- Bottleneck: updating the trie with fresh data. Don't rebuild on every search query. Collect search logs in Kafka → aggregate with Flink/MapReduce hourly → rebuild the trie offline → swap atomically (blue-green deploy of the trie). Users see updated suggestions within 1 hour.
- What breaks: long-tail prefixes that aren't in the pre-computed cache. For prefixes beyond 3-4 characters, fall back to Elasticsearch prefix + fuzzy queries. This handles typos and rare queries that the trie can't cover.
- Don't use: database LIKE queries for autocomplete — full table scans at scale. Don't trigger a backend call on every keystroke — debounce on client (200-300ms delay) to reduce requests by ~60%. Don't return 50 results — top 5-10 is the sweet spot.
- Content safety: filter offensive, NSFW, and dangerous suggestions. Maintain a blocklist. Suppress suggestions related to ongoing tragedies or sensitive events. Run an async content filter on all suggestion candidates before caching.
Design Strava
Difficulty: Medium. Category: Real-Time Systems.
Design a fitness tracking platform like Strava. Users can record GPS-based activities (running, cycling), view their routes on a map, compete on segments/leaderboards, and follow other athletes.
Key Topics
- GPS Tracking
- Activity Recording
- Segment Matching
- Leaderboard
- Social Feed
- Map Rendering
Hints for Design Strava
- Record GPS coordinates at regular intervals (every 1-5 seconds) on the client device. Batch upload the complete activity when finished — don't stream in real-time to save battery and bandwidth. A 1-hour run at 1Hz = 3,600 GPS points × 20 bytes each = ~72KB per activity.
- Store activities as polylines (ordered list of lat/lng/timestamp/elevation points). Calculate derived metrics server-side: total distance (Haversine formula between consecutive points), pace, elevation gain/loss, calories burned. Don't trust client calculations — they vary by device.
- Segments are predefined route sections (e.g., a popular hill climb). Match activities to segments using geospatial corridor matching: check if the activity polyline passes through the segment's start, end, and stays within a tolerance corridor (50m). This is a computational geometry problem.
- Leaderboard per segment: use Redis ZSET (score = time in seconds, member = userId). When an activity matches a segment, compute the elapsed time for that section and update the leaderboard. Support filters: all-time, this year, this month, age groups, weight classes.
- Social feed is a fan-out-on-write system. When a user completes an activity, push it to followers' feeds in Redis. Include activity summary (distance, time, map thumbnail). Detailed activity data is fetched on-demand when the user taps on it.
- Scale math: 100M users, 10M activities uploaded/day. Average activity = 3,600 GPS points = 72KB raw data + derived metrics. That's ~720GB/day of new activity data. Segment matching: 10M activities × 10 nearby segments each = 100M segment checks/day.
- Bottleneck: segment matching at upload time. Checking an activity against all segments globally is impossible (millions of segments). Solution: use geospatial indexing to find only segments near the activity's bounding box. Typically 5-20 candidate segments per activity.
- What breaks: GPS accuracy. Urban canyons (tall buildings), tunnels, and poor satellite coverage cause GPS drift — an activity might show you running through buildings. Fix: apply Kalman filtering to smooth GPS tracks. Flag impossible speeds (>40 km/h for a run) as GPS errors.
- Don't use: real-time streaming of GPS during recording (drains battery, wastes bandwidth). Don't compute leaderboard rankings on every query (pre-compute and cache). Don't store raw GPS in a relational DB (use a specialized format like GeoJSON or Protocol Buffers for compact storage).
Design Ticketmaster
Difficulty: Hard. Category: E-Commerce.
Design an event ticketing platform like Ticketmaster. Users can browse events, select seats, and purchase tickets. The system must handle extreme traffic spikes when popular events go on sale without overselling any seats.
Key Topics
- Inventory Management
- Seat Reservation
- Distributed Locking
- Queue Management
- High Concurrency
- Payment Integration
Hints for Design Ticketmaster
- The core challenge: tens of thousands of users trying to book the same seats simultaneously. The system must prevent overselling while remaining fast. This is fundamentally a distributed concurrency problem.
- Use temporary seat reservations with TTL (e.g., 10 minutes). When a user selects seats, hold them with a timer. If payment doesn't complete within the TTL, release the seats back to inventory. This prevents seats being locked forever by abandoned checkouts.
- Virtual waiting room for high-demand events: when a Taylor Swift concert goes on sale, queue users randomly before letting them in. Admit in batches of 1,000-5,000. This converts a thundering herd into controlled traffic. Communicate queue position and estimated wait time.
- Seat locking: optimistic locking (compare-and-swap — 'book seat 42 IF current_status = available') is faster but can fail under extreme contention. Pessimistic locking (SELECT FOR UPDATE) guarantees success but serializes access. For hot events, use a combination: optimistic for general admission, pessimistic for specific seat selection.
- Separate read path from write path (CQRS). Read path (browse events, view seat maps) is served from cache — eventually consistent, heavily cached in CDN. Write path (book tickets) goes to the primary DB with strong consistency. 99% of traffic is reads.
- Scale math: a major concert has 80K seats. On-sale moment: 500K-2M concurrent users. ~80K successful transactions in 5-10 minutes. That's ~250 TPS for booking. The real bottleneck isn't TPS — it's contention on individual seats.
- Bottleneck: the seat inventory database. All booking writes contend for rows in the same table. Solutions: partition inventory by section/zone, use Redis for real-time seat availability checks (fast reads), and only hit the DB for the actual booking transaction.
- What breaks: the payment service is slow (3-5 seconds per charge) while seats are held. If the payment provider has an outage, seats get locked and expire, causing a cascade of retries. Fix: separate reservation from payment — confirm the booking immediately, charge asynchronously. If payment fails, cancel the booking with a compensation flow.
- Don't use: a single relational DB for everything — it will collapse under on-sale load. Don't allow unlimited concurrent seat selections — cap the number of users actively selecting seats. Don't show real-time exact seat availability to everyone (too expensive) — show approximate availability, validate at booking time.
Design Tinder
Difficulty: Medium. Category: Social & Matching.
Design a location-based dating app like Tinder. Users create profiles, set preferences, and swipe through nearby candidates. When two users swipe right on each other, it's a match and they can chat.
Key Topics
- Geospatial Queries
- Recommendation Engine
- Swipe Queue
- Match Detection
- Real-Time Messaging
- Profile Ranking
Hints for Design Tinder
- Use geohashing to find users within a configurable radius (1-100 miles). Pre-compute a swipe queue of nearby candidates that match the user's age, gender, and distance preferences. Store the queue in Redis — when a user opens the app, the deck loads instantly.
- Recommendation engine factors: distance (closer = higher score), age preference match, mutual interests, historical swipe patterns (ELO-like scoring — if many people swipe right on you, you're shown to more people), profile completeness, and activity recency (active users first).
- Match detection: when User A swipes right on User B, check if B already swiped right on A. Store 'likes' as a set per user (Redis SET or a simple DB table). On right-swipe: SISMEMBER check. If mutual → create match and notify both users instantly via push notification.
- Pre-generate swipe stacks asynchronously (batch job every 30-60 min). When a user exhausts their current stack, replenish in the background. The key UX requirement: the card deck should NEVER be empty in an active area. Load next batch before current batch is exhausted.
- After a match, enable real-time chat via WebSocket. Store messages in a separate chat service (Cassandra, partitioned by match_id). Chat is a separate microservice from the matching/swiping service.
- Scale math: 75M MAU, 2B swipes/day = ~23K swipes/sec. Each swipe = a read (fetch next card) + a write (record the swipe). Match check = 1 Redis SISMEMBER. Profile images served via CDN — each profile has 6 photos × 3 sizes = 18 images.
- Bottleneck: the recommendation engine for dense cities. In Manhattan, there might be 500K eligible profiles within 10 miles. You can't score all 500K for every user. Solution: pre-filter by hard constraints (age, gender, distance), then score the top 1,000 candidates, cache the top 100 as the swipe stack.
- What breaks: stale swipe queues. User A's queue shows User B, but B already deleted their account or changed preferences. Fix: validate profiles in the swipe queue lazily — if a swiped profile is invalid, skip and fetch the next. Don't block the user experience for validation.
- Don't use: SQL LIKE queries for profile search. Don't show the same profile twice (track seen profiles per user in a bloom filter — memory efficient). Don't expose exact location (show 'X miles away', round to nearest mile). Don't let inactive accounts pollute the swipe queue — filter by last-active timestamp.
Design Uber
Difficulty: Hard. Category: Real-Time Systems.
Design a ride-sharing platform like Uber. Riders request a ride, the system matches them with nearby drivers, tracks the trip in real-time, calculates fares with surge pricing, and processes payments. Handle millions of concurrent trips.
Key Topics
- Geospatial Matching
- Supply-Demand Balancing
- ETA Service
- Trip Service
- Surge Pricing
- Driver Location Tracking
Hints for Design Uber
- Use a geospatial index (Google S2 cells, geohash, or quadtree) to match riders with nearby available drivers. S2 cells handle Earth's curvature accurately and support variable-resolution indexing. Geohash is simpler but has edge-case issues at cell boundaries.
- Driver location updates arrive every 3-5 seconds. Store in an in-memory grid (Redis) partitioned by geographic cells. Don't write every update to a persistent database — buffer and batch-write every 30 seconds for analytics. Live matching only needs the latest location.
- The matching algorithm combines: proximity (straight-line distance as initial filter), ETA (actual road distance via routing engine — much more accurate), driver rating, and vehicle type. Don't match the closest driver — match the fastest-arriving one.
- Surge pricing: monitor supply (available drivers) vs demand (ride requests) per geographic zone in real-time. When demand/supply ratio exceeds threshold, apply a multiplier (1.5x, 2x, etc.). Use Kafka for real-time supply/demand event streams and Flink for windowed aggregation.
- Trip lifecycle state machine: REQUEST → MATCHED → DRIVER_EN_ROUTE → ARRIVED → TRIP_STARTED → TRIP_COMPLETED → PAYMENT. Each state transition is an event logged immutably. This enables replaying trips for disputes and analytics.
- Scale math: 20M rides/day globally, 5M concurrent drivers sending location updates every 4 seconds = 1.25M location updates/sec. Peak hours in NYC alone: ~50K concurrent trips. The matching service must respond in <3 seconds for acceptable UX.
- Bottleneck: the matching service during peak demand. Thousands of ride requests per second in a single city, each needing a geospatial query + ETA calculation. Solution: pre-compute a driver availability grid (refreshed every 5s), use it for fast initial matching, then refine with precise ETA.
- What breaks: ETA accuracy degrades with traffic. An ETA estimated at request time becomes stale by the time the driver is en route. Fix: continuously update ETA using real-time traffic data and driver's live location. Push updated ETAs to the rider every 10-15 seconds.
- Don't use: a relational database for real-time location storage — too slow for millions of writes/sec. Don't compute road-distance ETA for every driver in the city — filter by radius first (geospatial index), then compute ETA only for the nearest 10-20 candidates. Don't use polling for trip tracking — use WebSocket/SSE push.
Design a URL Shortener
Difficulty: Easy. Category: System Design.
Design a service like bit.ly that takes long URLs and creates short, unique aliases. When users visit the short URL, they are redirected to the original. Consider scale (billions of URLs), latency (sub-100ms redirects), and analytics (click tracking).
Key Topics
- Hashing
- Base62 Encoding
- Database Design
- Caching
- Read-Heavy System
Hints for Design a URL Shortener
- Use base62 encoding (a-z, A-Z, 0-9) to generate short codes — 7 characters gives ~3.5 trillion unique URLs. Avoid base64 as +, / cause URL issues.
- Collision handling: check-and-retry works at low scale. At high scale, use a pre-generated unique ID service (Snowflake, Twitter's ID generator, or a counter with base62 conversion) to guarantee uniqueness without retries.
- Read-to-write ratio is extreme (~100:1 to 1000:1). Put Redis in front as a read-through cache for hot URLs. A single Redis instance handles ~100K reads/sec. LRU eviction works since popular URLs follow a power-law distribution.
- Database choice: a simple key-value store (DynamoDB, Cassandra) works well since you only do point lookups by short code. Avoid relational DBs unless you need analytics joins. Partition by short code hash for horizontal scaling.
- Scale math: 100M new URLs/day = ~1,200 writes/sec. 10B redirects/day = ~115K reads/sec. A single DB won't handle reads — cache layer is essential, not optional.
- Bottleneck: the ID generation service becomes a single point of failure. Solutions: run multiple ID generators with different ID ranges, or use hash-based approaches (MD5 of long URL, take first 7 chars of base62).
- For analytics (click counts, geo, referrer), don't update the main table on every redirect. Write click events to Kafka, aggregate with Flink/Spark into an analytics table. This keeps the redirect path fast (<10ms).
- What breaks at scale: cache stampede when a viral link's cache expires — millions of requests hit the DB simultaneously. Fix with cache locking (only one request fetches from DB, others wait) or probabilistic early expiration.
Design a Web Crawler
Difficulty: Medium. Category: Data Processing.
Design a web crawler that can crawl billions of web pages. The system must discover new URLs, download pages efficiently, handle politeness constraints, avoid traps, and store content for indexing.
Key Topics
- URL Frontier
- Politeness Policy
- Deduplication
- DNS Resolution
- Content Parsing
- Distributed Architecture
Hints for Design a Web Crawler
- Use a URL frontier (priority queue) to manage which URLs to crawl next. Prioritize by: domain importance (PageRank), freshness requirements (news sites > static pages), and crawl depth (homepage > deep links). The frontier is the brain of the crawler.
- Enforce politeness: never crawl a single domain faster than its robots.txt allows. Use per-domain rate limiting (typically 1 request every 1-10 seconds). Each crawler worker is assigned specific domains via consistent hashing — this naturally enforces per-domain limits and simplifies dedup.
- Deduplicate URLs using a Bloom filter (fast, ~1% false positive rate, minimal memory). For 10B URLs, a Bloom filter uses ~1.2GB RAM. Also deduplicate content: compute SimHash of page content to detect mirror sites and duplicate content hosted on different URLs.
- Distribute work across hundreds of crawl workers. Partition by domain via consistent hashing: one worker per domain cluster. This simplifies politeness enforcement (one worker controls the rate for its domains) and enables local DNS caching per worker.
- Handle crawler traps: infinite URL spaces (calendar pages with every date, shopping sites with every filter combination). Mitigations: max URL depth limit (e.g., 15 levels), max URLs per domain per crawl cycle, URL normalization (remove tracking params, sort query params, lowercase), and detect repetitive URL patterns.
- Scale math: to crawl 10B pages in 1 month, you need ~3,850 pages/sec sustained. Each page is ~100KB = ~385MB/sec download bandwidth. DNS lookups: ~3,850/sec (cache aggressively — local DNS resolver per worker). Storage: 10B × 100KB = ~1PB raw HTML.
- Bottleneck: DNS resolution. At 3,850 pages/sec, each page needs a DNS lookup. Public DNS resolvers will rate-limit you. Solution: run local DNS resolvers on each crawler worker, cache DNS results aggressively (TTL of hours, not seconds), and pre-resolve DNS for queued URLs in batch.
- What breaks: a single slow domain can bottleneck a worker. If a worker is assigned to a domain that responds slowly (2-3 seconds per page), it wastes time waiting. Solution: async HTTP clients (make hundreds of concurrent requests), and redistribute unresponsive domains to a 'slow lane' worker pool.
- Don't use: a single-threaded synchronous crawler (way too slow). Don't ignore robots.txt (ethical and legal issues, plus you'll get IP-banned). Don't store raw HTML in a relational database (use object storage like S3 with an index in the DB). Don't recrawl unchanged pages — use HTTP ETag/If-Modified-Since headers to skip unchanged content.
Design Yelp
Difficulty: Medium. Category: Search & Discovery.
Design a local business discovery and review platform like Yelp. Users can search for nearby businesses, read and write reviews, upload photos, and filter by category, rating, and distance.
Key Topics
- Geospatial Search
- Quadtree / Geohash
- Review System
- Search Ranking
- Photo Storage
- Business Profiles
Hints for Design Yelp
- Use a quadtree or geohash index to efficiently answer 'find restaurants near me within 5km' queries. A quadtree recursively divides 2D space into four quadrants. Geohash converts lat/lng to a string prefix for range queries. Both give O(log n) spatial lookups.
- Separate the search/discovery service from the review/rating service. Search needs fast reads with geospatial indexes (Elasticsearch with geo queries). Reviews need write consistency and anti-spam validation (PostgreSQL). Different scaling requirements.
- Aggregate ratings asynchronously. Don't recompute the average on every new review — it locks the business row. Use a running weighted average: new_avg = old_avg + (new_rating - old_avg) / (total_reviews + 1). Or batch-recompute with a Kafka consumer on new-review events.
- Search ranking combines: distance (closer = higher), rating (weighted by review count — a 4.5 with 500 reviews beats a 5.0 with 2 reviews), relevance to search query (text match on name, cuisine, keywords), and sponsored/promoted status. Use Elasticsearch for scoring.
- Store photos in object storage (S3) with CDN delivery. Generate multiple sizes on upload: 150px thumbnail (list view), 600px medium (business page), 1200px full (lightbox). Average business has 20-50 photos. Total: billions of photos.
- Scale math: 200M businesses, 50M reviews/year, 500M search queries/day = ~5,800 QPS. Geospatial queries are compute-heavy — each checks distance for thousands of nearby businesses. Cache popular queries (city + category combinations) in Redis.
- Bottleneck: geospatial index for dense urban areas. Manhattan has 50K+ restaurants. A query for 'pizza near Times Square' must filter, score, and rank thousands of candidates in <200ms. Solution: pre-computed geohash-based indexes with materialized search results for popular areas.
- What breaks: fake reviews. Spam accounts posting 5-star reviews for payment. Detection: flag accounts with unusual patterns (only 5-star reviews, all for new businesses, posted within minutes). Use ML models trained on known fake reviews. Rate-limit reviews per user per day.
- Don't use: database LIKE queries for text search (use Elasticsearch). Don't compute distances in the application layer for every business in the DB (use spatial indexes). Don't show raw average ratings — use Bayesian average to prevent manipulation by businesses with few reviews.
Design YouTube Top K
Difficulty: Medium. Category: Data Processing.
Design a system that computes the Top K most-viewed videos on YouTube in real-time. The system processes billions of view events per day and must return the current top K videos for various time windows (last hour, day, week).
Key Topics
- Count-Min Sketch
- Min-Heap
- Streaming Aggregation
- Time Windows
- MapReduce
- Approximate Counting
Hints for Design YouTube Top K
- Use a Count-Min Sketch (CMS) for approximate frequency counting — it's a probabilistic data structure that uses ~10KB of memory to track billions of items with bounded error. Much more memory-efficient than exact hash maps. Accept ~1% over-counting in exchange for massive memory savings.
- Maintain a min-heap of size K alongside the CMS. For each view event, increment the count in CMS and check if the new count exceeds the heap minimum. If yes, insert into heap (evicting the minimum). The heap always contains the approximate top K items.
- Define time windows: top K in the last hour, last day, last week. Use sliding or tumbling windows with separate counters per window. For hourly: maintain 60 one-minute buckets, sum the latest 60 for the hourly result. Rotate buckets every minute.
- For exact counts at scale, use a MapReduce/Spark batch pipeline: map (videoId, 1) → shuffle by videoId → reduce (sum) → sort → top K. Run hourly for final rankings. The real-time approximate path serves the dashboard; the batch path provides ground truth for billing/reporting.
- Hybrid approach: approximate real-time (CMS + heap, updated every second) for the user-facing dashboard, exact batch (Spark/Flink, runs hourly) for final official rankings. Show 'Updated X minutes ago' on the dashboard. Exact and approximate agree within 1-2% for truly popular videos.
- Scale math: 10B views/day = ~115K view events/sec. Each event is ~50 bytes (videoId, timestamp, userId). That's ~5.7MB/sec ingestion. The CMS for tracking 100M unique videos uses ~10MB RAM. The min-heap for K=1000 uses negligible memory.
- Bottleneck: the ingestion pipeline. 115K events/sec must be processed, counted, and the top-K updated. A single-node CMS can handle this, but for fault tolerance, shard by videoId hash across multiple counter nodes. Periodically merge partial top-K results from each shard.
- What breaks: coordinating top-K across shards. Each shard has its local top-K, but the global top-K requires merging. A video that's #50 on shard A and #50 on shard B might be #1 globally. Solution: each shard reports its top-K (with counts) to a coordinator, which merges and computes the global top-K every 5-10 seconds.
- Don't use: exact counting with a hash map for billions of videos (requires 100s of GB RAM). Don't use a database counter updated on every view (115K writes/sec to a single row). Don't compute top-K by sorting all videos — use a heap (O(n log K) instead of O(n log n)).
Design YouTube
Difficulty: Hard. Category: Social & Media.
Design a video sharing and streaming platform like YouTube. Users upload videos, the system transcodes and stores them, and viewers can stream in adaptive quality. Support recommendations, comments, subscriptions, and billions of daily views.
Key Topics
- Video Upload Pipeline
- Transcoding
- Adaptive Bitrate Streaming
- CDN
- Recommendation Engine
- View Count Service
Hints for Design YouTube
- Video upload pipeline: client uploads to a staging service → store raw video in object storage (S3) → enqueue in Kafka for processing → transcode to multiple resolutions (360p, 720p, 1080p, 4K) and codecs (H.264, VP9, AV1). Each resolution × codec = a separate file. A 10-minute video might produce 12+ variants.
- Use adaptive bitrate streaming (HLS or DASH). Break each transcoded video into small segments (2-10 seconds). Client requests a manifest file listing all quality levels, then dynamically switches between them based on network bandwidth. This is why YouTube quality changes mid-video.
- Serve videos via CDN with edge caching. Top 10% of videos get ~90% of views (power law). Cache popular videos at edge PoPs worldwide. Long-tail content is served from regional origin servers. CDN cost is the #1 expense for a video platform.
- Recommendation engine: watch history (what you've watched), collaborative filtering (users similar to you watched X), content similarity (video embeddings from titles, thumbnails, transcripts), and engagement signals (click-through rate, watch time, completion rate). Watch time is the most important signal.
- View counts don't need real-time accuracy. A 5-minute delay is fine. Ingest view events via Kafka → aggregate with Flink in 1-minute tumbling windows → write to a counter service. Show approximate counts. Exact counts are reconciled in a daily batch job.
- Scale math: 500 hours of video uploaded per minute. Average video: 10 min × 12 variants × ~100MB each = ~1.2GB per video. That's ~3.6TB of new transcoded content per hour. CDN serves 1B+ hours of video daily. Bandwidth: ~1 petabit/sec aggregate.
- Bottleneck: transcoding. It's CPU-intensive (1 hour of video takes 3-5 hours of CPU time for all variants). Use a large pool of GPU-accelerated transcoding workers. Auto-scale based on queue depth. Prioritize short videos and popular creators.
- What breaks: viral videos cause CDN cache misses at edge PoPs that haven't cached the video yet. All requests fall through to the origin, overwhelming it. Fix: pre-warm CDN edges when a video starts trending (monitor view velocity, proactively push to edges).
- Don't use: a single server to serve video files — impossible at scale. Don't transcode synchronously during upload — it takes too long. Don't store videos in a database — use object storage (S3/GCS). Don't skip adaptive bitrate — fixed quality means buffering on slow connections and wasted bandwidth on fast ones.