Design YouTube Top K
Design a system that computes the Top K most-viewed videos on YouTube in real-time. The system processes billions of view events per day and must return the current top K videos for various time windows (last hour, day, week).
Key Topics
Interview Cheat Sheet
60s skim · 3min careful readKafka is the only source of truth, everything downstream is derived state. Flink with RocksDB keyed state holds exact per-video counts across 4 windows (10-min sliding, daily and monthly tumbling, lifetime), a min-heap of size K=100 inside Flink selects top-K every 30 seconds. Dual sink: Redis via ZADD overwrite (idempotent display board), PostgreSQL via 2PC (durable). The load-bearing constraint is exactly-once end-to-end, which is why ZADD beats ZINCRBY.
- Ingest · 81K view events/sec sustained · Kafka source of truth
Every video view emits an event. The API attaches the server timestamp (client clocks lie), and the event lands in Kafka. Kafka is the single source of truth; everything downstream is derived state and can be rebuilt from Kafka.
View eventAPI attaches server_timestampidempotent Kafka producer with sequence numberKafka view_events topic (128 partitions, partitioned by video_id, RF=3, acks=all)consumers downstream are derived state, never the source of truth - Aggregate · Flink + RocksDB · min-heap top-K · emit every 30s
Flink reads the Kafka stream and keeps exact view counts per video in RocksDB. A min-heap of size K=100 picks the top videos per window without sorting the full set. Every 30 seconds it emits the latest top-K to two sinks (Redis for fast reads, Postgres for durability).
KafkaFlink keyBy(video_id) with BoundedOutOfOrderness 30s watermark + allowedLateness 5minRocksDB keyed state holds exact per-video counts across 4 windows (10-min sliding, daily tumbling, monthly tumbling, lifetime)min-heap of size K=100 inside Flink (O(log K) per update, not O(N log N) sort)emit every 30sdual sink: Redis ZADD (idempotent overwrite, NOT ZINCRBY) + Postgres 2PC sink (durable)incremental checkpoint to S3 every 60s - Read · 3-tier fallback · CDN absorbs 99%
Reads never hard-fail. The CDN serves most requests at the edge for 10 seconds at a time. Misses hit Redis with a tight 5ms timeout. Redis trouble falls back to Postgres, and Postgres trouble falls back to stale cache.
ClientCDN edge cache (10s TTL, absorbs >99% of read traffic)cache missRead APIRedis ZRANGE with 5ms timeout (display board, idempotent ZADD-written)if Redis slow/down: Postgres with 50ms timeout (durable, 2PC-written by Flink)if Postgres slow/down: stale cache last-known-goodnever hard-fail
- •7B view events/day, ~81K events/sec sustained, ~810K peak, ~2M viral spike
- •K=100 across 4 window types, emit every 30 seconds
- •Flink state: ~50M unique videos x ~200 bytes = ~10 GB in RocksDB
- •Incremental checkpoint to S3 every 60s, ~2-5 GB per checkpoint
- •Min-heap update: O(log K) = ~7 ops vs O(N log N) = ~1.3B ops for sort
- •Watermarks: BoundedOutOfOrderness 30s, allowedLateness 5min
- •Read fallback timeouts: Redis 5ms, Postgres 50ms, then stale cache
- •ZADD overwrite, not ZINCRBY (Flink is the counter, Redis is a display board)
- •Min-heap inside Flink, not Redis-side sorting (O(log K) vs O(N log N))
- •Exact counts via Flink keyed state, not Count-Min Sketch (50M keys fits in 10 GB)
- •server_timestamp for watermarks, event_timestamp for windowing (client clocks lie)
- •2PC sink to Postgres, idempotent ZADD to Redis (different guarantees per sink)
- •4 separate window branches, not one window with parameterized size
- •Exactly-once is the load-bearing constraint, every choice ties back to it
- •Multi-dim top-K cost is combinatorial, ship only dimensions the product needs
- •Ingest tier-1, freshness tier-2 (30-60s acceptable), durability tier-1 via 2PC
- ZADD over ZINCRBY (the load-bearing exactly-once choice)
ZINCRBY makes Redis the counter, so on Flink restart the replayed Kafka events would increment Redis again and double-count. ZADD makes Flink the counter and Redis a display board: replay just overwrites with the same value. Idempotent by construction. This is the single most common bug in streaming systems; the fix isn't trying harder to dedup, it's redesigning so the sink is idempotent. Same effectively-once pattern as job-scheduler fencing tokens and notification-system Bloom filter: make duplicates harmless rather than impossible.
- Min-heap inside Flink (not Redis-side sorting)
K=100 across 50M unique videos. Redis-side ZRANGE 0 99 sorts the full sorted set, O(N log N) per query, ~1.3B ops for 50M keys, untenable at 30-second emission cadence. Min-heap of size K maintained inside Flink: each update is O(log K) = ~7 ops, push-if-larger-than-min pop-if-over-size. Flink computes the actual top-K and writes only those 100 to Redis. Redis stays a tiny display board (100 ZADDs per emission) instead of a 50M-element sorted set.
- Exact counts over Count-Min Sketch (cardinality earns it)
50M unique videos × ~200 bytes each = ~10 GB in Flink RocksDB state, which fits cleanly in checkpointed local state. We get exactness AND per-key drill-down for free. If the key space were billions of IPs (web crawler, ad fraud) we'd switch to HeavyKeeper or CMS and accept ~1% error. Approximate algorithms save memory but are harder to debug (overcounts, no per-key drill-down, harder to verify in production). At our cardinality, the trade clearly goes to exact.
- Exactly-once end-to-end (the chain hop by hop)
Producer→Kafka: idempotent producer with sequence numbers per partition dedups retries. Kafka→Flink: offsets commit only on checkpoint success, bundled atomically with RocksDB state. Flink→Redis: at-least-once but idempotent because ZADD overwrites. Flink→Postgres: 2PC sink, pre-commit on checkpoint barrier, commit on next checkpoint. End-to-end is effectively exactly-once. Watermark on server_timestamp (not event_timestamp, which is client clock) prevents future-dated events from breaking window closure. Ingest tier-1, freshness tier-2 (30-60s acceptable), durability tier-1 via 2PC.
40-Minute Interview Playbook
Each phase is what the interviewer expects you to do and say. Concrete steps, not topic hints. The diagrams are what you sketch on the board.
- 15 min
Clarify Requirements and Scale
GoalPin down exact-vs-approximate, the window taxonomy, and the exactly-once expectation before you draw a single box. This is a correctness problem disguised as a throughput problem, and the interviewer needs to see you frame it that way.
Do & Say- ASK·1ASK: Are we computing exact top-K or approximate? Approximate (Count-Min Sketch, HeavyKeeper) is cheaper but undercounts. Exact requires per-key state in Flink, which is fine for 50M unique videos but breaks for top-K over billions of IPs. Land on exact because video counts are bounded.
- SAY·2Pin the scale: 7B view events/day = ~81K events/sec sustained, ~810K peak (10x), ~2M during a viral moment. K=100 across 4 window types: 10-min sliding, daily tumbling, monthly tumbling, lifetime counter. Write 81K avg, 2M spike, K=100 on the board.
- SAY·3Pin the freshness expectation: Emit top-K every 30 seconds. End-user sees top-K with 30-60s staleness, which is fine for a leaderboard. Write emit every 30s next to the QPS.
- SAY·4Pin the correctness requirement: Exactly-once end-to-end. A view event must affect the count once and only once, even through Flink restarts and Kafka rebalances. Write exactly-once under everything else. This is the load-bearing constraint.
- SAY·5Scope in: ingestion, dedup, windowed aggregation, top-K selection, serving with fallback. Out: spam/bot filtering (assume upstream), recommendation engine, monetization counts. Wait for confirmation.
Interviewer is grading: You ask exact-vs-approximate before drawing anything. You bring up exactly-once as the load-bearing constraint, not an afterthought. You name the 4 window types explicitly (sliding vs tumbling vs lifetime) and don't conflate them.
- 25 min
API and Data Model
GoalOne write endpoint (event ingest), one read endpoint (get top-K for a window). Justify Kafka as source of truth, Flink keyed state as the only place exact per-video counts live, Redis sorted set as a display board not a counter.
Do & Say- WRITE·1Write the ingest endpoint: POST /v1/events {video_id, user_id, event_timestamp, watch_duration_sec, session_id, region} returns a 202 Accepted after a synchronous Kafka produce with acks=all. Latency target sub-50ms p99 at the API.
- WRITE·2Write the read endpoint: GET /v1/topk?window=10min®ion=US&k=100 returns the current top-K with a freshness_ms field. CDN-cached for 10 seconds with stale-while-revalidate=30, so most traffic never hits the API.
- DRAW·3Sketch the Avro event schema (~150 bytes serialized): event_id, video_id, user_id, event_timestamp (client clock), server_timestamp (API clock, used for watermarks), region, category, device, watch_duration_sec, session_id. Two timestamps because client clocks are skewed and watermarks need a reliable one.
- SAY·4Justify Kafka as source of truth: 128 partitions on video.watches topic, RF=3, acks=all, partition key = video_id. Co-partitioning means Flink can keyBy(video_id) without repartitioning. 7 days hot retention, 30 days tiered to S3. Everything downstream is derived state. If Flink dies we replay.
- DRAW·5Sketch the Redis structure: A sorted set per (window, dimension) at key video:topk:{window}:{dimension}:current. Members are video_ids, scores are exact counts. Flink writes via ZADD (overwrite), never ZINCRBY. Redis is a display board, not a counter. The counter lives in Flink RocksDB.
Interviewer is grading: You make Kafka the only source of truth and treat Redis as a derived cache. You insist on ZADD (overwrite) not ZINCRBY (increment) and can explain why. You volunteer the dual timestamp (event_timestamp + server_timestamp) for watermarks before being asked.
- 310 min
High-Level Design (draw the streaming pipeline)
GoalOne diagram showing the Kafka -> Flink -> dual sink architecture, with the 4 parallel window branches and the 3-tier read fallback. The exactly-once chain should be visible in the arrows.
Draw on the boardDo & Say- DRAW·1Draw the ingest path. API Gateway uses Kafka idempotent producer. Settings: acks=all, retries=MAX_INT, max.in.flight.requests=5 with idempotence on so retries don't duplicate. Partition by video_id, 128 partitions across 12 brokers. ~15K events/sec per partition at viral spike, well within the single-thread ~50K/sec ceiling.
- DRAW·2Draw the Flink topology. Kafka source with checkpointed offsets (60-second incremental checkpoints to S3). Dedup operator keyed by event_id with 60-second state TTL catches producer retries that slipped through. Then keyBy(video_id), a no-op because the topic is already partitioned by video_id.
- DRAW·3Draw the four window branches: 10-min sliding window slides every 30s with BoundedOutOfOrderness 30s and allowedLateness 5 min, daily and monthly tumbling with continuous-trigger every 30s so live count is visible mid-window, lifetime is a per-video atomic counter in RocksDB.
- SAY·4Defend per-video counts: Flink keyed state stores 50M (video_id, count) pairs per window, 200 bytes each = 10 GB total in RocksDB, incremental checkpoints 2-5 GB. Top-K is recomputed every 30s via a min-heap of size K. Count-Min Sketch saves memory but at 50M keys we don't need to approximate.
- SAY·5Defend min-heap over sorting: O(log K) per update vs O(N log N) per emission. For K=100 and N=50M, that's 7 ops vs 1.3 billion. Heap runs in Flink, selects top-K from exact counts. Flink ZADD-overwrites Redis with the K=100 entries. Redis never runs the heap.
- DRAW·6Draw the dual sink: Flink writes to Redis via ZADD (at-least-once is fine because overwrite is idempotent), Flink writes to Postgres via 2PC sink: pre-commit on barrier, commit on checkpoint success. Redis is the fast path, Postgres is the durable record and the API fallback.
- DRAW·7Draw the 3-tier read fallback: Read API hits Redis with 5ms timeout, Redis slow → fall through to Postgres at 50ms, Postgres down → serve stale cached response or 503. Never hard-fail. CDN with 10s max-age plus stale-while-revalidate=30s absorbs 99%+ of traffic anyway.
Interviewer is grading: You volunteer the exactly-once chain (idempotent producer + checkpointed offsets + ZADD + 2PC) without being asked. You draw all 4 windows in parallel branches, not as a single window with different sizes. You insist on min-heap inside Flink, not sorting in Redis. You volunteer the 3-tier read fallback.
- 415 min
Deep Dive: Exactly-Once, Watermarks, Failure Recovery
GoalThree sub-dives. First, exactly-once end-to-end with each hop's guarantee. Second, watermarks and late-event handling with the BoundedOutOfOrderness logic. Third, failure recovery: Flink crash, Redis crash, Kafka rebalance.
Draw on the boardDo & Say- SAY·1Pivot to the exactly-once chain. Producer to Kafka uses idempotent producer (enable.idempotence=true). Per-partition sequence numbers stop retry duplicates. Kafka to Flink: Flink commits offsets only on checkpoint success, bundling offsets with state atomically. A Flink crash replays from last checkpoint, every event processed exactly once.
- SAY·2Continue: Flink to Redis is technically at-least-once but ZADD is idempotent (overwrite, not increment), so re-emitting the same top-K is safe. Flink to Postgres uses a 2PC sink: pre-commit writes the row uncommitted, commit fires on checkpoint success. Crash between pre-commit and commit rolls back. End-to-end is effectively exactly-once.
- SAY·3Cover the ZADD-not-ZINCRBY decision: ZINCRBY makes Redis the counter. On Flink restart, replayed events would increment again. ZADD makes Flink the counter. Replay just rewrites the same value. Idempotent. This is the single most common bug in streaming systems. Make Redis a display board, never a counter.
- SAY·4Pivot to watermarks: Events arrive late. A view at 12:00:00 might reach Kafka at 12:00:45 due to mobile network delay. If the 10-min window [11:50:00, 12:00:00] has already closed, the event is too late.
- SAY·5Defend BoundedOutOfOrderness 30s: Flink waits 30s past latest event time before closing a window. 99% arrive within 30s per real network data. allowedLateness=5min recomputes the window for late events. sideOutputLateData ships >5min events to a monitoring sink so we alarm on bot replays.
- SAY·6Defend server_timestamp over event_timestamp for watermarks: event_timestamp comes from the client clock, can be in the future, past, or jump. server_timestamp is when our API received the event, reliable. event_timestamp for windowing semantics, server_timestamp for watermarks. The 30s OutOfOrderness is on server_timestamp.
- SAY·7Defend withIdleness 1min: If one Kafka partition has no traffic for 60 seconds (low-traffic region at 3 AM), Flink advances the watermark anyway. Otherwise the whole pipeline stalls waiting for that partition.
- SAY·8Failure recovery. Flink crash: TaskManager dies. JobManager detects within 30s, restarts from last S3 checkpoint (~30s including RocksDB warmup). Kafka source resumes from checkpointed offset, events replay. ZADD re-overwrites Redis. 2PC to Postgres rolls back uncommitted txns, replay produces the same commit. Zero data loss.
- SAY·9Redis crash: Redis Cluster failover within 15s. During the gap, Read API falls back to Postgres (slower but durable). Flink keeps writing top-K to both sinks. New primary up, next 30s emission repopulates fully. Net effect: degraded p99 for ~15-45s, no data loss.
- SAY·10Kafka broker fail: One broker dies. RF=3, leader election promotes a healthy replica within seconds. Producers buffer up to 5s during failover. Events land at new leader after reconnect. No loss since acks=all and min.insync.replicas=2 means writes already had 2 replicas.
- SAY·11Cover the dedup caveat: Producer retries can still happen if the network blips between client and our API Gateway before idempotence kicks in. Flink has a Dedup operator keyed by event_id with 60-second state TTL. Catches retries within 60s, which covers the realistic retry window.
Interviewer is grading: You walk the exactly-once chain hop-by-hop, naming the guarantee at each link (idempotent producer, checkpointed offsets, ZADD overwrite, 2PC). You volunteer server_timestamp over event_timestamp for watermarks. You walk through all three failure modes (Flink, Redis, Kafka) with concrete recovery times in seconds.
- 55 min
Trade-offs, Risks, and Wrap-up
GoalName three deliberate trade-offs (exact counts via per-key state, 30-second emit cadence, exactly-once checkpoint pause), give the multi-dimensional top-K cost, close in one breath.
Do & Say- SAY·1Exact counts trade-off: We pick exact since 50M videos fit in 10 GB of Flink RocksDB state. Billions of IPs would force HeavyKeeper or Count-Min Sketch at ~1% error. Approximate is cheaper but harder to reason about (overcounts, no drill-down). 10 GB buys exactness and debuggability.
- SAY·230s emit cadence trade-off: User sees top-K with 30-60s staleness, not real-time. We accept this because emit-on-every-event means 81K Redis writes/sec instead of 3.3. CDN (10s max-age plus stale-while-revalidate=30s) absorbs 99% of traffic and tolerates the staleness.
- SAY·3Exactly-once checkpoint pause: Exactly-once mode pauses processing briefly during barrier alignment, 100-500ms per 60s checkpoint. Negligible for a 30s emit cycle. At-least-once skips barrier alignment for a small latency win but forces us to handle duplicates downstream. Not worth it.
- SAY·4Multi-dimensional cost: Per-region or per-category top-K adds a Flink branch keyed by region+video_id or category+video_id. Each dimension doubles state. 4 regions × 10 categories = 40x state. Limit to what the product actually ships. Avoid region × category × device = 200 branches.
- SAY·5Close: Kafka as source of truth, Flink RocksDB for exact per-video counts across 4 windows, min-heap K=100 emits every 30s, dual sink Redis (ZADD) plus Postgres (2PC). Exactly-once via idempotent producer + checkpointed offsets + idempotent sinks. 3-tier read Redis → Postgres → stale cache. Risk: late events past 30s.
Interviewer is grading: You name trade-offs as 'we accept X because Y'. You volunteer the multi-dimensional cost warning (don't multiply dimensions blindly). You close in one breath naming Kafka source-of-truth, Flink keyed state, min-heap, dual sink, exactly-once, 3-tier fallback, and the load-bearing risk.
Interview Grading by Level
What an interviewer at each level expects to see in your answer. Use this to calibrate, not to perform.
Mid-Level Engineer (L4 / SDE-II)
Reaches a working pipeline that counts views and emits a top list, but the correctness story (exactly-once, ZADD vs ZINCRBY, watermarks) is loose.
- Picks Kafka for ingestion and a stream processor (Flink or Spark) for aggregation.
- Knows to use Redis sorted sets for fast top-K serving.
- Adds a database for durable counts behind the cache.
- Knows windows exist (mentions 'tumbling' or 'sliding' but doesn't distinguish them).
- CDN-caches the read endpoint.
- Uses ZINCRBY (increment) on Redis instead of ZADD (overwrite), introduces double-counting on Flink restart.
- No mention of exactly-once semantics or how Kafka offsets are committed.
- Conflates the 4 window types or only handles one (usually 'daily').
- Doesn't bring up watermarks or late events, so windows close too early or never.
- Doesn't separate the min-heap (selection) from the count storage.
Senior Engineer (L5 / SDE-III)
Drives the Kafka-source-of-truth + Flink-keyed-state model, names the 4 windows explicitly, picks ZADD over ZINCRBY, and outlines the exactly-once chain end-to-end.
- Names Kafka as the source of truth and every downstream store as derived, within the first 5 minutes.
- Picks Flink (not Spark Streaming) and gives a one-line reason (true streaming, lower latency, RocksDB state).
- Insists on ZADD overwrite from Flink to Redis, never ZINCRBY, and can explain why (idempotency on restart).
- Names the 4 window types explicitly: 10-min sliding, daily tumbling, monthly tumbling, lifetime counter, with a separate Flink branch for each.
- Picks min-heap of size K inside Flink, not Redis-side sorting, with O(log K) per update vs O(N log N) per emission.
- Brings up exactly-once with the chain: idempotent producer -> checkpointed offsets -> ZADD idempotent sink -> 2PC Postgres sink.
- Brings up watermarks with BoundedOutOfOrderness and allowedLateness, distinguishes event_timestamp from server_timestamp.
- Brings up the 3-tier read fallback (Redis -> Postgres -> stale cache) only when asked about Redis failure.
- Doesn't volunteer the multi-dimensional top-K cost warning unless probed.
- Quantifies failure modes in words but not in numbers (e.g. 'Flink restart is fast' without 'about 30s including RocksDB warmup').
Staff+ Engineer (L6+)
Owns the room, treats exactly-once as the load-bearing constraint from the first minute, frames every trade-off as 'we accept X because Y', and brings operational depth (recovery times in seconds, multi-dim cost warning).
- Volunteers exactly-once as the load-bearing constraint within the first 5 minutes and ties every downstream choice (ZADD, 2PC, idempotent producer) back to it.
- Brings up server_timestamp vs event_timestamp for watermarks unprompted, with the client-clock-skew failure mode explicit.
- Sets explicit SLO tiers: ingestion is tier-1 (sub-50ms p99 on API), top-K freshness is tier-2 (30-60s staleness acceptable), exact-count durability is tier-1 (Postgres 2PC).
- Volunteers the multi-dimensional top-K cost warning ('region x category x device x window = combinatorial explosion, ship only the dimensions the product needs') without being asked.
- Pushes back on requirements where appropriate: 'Do we really need 100% exact counts, or is HeavyKeeper with 99% accuracy acceptable for a leaderboard? Exact gives debuggability, approximate saves 5x memory.' Picks exact with a clear reason.
- Names operational rehearsal of the Flink restore drill (30s S3 checkpoint, Kafka replay, sink idempotency confirms via integration test) as part of the design.
- Closes with a one-breath summary covering Kafka source-of-truth, Flink keyed state, min-heap, dual sink, exactly-once chain, 3-tier fallback, the load-bearing risk (late events stretching beyond 30s), and offers the deeper dives (2PC internals, multi-dim cost, dedup TTL) explicitly.
Common Follow-up Questions
click to expandQuestions an interviewer is likely to ask after your walkthrough. Rehearse the short answer.