Design YouTube
Design a video sharing and streaming platform like YouTube. Users upload videos, the system transcodes and stores them, and viewers can stream in adaptive quality. Support recommendations, comments, subscriptions, and billions of daily views.
Key Topics
Interview Cheat Sheet
60s skim · 3min careful readThree independent paths sharing infra. Upload uses a Temporal-orchestrated DAG of ~1,800 tiny encoding tasks per 10-min video (4-sec segments x 12 variants) on a 5,000-GPU spot fleet, so a video is playable in 3-5 minutes wall-clock instead of hours. Playback runs multi-CDN with origin shield request coalescing for viral hits. Recommendation is two-stage retrieval-then-rank at sub-200ms p99. View counts split approximate (Flink to Valkey) from exact (daily ClickHouse reconcile).
- Upload + transcode · 50 videos/sec · 3-5 min playable
The client uploads directly to S3 using a presigned URL, so the API never touches the bytes. A Temporal workflow then splits the raw video into 4-second segments and fans them out to a 5,000-GPU spot fleet, where each tiny task encodes one segment into 12 variants. The video becomes playable as early segments finish.
Clientpresigned S3 PUT (direct upload, bypasses API)Kafka media.uploadedTemporal workflowsplit raw into 4-sec segments (~150 segments per 10-min video)fan out 150 × 12 variants = 1,800 tiny encode tasks to 5,000-GPU T4 spot fleeteach task ~2s on a T4write encoded segments back to S3CDN warmingplayable in 3-5 min (progressive availability as earlier segments finish) - Playback · multi-CDN · origin shield coalesces viral hits
The client fetches a manifest, then pulls 2-10 second segments adaptively based on bandwidth. Most segments come from CDN edges. For viral videos where edges all miss, an origin shield in front of S3 coalesces concurrent fetches so 1,000 edges asking for the same segment fire one S3 read, not 1,000.
ClientDNS GSLB picks healthy CDN (multi-CDN failover in ~10s on regional outage)fetch HLS/DASH manifestadaptive bitrate: client picks variant by bandwidthCDN edge serves segment (95%+ hit)cache miss: origin shield with request coalescing (1,000 edges1 S3 fetch, 1000x origin reduction)S3 originproactive warming: view-velocity monitor pushes first 2 min of trending videos to edges - View counts · approximate UI + exact daily reconcile
View heartbeats fire at 1.39M events/sec. Real-time exact counting would need cross-region coordination, expensive for a metric where 0.1% drift is invisible. Flink does fast approximate aggregation for the UI; a daily ClickHouse batch reconciles exact counts for creator payouts.
Playerview heartbeat every 30s (1.39M events/sec)Kafka view_eventsFlink 1-min tumbling windowsZADD into Valkey (UI displays approximate within seconds)daily ClickHouse batch over full Kafka archiveexact countsreconcile driftwrite to PostgreSQL creator_payouts (payouts run on exact, never on approximate)
- •500 hours uploaded/min = 50 videos/sec ingest, 1.5 GB raw per video
- •1B hours watched/day = ~42M avg concurrent. At ~1 Mbps weighted bitrate (mobile-heavy global mix), egress is ~40 Tbps
- •Transcoding: 150 segments x 12 variants = 1,800 tasks per 10-min video
- •GPU fleet: 5,000 T4 spot instances, $7M/year vs $25M on-demand
- •CDN cache hit rate ~95%+ for popular content
- •View heartbeats: 1B hr/day x 2/min = 1.39M events/sec
- •Catalog: ~800M videos, ~616 PB transcoded in S3
- •Segment-based DAG, not monolithic (7-hr latency) or per-variant (75 GB redundant egress)
- •Temporal for orchestration, not Kafka-with-retries (need workflow state at 90K concurrent tasks)
- •Multi-CDN with DNS GSLB, not single-vendor (regional outage would be global)
- •ScyllaDB for video metadata at millions of reads/sec, PostgreSQL for users/billing
- •Presigned S3 upload, not proxy through API (would be 75 GB/sec ingest)
- •Approximate view counts in UI, exact in daily ClickHouse batch for creator payouts
- •Viral-video CDN miss cascade is the load-bearing risk, origin shield + warming mitigates
- •Playback startup tier-1 (sub-2s), transcoding tier-2 (under 30 min), view counts tier-3
- •Spot GPU economics ($7M vs $25M) made viable by Temporal segment-level retry
- Segment-based DAG (not monolithic, not per-variant)
Monolithic encode takes ~7 hours of wall-clock for a 10-min video. Per-variant parallel works for latency but each of 15 workers reads the full 5 GB raw: 75 GB of redundant S3 egress per video, 225 TB/min wasted at our ingest rate. Segment-based splits the raw once, then 1,800 tiny tasks each read one 4-sec segment. Massive parallelism, fine-grained retry granularity, and progressive availability (video becomes playable as early segments finish while later segments still encode). The segment-level DAG is what makes 3-5 min playable wall-clock possible.
- Temporal over Kafka-with-retries (workflow state at 90K concurrent tasks)
1,800 tasks per video × 50 videos/sec ingest = ~90K concurrent tasks at any moment. We need workflow state (which segments done, which retrying, which dead), automatic retry with backoff, and ops visibility. Temporal does this natively as durable workflow state with deterministic replay. Kafka with consumer retries works for stateless event handling but not for a long-running DAG with cross-task dependencies (segment X must precede segment Y's HLS manifest assembly). Same orchestrator-vs-choreography pattern as the payment system's saga.
- Spot GPU economics ($7M vs $25M made viable by retry granularity)
5,000-GPU T4 spot fleet runs $7M/year vs $25M on-demand. AWS reclaims ~30% of spot capacity during demand spikes. Monolithic encodes would never survive spot reclamation (lose a 7-hour job halfway through). Segment-based DAG makes spot viable because each task runs ~2s; worst case Temporal retries on a healthy instance and a 5-min transcode takes 15 min. Creator SLO is 30 min, still inside it. The architecture choice (segment DAG + Temporal) is what unlocks the cost choice (spot GPU).
- Origin shield with request coalescing (viral CDN miss cascade)
Viral video hits 10M viewers in 60 seconds, every edge misses simultaneously. Without coalescing, every edge slams S3, origin melts, every viewer stalls. Origin shield in front of S3 sees 1,000 edges asking for the same segment in the same second, fires one S3 fetch, broadcasts the response. 1,000x origin reduction. Pairs with proactive warming: view-velocity monitor pushes first 2 min of trending videos to edges as they trend, so the first wave of viewers never even misses. Same coalescing pattern as the URL shortener's hot-key request coalescing via Valkey lock.
40-Minute Interview Playbook
Each phase is what the interviewer expects you to do and say. Concrete steps, not topic hints. The diagrams are what you sketch on the board.
- 15 min
Clarify Requirements and Scale
GoalSeparate the three independent paths (upload + transcode, playback via CDN, recommendations) and pin the transcoding DAG decision early. Most candidates blur these together; the interviewer wants to see you carve them up.
Do & Say- SAY·1SAY: This is three independent systems sharing infra. The write path: upload + transcode, compute-bound and async. The read path: playback, bandwidth-bound and CDN-driven. The recommendation path: ML-inference-bound and sub-200ms. Write all three on the board.
- SAY·2Pin the scale: 500 hours/min upload = 50 videos/sec, 1B hours/day watched = 42M concurrent, 1 Mbps weighted bitrate → 40 Tbps egress, 800M videos in catalog, 50M peak concurrent viewers. Write 40 Tbps egress, 50 videos/sec ingest on the board.
- SAY·3Pin the SLOs: time-to-playable under 30 min for a 10-min video, first few segments playable within 60 sec, playback startup sub-2s TTFB, recommendation feed sub-200ms p99.
- SAY·4Pin the storage decision: raw uploads to object storage, transcoded segments to object storage with CDN in front, video metadata in ScyllaDB (not PostgreSQL) because we need millions of metadata reads/sec for page loads. PostgreSQL stays for users and billing where ACID matters.
- SAY·5Scope in: upload + transcode pipeline, adaptive bitrate playback via CDN, view count aggregation, basic recommendation architecture. Out: monetization/ads, content moderation internals, live streaming, Shorts. Wait for confirmation.
Interviewer is grading: You carve up the three paths before drawing anything. You volunteer the transcoding-is-async constraint early. You pick object storage for video data and ScyllaDB for metadata, and can give one-line reasons for both.
- 25 min
API and Data Model
GoalThree endpoints: upload init, playback manifest, view heartbeat. Justify the presigned-URL pattern for upload, the HLS manifest for playback, and ScyllaDB partitioning for metadata.
Do & Say- WRITE·1Write the upload init: POST /api/v1/videos {title, description} returns {video_id, presigned_upload_url}. Creator's app PUTs raw file directly to S3 via multipart upload (no proxy through our servers). Then POST /api/v1/videos/{id}/complete fires UploadComplete event to Kafka, which triggers the Temporal workflow.
- WRITE·2Write the playback endpoint: GET /api/v1/videos/{id}/manifest returns signed CDN URL to master.m3u8. The player fetches master.m3u8, picks a variant based on bandwidth, then fetches 4-second segments from CDN. Player switches variants mid-stream if bandwidth changes (adaptive bitrate).
- WRITE·3Write the view heartbeat: POST /api/v1/views/heartbeat {video_id, position_sec, session_id} fires every 30 seconds during playback. Goes to Kafka topic view.heartbeats. Flink aggregates in 1-minute tumbling windows, writes approximate counts to Valkey. Daily batch job reconciles exact counts in ClickHouse.
- SAY·4Justify presigned URL: Creator uploads 1.5 GB on average. Proxying through our servers means 75 GB/sec ingest at 50 videos/sec. Presigned URL means S3 takes the upload directly, our servers see only the metadata POST. Standard pattern for any large-file upload.
- DRAW·5Sketch the ScyllaDB videos table. Partition key video_id. Columns: channel_id, title, description, status (uploading/processing/playable/removed), visibility, view_count, like_count, manifest_url, thumbnail_urls (map<resolution, url>), captions (list<frozen<caption_track>>). With captions and full descriptions embedded, ~50 KB per row average. 800M rows is ~40 TB raw, ~120 TB at RF=3 across a 50-node cluster.
Interviewer is grading: You use the presigned-URL pattern for upload (don't proxy 1.5 GB files through your API). You return only a manifest URL for playback and let the CDN serve the segments. You volunteer ScyllaDB over PostgreSQL for video metadata with a throughput reason.
- 310 min
High-Level Design (draw the three paths)
GoalOne diagram with three labeled paths: upload + transcode (Client -> S3 -> Kafka -> Temporal -> GPU fleet -> S3 segments), playback (Client -> CDN -> Origin shield -> S3), recommendation (Client -> Reco API -> two-stage retrieval+ranking).
Draw on the boardDo & Say- DRAW·1Draw the upload + transcode path first, it's the most complex. Creator uploads to S3 via presigned URL. Then UploadComplete fires to Kafka. Then Temporal orchestrates the transcoding DAG. Key insight: split the video into 4-second segments first, not transcode the whole thing.
- SAY·2Defend the segment-based DAG: 10-min video → 150 segments, each encodes to 12 variants (5 H.264, 4 VP9, 3 AV1), 1,800 independent tasks per video, parallelized across 5,000 T4 GPUs on spot, ~2s per task on a T4, wall-clock 3-5 min instead of 30+ sequential.
- SAY·3Defend the variant set: H.264 at every resolution for universal decoder support, VP9 at 720p+ for 30-40% bitrate savings (hardware decoders since 2015), AV1 at 720p+ for another 30-50%, player falls back if not supported, 360p/480p get H.264 only since low-bitrate savings are negligible.
- SAY·4Pivot to playback. Viewer requests manifest. Streaming Service returns signed master.m3u8 listing all variants. Player fetches from CDN, picks initial variant from bandwidth probe, fetches 4-second segments. If bandwidth drops, switch variants on next segment boundary. Standard HLS adaptive bitrate.
- SAY·5Defend multi-CDN: Akamai, Cloudflare, Fastly run in parallel. DNS GSLB steers viewers to nearest healthy CDN. Akamai EU down → traffic shifts to Cloudflare EU in ~10s. 95%+ cache hit on popular content. Origin shield with request coalescing absorbs viral misses: 1000 edges asking for the same segment fan into one S3 fetch.
- SAY·6Pivot to recommendation. Two-stage. Retrieval: narrow 800M videos to ~5000 candidates using two-tower embeddings (user embedding + video embedding, ANN lookup). Ranker: score those 5000 with a gradient-boosted tree or DNN using watch-time-weighted features, returns top 50. Feature store is Valkey for sub-ms lookups. Total p99 ~150ms.
- SAY·7Pivot to view counts: heartbeat every 30s into Kafka view.heartbeats, partitioned by video_id. Flink does 1-min tumbling aggregation, writes approximate per-video counts to Valkey for the UI. Daily ClickHouse batch reconciles exact. View-count accuracy is not tier-1 so a 1-min lag is fine.
Interviewer is grading: You separate the four paths visibly (upload + transcode, playback, recommendation, view counts). You volunteer the 4-second-segment DAG before being asked. You name Temporal (or equivalent workflow engine) for orchestration, not 'just retry with Kafka'. You volunteer origin shield + request coalescing for viral playback.
- 415 min
Deep Dive: Transcoding DAG, CDN Strategy, View Counts
GoalThree sub-dives. First, why the segment-based DAG beats two simpler approaches. Second, multi-CDN with traffic steering and cache warming. Third, view-count counting strategy (approximate real-time + exact batch).
Draw on the boardDo & Say- SAY·1Pivot to transcoding. Sync monolithic: one worker produces all variants sequentially, 30-min 4K with 15 variants takes 2.5 to 7.5 hours, crash at variant 14 → restart. At 500 hours/min uploaded, queue grows faster than we drain. Dead.
- WATCH·2Knock out parallel-per-variant: Better latency, but each of the 15 workers reads the entire raw file from S3. 5 GB raw × 15 reads = 75 GB S3 egress per video. At 3000 videos/min that's 225 TB/min of redundant reads. Wasteful.
- SAY·3Defend segment-based DAG: Split into 4-second segments, each fans out to 12 encoding tasks, 150 × 12 = 1800 tiny tasks, parallelism on 5000 T4 spot GPUs, fine-grained retry, progressive availability so first 10 segments play while later ones encode.
- SAY·4Defend Temporal: 1800 tasks per video × 50 videos/sec = 90K concurrent tasks. Needs: workflow state (done, retrying, dead), backoff retries, ops visibility. All native in Temporal. Airflow and Step Functions work, but Temporal's worker model fits long-running encoding better.
- SAY·5Cover GPU sizing: 50 videos/sec × 1 GPU-hour each = 180K GPU-seconds/sec, 5000 T4s at ~36 NVENC sessions each = 180K concurrent sessions, balanced, spot at $0.16/hr = $7M/year. 4-second segments survive reclamation since Temporal retries elsewhere.
- SAY·6Pivot to CDN strategy. Multi-CDN is mandatory for video at this scale. Single-CDN means a regional outage is a global incident. Akamai, Cloudflare, Fastly run in parallel. DNS-based GSLB steers each viewer to nearest healthy CDN, weighted by per-CDN metrics (RTT, error rate, cost).
- SAY·7Defend origin shield + coalescing: Viral video hits 10M viewers in 60s. Edges miss, haven't cached yet. Without coalescing, every edge slams S3. Origin shield sees 1000 edges asking for the same segment, fires one S3 fetch, broadcasts. 1000x origin load reduction.
- SAY·8Defend proactive warming: Flink tracks view-velocity per video. When trending (>100 views/sec or >50% acceleration), push first 2 min of segments to edge PoPs. First 2 min carries ~60% of views, high-leverage warm. Segments at edge by the time 10M viewers arrive.
- SAY·9Pivot to view counts: heartbeats fire every 30 seconds during playback, 1B hours/day × 2 heartbeats/min = 120B heartbeats/day = 1.39M events/sec, Kafka partitioned by video_id with 200 partitions across 50 brokers, Flink does 1-min tumbling aggregation and writes per-video counts to Valkey.
- SAY·10Defend approximate-then-exact: UI shows approximate from Valkey within 1 min, daily ClickHouse batch scans Kafka and reconciles to exact, fixing Flink drift, approximate is fine for UI, no user notices 0.1% drift, exact is needed for creator payouts which run on daily batch.
- SAY·11Cover hot-metadata failure: Viral video's ScyllaDB row gets millions of reads/sec on one vnode → hot partition. Mitigation: Valkey with 1-hour TTL absorbs reads, ScyllaDB sees one read per hour per Valkey shard. Super Bowl scale: Flink writes the live counter to Valkey directly and bypasses ScyllaDB on that field.
Interviewer is grading: You volunteer the three transcoding architectures and kill the first two with one-line reasons. You name Temporal (or equivalent) as the workflow engine, not 'just Kafka with retries'. You bring up origin shield + request coalescing for viral content unprompted. You separate approximate (Flink + Valkey) from exact (daily ClickHouse) for view counts and can defend each.
- 55 min
Trade-offs, Risks, and Wrap-up
GoalName three deliberate trade-offs (approximate view counts, spot GPU instances, ScyllaDB tunable consistency), give a failure plan for a CDN outage, close in one breath.
Do & Say- SAY·1Approximate view-count trade-off: UI from Valkey via Flink with 1-min lag, exact via daily ClickHouse batch. We accept a 13:00 count off by 0.1% until next-day reconcile. No user notices, payouts run on exact. Real-time exact would need cross-region sync, too expensive.
- SAY·2Spot GPU trade-off: 5000 T4s on spot = $7M/year vs $25M on-demand. We accept AWS reclaiming ~30% of the fleet during capacity events. Temporal retries interrupted segments on healthy instances. Worst case: 5-min transcode takes 15 min. Creator SLO is 30 min, still inside it. Savings are real money.
- SAY·3ScyllaDB consistency trade-off: metadata reads use LOCAL_ONE (any replica), view count writes from Flink use LOCAL_QUORUM (2 of 3 replicas). We accept slightly stale view counts on read in exchange for sub-ms p99 reads. Source of truth for exact counts is ClickHouse, so any ScyllaDB drift is reconciled daily.
- SAY·4CDN outage plan: Akamai EU down 15 min. GSLB catches it in 10s, shifts EU to Cloudflare EU. Cloudflare cache cold, hit rate drops to ~60% for 1-2 min. Origin shield absorbs misses. Cache warms organically by 5 min. Users see 10-30s stall at failover then normal playback.
- SAY·5Close: upload via Temporal DAG of 1800 tasks on 5000 spot GPUs, playback via multi-CDN with origin shield and proactive warming, recommendation as two-stage retrieval-then-rank with Valkey at sub-200ms p99. ScyllaDB metadata, S3 video, ClickHouse exact counts. Risk: viral CDN miss cascade, mitigated by coalescing and warming.
Interviewer is grading: You name trade-offs as 'we accept X because Y'. You volunteer the CDN failover plan (DNS GSLB + cache warming) without being asked. The summary mentions all three paths plus the load-bearing CDN risk in one breath.
Interview Grading by Level
What an interviewer at each level expects to see in your answer. Use this to calibrate, not to perform.
Mid-Level Engineer (L4 / SDE-II)
Reaches a working video platform but transcoding is synchronous, CDN is single-vendor, and view counts hand-wave the scale.
- Knows to use object storage (S3) for raw video and transcoded segments.
- Adds a CDN for video delivery and mentions edge caching.
- Picks adaptive bitrate (HLS or DASH) for playback.
- Adds a queue (Kafka or SQS) for upload events.
- Knows recommendation needs ML but draws it as a single model call.
- Transcodes synchronously or one variant at a time, doesn't split into segments.
- Single CDN, no failover plan, treats CDN as a magic black box.
- Doesn't quantify 40 Tbps egress or do the per-video storage math (770 MB transcoded x 800M videos = 616 PB).
- View counts are 'just increment a counter', no separation between approximate and exact.
- Treats recommendation as a single ranking step, no retrieval stage to narrow 800M videos to 5000 candidates.
Senior Engineer (L5 / SDE-III)
Drives the three-path separation, picks segment-based DAG transcoding, multi-CDN with failover, and separates approximate from exact view counts.
- Separates upload + transcode, playback, and recommendation in the first 5 minutes and gives each independent scaling characteristics.
- Picks segment-based DAG (4-second segments x 12 variants = 1800 tasks per 10-min video) over monolithic or per-variant approaches, with a one-line reason for each rejection.
- Names a workflow engine (Temporal, Airflow, Step Functions) for DAG orchestration and gives a reason.
- Uses multi-CDN with DNS-based GSLB and explains the regional-outage blast-radius reason.
- Picks ScyllaDB over PostgreSQL for video metadata with the millions-of-reads/sec throughput reason.
- Separates approximate view counts (Flink + Valkey, 1-min lag) from exact (daily ClickHouse batch) and can defend each.
- Names origin shield + request coalescing for handling viral video cache misses.
- Brings up proactive cache warming on view-velocity only when prompted, not as part of the initial CDN design.
- Doesn't volunteer the spot-GPU economics ($7M/year vs $25M/year) until cost comes up.
- Quantifies failure modes in words but not in numbers (e.g. 'CDN failover is fast' without 'about 10s DNS TTL + cache warming over 1-2 minutes').
Staff+ Engineer (L6+)
Owns the room, names viral-video CDN miss cascade as the load-bearing risk before being asked, frames trade-offs as 'we accept X because Y', and brings operational depth (spot economics, GSLB health checks, ScyllaDB hot-vnode mitigation).
- Volunteers the viral-video CDN miss cascade as the load-bearing risk in the first 5 minutes, with origin shield + request coalescing + proactive warming already drawn before being asked.
- Brings up segment-based DAG with the per-architecture rejection (monolithic = 7-hour latency, per-variant = 75 GB redundant egress) unprompted.
- Sets explicit SLO tiers: playback startup is tier-1 (sub-2s p99), transcoding-to-playable is tier-2 (under 30 min), view counts are tier-3 (1-min approximate, exact within 24h), recommendation is tier-2 (sub-200ms but not critical).
- Volunteers the spot-GPU economics ($7M vs $25M) with the Temporal-retry safety net that makes spot viable.
- Pushes back on requirements where appropriate: 'Do we need AV1 from day one, or is H.264 + VP9 enough for launch? AV1 saves 30-50% bitrate but encoding is 5-10x slower. Add later when GPU fleet has headroom.'
- Names operational rehearsal of the multi-CDN failover (DNS TTL, GSLB health-check thresholds, cache warming rate) and the spot reclamation drill, not just the technical mechanism.
- Closes with a one-breath summary covering all three paths plus the load-bearing risk (viral CDN miss), and offers the deeper dives (transcoding DAG internals, two-tower retrieval, GSLB traffic steering) explicitly.
Common Follow-up Questions
click to expandQuestions an interviewer is likely to ask after your walkthrough. Rehearse the short answer.
Foundations Referenced
Detailed Solution Coming Soon
Full walkthrough coming soon. Stay tuned!