Design LeetCode
Design an online judge handling 50M submissions/day across 20+ languages. The hard parts: executing untrusted code safely in Firecracker microVMs with hardware KVM isolation and native snapshot warm-start (~25ms restore), real-time contest leaderboards with atomic Valkey Lua updates, post-contest Elo rating, and premium priority queues, all while keeping end-to-end latency under 5 seconds at 2K submissions/sec peak.
Key Topics
Interview Cheat Sheet
60s skim · 3min careful readFirecracker microVMs with hardware KVM isolation run untrusted code with 100% syscall compatibility and zero known escape CVEs. A warm pool with ~25ms snapshot restore and ~5ms diff-snapshot reset makes the boot-time math work. Four Kafka topics with priority pools plus practice-worker dual-consume give contest fairness. The load-bearing decision is Firecracker, justified by AWS Lambda's billion-function precedent.
- Submit · 580/s avg · 2K/s peak · 4 priority topics
When a user submits code, the API just drops it into the right Kafka topic based on what kind of submission it is. Contest submissions go to a critical topic, premium to high, practice to normal, custom-test to low. Dedicated worker pools read from each topic so practice spam never blocks a contest run.
ClientPOST /submitAPI service validates + persists submissionKafka publish to one of 4 priority topics (contest > premium > practice > custom-test based on submission_type)202 returnedpractice workers can dual-consume contest topic during spikes (adaptive with hysteresis) but never the reverse - Execute · warm VM pool · ~5ms claim, ~5ms reset
A worker grabs an already-running Firecracker microVM from a warm pool keyed by language. It writes the user's code in over a virtual socket, runs it with strict resource limits, captures the verdict, then resets the VM to a clean snapshot. The pool stays warm so almost no submission ever pays the boot cost.
Worker consumes Kafkaclaim idle microVM from warm pool by language (~2ms channel receive)write code via vsockexecute under limits (256MB RAM, 2 vCPU, no network, language time multiplier 1.0x C++ / 2.0x Java / 3.0x Python)capture verdict + stdoutdiff-snapshot reset (~5ms)publish verdict eventVM hard caps: 60-min age, 100 executions, then destroyed - Leaderboard · Valkey ZSET · live WebSocket push
Every accepted submission updates a single Valkey sorted set. The score is a clever single number that puts solved-count in the high bits and penalty in the low bits, so one ZADD does both the sort and the tiebreak. Rank changes get published over WebSocket to anyone watching the leaderboard.
Verdict=ACCEPTEDLua atomic on Valkey (check duplicate via SET NX accepted:{user}:{problem}compute penalty = solve_time + wrong_attempts × 5minZADD leaderboard:{contest} {solved×10^9 - penalty} {user_id}PUBLISH rank.changed)WebSocket gateways subscribedpush delta to clientsasync post-contest Elo: pairwise expected vs actual rank, K∈[20,80] decay, clamp delta ∈ [-150, +150]
- •50M submissions/day, 580/sec avg, 2K/sec peak, 20K concurrent during contests
- •SLAs: p95 verdict under 5s, p99 under 10s, 99.99% contest availability, zero escapes
- •Firecracker: ~125ms cold boot, ~25ms snapshot restore, ~5ms diff-snapshot reset
- •Warm pool defaults: Python 200, Java 150, C++ 150, Go 100 idle VMs
- •Fleet: 1,500 worker pods × 8 VM slots = 12,000 concurrent VMs, 14.5% steady utilization
- •VM hard caps: 60 min max age, 100 max executions, 256MB RAM, 2 vCPU, no network
- •Language time multipliers: C++ 1.0x, Java 2.0x, Python 3.0x
- •Contest pre-warm 15 min ahead: Python 45%, C++ 30%, Java 15% of historical mix
- •Firecracker (not Docker+seccomp, not gVisor) for the sandbox
- •Native Firecracker snapshot/restore, not CRIU
- •Kata Containers to wrap Firecracker as a K8s RuntimeClass (no custom orchestrator)
- •Four Kafka topics by priority (not one with a field) with dedicated worker pools
- •Valkey Lua atomic for leaderboard updates, score = solved × 10^9 minus penalty
- •Test cases on worker SSDs cached from S3, never inside the rootfs image
- •Elo runs async post-contest; live ranking uses penalty-time only
- •Diff-snapshot VM reset (~5ms) plus 60-min age cap is a security rotation guarantee, not a perf tweak
- •Premium subscribers do NOT cut contest lines; fairness is non-negotiable during contests
- •KEDA scales on Kafka consumer lag, not CPU, because pod spin-up lags CPU by minutes
- Firecracker over gVisor or Docker+seccomp
Firecracker runs a full Linux kernel inside KVM, giving 100% syscall compatibility, hardware-enforced isolation, and zero known escape CVEs. gVisor's user-space Go Sentry only gets 95% syscall compatibility, adds 10-30% overhead on I/O-heavy code, and has tricky debugging. For 50M arbitrary programs/day across 20+ languages the gVisor edge cases would surface daily. Docker+seccomp shares the host kernel, which is the actual attack surface. AWS Lambda runs billions of Firecracker functions/day, which is the proof point.
- Warm pool + snapshot restore + diff-snapshot reset
Cold boot ~125ms is too slow for 580/sec sustained. Native Firecracker snapshot/restore brings boot to ~25ms (Python stdlib imported, JVM classloader + JIT warmed). The warm pool keeps N pre-restored microVMs per language, so >95% of submissions claim in ~2ms (channel receive) and never touch the restore path. After execution, diff-snapshot reset reverts filesystem and memory in ~5ms; one VM serves many submissions back-to-back without destroy.
- 60-minute VM age cap (security rotation, not perf)
VMs get destroyed at 60 minutes max or 100 executions max, whichever comes first, even if healthy. This is security rotation: hard age cap limits the blast radius of any future Firecracker CVE. Combined with no network, 256MB RAM, 2 vCPU, and the diff-snapshot reset between every submission, no VM is reused enough to accumulate state-based attacks. The 60-minute window is short enough that an attacker who finds an escape has limited time to weaponize before replacement.
- Contest fairness and premium queue separation
Premium priority only applies to practice submissions; during contests, all submissions go to the contest topic regardless of subscriber tier and are equal priority. Practice workers can dual-consume the contest topic during spikes (adaptive with hysteresis), but premium tier never gets queue priority on contest work. Pre-warm starts 15 min before a contest based on historical language mix (Python 45%, C++ 30%, Java 15%); dynamic rebalancing steals idle VMs from low-demand languages mid-contest.
40-Minute Interview Playbook
Each phase is what the interviewer expects you to do and say. Concrete steps, not topic hints. The diagrams are what you sketch on the board.
- 15 min
Clarify Requirements and Scale
GoalForce the security question first. Running untrusted code is the whole ballgame, and most candidates skip past it to the queue.
Do & Say- SAY·1SAY: the hard part is not the queue, it's running untrusted code safely. Confirm: arbitrary code from millions of strangers including fork bombs, kernel exploits, infinite loops. Get a yes.
- SAY·2Pin scale: 50M submissions/day, 580/sec average, 2K/sec peak, 20K concurrent during contest windows, 20+ programming languages. Write on the board.
- SAY·3Pin SLAs: p95 submission-to-verdict < 5s, p99 < 10s, 99.99% availability during contests, zero sandbox escapes (hard requirement). Wait for nod.
- SAY·4State scope. In scope: secure execution, multi-language compile + run, real-time contest leaderboard, post-contest Elo rating, premium priority queue. Out of scope: problem editing UI, payment for premium, discussion forums, editorial solutions.
- SAY·5Drop the bait: I'm going to argue for Firecracker microVMs over gVisor and Docker-with-seccomp. The choice drives the entire pool architecture. OK to defend that now? Get them to confirm now.
Interviewer is grading: You name the sandbox-escape rate as a hard requirement, not a soft SLA. You force the isolation-tech question early because it drives everything else.
- 27 min
Sandbox Isolation Comparison
GoalThree isolation candidates, knock out two, defend Firecracker with the AWS Lambda precedent and the boot-time math.
Do & Say- SAY·1Write three candidates: Docker + seccomp, gVisor, Firecracker microVMs. All three exist in production. The decision is driven by syscall compatibility, cold-start latency, sandbox-escape blast radius.
- WATCH·2Kill Docker + seccomp: shared kernel, CVE-2019-5736 (runc), CVE-2020-15257 (containerd), 6+ known escape CVEs in last 5 years. Even with a 50-syscall seccomp allowlist, every kernel zero-day is a host compromise. For 50M arbitrary programs/day, the math doesn't work. Cross it off.
- SAY·3gVisor next: Google's user-space kernel, Sentry (Go) intercepts and reimplements every syscall, high security but only 95% syscall compatibility, I/O-heavy code sees 10-30% overhead. Viable but not optimal.
- SAY·4Land on Firecracker: 100% syscall compat (guest runs a real Linux kernel), hardware KVM boundary, not software Sentry, zero known escape CVEs, ~5MB VMM, ~35MB per microVM. AWS Lambda runs billions on it, not theoretical.
- SAY·5Boot-time math: cold boot ~125ms (start VMM, load ~4MB kernel, mount rootfs, init runtime) too slow at 2K/sec. Firecracker native snapshot/restore (no CRIU) restores a fully-initialized VM in ~25ms (Python stdlib imported, JVM JIT-warm). 5x faster, makes the warm pool feasible.
- WATCH·6Get ahead of why not Kata?: Kata Containers wraps Firecracker behind the OCI runtime interface, K8s sees it as a normal pod, HPA + KEDA + rolling deployments work unchanged. Without Kata, we'd need a custom orchestrator.
Interviewer is grading: Three options compared with one-line kills (CVE count for Docker, 95% compatibility for gVisor). You quote AWS Lambda as precedent. You explain why Kata Containers bridges Firecracker into Kubernetes instead of building a custom orchestrator.
- 310 min
High-Level Design
GoalDraw three independent paths. Submission async via Kafka. Execution on warm pool. Contest leaderboard with atomic Lua + WebSocket.
Draw on the boardDo & Say- SAY·1Walk submit: API validates code length + language, Valkey rate limit (5/min practice, 10/min contest), write submission row to Postgres as PENDING, produce to one of four Kafka topics by priority, return 202 with submission_id. HTTP request completes in under 100ms.
- SAY·2Why four topics: single topic can't prioritize within a partition. Four topics with dedicated worker pools: contest (CRITICAL), premium (HIGH), practice (NORMAL), custom-test (LOW). Practice workers dual-consume contest on spike via adaptive consumer with hysteresis, so contest never sits in lag.
- SAY·3Walk execution: worker pulls Kafka, claims a warm microVM, 95%+ hit a ready VM in ~2ms, burst falls back to ~25ms snapshot restore. VM flow: write code via vsock, compile, run test cases via stdin/stdout, capture wall-time + peak memory from cgroup, stop on first failure unless Accepted.
- SAY·4Drop the killer line on VM reset: after each submission, restore the VM from a pre-execution diff snapshot, reverts all filesystem + memory changes in ~5ms, one VM serves multiple submissions back-to-back without being destroyed. Saves the full ~25ms snapshot restore.
- SAY·5Walk contest path on Accepted: Valkey Lua script atomically checks duplicate via HGET problem_accepted, marks accepted, increments problems_solved, computes penalty = solve_time + wrong_attempts*5min, ZADD score = problems_solved*1e9 - total_penalty, PUBLISH contest:id:update. WebSocket gateways push live deltas.
- SAY·6Why score = solved × 10^9 minus penalty: sort by problems_solved descending, tiebreak by penalty ascending, one numeric score lets Valkey ZSET do the sort for free. Wrong_attempts only increment if the problem isn't already accepted, in a separate Lua script.
- SAY·7Storage math: PostgreSQL ~1.5TB hot (30 days, weekly partitions), S3 ~36.5TB/year source code, lifecycle to S3-IA after 90 days, Valkey ~150MB total (leaderboards, rate limits, sessions), one node, 1,500 worker pods × 8 VM slots = 12K concurrent VMs, 14.5% steady-state util.
Interviewer is grading: Four Kafka topics with dedicated pools, not one with priority field. Diff snapshot VM reset at ~5ms is named explicitly. Lua script atomicity is justified, not assumed. The score = solved × 10^9 minus penalty encoding is volunteered, not asked for.
- 413 min
Deep Dive: Warm Pool, Contest Spikes, and Elo
GoalThree sub-dives with concrete algorithms and numbers.
Draw on the boardDo & Say- SAY·1Warm pool sub-dive: pool maintains N pre-restored microVMs per language. Default sizes: python3=200, java=150, cpp=150, go=100. Pool adjusts every 5 seconds: target_idle = max(min_idle_floor, demand_5min_avg × 1.5, contest_pre_warm). Walk the diagram.
- SAY·2Contest pre-warming: 15 minutes before a weekly contest, pool boosts by historical language distribution: Python 45%, C++ 30%, Java 15%, others 10%. Dynamic rebalancing steals idle VMs from low-demand languages.
- SAY·3VM health scoring: each VM tracks age (max 60 min) and execution count (max 100). Health below 0.3 → kill and replace from snapshot, no VM lives past an hour. This is a security rotation, not a perf tweak. Limits blast radius of any future Firecracker CVE.
- SAY·4Spike handling sub-dive: contest start moves traffic from 580/sec to ~3,000/sec. KEDA auto-scales worker pods on Kafka consumer lag, not on CPU. CPU lags Kafka lag by minutes because pods spin up, restore VMs, then start consuming. Lag is the right signal.
- SAY·5Practice workers dual-consume: non-blocking poll on contest topic first, then blocking poll on practice, steal contest work when contest lag > threshold, stop when lag < threshold/3. Hysteresis prevents flapping. Contest submissions are never delayed by practice traffic.
- WATCH·6Be ready for the fairness question: during contests, premium subscribers do NOT jump the line, all contest submissions are equal priority regardless of premium tier, fairness is non-negotiable. Premium priority only applies to practice submissions, not contests.
- SAY·7Elo sub-dive: live leaderboard uses penalty-time during the contest, Elo runs after, async. For N participants, treat as round-robin: Expected Rank = sum of 1/(1+10^((opp-player)/400)) over all opponents, Delta = K*(ln(expected) - ln(actual)), clamped to [-150, +150].
- SAY·8Elo K-factor: K = max(20, 80 - contests_played × 2). Starts at 80 for new players, decays to 20 with experience, starting rating 1500. High volatility early, stable late. Same approach LeetCode introduced in 2020.
- WATCH·9Be ready for cheating: same code, two machines, 48ms vs 72ms from noisy neighbors, raw exec-time ranking is unfair. We normalize by language multiplier (C++ 1.0x, Java 2.0x, Python 3.0x) but rank by problems_solved then penalty, never raw time. Exec time is shown to the user, not used for ranking.
Interviewer is grading: You name health scoring with the 60-min age cap and explain it as a security rotation, not a perf tweak. You volunteer that premium does NOT cut contest lines. You quote LeetCode's 2020 Elo introduction and walk the multi-player adaptation.
- 55 min
Trade-offs and Wrap-up
GoalThree deliberate trade-offs, one operational tail risk, one-sentence summary.
Do & Say- SAY·1Bare-metal trade-off: Firecracker needs /dev/kvm, bare-metal K8s nodes (c5.metal, m5.metal) or nested-virt instances, 2-3x EC2 cost. We pay for the hardware-enforced boundary. gVisor on regular nodes is cheaper, but 10-30% I/O overhead and 95% syscall compat kill it for a multi-language platform.
- SAY·2Time-limit multiplier trade-off: same algo in C++ runs 5ms, Python takes 50ms. We multiply base time by language (C++ 1.0x, Java 2.0x, Python 3.0x) so Python gets 3x. LeetCode skips this and Python users complain. We accept setter complexity because per-language fairness matters on a learning platform.
- SAY·3Test case storage trade-off: test cases live in S3, cached on worker SSDs, never in the rootfs image. Inside the image, users could read expected outputs and hardcode. SSD hit sub-ms, cold S3 fetch 50-100ms, total under 2GB fits on worker SSDs.
- SAY·4Operational risk: Kafka outage during contest is tier-1: no ingest, no verdicts, no leaderboard. Mitigations: RF=3 across 3 brokers, MirrorMaker to standby, idempotent producer. 5-min outage buffers in API memory with 30s timeout then client retries. We don't extend contests but credit penalty time.
- SAY·5Close: Firecracker microVMs with hardware KVM (100% syscall compat + zero-CVE radius at 50M/day), warm pool with ~25ms snapshot restore makes boot math work, four Kafka topics with priority pools for contest fairness, Valkey Lua + PUBLISH + WebSocket for sub-second leaderboard, Elo runs async post-contest.
- SAY·6Offer deeper dives: Firecracker snapshot API details, dual-consume hysteresis math, Elo multi-player adaptation, or per-language rootfs build pipeline.
Interviewer is grading: Bare-metal cost trade-off is acknowledged with a number (2-3x). Test cases are kept out of the image with the security reason given (hardcoded answers). The closing summary names Firecracker as load-bearing and explains why no alternative meets all three constraints.
Interview Grading by Level
What an interviewer at each level expects to see in your answer. Use this to calibrate, not to perform.
Mid-Level Engineer (L4 / SDE-II)
Reaches a queue-and-worker design with a Docker sandbox, but the security story is thin and the warm-pool dynamics are missing.
- Identifies that user code must run in a container, not on the host.
- Adds a queue (Kafka or SQS) between submission and execution to handle bursts.
- Knows test cases shouldn't be inside the user's container, even if the reasoning is vague.
- Separates contest leaderboard reads from submission writes.
- Mentions auto-scaling workers based on queue depth.
- Picks default Docker for the sandbox without acknowledging shared-kernel CVE risk.
- Cold-boots a container per submission; doesn't propose any warm pool.
- Doesn't separate priority queues; treats premium and contest as the same lane.
- Leaderboard updates are 'increment a counter'; no atomicity story.
- Has no plan for cheating prevention beyond 'check if outputs match'.
Senior Engineer (L5 / SDE-III)
Drives the design, picks Firecracker over Docker with reasons, names the warm pool with snapshot restore, and quantifies the contest spike.
- Compares Docker, gVisor, and Firecracker with concrete CVE history and syscall-compatibility numbers, lands on Firecracker.
- Names native Firecracker snapshot/restore at ~25ms (no CRIU) and explains the warm pool that brings claim time to ~2ms.
- Designs four Kafka topics by priority with dedicated worker pools, not one topic with a priority field.
- Atomic Valkey Lua for leaderboard updates with score = solved × 10^9 minus penalty encoding.
- Adds a /dev/kvm requirement and explicitly notes bare-metal Kubernetes nodes (c5.metal).
- Test cases on worker SSDs cached from S3, not in the rootfs image, with the security reason.
- KEDA auto-scaling on Kafka consumer lag (not CPU) and explains why CPU lags.
- Diff-snapshot VM reset (~5ms) is mentioned only after being asked about reset overhead.
- Doesn't bring up VM health rotation (max 60 min age) as a security control.
- Elo computation is mentioned but the multi-player round-robin adaptation isn't walked end-to-end.
Staff+ Engineer (L6+)
Owns the design, surfaces Firecracker as the load-bearing decision unprompted, brings concrete numbers for every layer, and names contest-fairness rules.
- Volunteers Firecracker as the load-bearing decision and ties it to AWS Lambda's billion-function precedent without prompting.
- Names diff-snapshot VM reset (~5ms) and the 60-minute VM age cap as a security rotation guarantee, not a perf tweak.
- Quantifies the warm pool formula (target = max(min_idle, demand_5min_avg × 1.5, contest_pre_warm)) and the per-language defaults (Python 200, Java 150, C++ 150, Go 100).
- Specifies that premium subscribers do NOT cut contest lines unprompted: fairness is non-negotiable during contests.
- Brings up the dual-consume practice-worker pattern with hysteresis (engage at lag > N, disengage at lag < N/3) and explains why it prevents flapping.
- Pushes back on time-limit fairness across languages and walks the multiplier table (C++ 1.0x, Java 2.0x, Python 3.0x), citing the LeetCode-doesn't-do-this trade-off.
- Closes with a one-sentence summary that names Firecracker as load-bearing and explains why no alternative meets all three constraints (100% syscall compat, zero-CVE escape rate, fast warm restore).
Common Follow-up Questions
click to expandQuestions an interviewer is likely to ask after your walkthrough. Rehearse the short answer.