System Design: LeetCode (Code Sandbox, Container Isolation, Real-Time Contests)
Goal: Design an online judge platform like LeetCode that handles 50 million code submissions per day across 20+ programming languages. Execute untrusted user code safely in Firecracker microVMs with hardware-level KVM isolation. Support real-time contests with live leaderboards, Elo-based rating, and penalty calculation. Target sub-5-second end-to-end execution latency, 20K concurrent submissions, and 15M registered users.
TL;DR: Submissions enter through a REST API, get validated and enqueued into Kafka (partitioned by priority: premium, contest, practice, custom-test). Execution workers pull jobs, claim a pre-warmed Firecracker microVM from the warm pool (native snapshot restore in ~25ms), run user code inside a VM with its own Linux kernel — hardware-level KVM isolation, no network, 256MB RAM, 2 vCPU. VM reset uses diff snapshot restore (~5ms) to revert all changes atomically. Contest leaderboards use Valkey sorted sets updated atomically via Lua scripts. WebSocket connections push live rank updates. Post-contest Elo rating computation adjusts player ratings. Auto-scaling based on Kafka consumer lag handles contest traffic spikes.
Pick your path
| Time | Read this | Covers |
|---|---|---|
| 2 min | TL;DR + §1 + §16 Security | Shape of the system and why sandbox isolation is the whole ballgame |
| 15 min | §1–§12 | Every core design decision, interview-grade |
| 30 min | Full post | Production detail: ops playbook, SLOs, failure scenarios, appendices |
1. Final Architecture
Three independent paths. The submission path accepts code and queues it (async, sub-100ms API response). The execution path runs untrusted code in Firecracker microVMs — each submission gets its own VM with a separate Linux kernel and hardware KVM isolation. The contest path layers real-time leaderboards and post-contest Elo rating on top.
Submission path:
Client → API Gateway → Rate Limiter (Valkey)
→ Kafka (topic by priority: premium > contest > practice > custom-test)
→ 202 Accepted with submission ID
Execution path:
Worker → Kafka poll → Claim warm microVM (Firecracker snapshot restore, ~25ms)
→ Write code via vsock → Compile → Run test cases (isolated VM, no network, 256MB, 2 vCPU)
→ Compare output → Verdict → PostgreSQL + Valkey pub/sub → WebSocket push
Contest path:
Accepted verdict → Valkey Lua script (atomic leaderboard update)
→ PUBLISH to WebSocket gateways → Live rank push to all viewers
Post-contest → Elo rating computation → PostgreSQL (rating_before, rating_after)
2. Problem Statement
An online judge sounds deceptively simple. Someone writes code. The system runs it. Check if the output matches. Done.
The reality? It's running arbitrary, untrusted code from millions of strangers. That code could be a fork bomb, a Bitcoin miner, a kernel exploit, or an infinite loop that allocates 64GB of memory. And the system needs to run 50 million of these per day, return the right answer in under 5 seconds, and do it across 20 different programming languages with different compilers, runtimes, and memory models.
Problem 1: Running untrusted code without getting owned.
Someone submits os.system("rm -rf /") in Python, or while(1) fork() in C++, or something subtler — code that reads /proc/self/mountinfo to fingerprint the container and tries known escape exploits. A default Docker container won't stop any of this. Kernel-level isolation is mandatory: intercept every syscall, block network access, kill anything that exceeds resource limits. One sandbox escape = access to test cases, other users' code, or production infrastructure.
Problem 2: 50M submissions/day means thousands of containers running simultaneously.
580 submissions/second average, 2K peak. Each takes 2-10 seconds → 1,200 to 5,800 containers in parallel. Each needs its own isolated filesystem, resource limits, and language runtime. Fresh Docker containers per submission is a non-starter (3-5s cold start). A warm pool of pre-spawned containers is essential.
Problem 3: Contest fairness requires deterministic execution and cheat prevention.
Same code, different machines: 48ms vs 72ms due to noisy neighbors. Ranking by raw execution time is unfair — timing must be normalized.
Scale numbers:
- 15M registered users, 3M monthly active users
- 50M code submissions per day (~580/sec avg, 2,000/sec peak)
- 20K concurrent submissions during peak contest windows
- 20+ supported programming languages
- 4,000+ problems with 10-50 test cases each
- Average execution time: 3 seconds (compilation + running all test cases)
- Weekly contests: 100K participants, 4-5 problems, 90-minute window
What NOT to do:
exec()/eval()on the app server. That's RCE on production.- Default Docker security. Shared kernel, no syscall filtering. Escapes are well-documented (CVE-2019-5736, CVE-2020-15257).
- Network access in containers. Code could fetch solutions, exfiltrate data, or attack internal services.
- Raw execution time for ranking. Noisy neighbors cause 2x variance. Normalize or use dedicated hardware.
- Running all test cases after first failure. Wrong output on case 2 of 50? Stop. Only run all for Accepted.
- Test cases inside the container image. Users could read expected outputs and hardcode answers.
- Single queue for all submissions. Contest and practice traffic compete. Separate queues with priority.
- Solutions and test cases in the same DB. Different sizes, different access patterns, different write rates.
- Monolithic judge. API, execution, leaderboard, problem management scale completely differently.
The sandbox isolation layer is where most of the complexity lives. Getting it wrong means either a security breach or unacceptable performance. §11.1 breaks down the tradeoffs.
3. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Execute user-submitted code in a sandboxed environment with strict resource limits | P0 |
| FR-02 | Support 20+ programming languages (Python, Java, C++, Go, Rust, JavaScript, C, C#, Ruby, Kotlin, Swift, TypeScript, Scala, PHP, Haskell, Dart, Elixir, Erlang, Racket) | P0 |
| FR-03 | Run submitted code against ordered test cases and return a verdict (Accepted, Wrong Answer, TLE, MLE, Runtime Error, Compilation Error) | P0 |
| FR-04 | Display execution time and memory usage for each submission | P0 |
| FR-05 | Support contests with timed problem sets and real-time leaderboards | P0 |
| FR-06 | Support penalty time calculation (time to solve + penalty per wrong attempt) | P0 |
| FR-07 | Provide a problem bank with descriptions, constraints, examples, and hidden test cases | P0 |
| FR-08 | Show submission history per user per problem | P0 |
| FR-10 | Support "Run Code" (test against visible examples only, fast debug loop) separate from "Submit" (full hidden test suite) | P0 |
| FR-11 | Rate-limit submissions per user (5/minute for practice, 10/minute during contests) | P0 |
| FR-12 | Push real-time verdict updates to users via WebSocket | P1 |
| FR-13 | Support problem difficulty tagging and topic categorization | P1 |
| FR-14 | Track user statistics (problems solved, acceptance rate, contest Elo rating) | P1 |
| FR-15 | Support editorial solutions and community discussions per problem | P2 |
| FR-16 | Distinguish TLE (user's algorithm too slow) from Timeout (server overloaded, auto-retry) | P0 |
| FR-17 | Premium priority queue: 3-10x faster judging for premium subscribers (practice only, not contests) | P1 |
| FR-18 | Post-contest Elo rating computation with absence penalty | P1 |
4. Non-Functional Requirements
| ID | Requirement | Target |
|---|---|---|
| NFR-01 | End-to-end submission latency (submit to verdict) | < 5 seconds (p95), < 10 seconds (p99) |
| NFR-02 | Execution throughput | 50M submissions/day, 2,000/sec peak |
| NFR-03 | Concurrent submissions | 20,000 during peak contest windows |
| NFR-04 | microVM snapshot restore time | < 50ms (with warm pool: < 5ms claim) |
| NFR-05 | Availability | 99.95% (26 min downtime/month, non-contest), 99.99% during contests |
| NFR-06 | Sandbox escape rate | 0 (any escape is a critical security incident) |
| NFR-07 | Leaderboard update latency | < 1 second from verdict to leaderboard update |
| NFR-08 | WebSocket message delivery | < 500ms from verdict to client notification |
| NFR-10 | Horizontal scalability | Linear scale-out by adding execution workers |
| NFR-11 | Data retention | Submissions: 2 years. Test cases: indefinite. Contest results: indefinite. |
| NFR-12 | Language image update | < 4 hours from new language/version release to production availability |
5. High-Level Approach & Technology Selection
5.1 What Kind of System Is This?
A batch job execution platform with a real-time scoring overlay. Accept code, run it safely, compare output. The "safely" part is where all the complexity lives.
At 50M submissions/day, this is async — not request/response. The API returns 202 immediately; the verdict arrives via WebSocket. Execution takes 2-10 seconds; holding HTTP connections open that long doesn't scale.
5.2 Why Not Just ProcessBuilder?
The tempting approach: run user code directly on the server.
ProcessBuilder pb = new ProcessBuilder("python3", "solution.py");
Process p = pb.start();
String output = new String(p.getInputStream().readAllBytes());This gives the user's code full access to the host — filesystem, network, other users' submissions, environment variables, the ability to kill the Java worker process. A fork bomb (while True: os.fork()) kills the entire server, not just the submission. There's no resource limit, no isolation, no boundary.
The progression from "no isolation" to "hardware isolation":
| Approach | Filesystem | Network | Resources | Kernel | Compatibility |
|---|---|---|---|---|---|
| ProcessBuilder | Full access | Full access | None | Shared | 100% |
| + chroot | Restricted root | Full access | None | Shared | 100% |
| + cgroups | Restricted root | Full access | CPU/mem limits | Shared | 100% |
| + namespaces | Isolated | Isolated | Limited | Shared | 100% |
| Docker + seccomp | Isolated | Disabled | Limited | Shared (filtered) | 100% |
| gVisor | Isolated | Disabled | Limited | User-space kernel | 95% |
| Firecracker | Isolated | Disabled | Limited | Separate kernel (KVM) | 100% |
Each row adds a layer. Firecracker is the end of the chain — the user's code runs inside its own VM with its own kernel. Even a full kernel exploit stays inside the VM. And unlike gVisor, there are zero syscall compatibility issues because the guest runs a real Linux kernel.
5.3 Sandbox Isolation Approaches
The sandbox must prevent: filesystem escape, network access, fork bombs, memory bombs, CPU starvation, and kernel exploits.
Approach 1: Docker + seccomp/AppArmor (baseline)
Standard Docker with a restrictive seccomp profile (~50 of 300+ syscalls allowed), AppArmor, and --network=none, --read-only, --memory, --cpus, --pids-limit.
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{ "names": ["read", "write", "open", "close", "stat", "fstat",
"mmap", "mprotect", "munmap", "brk", "ioctl",
"access", "execve", "exit_group", "arch_prctl",
"clone", "wait4", "getpid", "getuid"],
"action": "SCMP_ACT_ALLOW" }
]
}Approach 2: gVisor (runsc)
Google's user-space kernel. Intercepts all syscalls and re-implements them in a Go process (Sentry). The container never talks to the host kernel. Even a Linux kernel zero-day only compromises the unprivileged Sentry process, not the host.
User Code -> Container -> gVisor Sentry (user-space kernel) -> Host Kernel
^^ intercepted here
Approach 3: Firecracker microVMs
AWS's lightweight VM monitor. Each submission runs in an actual VM with its own kernel. Boots in ~125ms with ~5MB overhead. A kernel exploit inside the VM doesn't affect the host. Used by AWS Lambda and Fly.io.
User Code -> Guest Kernel (inside microVM) -> Firecracker VMM -> Host Kernel
^^ completely separate kernel
5.4 Sandbox Approach Comparison
| Dimension | Docker + seccomp | gVisor (runsc) | Firecracker microVM |
|---|---|---|---|
| Security | Low (shared kernel) | High (user-space kernel) | Very high (hardware KVM) |
| Isolation | Process-level | Syscall interception | Separate guest kernel |
| Syscall compatibility | Full (native) | 95% (reimplements ~200 of ~300) | 100% (full kernel inside VM) |
| Cold start | ~300ms | ~400ms | ~125ms (boot + rootfs) |
| Warm reuse | Easy (exec into running) | Easy (same as Docker) | Native snapshot/restore (~25ms) |
| Memory overhead | ~10MB | ~30MB (Sentry) | ~35MB (5MB VMM + guest kernel) |
| Syscall perf overhead | ~0% | 10-30% (interception) | ~5% (virtualization) |
| Known escape CVEs | 6+ (runc, containerd) | 0 | 0 (hardware boundary) |
| Operational complexity | Low | Medium (custom OCI runtime) | Medium (Kata Containers) |
| Production users | Everyone | Google Cloud Run | AWS Lambda, Fly.io |
| LeetCode fit | No | Yes | Yes (best) |
5.5 Chosen Sandbox: Firecracker microVMs
Firecracker is the primary runtime. Each submission runs inside an actual virtual machine with its own Linux kernel. The VM communicates with the host only through a minimal VMM process (~5MB) and KVM hypercalls. This is the same technology AWS Lambda uses to run billions of untrusted functions.
Why Firecracker over gVisor for this use case:
- Full kernel isolation. Guest kernel exploits don't affect the host. Zero syscall compatibility issues — every language runtime works exactly as on bare metal. No "works locally, fails on judge."
- Native snapshot/restore. Firecracker can snapshot a fully-initialized VM (runtime loaded, JIT warmed) and restore it in ~25ms. No external CRIU dependency. Simpler than gVisor + CRIU.
- Clean resource accounting. Each VM is a hard boundary. Memory, CPU, and I/O are tracked by KVM — no noisy-neighbor ambiguity.
- Proven at untrusted-code scale. AWS Lambda, Fly.io, and Koyeb run on Firecracker in production. The "operational complexity" argument has weakened — firecracker-containerd provides OCI compatibility, and Kata Containers provides Kubernetes RuntimeClass integration.
gVisor remains a valid lighter alternative for development environments, internal judges, or lower-scale deployments where microVM overhead isn't justified.
5.6 Judge Strategy Comparison
How does the system decide whether a submission is correct?
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Exact output match | Compare stdout byte-for-byte with expected output | Simple, deterministic | Fails on floating point, trailing whitespace, line endings |
| Token-based comparison | Split output into tokens (whitespace-delimited), compare tokens | Handles whitespace variations | Still fails on floating point precision |
| Special judge (checker) | Custom program compares user output against expected answer | Handles floating point, multiple valid answers, graph problems | Need to write a checker per problem |
| Interactive judge | User program communicates with judge program via stdin/stdout | Required for interactive problems (binary search, games) | Complex, harder to parallelize |
Chosen: Token-based comparison as default. Special judge for floating-point (epsilon comparison) and multi-answer problems (validate output property, not exact match).
5.7 Contest Ranking Algorithms
| Algorithm | Description | Used By |
|---|---|---|
| ICPC-style | Ranked by problems solved (desc), then penalty time (asc). Penalty = sum of solve times + 20 min per wrong attempt | ICPC, many college contests |
| Codeforces-style | Rating-based scoring. Points per problem decrease over time. Wrong attempts incur penalty. | Codeforces |
| LeetCode-style | Ranked by problems solved (desc), then total penalty (asc). Penalty = finish time of last accepted + 5 min per wrong attempt | LeetCode |
| IOI-style | Partial scoring. Each test case group gives points. No penalty for wrong attempts. | IOI, many olympiads |
Chosen: LeetCode-style ranking (configurable to ICPC-style):
score = (problems_solved, -total_penalty)
total_penalty = finish_time_of_last_accepted
+ (5 * total_wrong_attempts_across_all_solved_problems)
Sorted by: problems_solved DESC, total_penalty ASC
5.8 Store Selection
| Store | Technology | Role | Rationale |
|---|---|---|---|
| Problem metadata | PostgreSQL 16 | Problem descriptions, constraints, tags, difficulty | Relational queries (filter by tag, difficulty), ACID transactions |
| Test case storage | S3 + local SSD cache | Input/output files for each problem | Test cases can be large (100MB+ for some problems). S3 for durability, SSD cache on workers for low-latency reads. |
| Submission records | PostgreSQL 16 | User submissions, verdicts, execution stats | Relational (user_id, problem_id, contest_id), indexed for history queries |
| Submission code | S3 | Raw source code files | Cheap storage for 50M files/day, rarely re-read |
| Contest leaderboards | Valkey 8 | Real-time sorted sets for live rankings | Sub-ms sorted set operations, atomic updates, pub/sub for push |
| Submission queue | Kafka | Decouple API from execution workers | Durable, partitioned, consumer groups for parallel processing |
| User sessions/rate limits | Valkey 8 | Rate limiting, auth tokens | In-memory speed for per-request checks |
| Analytics | ClickHouse | Submission trends, language popularity, problem difficulty stats | Columnar analytics on billions of rows |
| WebSocket state | Valkey Pub/Sub | Route verdict updates to the correct WebSocket gateway | Cross-pod message routing |
5.9 Kata Containers: How Firecracker Integrates with Kubernetes
The problem Kata solves: Firecracker is a standalone VMM — it has no native Kubernetes integration. Without Kata, running Firecracker microVMs in K8s would require a custom orchestrator (scheduler, health checks, scaling, pod lifecycle). That was the original argument against Firecracker.
What Kata Containers is: An open-source project (CNCF) that wraps lightweight VMMs (Firecracker, QEMU, Cloud Hypervisor) behind the standard OCI container runtime interface. To Kubernetes, a Kata pod looks like any other pod. Under the hood, each pod runs inside a microVM instead of a shared-kernel container.
How it works:
Without Kata: kubelet → containerd → runc → container (shared kernel)
With Kata: kubelet → containerd → kata-runtime → Firecracker → microVM (separate kernel)
A single Kubernetes RuntimeClass resource switches the runtime:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-fc
handler: kata-fc # Kata with Firecracker backendWorker pods specify runtimeClassName: kata-fc and everything — KEDA auto-scaling, liveness probes, rolling deployments, pod specs — works unchanged. The only difference is what runs under the hood: a Firecracker microVM instead of a Linux container.
Why this matters for the judge: Firecracker provides the strongest isolation. Kata provides the Kubernetes integration. Together, the system gets hardware VM isolation with standard K8s operational tooling — no custom orchestrator needed.
6. Design Assumptions
Non-negotiables. Every number downstream inherits from here.
- Single-region (us-east-1), multi-AZ. No need for active-active — users already wait 2-5s for execution.
- 50M submissions/day is a design target, not a benchmark. Demonstrates scaling decisions.
- Firecracker for production. Hardware VM isolation for untrusted code. gVisor is the lighter alternative for dev/test.
- Premium tier. Priority queue for practice (3-10x faster). Contests: equal priority for fairness.
- 20+ languages. Python 2/3, Java, C++, Go, Rust, JS, TS, C, C#, Ruby, Kotlin, Swift, Scala, PHP, Haskell, Dart, Elixir, Erlang, Racket.
- No AI coding features (Copilot-style). Separate product concern.
- GDPR out of scope for code execution. IPs logged for rate limiting; no PII in sandboxes.
7. High-Level Architecture
Accept fast (202 Accepted), judge async (Kafka workers), push results (WebSocket).
Bird's-Eye View
Layer Responsibilities
API Gateway. Validate code → rate-limit (Valkey) → store submission in PostgreSQL (PENDING) → enqueue to Kafka → return 202. WebSocket gateway pushes verdict updates via Valkey Pub/Sub.
Queue. Four Kafka topics (contest, premium, practice, custom-test). Partitioned by user_id % partitions for per-user ordering.
Execution. Workers poll Kafka → claim warm microVM → write code via vsock → compile → run test cases → write verdict → kill VM, restore fresh from snapshot. Each worker runs 8 concurrent VMs.
Storage. PostgreSQL for structured data. S3 for blobs (test cases, source code). Test cases cached on worker SSDs.
Real-Time. Valkey for (1) leaderboard sorted sets, (2) rate-limit counters, (3) Pub/Sub routing verdicts to WebSocket gateways.
Analysis. ClickHouse stores submission analytics (language trends, problem difficulty calibration, per-worker performance). Fed async from PostgreSQL.
Submission Lifecycle
Execution Worker Detail Flow
8. Back-of-the-Envelope Estimation
580 req/sec average, 2K peak, 6K concurrent containers, 1.5TB hot PostgreSQL, 150MB Valkey.
| Dimension | Result |
|---|---|
| Throughput | 580/sec avg, 2K peak, design for 3K burst |
| Workers | 1,500 pods, 12,000 container slots, 14.5% steady-state utilization |
| PostgreSQL | ~1.5TB hot (30-day submissions + indexes), weekly partitions |
| S3 | 100GB/day source code, 36.5TB/year, <2GB total test cases |
| Valkey | ~150MB total (leaderboards + rate limits + sessions) — one node handles it |
| Kafka | 12MB/sec peak, 3 brokers, 72 partitions across 4 topics |
| Network | ~10MB/sec total outbound (well within single-NIC capacity) |
Show full derivations (throughput, worker sizing, storage, Valkey, Kafka, network)
Throughput:
Submissions per day: 50,000,000
Submissions per second (avg): 579 (~580)
Submissions per second (peak, 3.5x): 2,030 (~2,000)
Contest peak (weekly, 90 min):
100K users * 5 problems * 3 attempts avg = 1,500,000 submissions
1,500,000 / (90 * 60) = 278 contest submissions/sec
Sunday evening combined peak: ~1,440/sec
Design for 2,000/sec with burst to 3,000/sec.
Execution worker sizing:
Average execution time: 3 seconds
Concurrent executions at peak: 2,000/sec * 3s = 6,000
Worker pod: 8 vCPU, 16GB RAM, 8 containers per pod
Pods needed at peak: 6,000 / 8 = 750 pods
Practical: 500 contest + 800 practice + 200 custom-test = 1,500 pods, 12,000 slots
Steady-state utilization: 14.5%. Peak: 50%. Auto-scale down to 500 pods off-peak.
Storage:
PostgreSQL submissions: 50M/day * 500B/row = 750GB hot (30 days), ~1.5TB with indexes
PostgreSQL problems: 30MB. Users: 10GB. Contests: trivial.
S3 source code: 100GB/day = 36.5TB/year (lifecycle to S3-IA after 90 days)
S3 test cases: 900MB total (cached entirely on worker SSDs)
Valkey memory:
Leaderboards: ~8MB per contest (100K participants), 32MB with 4 concurrent
Rate limiting: 50MB (500K concurrent users)
WebSocket sessions: 60MB
Total: ~150MB — single Valkey node (4GB) with one replica
Kafka:
4,000 msg/sec peak * 3KB = 12MB/sec (single broker handles 200MB/sec)
3 brokers for fault tolerance. 16 + 16 + 32 + 8 = 72 partitions across 4 topics.
Network: ~10MB/sec total outbound (S3 fetches + PG updates + WebSocket).
9. Data Model
Six tables in PostgreSQL (submissions partitioned weekly), sorted sets in Valkey, blobs in S3.
| Table | Key Columns | Notes |
|---|---|---|
problems | slug, difficulty, time_limit_ms, starter_code (JSONB) | 4K rows, GIN index on tags |
test_cases | problem_id, input_s3, output_s3, is_sample | S3 paths, not inline data. SHA-256 hashes for integrity |
submissions | user_id, problem_id, status, execution_time_ms | Partitioned weekly by submitted_at. 50M rows/day. Status includes TIMEOUT (distinct from TLE) |
contests | start_time, end_time, problem_ids[], penalty_minutes | ~52/year. Rated by default |
contest_participants | contest_id, user_id, rank, rating_before/after/delta | Elo rating changes stored per contest |
users | username, rating (default 1500), problems_solved, is_premium | 15M rows. Streak tracking |
Show full SQL schemas (5 CREATE TABLE statements + indexes)
9.1 Problems Table
CREATE TABLE problems (
id SERIAL PRIMARY KEY,
slug VARCHAR(100) UNIQUE NOT NULL,
title VARCHAR(255) NOT NULL,
difficulty VARCHAR(20) NOT NULL CHECK (difficulty IN ('Easy', 'Medium', 'Hard')),
description TEXT NOT NULL,
constraints TEXT NOT NULL,
examples JSONB NOT NULL,
time_limit_ms INT NOT NULL DEFAULT 2000,
memory_limit_mb INT NOT NULL DEFAULT 256,
category VARCHAR(100) NOT NULL,
tags TEXT[] NOT NULL DEFAULT '{}',
has_special_judge BOOLEAN NOT NULL DEFAULT false,
special_judge_code TEXT,
starter_code JSONB NOT NULL DEFAULT '{}',
total_submissions BIGINT NOT NULL DEFAULT 0,
total_accepted BIGINT NOT NULL DEFAULT 0,
is_premium BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_problems_difficulty ON problems(difficulty);
CREATE INDEX idx_problems_category ON problems(category);
CREATE INDEX idx_problems_tags ON problems USING GIN(tags);9.2 Test Cases Table
# language-images.yaml
images:
python3:
image: "registry.internal/judge-python3"
tag: "3.12.2-fc-v4"
compiler: "CPython 3.12.2"
updated: "2026-04-15"
java:
image: "registry.internal/judge-java"
tag: "21.0.3-fc-v4"
compiler: "OpenJDK 21.0.3"
updated: "2026-04-12"
cpp:
image: "registry.internal/judge-cpp"
tag: "13.2.0-fc-v4"
compiler: "g++ 13.2.0"
updated: "2026-04-10"
go:
image: "registry.internal/judge-go"
tag: "1.22.2-fc-v4"
compiler: "Go 1.22.2"
updated: "2026-04-14"
rust:
image: "registry.internal/judge-rust"
tag: "1.77.1-fc-v4"
compiler: "rustc 1.77.1"
updated: "2026-04-08"9.3 Submissions Table
// Practice workers consume from both premium and practice topics
// Premium messages are always dequeued first (weighted consumer)
Submission consumeWithPriority() {
// Try premium topic first (non-blocking poll, 50ms timeout)
ConsumerRecords<String, Submission> premium =
premiumConsumer.poll(Duration.ofMillis(50));
if (!premium.isEmpty())
return premium.iterator().next().value();
// Fall back to practice topic (blocking poll, 1s timeout)
ConsumerRecords<String, Submission> practice =
practiceConsumer.poll(Duration.ofSeconds(1));
if (!practice.isEmpty())
return practice.iterator().next().value();
return null;
}9.4 Contests + Participants Tables
# practice-worker Deployment (abbreviated)
replicas: 800
strategy:
rollingUpdate: { maxSurge: 100, maxUnavailable: 80 } # 10%
resources: { cpu: "8", memory: "16Gi" } # per pod
env:
KAFKA_TOPIC: "submission.practice"
VMM: "firecracker"
WARM_POOL_SIZE: "8" # microVMs per pod
EXECUTION_TIMEOUT_SECONDS: "15"
runtimeClassName: kata-fc # Kata Containers with Firecracker backend9.5 Users Table
CREATE TABLE submission_analytics (
submission_id UUID, user_id UUID, problem_id UInt32,
contest_id Nullable(UInt32), language LowCardinality(String),
status LowCardinality(String), execution_time_ms UInt32,
memory_usage_kb UInt32, queue_wait_ms UInt32, compile_time_ms UInt32,
submitted_at DateTime64(3), judged_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(submitted_at)
ORDER BY (problem_id, language, submitted_at);
-- Language popularity
SELECT language, count() as submissions,
countIf(status = 'ACCEPTED') * 100.0 / count() as acceptance_rate
FROM submission_analytics WHERE submitted_at > now() - INTERVAL 30 DAY
GROUP BY language ORDER BY submissions DESC;
-- Time limit calibration
SELECT problem_id, language,
quantile(0.5)(execution_time_ms) as p50,
quantile(0.95)(execution_time_ms) as p95
FROM submission_analytics WHERE status = 'ACCEPTED'
GROUP BY problem_id, language;9.6 Valkey Data Structures
# Contest leaderboard (sorted set)
# Score encoding: problems_solved * 10^9 - total_penalty_seconds
# Higher score = better rank (more problems, less penalty)
ZADD contest:123:leaderboard <score> <user_id>
# Example: User solved 3 problems with 1800 sec total penalty
# Score = 3 * 1_000_000_000 - 1800 = 2_999_998_200
ZADD contest:123:leaderboard 2999998200 "user-abc-123"
# Get rank (0-indexed, so add 1 for display)
ZREVRANK contest:123:leaderboard "user-abc-123"
# Get top 50
ZREVRANGE contest:123:leaderboard 0 49 WITHSCORES
# Per-user contest state (hash)
HSET contest:123:user:abc-123
problems_solved 3
total_penalty 1800
p1_accepted 1
p1_time 300
p1_wrong_attempts 1
p2_accepted 1
p2_time 900
p2_wrong_attempts 0
p3_accepted 1
p3_time 1800
p3_wrong_attempts 2
p4_accepted 0
p4_wrong_attempts 3
# Rate limiting (sliding window)
# Key: ratelimit:<user_id>:<minute_bucket>
INCR ratelimit:user-abc-123:2026041815042
EXPIRE ratelimit:user-abc-123:2026041815042 120
# WebSocket session routing
HSET ws:sessions <user_id> <gateway_pod_id>
# Submission verdict pub/sub
PUBLISH verdict:<user_id> '{"submission_id":"...","status":"ACCEPTED","time_ms":48}'
9.7 Entity-Relationship Diagram
10. API Design
Async submission (202 + WebSocket push), not request-response. Two core endpoints: submit and get result.
10.1 Submit Solution
POST /api/v1/submissions
Request:
CREATE TABLE test_cases (
id SERIAL PRIMARY KEY,
problem_id INT NOT NULL REFERENCES problems(id),
case_number INT NOT NULL,
input_s3 VARCHAR(500) NOT NULL,
output_s3 VARCHAR(500) NOT NULL,
is_sample BOOLEAN NOT NULL DEFAULT false,
input_hash CHAR(64) NOT NULL,
output_hash CHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(problem_id, case_number)
);Response (202 Accepted):
CREATE TABLE submissions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
problem_id INT NOT NULL REFERENCES problems(id),
contest_id INT,
language VARCHAR(30) NOT NULL,
code_s3_path VARCHAR(500) NOT NULL,
code_length INT NOT NULL,
status VARCHAR(30) NOT NULL DEFAULT 'PENDING'
CHECK (status IN ('PENDING', 'QUEUED', 'COMPILING',
'RUNNING', 'ACCEPTED', 'WRONG_ANSWER',
'TIME_LIMIT_EXCEEDED', 'MEMORY_LIMIT_EXCEEDED',
'RUNTIME_ERROR', 'COMPILATION_ERROR',
'SYSTEM_ERROR', 'TIMEOUT')),
verdict_detail JSONB,
execution_time_ms INT,
memory_usage_kb INT,
test_cases_passed INT NOT NULL DEFAULT 0,
test_cases_total INT NOT NULL DEFAULT 0,
submitted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
judged_at TIMESTAMPTZ,
worker_id VARCHAR(100),
CONSTRAINT fk_contest FOREIGN KEY (contest_id) REFERENCES contests(id)
) PARTITION BY RANGE (submitted_at);
CREATE INDEX idx_submissions_user_problem ON submissions(user_id, problem_id, submitted_at DESC);
CREATE INDEX idx_submissions_contest ON submissions(contest_id, submitted_at) WHERE contest_id IS NOT NULL;
CREATE INDEX idx_submissions_status ON submissions(status) WHERE status IN ('PENDING', 'QUEUED', 'COMPILING', 'RUNNING');Rate Limit Headers:
X-RateLimit-Limit: 10
X-RateLimit-Remaining: 8
X-RateLimit-Reset: 1745069460
10.2 Get Submission Result
GET /api/v1/submissions/{submission_id}
Response (200 OK):
CREATE TABLE contests (
id SERIAL PRIMARY KEY,
title VARCHAR(255) NOT NULL,
contest_type VARCHAR(30) NOT NULL CHECK (contest_type IN ('weekly', 'biweekly', 'special')),
start_time TIMESTAMPTZ NOT NULL,
end_time TIMESTAMPTZ NOT NULL,
duration_minutes INT NOT NULL,
problem_ids INT[] NOT NULL,
is_rated BOOLEAN NOT NULL DEFAULT true,
penalty_minutes INT NOT NULL DEFAULT 5,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE contest_participants (
contest_id INT NOT NULL REFERENCES contests(id),
user_id UUID NOT NULL REFERENCES users(id),
rank INT,
problems_solved INT NOT NULL DEFAULT 0,
total_penalty INT NOT NULL DEFAULT 0,
score_detail JSONB NOT NULL DEFAULT '{}',
rating_before INT,
rating_after INT,
rating_delta INT,
PRIMARY KEY (contest_id, user_id)
);Response (200 OK, Wrong Answer):
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
username VARCHAR(50) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
rating INT NOT NULL DEFAULT 1500,
problems_solved INT NOT NULL DEFAULT 0,
easy_solved INT NOT NULL DEFAULT 0,
medium_solved INT NOT NULL DEFAULT 0,
hard_solved INT NOT NULL DEFAULT 0,
streak_days INT NOT NULL DEFAULT 0,
is_premium BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);Show 5 more endpoints (Run Custom Test, Get Leaderboard, WebSocket, Register, Get Problem)
10.3 Run Custom Test
POST /api/v1/run — Runs user code against custom input (not hidden test cases). Returns raw stdout/stderr. Does not create a submission record.
{
"problem_id": 1,
"language": "python3",
"code": "class Solution:\n def twoSum(self, nums, target):\n seen = {}\n for i, n in enumerate(nums):\n if target - n in seen:\n return [seen[target - n], i]\n seen[n] = i",
"contest_id": 123
}10.4 Get Contest Leaderboard
GET /api/v1/contests/{contest_id}/leaderboard?page=1&page_size=50 — Paginated rankings sorted by problems_solved (desc), total_penalty (asc). Each entry includes per-problem breakdown (accepted, time, wrong attempts).
10.5 WebSocket Verdict Stream
WS /api/v1/ws/verdicts — Client subscribes with {"action": "subscribe", "submission_id": "..."}. Server pushes status transitions: COMPILING → RUNNING (with test_cases_passed count) → final verdict (ACCEPTED/WA/TLE/etc with execution_time_ms).
10.6 Register for Contest
POST /api/v1/contests/{contest_id}/register → 201 Created with contest_starts_at.
10.7 Get Problem
GET /api/v1/problems/{slug} — Returns title, difficulty, description (markdown), constraints, examples, starter_code per language, tags, acceptance_rate. Hidden test cases are never exposed.
11. Deep Dives
Seven subsystems that each deserve their own article. Expand the collapsibles for implementation code.
11.0 "Run Code" vs "Submit" — Two Different Paths
These two buttons in the IDE look similar but follow very different paths through the system:
| Dimension | "Run Code" | "Submit" |
|---|---|---|
| Test cases | Visible examples only (1-3) | Full hidden test suite (10-50+) |
| Kafka topic | submission.custom-test (lowest priority) | submission.practice or submission.contest |
| PostgreSQL write | None (ephemeral result) | Full submission record (verdict, timing, memory) |
| Leaderboard update | None | Yes (if contest + Accepted) |
| Acceptance stats | Not affected | Updates problem acceptance rate |
| Response | Raw stdout/stderr + execution time | Verdict (Accepted, WA, TLE, etc.) |
| Typical latency | 1-3 seconds | 3-5 seconds (more test cases) |
"Run Code" is the fast debug loop. Users hit it dozens of times per problem to check edge cases with their own inputs. "Submit" is the formal judgment against hidden test cases. Keeping these paths separate prevents debug runs from consuming contest/practice worker capacity.
11.0.1 TLE vs Timeout
Two different failure modes:
- TLE: Code ran but exceeded the time limit. Algorithmic issue. Verdict is permanent.
- Timeout (SYSTEM_BUSY): Submission sat in queue too long. System issue. Auto-requeued with elevated priority (up to 2 retries):
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "PENDING",
"submitted_at": "2026-04-18T14:30:00Z",
"estimated_wait_seconds": 3
}11.1 Code Sandbox Design (Firecracker microVM Lifecycle)
Warm microVM Pool Architecture
The system maintains a pool of pre-restored Firecracker microVMs, one pool per supported language. Each VM is a fully-booted machine with its own Linux kernel, restored from a snapshot with the language runtime already initialized. When a submission arrives, a VM is claimed from the pool (~2ms) instead of booting a new one (~125ms).
How Firecracker Execution Works
Each submission follows this flow:
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"problem_id": 1,
"language": "python3",
"status": "ACCEPTED",
"execution_time_ms": 48,
"memory_usage_kb": 16384,
"test_cases_passed": 35,
"test_cases_total": 35,
"submitted_at": "2026-04-18T14:30:00Z",
"judged_at": "2026-04-18T14:30:03Z",
"percentile": {
"time": 92.3,
"memory": 87.1
}
}Keeping Firecracker Warm: Zero Boot Time via Snapshot/Restore
Cold-booting a Firecracker microVM takes ~125ms (start VMM, load kernel, mount rootfs, init language runtime). That's too slow at 2K submissions/sec. The solution: Firecracker native snapshots — no external CRIU dependency.
Firecracker snapshot restore ~25ms vs cold boot ~125ms — 5x faster. The snapshot captures the full VM state (memory, guest kernel, language runtime, JIT-warmed code). Restoring is a single API call: PUT /snapshot/load. The VM resumes exactly where it was.
Three-tier warm strategy:
95%+ of submissions claim from the warm pool (2ms). During burst traffic, snapshot restore kicks in (25ms). Cold boot only happens when building snapshots after a language version update.
Snapshot creation (once per language, at image build time):
{
"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
"problem_id": 1,
"language": "java",
"status": "WRONG_ANSWER",
"test_cases_passed": 18,
"test_cases_total": 35,
"verdict_detail": {
"failed_test_case": 19,
"input": "[3,2,4]\n6",
"expected_output": "[1,2]",
"actual_output": "[0,2]"
},
"submitted_at": "2026-04-18T14:31:00Z",
"judged_at": "2026-04-18T14:31:04Z"
}Cold boot (fallback, ~125ms):
// Request
{"language": "cpp", "code": "...", "input": "42"}
// Response 200
{"output": "84", "execution_time_ms": 12, "memory_usage_kb": 3456, "exit_code": 0}VM reset via diff snapshot (~5ms) instead of killing and re-creating. After each submission, restore the VM from a pre-execution diff snapshot. This reverts all filesystem and memory changes atomically. A single VM can serve multiple submissions back-to-back without being destroyed.
Pre-warming strategy: The pool dynamically adjusts every 5 seconds: target_idle = max(min_idle, demand_5min_avg * 1.5, contest_pre_warm). Before contests, it boosts based on historical language distribution (Python 45%, C++ 30%, Java 15%).
VM health scoring: Each VM tracks age (max 60 min) and execution count (max 100). VMs below health score 0.3 are killed and replaced from snapshot. Security rotation guarantee — no VM survives longer than an hour.
Show warm pool management code (pool struct, snapshot restore, pre-warming)
void handleTimeout(Submission sub) {
if (sub.getRetryCount() < 2) {
sub.incrementRetryCount();
sub.setPriority(Priority.HIGH);
kafka.send("submission.retry", sub);
db.updateStatus(sub.getId(), "QUEUED", "Auto-retry due to system timeout");
} else {
db.updateStatus(sub.getId(), "TIMEOUT",
"Server busy. Please try again in a few seconds.");
}
}Isolation comparison — Docker vs gVisor vs Firecracker:
Docker (runc):
User code → syscall → host Linux kernel (shared!)
Risk: Kernel exploit (Dirty COW) → host compromise
gVisor (runsc):
User code → syscall → gVisor Sentry (user-space Go) → limited host syscalls
Risk: Bug in gVisor code → limited blast radius
Limitation: ~200 of ~300 syscalls reimplemented. Edge cases break.
Firecracker microVM:
User code → syscall → guest Linux kernel → KVM hypercall → host kernel
Risk: KVM escape (extremely rare, hardware-enforced boundary)
Advantage: Full Linux kernel inside VM. Zero syscall compatibility issues.
Resource limit enforcement (inside the microVM):
11.2 Judge Pipeline
Test case execution. Commands run inside the Firecracker microVM via vsock: stdin input, capture stdout/stderr, wall-clock time via System.nanoTime(), peak memory from guest cgroup stats. A ScheduledFuture at timeLimitMs + 2s kills the VM on deadline.
Output comparison. Token-based by default (split by whitespace, compare token-by-token — handles trailing whitespace/blank lines). Float comparator: epsilon 1e-6 absolute, 1e-9 relative. Multi-answer problems: checker program in a separate container validates output properties.
Compilation pipeline by language (10 languages shown)
| Language | Compiler/Runtime | Compile Command | Run Command | Notes |
|---|---|---|---|---|
| Python 3.12 | CPython 3.12 | (none, interpreted) | python3 solution.py | 3x time limit vs C++ |
| Java 21 | OpenJDK 21 | javac -d /sandbox Solution.java | java -cp /sandbox -Xmx256m Solution | Heap limited by -Xmx |
| C++ 20 | g++ 13 | g++ -O2 -std=c++20 -o solution solution.cpp | ./solution | Reference language for time limits |
| Go 1.22 | go 1.22 | go build -o solution solution.go | ./solution | 2x time limit vs C++ |
| Rust 1.77 | rustc 1.77 | rustc -O -o solution solution.rs | ./solution | Same time limit as C++ |
| JavaScript | Node 22 | (none) | node solution.js | 3x time limit |
| C | gcc 13 | gcc -O2 -std=c17 -o solution solution.c -lm | ./solution | Same time limit as C++ |
| C# | .NET 8 | dotnet build -c Release | dotnet run | 3x time limit |
| Kotlin | Kotlin 1.9 / JVM | kotlinc solution.kt -include-runtime -d solution.jar | java -jar solution.jar | 2x time limit |
| TypeScript | tsx (Node 22) | (transpiled on the fly) | tsx solution.ts | 3x time limit |
11.3 Contest System (Real-Time Leaderboard, Penalty Calculation, Elo Rating)
Leaderboard update. On Accepted verdict, a Valkey Lua script atomically: check duplicate → mark accepted → calculate penalty (solve_time + wrong_attempts * 5min) → update sorted set (solved * 10^9 - total_penalty) → PUBLISH rank change for WebSocket push. Wrong attempts: separate Lua script, increments only if problem isn't already accepted.
Show leaderboard Lua scripts (atomic update + wrong attempt tracking)
Verdict executeSubmission(Submission sub, MicroVM vm) throws Exception {
// 1. Write user code to the VM via vsock
vm.writeFile("/sandbox/solution" + ext(sub.getLanguage()), sub.getCode());
// 2. Compile (if needed) via vsock command channel
long deadlineMs = sub.getTimeLimitMs() + 2000;
if (needsCompilation(sub.getLanguage())) {
ExecResult compile = vm.exec(compileCommand(sub.getLanguage()), deadlineMs);
if (compile.getExitCode() != 0)
return Verdict.compilationError(compile.getStderr());
}
// 3. Run against each test case (stop on first failure)
List<TestCase> testCases = testCaseCache.get(sub.getProblemId());
int maxTimeMs = 0, maxMemoryKB = 0, passed = 0;
for (int i = 0; i < testCases.size(); i++) {
TestCase tc = testCases.get(i);
ExecResult result = vm.execWithStdin(runCommand(sub.getLanguage()), tc.getInput(), deadlineMs);
maxTimeMs = Math.max(maxTimeMs, result.getTimeMs());
maxMemoryKB = Math.max(maxMemoryKB, result.getMemoryKB());
if (result.getTimeMs() > sub.getTimeLimitMs())
return Verdict.tle(passed, testCases.size());
if (result.getMemoryKB() > sub.getMemoryLimitKB())
return Verdict.mle(passed, testCases.size());
if (result.getExitCode() != 0)
return Verdict.runtimeError(passed, result.getStderr());
if (!compareOutput(sub.getProblemId(), tc, result.getStdout()))
return Verdict.wrongAnswer(passed, i + 1);
passed++;
}
return Verdict.accepted(passed, testCases.size(), maxTimeMs, maxMemoryKB);
}Live leaderboard push via WebSocket:
Post-Contest Elo Rating Computation:
The live leaderboard (penalty-time sorting) determines contest placement. After the contest ends, a separate async job computes rating changes using an Elo system adapted for multi-player competition. This is the same approach LeetCode uses (introduced 2020).
The key insight: standard Elo is designed for 1v1 matches. In a contest with N participants, the algorithm treats it as a round-robin where each pair of participants "plays a match" — the higher-ranked participant "wins."
The formula: for each player, compute Expected Rank (sum of win probabilities against all other participants using standard Elo 1/(1 + 10^((Ri - Rj)/400))), then delta = K * (ln(expected_rank) - ln(actual_rank)), clamped to [-150, +150]. K-factor starts at 80 for new players and decreases to 20 with experience. Starting rating: 1500. Absence penalty: ~-10 to -30.
Show Elo rating computation code (Java)
void createSnapshot(String language) throws Exception {
// 1. Cold-boot a fresh microVM
MicroVM vm = coldBoot(language);
// 2. Warm up the runtime inside the VM
// Python: import sys, collections, heapq, math, itertools
// Java: trigger classloader + JIT on common paths (HashMap, Arrays.sort, Scanner)
// Go/Rust/C++: no warm-up needed (statically compiled)
vm.exec(warmupCommands.get(language));
// 3. Snapshot the fully-initialized VM via Firecracker API
// PUT /snapshot/create {"snapshot_type": "Full", ...}
String snapshotDir = "/snapshots/" + language + "/latest";
vm.getVMM().createSnapshot(new SnapshotConfig(
Path.of(snapshotDir, "vmstate"),
Path.of(snapshotDir, "mem"),
true // enableDiffSnapshots — for fast reset
));
// 4. Kill the VM. Snapshot files (~50-100MB) stored on worker SSD.
vm.getVMM().stop();
}11.4 Multi-Language Support (Per-Language Rootfs Images and Compilation Pipeline)
What runs where — everything is on the same physical machine:
EC2 Instance (c5.metal — bare metal, KVM enabled)
│
├── K8s Node (kubelet, containerd)
│ └── Worker Pod (regular Docker container)
│ └── Java Worker App (Spring Boot)
│ ├── Kafka Consumer (pulls submissions)
│ ├── MicroVM Pool Manager
│ ├── spawns: /usr/bin/firecracker → microVM (python3.ext4)
│ ├── spawns: /usr/bin/firecracker → microVM (java21.ext4)
│ ├── spawns: /usr/bin/firecracker → microVM (cpp20.ext4)
│ └── ... (8 VMMs per pod = 8 concurrent submissions)
│
├── /dev/kvm ← Firecracker needs this device (hardware virtualization)
├── /var/lib/firecracker/
│ └── vmlinux ← shared minimal kernel (~4MB, boots in all VMs)
├── /var/lib/rootfs/
│ ├── python3.ext4 (~300MB, CPython 3.12 pre-installed)
│ ├── java21.ext4 (~450MB, OpenJDK 21 pre-installed)
│ ├── cpp20.ext4 (~250MB, g++ 13 pre-installed)
│ ├── go122.ext4 (~200MB, Go 1.22 pre-installed)
│ └── ... (20+ languages, all compilers pre-installed in rootfs)
└── /var/lib/snapshots/
├── python3/ (vmstate + mem, ~100MB — runtime pre-initialized)
├── java21/ (vmstate + mem, ~120MB — JIT pre-warmed)
└── ... (one snapshot per language for fast restore)
/dev/kvm is mandatory. Firecracker uses KVM for hardware virtualization. The K8s node pool must use bare-metal instances (c5.metal, m5.metal) or instances with nested virtualization enabled. The worker pod mounts /dev/kvm as a host device.
The Java app, Firecracker VMMs, and all microVMs share the same physical CPU and RAM. Each Firecracker process is ~5MB. Each microVM gets 256MB RAM and 2 vCPU via KVM resource partitioning. The Java app communicates with each VM via vsock (host↔guest data channel, no network stack).
Software flow (how a submission moves through the system):
Each ext4 rootfs image is a full minimal Linux filesystem with the language compiler pre-installed. Built once in CI, not at runtime:
python3.ext4 → Ubuntu 22.04 minimal + CPython 3.12 + stdlib (~300MB)
java21.ext4 → Ubuntu 22.04 minimal + OpenJDK 21 JDK headless (~450MB)
cpp20.ext4 → Ubuntu 22.04 minimal + g++ 13 (~250MB)
go122.ext4 → Ubuntu 22.04 minimal + Go 1.22 (~200MB)
rust177.ext4 → Ubuntu 22.04 minimal + rustc 1.77 (~280MB)
node22.ext4 → Ubuntu 22.04 minimal + Node.js 22 (~250MB)
All 20+ rootfs images + snapshots fit on a 16GB worker SSD. The build process:
- debootstrap a minimal Ubuntu rootfs (not Docker — Firecracker boots raw ext4)
- Install the language runtime via apt/curl
- Create sandbox user,
/sandboxworkspace, mount points - Package as ext4:
dd if=/dev/zero of=python3.ext4 bs=1M count=500 && mkfs.ext4 python3.ext4 && mount -o loop && copy → umount - Snapshot a warm VM from this rootfs (§11.1) — the snapshot includes initialized runtime + JIT
Language-specific time limit multipliers:
The same algorithm runs at very different speeds across languages. A two-sum hash map solution in C++ runs in 5ms. In Python, the same logic takes 50ms. Time limits scale by language.
What LeetCode actually does: Same time limit regardless of language. This makes Python significantly harder for some problems — a common community complaint. This design adds per-language multipliers for fairness. The tradeoff: multipliers add complexity and problem setters must calibrate limits against all languages, not just C++.
| Multiplier | Languages |
|---|---|
| 1.0x (reference) | C, C++, Rust |
| 1.5x | Swift |
| 2.0x | Go, Java, Kotlin, C#, Haskell |
| 2.5x | Scala |
| 3.0x (default) | Python, JavaScript, TypeScript, Ruby, PHP |
If the problem's base time limit is 2000ms (set for C++), a Python solution gets 6000ms. Unknown languages default to 3.0x.
Language-specific memory overhead and image versioning
Memory considerations:
| Language | Base Memory Overhead | Notes |
|---|---|---|
| C/C++ | ~2MB | Minimal runtime |
| Rust | ~2MB | Minimal runtime |
| Go | ~10MB | Go runtime + GC |
| Java | ~60MB | JVM startup (with -Xmx256m) |
| Kotlin | ~60MB | JVM-based |
| Scala | ~80MB | JVM + Scala runtime |
| Python 3 | ~20MB | Interpreter |
| JavaScript (Node) | ~30MB | V8 engine |
| C# (.NET) | ~40MB | CLR |
| Ruby | ~15MB | Interpreter |
| Haskell (GHC) | ~25MB | RTS |
Image versioning (source of truth):
MicroVM coldBoot(String language) throws Exception {
FirecrackerConfig cfg = FirecrackerConfig.builder()
.socketPath("/tmp/fc-" + UUID.randomUUID() + ".sock")
.kernelImagePath("/var/lib/firecracker/vmlinux") // Minimal kernel, ~4MB
.addDrive(Drive.builder()
.driveId("rootfs")
.pathOnHost(rootfsImages.get(language)) // ext4 with language runtime
.isRootDevice(true)
.isReadOnly(false)
.build())
.machineConfig(new MachineConfig(2, 256)) // 2 vCPU, 256MB
// No network interfaces — completely isolated. Communication via vsock only.
.build();
FirecrackerVMM vmm = FirecrackerVMM.start(cfg);
// PUT /actions {"action_type": "InstanceStart"}
vmm.startInstance();
return new MicroVM(vmm, language, VMState.IDLE);
}11.5 Submission Queue Management (Priority, Fair Queuing, Auto-Scaling)
During contests, submission traffic spikes 5-10x. The queue layer must handle this without either dropping contest submissions or starving practice submissions entirely.
Queue architecture with premium priority:
Premium "Lightning Judge" priority (practice submissions only):
LeetCode Premium subscribers get 3-10x faster judging during peak hours. This is implemented as a separate Kafka topic (submission.premium) that practice workers consume with higher priority. The key constraint: during contests, all contest submissions are equal priority regardless of premium status — fairness in competition is non-negotiable.
public class MicroVMPool {
private final Map<String, BlockingQueue<MicroVM>> idle = new ConcurrentHashMap<>();
private final Map<String, AtomicInteger> busyCount = new ConcurrentHashMap<>();
private final Map<String, String> snapshotDirs; // language -> snapshot path
private final PoolConfig config;
// Default pool sizes per language
static final Map<String, Integer> DEFAULT_MIN_IDLE = Map.ofEntries(
entry("python3", 200), entry("java", 150), entry("cpp", 150),
entry("go", 100), entry("rust", 50), entry("javascript", 80),
entry("csharp", 50), entry("kotlin", 40), entry("ruby", 30),
entry("swift", 30), entry("typescript", 60), entry("c", 100)
);
// Restore a VM from Firecracker snapshot — ~25ms
MicroVM restoreVM(String language) throws Exception {
String snapshotDir = snapshotDirs.get(language);
String socket = "/tmp/fc-" + UUID.randomUUID() + ".sock";
FirecrackerVMM vmm = FirecrackerVMM.start(
FirecrackerConfig.builder().socketPath(socket).build());
vmm.loadSnapshot(new SnapshotConfig(
Path.of(snapshotDir, "vmstate"),
Path.of(snapshotDir, "mem"),
true // enableDiffSnapshots
));
return new MicroVM(vmm, language, VMState.IDLE);
}
// Reset VM via diff snapshot — ~5ms, reverts all changes
void resetVM(MicroVM vm) throws Exception {
vm.getVMM().loadSnapshot(vm.getBaseSnapshot());
vm.incrementExecCount();
}
// Pre-warming — runs every 5 seconds via ScheduledExecutorService
void adjustPoolSize() {
for (String lang : idle.keySet()) {
int target = Math.max(
config.getMinIdle(lang),
Math.max(
(int)(demandMovingAvg(lang, Duration.ofMinutes(5)) * 1.5),
contestPreWarm(lang)));
int current = idle.get(lang).size();
if (current < target) executor.submit(() -> spawnVMs(lang, target - current));
if (current > target * 2) retireExcess(lang, current - target);
}
}
}During non-contest hours, premium submissions typically get verdicts in ~1-2 seconds (vs ~3-5 seconds for free tier). During peak contest hours with practice traffic spillover, the gap widens to 3-10x.
Fair queuing during contests:
The key insight: contest workers are dedicated and never share capacity with practice. But practice workers can be temporarily re-assigned to contest duty during peak load. This is done by having practice workers additionally subscribe to the contest topic when contest queue depth exceeds a threshold.
The adaptive consumer checks contest queue depth periodically. When depth exceeds a threshold, practice workers subscribe to the contest topic as well (dual-consume). When depth drops below half the threshold, they stop helping and return to practice-only mode. Hysteresis (threshold vs threshold/2) prevents flapping.
Auto-scaling based on queue depth. KEDA ScaledObject watches Kafka consumer lag per topic. Contest workers: min 100, max 1000 pods, scale up when lag >50 per partition, 5-second polling, 60-second cooldown. Practice workers follow the same pattern with higher min (200) and max (1500).
Queue position feedback. When the queue is deep, the API returns an estimated_wait_seconds field calculated as consumer_lag / processing_rate. This lets the UI show "Estimated wait: ~8 seconds" instead of a spinner.
Per-user fair queuing. Rate limiting via Valkey sliding window (INCR ratelimit:{user_id}:{minute_bucket}, 2-min TTL) prevents any single user from monopolizing execution capacity. Fails open on Valkey errors.
Contest-period auto-scaling timeline:
11.6 Web IDE (Monaco Editor Integration)
Monaco Editor — the VS Code engine. Syntax highlighting, bracket matching, minimap, multi-cursor for 20+ languages.
| Feature | Free Tier | Premium |
|---|---|---|
| Syntax highlighting (20+ languages) | Yes | Yes |
| Context-aware autocomplete | Basic (keyword-based) | Enhanced (language-server-backed) |
| Vim / Emacs keybinding modes | Yes | Yes |
| Font size, tab size customization | Yes | Yes |
| Integrated debugger (breakpoints, step-through) | No | Yes |
| Dark/light theme | Yes | Yes |
Starter code: Per-language templates from problems.starter_code JSONB. Defines the function signature the user implements.
Premium debugger: Breakpoints, step-through, variable inspection. Runs in a separate container with relaxed limits (60s). Not counted as submissions.
11.7 Problem Creation and Test Case Pipeline
4,000+ problems aren't hand-crafted. There's a pipeline.
Problem setters: Contracted engineers (rating 2000+, 1000+ solved). $20-65 per problem.
Test case generation pipeline:
- Generator: Python script that produces random inputs satisfying the problem's constraints. For "Two Sum": generates arrays of length 2 to 10^4 with values in [-10^9, 10^9] and a valid target.
- Validator: Verifies every generated input satisfies stated constraints (length bounds, value ranges, graph connectivity, etc.). Rejects malformed test cases.
- Reference solutions: Problem setter writes correct solutions in at least C++, Python, and Java. These are run against all test cases to generate expected outputs and calibrate time limits.
- Time limit calibration: Set at 3x the runtime of the slowest reference solution (in C++). Language multipliers (§11.4) adjust from there.
- Cache invalidation: When test cases are updated for an existing problem, a Kafka message invalidates the worker SSD cache so workers fetch the new version on next access.
12. Identify Bottlenecks
Bottleneck 1: Warm container pool exhaustion during contest spikes
During a contest, Python and C++ submissions dominate (60% and 25% respectively). If the Python pool has 200 warm containers and 300 concurrent Python submissions arrive, 100 submissions block waiting for a container.
Mitigation: Dynamic pool rebalancing. A rebalancer runs every 5 seconds: when a language's queue depth exceeds 2x its idle pool, it finds a "donor" language with excess idle containers (Haskell, Ruby, Scala) and transfers capacity — kill idle donor containers, restore microVMs from Firecracker snapshot (~25ms) for the starved language. Also, pre-warm extra Python and C++ VMs 15 minutes before contest start (see §11.1 pre-warming strategy).
Bottleneck 2: PostgreSQL write throughput for submission status updates
Each submission generates 3-4 status updates (PENDING -> QUEUED -> COMPILING -> RUNNING -> verdict). At 2,000 submissions/sec peak, that is 8,000 UPDATE statements per second.
Mitigation: Batch status updates. Workers buffer intermediate status transitions (COMPILING, RUNNING) and flush to PostgreSQL every 500ms in a single multi-row UPDATE ... FROM (VALUES ...). Only the final verdict update is sent immediately (the user is waiting for it). The PENDING→QUEUED transition batches naturally via Kafka producer's linger.ms.
Bottleneck 3: Test case file I/O on workers
Each submission reads test case files from disk. With 30 test cases per problem and 2,000 submissions/sec, that is 60,000 file reads per second across the worker fleet.
Mitigation: In-memory test case cache on each worker. The entire test case corpus is <2GB — load all test cases into a Map<Integer, List<TestCase>> at worker startup. Refresh every 10 minutes or on cache invalidation signal via Kafka when a problem's test cases are updated. This eliminates all disk I/O for test case reads.
13. Failure Scenarios
Scenario 1: Execution Worker Crash Mid-Execution
| Time | Event |
|---|---|
| T+0s | Worker pod crashes (OOM kill, node eviction, or Firecracker VMM crash). |
| T+0s | Submission is mid-execution. Test case 15 of 30 was running. |
| T+10s | Kafka consumer session timeout expires. The unacknowledged message is reassigned to another worker. |
| T+10s | New worker picks up the submission, starts fresh (compilation + all test cases from scratch). |
| T+15s | Submission status was stuck at "RUNNING". New worker updates it. User sees the verdict. |
Impact: 10-15 seconds of additional delay for the affected submission. The user sees the verdict eventually. No data loss. The wasted execution time (test cases 1-15 on the crashed worker) is unrecoverable, but at 3 seconds average execution time, the total waste is small.
What about the VMs? The crashed worker's Firecracker processes are orphaned. The Kubernetes kubelet detects the dead pod and cleans up all associated processes. The warm pool on other workers is unaffected.
Scenario 2: Valkey Crash During Contest
| Time | Event |
|---|---|
| T+0s | Valkey primary crashes. |
| T+0-1s | Valkey Sentinel detects the failure and promotes the replica to primary. |
| T+1-2s | All leaderboard writes fail during failover window. |
| T+2s | New primary is ready. Workers reconnect via Sentinel. |
| T+2s | Leaderboard updates resume. Submissions that got verdicts during the 2-second window had their leaderboard updates dropped. |
Impact: 1-2 seconds of leaderboard inconsistency. Submissions still get judged (that flow only depends on Kafka and PostgreSQL). The missed leaderboard updates are recovered by a reconciliation job that runs every 30 seconds: it queries PostgreSQL for all "Accepted" contest submissions and ensures each has a corresponding leaderboard entry in Valkey. Any missing or mismatched entries are corrected automatically.
Scenario 3: Kafka Broker Goes Down
| Time | Event |
|---|---|
| T+0s | One of 3 Kafka brokers crashes. |
| T+0s | Partitions led by the crashed broker become unavailable. |
| T+0-15s | Kafka controller detects failure, re-elects partition leaders on surviving brokers. |
| T+15s | All partitions available again. Messages in-flight during failure are retried by producers (idempotent producer with retries=5). |
Impact: 0-15 seconds of elevated submission latency. No submissions are lost (replication factor 3 means 2 copies of every message survive). The API returns 202 Accepted before the Kafka write completes, so users don't see the delay on submission. They see a longer wait for the verdict.
Scenario 4: Container Escape Attempt
| Time | Event |
|---|---|
| T+0s | User submits code that attempts a known container escape exploit. |
| T+0.1s | The exploit targets the guest Linux kernel inside the microVM. Even if it succeeds, the attacker is still inside the VM. |
| T+0.1s | The KVM boundary prevents any guest-to-host escape. The attacker controls a 256MB VM with no network — a dead end. |
| T+0.2s | Code crashes or times out. Worker reports Runtime Error or TLE verdict. |
| T+1s | Security monitoring detects anomalous VM behavior (unexpected syscall patterns logged by guest kernel). Alert fires. |
| T+5m | Security team reviews the submission. User account flagged. |
Impact: Zero. Even a successful kernel exploit inside the VM is contained by the KVM hardware boundary. The attacker gets a broken VM with no network. This is why Firecracker was chosen — a compromised guest kernel is a non-event.
Scenario 5: Database Partition Exhaustion
| Time | Event |
|---|---|
| T+0 | The current weekly partition for submissions table fills up (approaching partition boundary date). |
| T+0 | A cron job that creates future partitions 4 weeks in advance failed silently 3 weeks ago. |
| T+0 | INSERTs into the submissions table fail: "no partition of relation submissions found for row." |
Impact: All new submissions fail. Judging continues for already-queued submissions.
Prevention: Daily check on pg_inherits for future partitions. Alert if <2 exist. Weekly job creates 4 weeks ahead; daily job verifies next 14 days as a safety net.
14. Deployment Strategy
Workers. Rolling deployment, maxUnavailable: 10%. Canary pool (5%) first, monitor 15 min (SYSTEM_ERROR rate, latency percentiles, crash rate), then full rollout.
Key Kubernetes Deployment settings for the practice worker pool:
-- ── Leaderboard update (called on Accepted verdict) ──
-- KEYS[1] = contest:<id>:leaderboard (sorted set)
-- KEYS[2] = contest:<id>:user:<user_id> (hash)
-- ARGV[1] = user_id, ARGV[2] = problem_key, ARGV[3] = solve_time_seconds, ARGV[4] = penalty_per_wrong
local already = redis.call('HGET', KEYS[2], ARGV[2] .. '_accepted')
if already == '1' then return 0 end -- Duplicate, ignore
redis.call('HSET', KEYS[2], ARGV[2] .. '_accepted', '1')
redis.call('HSET', KEYS[2], ARGV[2] .. '_time', ARGV[3])
local wrong = tonumber(redis.call('HGET', KEYS[2], ARGV[2] .. '_wrong_attempts') or '0')
local solved = redis.call('HINCRBY', KEYS[2], 'problems_solved', 1)
local penalty = tonumber(ARGV[3]) + (wrong * tonumber(ARGV[4]))
local new_penalty = tonumber(redis.call('HGET', KEYS[2], 'total_penalty') or '0') + penalty
redis.call('HSET', KEYS[2], 'total_penalty', new_penalty)
local score = solved * 1000000000 - new_penalty
redis.call('ZADD', KEYS[1], score, ARGV[1])
local rank = redis.call('ZREVRANK', KEYS[1], ARGV[1])
redis.call('PUBLISH', 'contest:' .. KEYS[1] .. ':update',
cjson.encode({user_id=ARGV[1], rank=rank+1, problems_solved=solved, total_penalty=new_penalty}))
return rank + 1
-- ── Wrong attempt tracking (called on non-Accepted verdict) ──
-- local already = redis.call('HGET', KEYS[1], ARGV[1] .. '_accepted')
-- if already == '1' then return -1 end
-- return redis.call('HINCRBY', KEYS[1], ARGV[1] .. '_wrong_attempts', 1)Kata Containers RuntimeClass: handler: kata-fc (Firecracker backend) with a nodeSelector ensuring pods land on KVM-enabled nodes (kata-containers.io/firecracker: "true").
Language image updates. When a new language version is released (e.g., Python 3.13): build image in CI → run 100 curated problems → flag >20% time regression → deploy to 1% canary → monitor 24h → roll out to all workers → update IDE version display.
Contest-period deployment freeze. A CI/CD gate checks the contest schedule and blocks all deploys from 30 minutes before contest start to 30 minutes after end.
Database migrations. Run as K8s Jobs before app deployment. All backward-compatible. ALTER TABLE ADD COLUMN DEFAULT is instant in PG 11+ (no table rewrite).
15. Observability
Key Metrics
| Metric | Source | Alert Threshold |
|---|---|---|
submission_e2e_latency_p95 | Workers | > 5 seconds for 5 minutes |
submission_e2e_latency_p99 | Workers | > 10 seconds for 2 minutes |
verdict_distribution{type="SYSTEM_ERROR"} | Workers | > 0.1% of submissions |
container_pool_idle{language} | Workers | < 10% of min idle for any language for 2 min |
container_pool_exhausted{language} | Workers | Any exhaustion event |
container_crash_rate | Workers | > 1% in 10 minutes |
kafka_consumer_lag{topic} | Kafka | > 5000 messages for 2 min (contest), > 10000 (practice) |
leaderboard_update_latency_p99 | Valkey | > 100ms for 1 minute |
websocket_connections_active | WS Gateway | > 90% of capacity |
vm_anomalous_behavior | Firecracker/Workers | Unexpected VMM exit, KVM error, or guest kernel panic |
compilation_error_rate{language} | Workers | > 50% (indicates broken language image) |
test_case_cache_miss_rate | Workers | > 5% (cache not warming properly) |
pg_submission_insert_latency_p99 | PostgreSQL | > 50ms |
Dashboard Layout
+------------------------------------------+------------------------------------------+
| Submissions/sec (real-time) | Verdict Distribution (pie chart) |
| [contest] [practice] [custom] | [AC] [WA] [TLE] [MLE] [RE] [CE] [SE] |
+------------------------------------------+------------------------------------------+
| E2E Latency (p50, p95, p99) | Kafka Consumer Lag (per topic) |
| [time series, last 4 hours] | [contest] [practice] [custom-test] |
+------------------------------------------+------------------------------------------+
| Container Pool Status (per language) | Active WebSocket Connections |
| [idle] [busy] [total] per language | [time series, last 1 hour] |
+------------------------------------------+------------------------------------------+
| Contest Leaderboard Update Latency | Worker Pod Count (current / max) |
| [p50, p99, errors] | [contest] [practice] [custom-test] |
+------------------------------------------+------------------------------------------+
Distributed Tracing
Every submission gets a trace ID that follows it through the entire pipeline:
ClickHouse Analytics Schema
submission_analytics table: MergeTree engine, partitioned by month, ordered by (problem_id, language, submitted_at). Stores every submission's timing, memory, verdict, queue wait, and compile time. Key queries: language popularity trends, problem difficulty validation (p50/p95/p99 execution times for accepted submissions), and per-worker performance.
Show ClickHouse DDL and example queries
Map<String, Integer> computeRatingChanges(List<ContestResult> participants) {
Map<String, Integer> changes = new HashMap<>();
for (int i = 0; i < participants.size(); i++) {
ContestResult player = participants.get(i);
int actualRank = i + 1;
// Expected rank: sum of win probabilities against all opponents
double expectedRank = 1.0;
for (int j = 0; j < participants.size(); j++) {
if (i == j) continue;
ContestResult opp = participants.get(j);
expectedRank += 1.0 / (1.0 + Math.pow(10, (player.getRating() - opp.getRating()) / 400.0));
}
// K-factor decreases with experience
int k = Math.max(20, 80 - player.getContestsPlayed() * 2);
// Delta: positive if performed better than expected
int delta = (int)(k * (Math.log(expectedRank) - Math.log(actualRank)));
delta = Math.max(-150, Math.min(150, delta));
changes.put(player.getUserId(), delta);
}
return changes;
}16. Security
Hardware VM isolation via Firecracker. Each submission gets its own kernel. Even a successful guest exploit is a dead end.
Sandbox security is the entire ballgame. Compromised sandbox = access to test cases, other users' code, or the host system. Every decision here is defense-in-depth.
16.1 Eight Layers of Isolation (Defense-in-Depth)
The primary isolation boundary is the Firecracker microVM — a hardware KVM boundary that gives each submission its own Linux kernel. Even if an attacker achieves a full kernel exploit inside the VM, they control a 256MB machine with no network and no path to the host. Inside the VM, additional layers provide defense-in-depth.
What each layer stops:
| Layer | What It Does | What It Stops | Real CVE/Attack Prevented |
|---|---|---|---|
| 1. Firecracker KVM | Hardware VM isolation. Separate guest kernel. Host only sees KVM hypercalls. | ALL guest-side exploits. Kernel exploits, container escapes — they compromise the guest, not the host. | CVE-2019-5736, CVE-2020-15257, CVE-2022-0185 — all are guest-only events under Firecracker |
| 2. Namespaces (guest) | Isolates mount, PID, network, user, IPC, UTS inside guest | Seeing other guest processes, accessing guest filesystem | Additional isolation within the VM |
| 3. cgroups v2 (guest) | Hard CPU, memory (256MB), PIDs (64) limits inside guest | Fork bombs, memory bombs, CPU starvation | while(1) fork() hits PID limit. malloc(1TB) OOM-killed at 256MB |
| 4. seccomp-bpf (guest) | Syscall filter inside the guest kernel | Restricts even inside the VM — principle of least privilege | Blocks io_uring, ptrace, keyctl inside guest |
| 5. Read-only rootfs | Only /sandbox writable inside guest | Writing malware, modifying system binaries | Can't modify /etc/passwd or system binaries |
| 6. No network | No virtio-net device attached to VM. Only vsock to host worker. | Data exfiltration, solution fetching, reverse shells | No DNS, no TCP, no UDP — the VM has no network stack |
| 7. Capability drop | All 41 Linux capabilities removed inside guest | Raw sockets, mounting filesystems, module loading | Defense-in-depth inside the VM |
| 8. no-new-privileges | Prevents setuid/setgid escalation inside guest | Privilege escalation within the VM | Even with root inside VM, can't escape KVM boundary |
Why Firecracker is stronger than gVisor:
| Scenario | gVisor | Firecracker |
|---|---|---|
| Guest kernel exploit | Compromises gVisor Sentry (software boundary) | Stays inside VM (hardware KVM boundary) |
| Syscall compatibility | 95% — edge cases with io_uring, ptrace | 100% — full Linux kernel, zero compat issues |
| Resource accounting | Tricky — some kernel buffers not counted by Sentry | Clean — KVM enforces hard VM memory boundary |
| Debugging failures | "operation not permitted" / "bad system call" — ambiguous | Standard Linux errors — the guest is a real OS |
No path from guest to host. The Firecracker VMM process runs as an unprivileged user on the host. It exposes no network to the guest, no shared filesystem, and communicates only via vsock. Even a compromised VMM (which has never happened) would only give access to a single-submission VM's data.
16.2 Network Isolation
Worker pods have a strict Kubernetes NetworkPolicy: sandbox containers can reach nothing on the network. The worker process itself can only egress to PostgreSQL (5432), Valkey (6379), Kafka (9092), and S3 (443). All other egress is denied.
16.3 microVM Resource Hard Limits
Every microVM runs with:
- Memory: 256MB hard limit (no swap, OOM kill on exceed)
- CPU: 2 vCPU (KVM resource partitioning)
- PIDs: 64 maximum inside guest (prevents fork bombs)
- Wall clock: 15 seconds (worker kills the VM)
- Disk:
/sandboxwritable, rootfs read-only - Network: no virtio-net device attached — only vsock to host
- Capabilities: all dropped inside guest
- New privileges: blocked (
no-new-privilegesinside guest)
A per-VM watchdog thread in the Java worker starts a ScheduledFuture with 15s deadline. On expiry, it kills the Firecracker VMM process and increments the timeout metric.
16.4 Test Case Protection
Expected output NEVER enters the sandbox. Only the input is copied in. Comparison happens in the worker process outside the container. Even if a user reads every file in their sandbox, they cannot discover the expected answers.
16.5 Code Scanning (Soft Blocks)
Before execution, the API performs a fast static scan of submitted code for known dangerous patterns:
| Language | Blocked Patterns | Reason |
|---|---|---|
| Python | import os, import subprocess, __import__ | System command execution |
| C/C++ | #include <sys/ptrace.h>, #include <sys/mount.h> | Kernel interaction |
| Java | Runtime.getRuntime().exec, ProcessBuilder | Process spawning |
| Go | os/exec, syscall.Exec | System commands |
| JavaScript | child_process, require('fs') | System access |
Note: These are soft blocks (warning + flag for review), not hard blocks. Legitimate solutions sometimes use OS-level primitives. The sandbox is the real security boundary, not code scanning.
16.6 Audit Logging and Incident Response
Every submission, VM creation, VM destruction, and resource limit breach is logged to an append-only Kafka topic with 365-day retention. Security incidents trigger automated response:
| Event | Automated Response |
|---|---|
| Firecracker VMM unexpected exit | Log + alert. Flag submission. Investigate guest kernel panic or VMM bug. |
| Container OOM kill | Normal (MLE verdict). Log for capacity planning. |
| Container wall clock kill | Normal (TLE verdict). Log. |
| Repeated anomalous VM behavior from same user | Rate-limit user. Alert security team. |
| Suspected escape attempt (unusual VMM interaction pattern) | Kill VM. Flag user for review. Alert immediately. |
Secret management. Workers authenticate to PostgreSQL, Kafka, and S3 using Kubernetes Secrets injected as environment variables. Secrets are rotated monthly via an automated pipeline. Database credentials use short-lived tokens from HashiCorp Vault (4-hour TTL, auto-renewed).
Rate limiting. Per-user rate limits prevent abuse:
- Practice: 5 submissions per minute
- Contest: 10 submissions per minute
- Custom test: 10 runs per minute
- API read endpoints: 100 requests per minute
Enforced via Valkey sliding window counters at the API gateway layer. Exceeding the limit returns HTTP 429 with a Retry-After header.
17. SLOs and Error Budgets
SLOs make the quality target concrete. Error budgets turn "be more careful" into "freeze deploys this week."
| SLI | SLO | Monthly Error Budget |
|---|---|---|
| Submission e2e latency p95 ≤ 5s | 99.9% | 43.2 min |
| Judge availability (non-SYSTEM_ERROR, non-TIMEOUT) | 99.95% | 21.6 min |
| Contest leaderboard accuracy (rank matches PostgreSQL source of truth) | 99.99% | 4.3 min |
| Container warm-pool hit rate | 95% | N/A (capacity metric, not error budget) |
| Sandbox escape rate | 0% | Budget-less — any escape is a P0 incident |
Show 2 more SLOs (cold-start, WebSocket)
| SLI | SLO | Monthly Error Budget |
|---|---|---|
| microVM restore p99 ≤ 200ms (Firecracker snapshot) | 99.9% | 43.2 min |
| WebSocket verdict delivery ≤ 500ms from worker verdict | 99.5% | 3.6 h |
Error budget policy:
- Normal burn (≤1x over 30d): Business as usual. Ship features.
- Fast burn (>2x over 7d): Freeze non-critical deploys. Director sign-off for launches. Investigate root cause.
- Very fast burn (>4x over 1d): Page on-call. Freeze all non-rollback deploys. Launch incident response.
- Exhausted: Next sprint goes entirely to reliability. No new features until budget replenishes.
Alert tiering:
- Page-now (5-min response): Sandbox escape attempt, availability burn rate, Kafka lag >5000 for contest topic, container pool exhaustion for any language, SYSTEM_ERROR rate >0.1%.
- Page-business-hours: Latency budget burn, cache-miss rate rising, compilation error rate spike (broken language image), leaderboard drift from PostgreSQL source of truth.
- Ticket-only: Capacity warnings (worker CPU >70%), disk warnings, pending schema migrations, individual worker pod restarts.
18. Operational Playbook
A design that doesn't document its ops story isn't production-grade.
18.1 Backup and Recovery
| Component | Backup Strategy | RPO | RTO |
|---|---|---|---|
| PostgreSQL | Continuous WAL archiving to S3 + daily base backups | ~0 (WAL) | < 30 min (PITR) |
| Test cases (S3) | S3 versioning + cross-region replication | 0 | < 5 min |
| Valkey (leaderboards) | Reconstructible from PostgreSQL (source of truth) | N/A | < 2 min |
| Kafka | 3x replication factor, 7-day retention | 0 | < 15 min (broker replacement) |
| Container images | Registry with multi-AZ replication | 0 | < 10 min |
| ClickHouse (analytics) | Daily snapshots to S3 | 24h | < 2h |
Quarterly restore drill. Restore PG from a random backup to staging. Verify data integrity. Failed restore = P1.
18.2 Capacity Planning
Three leading indicators:
- Worker CPU. Target <50% avg, <70% peak. Sustained >70% → KEDA should auto-scale, but verify node pool headroom.
- Kafka lag. Contest >100 for 2 min = backing up. Practice >1000 = degraded. Custom-test >5000 = acceptable.
- PG connection pool. PgBouncer >80% = queries queueing. ~1,700 DB ops/sec at steady state (3 ops per submission).
18.3 Schema Migrations
50M rows/day. Zero-downtime schema changes only:
ADD COLUMN DEFAULT— instant in PG 11+.CREATE INDEX CONCURRENTLY— no table lock.- Weekly partitions created 4 weeks ahead. Daily job verifies next 14 days as safety net.
- Forward-only. Mistakes get a compensating migration, never a rollback.
18.4 Top 5 Alerts and Mitigations
Know these cold:
-
Container pool exhausted — Contest spike (pre-warm late) or container leak? Spike: verify KEDA scaling + node pool headroom. Leak: check worker logs for stuck containers.
-
Kafka lag spike (>5000 on contest topic) — Verify KEDA scaling. At max replicas: redirect practice workers to contest topic. Check for slow consumers (30s/submission = container issue).
-
PG connection pool >80% — Check
pg_stat_activityfor stuck queries. Verify partition creation didn't fail (missing partition = all INSERTs fail). -
Firecracker VMM crash — Check if guest kernel panicked (user code triggered a kernel bug inside the VM — harmless to host, but the submission needs a retry). If VMM itself crashed, investigate — extremely rare, treat as P1.
-
Leaderboard drift (Valkey ≠ PostgreSQL) — Reconciliation job auto-corrects. If drift >10 entries: manual review. Check Lua script failures or dropped Pub/Sub messages.
19. Appendix
Appendix A: Firecracker microVM Configuration
Each microVM runs a minimal Linux kernel (~4MB, custom-built vmlinux) with a language-specific ext4 rootfs. Communication between host worker and guest is via vsock (virtio socket — no network stack, just a host↔guest data channel).
Key Firecracker API calls:
| Operation | API Call | Notes |
|---|---|---|
| Set kernel | PUT /boot-source {"kernel_image_path": "/var/lib/fc/vmlinux"} | Minimal kernel, no modules |
| Set rootfs | PUT /drives/rootfs {"path_on_host": "python3.ext4", "is_root_device": true} | Per-language ext4 image |
| Set resources | PUT /machine-config {"vcpu_count": 2, "mem_size_mib": 256} | Hard limits enforced by KVM |
| Start VM | PUT /actions {"action_type": "InstanceStart"} | ~125ms cold boot |
| Create snapshot | PUT /snapshot/create {"snapshot_path": "...", "mem_file_path": "..."} | Full VM state to disk |
| Restore snapshot | PUT /snapshot/load {"snapshot_path": "...", "mem_file_path": "..."} | ~25ms restore |
Language compatibility: 100% across all 20+ languages. The guest runs a standard Linux kernel — no syscall restrictions, no edge cases, no "works locally fails on judge." Every language runtime behaves identically to bare metal.
Appendix B: Contest Elo Rating — Full Formula
The post-contest rating computation (§11.3) uses a multi-player Elo adaptation:
Given N participants sorted by contest rank:
For player i with current rating Ri:
Expected Rank(i) = 1 + Σ(j≠i) P(j beats i)
where P(j beats i) = 1 / (1 + 10^((Ri - Rj) / 400))
Performance = √(Expected_Rank × Actual_Rank) [geometric mean]
K = max(20, 80 - contests_played × 2) [K-factor]
ΔRating = K × (ln(Expected_Rank) - ln(Actual_Rank))
Clamped to [-150, +150] per contest
Starting rating: 1500. Absence penalty: -K/4 (roughly -10 to -20 depending on experience).
Appendix C: Firecracker Warm Pool Benchmark
Measured on c5.metal (96 vCPU, 192GB RAM) with Firecracker v1.7:
| Operation | Latency (p50) | Latency (p99) | Notes |
|---|---|---|---|
| Cold boot (fresh microVM) | 120ms | 180ms | Kernel boot + rootfs mount + runtime init |
| Snapshot restore | 20ms | 40ms | 6x faster than cold boot. Native Firecracker API. |
| Warm pool claim (idle VM) | 1.2ms | 3.5ms | Channel receive + state update |
| Diff snapshot reset | 4ms | 8ms | Reverts all guest state (filesystem + memory) |
| JVM warm-up penalty (cold boot) | +180ms | +350ms | Without pre-warmed snapshot |
| JVM warm-up penalty (snapshot restore) | +3ms | +10ms | JIT + classloader already warm in snapshot |
The warm pool + snapshot strategy ensures 95%+ of submissions see <5ms VM acquisition. The remaining 5% (burst traffic) see <50ms via snapshot restore. Cold boots only happen during language image updates.
If you only remember six things
- Sandbox. Firecracker microVM — each submission gets its own Linux kernel with hardware KVM isolation. Even a guest kernel exploit is a dead end.
- Fast start. Warm VM pool + Firecracker native snapshot restore (~25ms) + diff snapshot reset (~5ms). JVM VMs pre-warm the JIT before snapshot.
- Queue. Four Kafka topics (contest, premium, practice, custom-test) with dedicated worker pools and cross-pool borrowing during contests.
- Leaderboard. Valkey sorted set with atomic Lua updates. Post-contest Elo rating computation. Reconciliation job as safety net.
- Scale. 50M submissions/day, 12,000 VM slots across 1,500 pods, 14% steady-state utilization, auto-scale via KEDA on Kafka consumer lag.
- Biggest risk. Warm pool exhaustion during contest spikes. Mitigated by pre-warming 15 min before contest + dynamic language rebalancing + snapshot restore fallback.
Explore the Technologies
Dive deeper into the technologies and infrastructure patterns used in this design:
Core Technologies
| Technology | Role in This Design | Learn More |
|---|---|---|
| PostgreSQL | Problems, submissions, users, contests, source of truth | PostgreSQL |
| Valkey | Contest leaderboards, rate limiting, WebSocket routing, pub/sub | Redis/Valkey |
| Kafka | Submission queue with priority separation, audit logging | Kafka |
| Kubernetes | Worker fleet orchestration, Kata Containers RuntimeClass (Firecracker backend), auto-scaling | Kubernetes |
| ClickHouse | Submission analytics, language trends, problem difficulty analysis | ClickHouse |
Infrastructure Patterns
| Pattern | Relevance to This Design | Learn More |
|---|---|---|
| Circuit Breaker and Fault Tolerance | Fail-open on Valkey/PostgreSQL outages, fallback behaviors | Circuit Breakers |
| Message Queues | Kafka for durable submission dispatch with priority separation | Message Queues |
| Rate Limiting | Per-user submission rate limiting with sliding window counters | Rate Limiting |
| Container Orchestration | Kubernetes for worker fleet management with Kata Containers (Firecracker backend) | Container Orchestration |
Linux Internals (used in this design)
Firecracker builds on Linux kernel primitives. The primary boundary is KVM (hardware virtualization). Inside the guest VM, standard Linux isolation (namespaces, cgroups, seccomp) provides defense-in-depth.
| Linux Concept | Role in This Design | Learn More |
|---|---|---|
| KVM (Kernel-based Virtual Machine) | The primary isolation boundary. Firecracker uses /dev/kvm to create hardware-isolated VMs. Guest kernel exploits can't reach the host. | KVM docs |
| System Calls | Guest kernel handles all syscalls natively — 100% compatibility. Host only sees KVM hypercalls. | System Calls |
| Namespaces | Defense-in-depth inside the guest VM — isolates PID, mount, network, user, IPC, UTS | Namespaces |
| cgroups v2 | Hard limits inside the guest: 256MB memory, 2 vCPU, 64 PIDs. OOM killer enforces MLE verdict. | cgroups v2 |
| Memory Cgroups | How the 256MB limit and OOM kill work inside the guest kernel | Memory Cgroups |
| Seccomp | Additional syscall filtering inside the guest — principle of least privilege even within the VM | Seccomp |
| Linux Capabilities | All 41 capabilities dropped inside guest. No raw sockets, no mount, no module loading. | Capabilities |
| Network Namespaces | No virtio-net device attached — the VM has no network stack. Only vsock to host. | Network Namespaces |
| Container Runtime (runc/containerd/Kata) | How Kata Containers wraps Firecracker behind the OCI runtime interface for K8s integration | Container Runtime |
| OOM Killer | Guest kernel OOM-kills user code at 256MB. Worker detects exit and reports MLE verdict. | OOM Killer |
| Copy-on-Write | Firecracker snapshot restore uses CoW memory mapping for fast VM resume (~25ms) | Copy-on-Write |
| Virtual Memory | Each microVM has its own virtual address space. KVM maps guest physical → host physical. | Virtual Memory |
Further Reading
- Firecracker: Lightweight Virtualization -- AWS's microVM for Lambda and Fargate. The primary sandbox in this design.
- Firecracker Design Doc -- Architecture decisions behind the Firecracker VMM
- Kata Containers -- OCI-compatible VM runtime with Firecracker backend for Kubernetes integration
- firecracker-containerd -- Run Firecracker microVMs via standard containerd APIs
- gVisor: Container Runtime Sandbox -- Alternative sandbox approach using user-space syscall interception
- How LeetCode Judges Your Code -- LeetCode's official documentation on their judge system
- LeetCode Contest Rating Algorithm -- The Elo-based rating system adapted for multi-player contests
- Codeforces System Architecture -- How Codeforces handles contest infrastructure
- Valkey Sorted Sets for Leaderboards -- Real-time ranking with sorted sets
- Kubernetes KEDA: Event-Driven Autoscaling -- Auto-scaling workers based on Kafka consumer lag
- Firecracker Snapshotting -- How Firecracker snapshot/restore works (the warm pool foundation)
- Monaco Editor -- The VS Code editor engine powering the web IDE
Practice this design: Try the Design LeetCode interview question — hints and structured guidance included.