CrackingWalnuts

System DesignApril 12, 2026· 68 min read

System Design: LeetCode (Code Sandbox, Container Isolation, Real-Time Contests)

Goal: Design an online judge platform like LeetCode that handles 50 million code submissions per day across 20+ programming languages. Execute untrusted user code safely in Firecracker microVMs with hardware-level KVM isolation. Support real-time contests with live leaderboards, Elo-based rating, and penalty calculation. Target sub-5-second end-to-end execution latency, 20K concurrent submissions, and 15M registered users.

TL;DR: Submissions enter through a REST API, get validated and enqueued into Kafka (partitioned by priority: premium, contest, practice, custom-test). Execution workers pull jobs, claim a pre-warmed Firecracker microVM from the warm pool (native snapshot restore in ~25ms), run user code inside a VM with its own Linux kernel — hardware-level KVM isolation, no network, 256MB RAM, 2 vCPU. VM reset uses diff snapshot restore (~5ms) to revert all changes atomically. Contest leaderboards use Valkey sorted sets updated atomically via Lua scripts. WebSocket connections push live rank updates. Post-contest Elo rating computation adjusts player ratings. Auto-scaling based on Kafka consumer lag handles contest traffic spikes.

Tip

Pick your path

Time	Read this	Covers
2 min	TL;DR + §1 + §16 Security	Shape of the system and why sandbox isolation is the whole ballgame
15 min	§1–§12	Every core design decision, interview-grade
30 min	Full post	Production detail: ops playbook, SLOs, failure scenarios, appendices

1. Final Architecture

Three independent paths. The submission path accepts code and queues it (async, sub-100ms API response). The execution path runs untrusted code in Firecracker microVMs — each submission gets its own VM with a separate Linux kernel and hardware KVM isolation. The contest path layers real-time leaderboards and post-contest Elo rating on top.

Submission path:
  Client → API Gateway → Rate Limiter (Valkey)
         → Kafka (topic by priority: premium > contest > practice > custom-test)
         → 202 Accepted with submission ID

Execution path:
  Worker → Kafka poll → Claim warm microVM (Firecracker snapshot restore, ~25ms)
         → Write code via vsock → Compile → Run test cases (isolated VM, no network, 256MB, 2 vCPU)
         → Compare output → Verdict → PostgreSQL + Valkey pub/sub → WebSocket push

Contest path:
  Accepted verdict → Valkey Lua script (atomic leaderboard update)
         → PUBLISH to WebSocket gateways → Live rank push to all viewers
  Post-contest → Elo rating computation → PostgreSQL (rating_before, rating_after)

2. Problem Statement

An online judge sounds deceptively simple. Someone writes code. The system runs it. Check if the output matches. Done.

The reality? It's running arbitrary, untrusted code from millions of strangers. That code could be a fork bomb, a Bitcoin miner, a kernel exploit, or an infinite loop that allocates 64GB of memory. And the system needs to run 50 million of these per day, return the right answer in under 5 seconds, and do it across 20 different programming languages with different compilers, runtimes, and memory models.

Problem 1: Running untrusted code without getting owned.

Someone submits os.system("rm -rf /") in Python, or while(1) fork() in C++, or something subtler — code that reads /proc/self/mountinfo to fingerprint the container and tries known escape exploits. A default Docker container won't stop any of this. Kernel-level isolation is mandatory: intercept every syscall, block network access, kill anything that exceeds resource limits. One sandbox escape = access to test cases, other users' code, or production infrastructure.

Problem 2: 50M submissions/day means thousands of containers running simultaneously.

580 submissions/second average, 2K peak. Each takes 2-10 seconds → 1,200 to 5,800 containers in parallel. Each needs its own isolated filesystem, resource limits, and language runtime. Fresh Docker containers per submission is a non-starter (3-5s cold start). A warm pool of pre-spawned containers is essential.

Problem 3: Contest fairness requires deterministic execution and cheat prevention.

Same code, different machines: 48ms vs 72ms due to noisy neighbors. Ranking by raw execution time is unfair — timing must be normalized.

Scale numbers:

15M registered users, 3M monthly active users
50M code submissions per day (~580/sec avg, 2,000/sec peak)
20K concurrent submissions during peak contest windows
20+ supported programming languages
4,000+ problems with 10-50 test cases each
Average execution time: 3 seconds (compilation + running all test cases)
Weekly contests: 100K participants, 4-5 problems, 90-minute window

What NOT to do:

exec()/eval() on the app server. That's RCE on production.
Default Docker security. Shared kernel, no syscall filtering. Escapes are well-documented (CVE-2019-5736, CVE-2020-15257).
Network access in containers. Code could fetch solutions, exfiltrate data, or attack internal services.
Raw execution time for ranking. Noisy neighbors cause 2x variance. Normalize or use dedicated hardware.
Running all test cases after first failure. Wrong output on case 2 of 50? Stop. Only run all for Accepted.
Test cases inside the container image. Users could read expected outputs and hardcode answers.
Single queue for all submissions. Contest and practice traffic compete. Separate queues with priority.
Solutions and test cases in the same DB. Different sizes, different access patterns, different write rates.
Monolithic judge. API, execution, leaderboard, problem management scale completely differently.

The sandbox isolation layer is where most of the complexity lives. Getting it wrong means either a security breach or unacceptable performance. §11.1 breaks down the tradeoffs.

3. Functional Requirements

ID	Requirement	Priority
FR-01	Execute user-submitted code in a sandboxed environment with strict resource limits	P0
FR-02	Support 20+ programming languages (Python, Java, C++, Go, Rust, JavaScript, C, C#, Ruby, Kotlin, Swift, TypeScript, Scala, PHP, Haskell, Dart, Elixir, Erlang, Racket)	P0
FR-03	Run submitted code against ordered test cases and return a verdict (Accepted, Wrong Answer, TLE, MLE, Runtime Error, Compilation Error)	P0
FR-04	Display execution time and memory usage for each submission	P0
FR-05	Support contests with timed problem sets and real-time leaderboards	P0
FR-06	Support penalty time calculation (time to solve + penalty per wrong attempt)	P0
FR-07	Provide a problem bank with descriptions, constraints, examples, and hidden test cases	P0
FR-08	Show submission history per user per problem	P0
FR-10	Support "Run Code" (test against visible examples only, fast debug loop) separate from "Submit" (full hidden test suite)	P0
FR-11	Rate-limit submissions per user (5/minute for practice, 10/minute during contests)	P0
FR-12	Push real-time verdict updates to users via WebSocket	P1
FR-13	Support problem difficulty tagging and topic categorization	P1
FR-14	Track user statistics (problems solved, acceptance rate, contest Elo rating)	P1
FR-15	Support editorial solutions and community discussions per problem	P2
FR-16	Distinguish TLE (user's algorithm too slow) from Timeout (server overloaded, auto-retry)	P0
FR-17	Premium priority queue: 3-10x faster judging for premium subscribers (practice only, not contests)	P1
FR-18	Post-contest Elo rating computation with absence penalty	P1

4. Non-Functional Requirements

ID	Requirement	Target
NFR-01	End-to-end submission latency (submit to verdict)	< 5 seconds (p95), < 10 seconds (p99)
NFR-02	Execution throughput	50M submissions/day, 2,000/sec peak
NFR-03	Concurrent submissions	20,000 during peak contest windows
NFR-04	microVM snapshot restore time	< 50ms (with warm pool: < 5ms claim)
NFR-05	Availability	99.95% (26 min downtime/month, non-contest), 99.99% during contests
NFR-06	Sandbox escape rate	0 (any escape is a critical security incident)
NFR-07	Leaderboard update latency	< 1 second from verdict to leaderboard update
NFR-08	WebSocket message delivery	< 500ms from verdict to client notification
NFR-10	Horizontal scalability	Linear scale-out by adding execution workers
NFR-11	Data retention	Submissions: 2 years. Test cases: indefinite. Contest results: indefinite.
NFR-12	Language image update	< 4 hours from new language/version release to production availability

5. High-Level Approach & Technology Selection

5.1 What Kind of System Is This?

A batch job execution platform with a real-time scoring overlay. Accept code, run it safely, compare output. The "safely" part is where all the complexity lives.

At 50M submissions/day, this is async — not request/response. The API returns 202 immediately; the verdict arrives via WebSocket. Execution takes 2-10 seconds; holding HTTP connections open that long doesn't scale.

5.2 Why Not Just ProcessBuilder?

The tempting approach: run user code directly on the server.

java

ProcessBuilder pb = new ProcessBuilder("python3", "solution.py");
Process p = pb.start();
String output = new String(p.getInputStream().readAllBytes());

This gives the user's code full access to the host — filesystem, network, other users' submissions, environment variables, the ability to kill the Java worker process. A fork bomb (while True: os.fork()) kills the entire server, not just the submission. There's no resource limit, no isolation, no boundary.

The progression from "no isolation" to "hardware isolation":

Approach	Filesystem	Network	Resources	Kernel	Compatibility
ProcessBuilder	Full access	Full access	None	Shared	100%
+ chroot	Restricted root	Full access	None	Shared	100%
+ cgroups	Restricted root	Full access	CPU/mem limits	Shared	100%
+ namespaces	Isolated	Isolated	Limited	Shared	100%
Docker + seccomp	Isolated	Disabled	Limited	Shared (filtered)	100%
gVisor	Isolated	Disabled	Limited	User-space kernel	95%
Firecracker	Isolated	Disabled	Limited	Separate kernel (KVM)	100%

Each row adds a layer. Firecracker is the end of the chain — the user's code runs inside its own VM with its own kernel. Even a full kernel exploit stays inside the VM. And unlike gVisor, there are zero syscall compatibility issues because the guest runs a real Linux kernel.

5.3 Sandbox Isolation Approaches

The sandbox must prevent: filesystem escape, network access, fork bombs, memory bombs, CPU starvation, and kernel exploits.

Approach 1: Docker + seccomp/AppArmor (baseline)

Standard Docker with a restrictive seccomp profile (~50 of 300+ syscalls allowed), AppArmor, and --network=none, --read-only, --memory, --cpus, --pids-limit.

json

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    { "names": ["read", "write", "open", "close", "stat", "fstat",
                "mmap", "mprotect", "munmap", "brk", "ioctl",
                "access", "execve", "exit_group", "arch_prctl",
                "clone", "wait4", "getpid", "getuid"],
      "action": "SCMP_ACT_ALLOW" }
  ]
}

Approach 2: gVisor (runsc)

Google's user-space kernel. Intercepts all syscalls and re-implements them in a Go process (Sentry). The container never talks to the host kernel. Even a Linux kernel zero-day only compromises the unprivileged Sentry process, not the host.

User Code -> Container -> gVisor Sentry (user-space kernel) -> Host Kernel
                          ^^ intercepted here

Approach 3: Firecracker microVMs

AWS's lightweight VM monitor. Each submission runs in an actual VM with its own kernel. Boots in ~125ms with ~5MB overhead. A kernel exploit inside the VM doesn't affect the host. Used by AWS Lambda and Fly.io.

User Code -> Guest Kernel (inside microVM) -> Firecracker VMM -> Host Kernel
             ^^ completely separate kernel

5.4 Sandbox Approach Comparison

Dimension	Docker + seccomp	gVisor (runsc)	Firecracker microVM
Security	Low (shared kernel)	High (user-space kernel)	Very high (hardware KVM)
Isolation	Process-level	Syscall interception	Separate guest kernel
Syscall compatibility	Full (native)	95% (reimplements ~200 of ~300)	100% (full kernel inside VM)
Cold start	~300ms	~400ms	~125ms (boot + rootfs)
Warm reuse	Easy (exec into running)	Easy (same as Docker)	Native snapshot/restore (~25ms)
Memory overhead	~10MB	~30MB (Sentry)	~35MB (5MB VMM + guest kernel)
Syscall perf overhead	~0%	10-30% (interception)	~5% (virtualization)
Known escape CVEs	6+ (runc, containerd)	0	0 (hardware boundary)
Operational complexity	Low	Medium (custom OCI runtime)	Medium (Kata Containers)
Production users	Everyone	Google Cloud Run	AWS Lambda, Fly.io
LeetCode fit	No	Yes	Yes (best)

5.5 Chosen Sandbox: Firecracker microVMs

Firecracker is the primary runtime. Each submission runs inside an actual virtual machine with its own Linux kernel. The VM communicates with the host only through a minimal VMM process (~5MB) and KVM hypercalls. This is the same technology AWS Lambda uses to run billions of untrusted functions.

Why Firecracker over gVisor for this use case:

Full kernel isolation. Guest kernel exploits don't affect the host. Zero syscall compatibility issues — every language runtime works exactly as on bare metal. No "works locally, fails on judge."
Native snapshot/restore. Firecracker can snapshot a fully-initialized VM (runtime loaded, JIT warmed) and restore it in ~25ms. No external CRIU dependency. Simpler than gVisor + CRIU.
Clean resource accounting. Each VM is a hard boundary. Memory, CPU, and I/O are tracked by KVM — no noisy-neighbor ambiguity.
Proven at untrusted-code scale. AWS Lambda, Fly.io, and Koyeb run on Firecracker in production. The "operational complexity" argument has weakened — firecracker-containerd provides OCI compatibility, and Kata Containers provides Kubernetes RuntimeClass integration.

gVisor remains a valid lighter alternative for development environments, internal judges, or lower-scale deployments where microVM overhead isn't justified.

5.6 Judge Strategy Comparison

How does the system decide whether a submission is correct?

Strategy	How It Works	Pros	Cons
Exact output match	Compare stdout byte-for-byte with expected output	Simple, deterministic	Fails on floating point, trailing whitespace, line endings
Token-based comparison	Split output into tokens (whitespace-delimited), compare tokens	Handles whitespace variations	Still fails on floating point precision
Special judge (checker)	Custom program compares user output against expected answer	Handles floating point, multiple valid answers, graph problems	Need to write a checker per problem
Interactive judge	User program communicates with judge program via stdin/stdout	Required for interactive problems (binary search, games)	Complex, harder to parallelize

Chosen: Token-based comparison as default. Special judge for floating-point (epsilon comparison) and multi-answer problems (validate output property, not exact match).

5.7 Contest Ranking Algorithms

Algorithm	Description	Used By
ICPC-style	Ranked by problems solved (desc), then penalty time (asc). Penalty = sum of solve times + 20 min per wrong attempt	ICPC, many college contests
Codeforces-style	Rating-based scoring. Points per problem decrease over time. Wrong attempts incur penalty.	Codeforces
LeetCode-style	Ranked by problems solved (desc), then total penalty (asc). Penalty = finish time of last accepted + 5 min per wrong attempt	LeetCode
IOI-style	Partial scoring. Each test case group gives points. No penalty for wrong attempts.	IOI, many olympiads

Chosen: LeetCode-style ranking (configurable to ICPC-style):

score = (problems_solved, -total_penalty)

total_penalty = finish_time_of_last_accepted
              + (5 * total_wrong_attempts_across_all_solved_problems)

Sorted by: problems_solved DESC, total_penalty ASC

5.8 Store Selection

Store	Technology	Role	Rationale
Problem metadata	PostgreSQL 16	Problem descriptions, constraints, tags, difficulty	Relational queries (filter by tag, difficulty), ACID transactions
Test case storage	S3 + local SSD cache	Input/output files for each problem	Test cases can be large (100MB+ for some problems). S3 for durability, SSD cache on workers for low-latency reads.
Submission records	PostgreSQL 16	User submissions, verdicts, execution stats	Relational (user_id, problem_id, contest_id), indexed for history queries
Submission code	S3	Raw source code files	Cheap storage for 50M files/day, rarely re-read
Contest leaderboards	Valkey 8	Real-time sorted sets for live rankings	Sub-ms sorted set operations, atomic updates, pub/sub for push
Submission queue	Kafka	Decouple API from execution workers	Durable, partitioned, consumer groups for parallel processing
User sessions/rate limits	Valkey 8	Rate limiting, auth tokens	In-memory speed for per-request checks
Analytics	ClickHouse	Submission trends, language popularity, problem difficulty stats	Columnar analytics on billions of rows
WebSocket state	Valkey Pub/Sub	Route verdict updates to the correct WebSocket gateway	Cross-pod message routing

5.9 Kata Containers: How Firecracker Integrates with Kubernetes

The problem Kata solves: Firecracker is a standalone VMM — it has no native Kubernetes integration. Without Kata, running Firecracker microVMs in K8s would require a custom orchestrator (scheduler, health checks, scaling, pod lifecycle). That was the original argument against Firecracker.

What Kata Containers is: An open-source project (CNCF) that wraps lightweight VMMs (Firecracker, QEMU, Cloud Hypervisor) behind the standard OCI container runtime interface. To Kubernetes, a Kata pod looks like any other pod. Under the hood, each pod runs inside a microVM instead of a shared-kernel container.

How it works:

Without Kata:  kubelet → containerd → runc → container (shared kernel)
With Kata:     kubelet → containerd → kata-runtime → Firecracker → microVM (separate kernel)

A single Kubernetes RuntimeClass resource switches the runtime:

yaml

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-fc
handler: kata-fc    # Kata with Firecracker backend

Worker pods specify runtimeClassName: kata-fc and everything — KEDA auto-scaling, liveness probes, rolling deployments, pod specs — works unchanged. The only difference is what runs under the hood: a Firecracker microVM instead of a Linux container.

Why this matters for the judge: Firecracker provides the strongest isolation. Kata provides the Kubernetes integration. Together, the system gets hardware VM isolation with standard K8s operational tooling — no custom orchestrator needed.

6. Design Assumptions

Non-negotiables. Every number downstream inherits from here.

Single-region (us-east-1), multi-AZ. No need for active-active — users already wait 2-5s for execution.
50M submissions/day is a design target, not a benchmark. Demonstrates scaling decisions.
Firecracker for production. Hardware VM isolation for untrusted code. gVisor is the lighter alternative for dev/test.
Premium tier. Priority queue for practice (3-10x faster). Contests: equal priority for fairness.
20+ languages. Python 2/3, Java, C++, Go, Rust, JS, TS, C, C#, Ruby, Kotlin, Swift, Scala, PHP, Haskell, Dart, Elixir, Erlang, Racket.
No AI coding features (Copilot-style). Separate product concern.
GDPR out of scope for code execution. IPs logged for rate limiting; no PII in sandboxes.

7. High-Level Architecture

Accept fast (202 Accepted), judge async (Kafka workers), push results (WebSocket).

Bird's-Eye View

Layer Responsibilities

API Gateway. Validate code → rate-limit (Valkey) → store submission in PostgreSQL (PENDING) → enqueue to Kafka → return 202. WebSocket gateway pushes verdict updates via Valkey Pub/Sub.

Queue. Four Kafka topics (contest, premium, practice, custom-test). Partitioned by user_id % partitions for per-user ordering.

Execution. Workers poll Kafka → claim warm microVM → write code via vsock → compile → run test cases → write verdict → kill VM, restore fresh from snapshot. Each worker runs 8 concurrent VMs.

Storage. PostgreSQL for structured data. S3 for blobs (test cases, source code). Test cases cached on worker SSDs.

Real-Time. Valkey for (1) leaderboard sorted sets, (2) rate-limit counters, (3) Pub/Sub routing verdicts to WebSocket gateways.

Analysis. ClickHouse stores submission analytics (language trends, problem difficulty calibration, per-worker performance). Fed async from PostgreSQL.

Submission Lifecycle

Execution Worker Detail Flow

8. Back-of-the-Envelope Estimation

580 req/sec average, 2K peak, 6K concurrent containers, 1.5TB hot PostgreSQL, 150MB Valkey.

Dimension	Result
Throughput	580/sec avg, 2K peak, design for 3K burst
Workers	1,500 pods, 12,000 container slots, 14.5% steady-state utilization
PostgreSQL	~1.5TB hot (30-day submissions + indexes), weekly partitions
S3	100GB/day source code, 36.5TB/year, <2GB total test cases
Valkey	~150MB total (leaderboards + rate limits + sessions) — one node handles it
Kafka	12MB/sec peak, 3 brokers, 72 partitions across 4 topics
Network	~10MB/sec total outbound (well within single-NIC capacity)

Show full derivations (throughput, worker sizing, storage, Valkey, Kafka, network)

Throughput:

Submissions per day:          50,000,000
Submissions per second (avg): 579 (~580)
Submissions per second (peak, 3.5x): 2,030 (~2,000)

Contest peak (weekly, 90 min):
  100K users * 5 problems * 3 attempts avg = 1,500,000 submissions
  1,500,000 / (90 * 60) = 278 contest submissions/sec
  Sunday evening combined peak: ~1,440/sec
  Design for 2,000/sec with burst to 3,000/sec.

Execution worker sizing:

Average execution time: 3 seconds
Concurrent executions at peak: 2,000/sec * 3s = 6,000

Worker pod: 8 vCPU, 16GB RAM, 8 containers per pod
Pods needed at peak: 6,000 / 8 = 750 pods

Practical: 500 contest + 800 practice + 200 custom-test = 1,500 pods, 12,000 slots
Steady-state utilization: 14.5%. Peak: 50%. Auto-scale down to 500 pods off-peak.

Storage:

PostgreSQL submissions: 50M/day * 500B/row = 750GB hot (30 days), ~1.5TB with indexes
PostgreSQL problems: 30MB. Users: 10GB. Contests: trivial.
S3 source code: 100GB/day = 36.5TB/year (lifecycle to S3-IA after 90 days)
S3 test cases: 900MB total (cached entirely on worker SSDs)

Valkey memory:

Leaderboards: ~8MB per contest (100K participants), 32MB with 4 concurrent
Rate limiting: 50MB (500K concurrent users)
WebSocket sessions: 60MB
Total: ~150MB — single Valkey node (4GB) with one replica

Kafka:

4,000 msg/sec peak * 3KB = 12MB/sec (single broker handles 200MB/sec)
3 brokers for fault tolerance. 16 + 16 + 32 + 8 = 72 partitions across 4 topics.

Network: ~10MB/sec total outbound (S3 fetches + PG updates + WebSocket).

9. Data Model

Six tables in PostgreSQL (submissions partitioned weekly), sorted sets in Valkey, blobs in S3.

Table	Key Columns	Notes
`problems`	slug, difficulty, time_limit_ms, starter_code (JSONB)	4K rows, GIN index on tags
`test_cases`	problem_id, input_s3, output_s3, is_sample	S3 paths, not inline data. SHA-256 hashes for integrity
`submissions`	user_id, problem_id, status, execution_time_ms	Partitioned weekly by submitted_at. 50M rows/day. Status includes TIMEOUT (distinct from TLE)
`contests`	start_time, end_time, problem_ids[], penalty_minutes	~52/year. Rated by default
`contest_participants`	contest_id, user_id, rank, rating_before/after/delta	Elo rating changes stored per contest
`users`	username, rating (default 1500), problems_solved, is_premium	15M rows. Streak tracking

Show full SQL schemas (5 CREATE TABLE statements + indexes)

9.1 Problems Table

sql

CREATE TABLE problems (
    id              SERIAL PRIMARY KEY,
    slug            VARCHAR(100) UNIQUE NOT NULL,
    title           VARCHAR(255) NOT NULL,
    difficulty      VARCHAR(20) NOT NULL CHECK (difficulty IN ('Easy', 'Medium', 'Hard')),
    description     TEXT NOT NULL,
    constraints     TEXT NOT NULL,
    examples        JSONB NOT NULL,
    time_limit_ms   INT NOT NULL DEFAULT 2000,
    memory_limit_mb INT NOT NULL DEFAULT 256,
    category        VARCHAR(100) NOT NULL,
    tags            TEXT[] NOT NULL DEFAULT '{}',
    has_special_judge BOOLEAN NOT NULL DEFAULT false,
    special_judge_code TEXT,
    starter_code    JSONB NOT NULL DEFAULT '{}',
    total_submissions BIGINT NOT NULL DEFAULT 0,
    total_accepted   BIGINT NOT NULL DEFAULT 0,
    is_premium      BOOLEAN NOT NULL DEFAULT false,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_problems_difficulty ON problems(difficulty);
CREATE INDEX idx_problems_category ON problems(category);
CREATE INDEX idx_problems_tags ON problems USING GIN(tags);

9.2 Test Cases Table

sql

# language-images.yaml
images:
  python3:
    image: "registry.internal/judge-python3"
    tag: "3.12.2-fc-v4"
    compiler: "CPython 3.12.2"
    updated: "2026-04-15"

  java:
    image: "registry.internal/judge-java"
    tag: "21.0.3-fc-v4"
    compiler: "OpenJDK 21.0.3"
    updated: "2026-04-12"

  cpp:
    image: "registry.internal/judge-cpp"
    tag: "13.2.0-fc-v4"
    compiler: "g++ 13.2.0"
    updated: "2026-04-10"

  go:
    image: "registry.internal/judge-go"
    tag: "1.22.2-fc-v4"
    compiler: "Go 1.22.2"
    updated: "2026-04-14"

  rust:
    image: "registry.internal/judge-rust"
    tag: "1.77.1-fc-v4"
    compiler: "rustc 1.77.1"
    updated: "2026-04-08"

9.3 Submissions Table

sql

// Practice workers consume from both premium and practice topics
// Premium messages are always dequeued first (weighted consumer)
Submission consumeWithPriority() {
    // Try premium topic first (non-blocking poll, 50ms timeout)
    ConsumerRecords<String, Submission> premium =
        premiumConsumer.poll(Duration.ofMillis(50));
    if (!premium.isEmpty())
        return premium.iterator().next().value();

    // Fall back to practice topic (blocking poll, 1s timeout)
    ConsumerRecords<String, Submission> practice =
        practiceConsumer.poll(Duration.ofSeconds(1));
    if (!practice.isEmpty())
        return practice.iterator().next().value();

    return null;
}

9.4 Contests + Participants Tables

sql

# practice-worker Deployment (abbreviated)
replicas: 800
strategy:
  rollingUpdate: { maxSurge: 100, maxUnavailable: 80 }  # 10%
resources: { cpu: "8", memory: "16Gi" }  # per pod
env:
  KAFKA_TOPIC: "submission.practice"
  VMM: "firecracker"
  WARM_POOL_SIZE: "8"         # microVMs per pod
  EXECUTION_TIMEOUT_SECONDS: "15"
runtimeClassName: kata-fc     # Kata Containers with Firecracker backend

9.5 Users Table

sql

CREATE TABLE submission_analytics (
    submission_id UUID, user_id UUID, problem_id UInt32,
    contest_id Nullable(UInt32), language LowCardinality(String),
    status LowCardinality(String), execution_time_ms UInt32,
    memory_usage_kb UInt32, queue_wait_ms UInt32, compile_time_ms UInt32,
    submitted_at DateTime64(3), judged_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(submitted_at)
ORDER BY (problem_id, language, submitted_at);

-- Language popularity
SELECT language, count() as submissions,
       countIf(status = 'ACCEPTED') * 100.0 / count() as acceptance_rate
FROM submission_analytics WHERE submitted_at > now() - INTERVAL 30 DAY
GROUP BY language ORDER BY submissions DESC;

-- Time limit calibration
SELECT problem_id, language,
       quantile(0.5)(execution_time_ms) as p50,
       quantile(0.95)(execution_time_ms) as p95
FROM submission_analytics WHERE status = 'ACCEPTED'
GROUP BY problem_id, language;

9.6 Valkey Data Structures

# Contest leaderboard (sorted set)
# Score encoding: problems_solved * 10^9 - total_penalty_seconds
# Higher score = better rank (more problems, less penalty)
ZADD contest:123:leaderboard <score> <user_id>

# Example: User solved 3 problems with 1800 sec total penalty
# Score = 3 * 1_000_000_000 - 1800 = 2_999_998_200
ZADD contest:123:leaderboard 2999998200 "user-abc-123"

# Get rank (0-indexed, so add 1 for display)
ZREVRANK contest:123:leaderboard "user-abc-123"

# Get top 50
ZREVRANGE contest:123:leaderboard 0 49 WITHSCORES

# Per-user contest state (hash)
HSET contest:123:user:abc-123
    problems_solved 3
    total_penalty 1800
    p1_accepted 1
    p1_time 300
    p1_wrong_attempts 1
    p2_accepted 1
    p2_time 900
    p2_wrong_attempts 0
    p3_accepted 1
    p3_time 1800
    p3_wrong_attempts 2
    p4_accepted 0
    p4_wrong_attempts 3

# Rate limiting (sliding window)
# Key: ratelimit:<user_id>:<minute_bucket>
INCR ratelimit:user-abc-123:2026041815042
EXPIRE ratelimit:user-abc-123:2026041815042 120

# WebSocket session routing
HSET ws:sessions <user_id> <gateway_pod_id>

# Submission verdict pub/sub
PUBLISH verdict:<user_id> '{"submission_id":"...","status":"ACCEPTED","time_ms":48}'

9.7 Entity-Relationship Diagram

10. API Design

Async submission (202 + WebSocket push), not request-response. Two core endpoints: submit and get result.

10.1 Submit Solution

POST /api/v1/submissions

Request:

json

CREATE TABLE test_cases (
    id          SERIAL PRIMARY KEY,
    problem_id  INT NOT NULL REFERENCES problems(id),
    case_number INT NOT NULL,
    input_s3    VARCHAR(500) NOT NULL,
    output_s3   VARCHAR(500) NOT NULL,
    is_sample   BOOLEAN NOT NULL DEFAULT false,
    input_hash  CHAR(64) NOT NULL,
    output_hash CHAR(64) NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(problem_id, case_number)
);

Response (202 Accepted):

json

CREATE TABLE submissions (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         UUID NOT NULL REFERENCES users(id),
    problem_id      INT NOT NULL REFERENCES problems(id),
    contest_id      INT,
    language        VARCHAR(30) NOT NULL,
    code_s3_path    VARCHAR(500) NOT NULL,
    code_length     INT NOT NULL,
    status          VARCHAR(30) NOT NULL DEFAULT 'PENDING'
                    CHECK (status IN ('PENDING', 'QUEUED', 'COMPILING',
                           'RUNNING', 'ACCEPTED', 'WRONG_ANSWER',
                           'TIME_LIMIT_EXCEEDED', 'MEMORY_LIMIT_EXCEEDED',
                           'RUNTIME_ERROR', 'COMPILATION_ERROR',
                           'SYSTEM_ERROR', 'TIMEOUT')),
    verdict_detail  JSONB,
    execution_time_ms INT,
    memory_usage_kb   INT,
    test_cases_passed INT NOT NULL DEFAULT 0,
    test_cases_total  INT NOT NULL DEFAULT 0,
    submitted_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    judged_at       TIMESTAMPTZ,
    worker_id       VARCHAR(100),
    CONSTRAINT fk_contest FOREIGN KEY (contest_id) REFERENCES contests(id)
) PARTITION BY RANGE (submitted_at);

CREATE INDEX idx_submissions_user_problem ON submissions(user_id, problem_id, submitted_at DESC);
CREATE INDEX idx_submissions_contest ON submissions(contest_id, submitted_at) WHERE contest_id IS NOT NULL;
CREATE INDEX idx_submissions_status ON submissions(status) WHERE status IN ('PENDING', 'QUEUED', 'COMPILING', 'RUNNING');

Rate Limit Headers:

X-RateLimit-Limit: 10
X-RateLimit-Remaining: 8
X-RateLimit-Reset: 1745069460

10.2 Get Submission Result

GET /api/v1/submissions/{submission_id}

Response (200 OK):

json

CREATE TABLE contests (
    id              SERIAL PRIMARY KEY,
    title           VARCHAR(255) NOT NULL,
    contest_type    VARCHAR(30) NOT NULL CHECK (contest_type IN ('weekly', 'biweekly', 'special')),
    start_time      TIMESTAMPTZ NOT NULL,
    end_time        TIMESTAMPTZ NOT NULL,
    duration_minutes INT NOT NULL,
    problem_ids     INT[] NOT NULL,
    is_rated        BOOLEAN NOT NULL DEFAULT true,
    penalty_minutes INT NOT NULL DEFAULT 5,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE contest_participants (
    contest_id      INT NOT NULL REFERENCES contests(id),
    user_id         UUID NOT NULL REFERENCES users(id),
    rank            INT,
    problems_solved INT NOT NULL DEFAULT 0,
    total_penalty   INT NOT NULL DEFAULT 0,
    score_detail    JSONB NOT NULL DEFAULT '{}',
    rating_before   INT,
    rating_after    INT,
    rating_delta    INT,
    PRIMARY KEY (contest_id, user_id)
);

Response (200 OK, Wrong Answer):

json

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username        VARCHAR(50) UNIQUE NOT NULL,
    email           VARCHAR(255) UNIQUE NOT NULL,
    password_hash   VARCHAR(255) NOT NULL,
    rating          INT NOT NULL DEFAULT 1500,
    problems_solved INT NOT NULL DEFAULT 0,
    easy_solved     INT NOT NULL DEFAULT 0,
    medium_solved   INT NOT NULL DEFAULT 0,
    hard_solved     INT NOT NULL DEFAULT 0,
    streak_days     INT NOT NULL DEFAULT 0,
    is_premium      BOOLEAN NOT NULL DEFAULT false,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Show 5 more endpoints (Run Custom Test, Get Leaderboard, WebSocket, Register, Get Problem)

10.3 Run Custom Test

POST /api/v1/run — Runs user code against custom input (not hidden test cases). Returns raw stdout/stderr. Does not create a submission record.

json

{
    "problem_id": 1,
    "language": "python3",
    "code": "class Solution:\n    def twoSum(self, nums, target):\n        seen = {}\n        for i, n in enumerate(nums):\n            if target - n in seen:\n                return [seen[target - n], i]\n            seen[n] = i",
    "contest_id": 123
}

10.4 Get Contest Leaderboard

GET /api/v1/contests/{contest_id}/leaderboard?page=1&page_size=50 — Paginated rankings sorted by problems_solved (desc), total_penalty (asc). Each entry includes per-problem breakdown (accepted, time, wrong attempts).

10.5 WebSocket Verdict Stream

WS /api/v1/ws/verdicts — Client subscribes with {"action": "subscribe", "submission_id": "..."}. Server pushes status transitions: COMPILING → RUNNING (with test_cases_passed count) → final verdict (ACCEPTED/WA/TLE/etc with execution_time_ms).

10.6 Register for Contest

POST /api/v1/contests/{contest_id}/register → 201 Created with contest_starts_at.

10.7 Get Problem

GET /api/v1/problems/{slug} — Returns title, difficulty, description (markdown), constraints, examples, starter_code per language, tags, acceptance_rate. Hidden test cases are never exposed.

11. Deep Dives

Seven subsystems that each deserve their own article. Expand the collapsibles for implementation code.

11.0 "Run Code" vs "Submit" — Two Different Paths

These two buttons in the IDE look similar but follow very different paths through the system:

Dimension	"Run Code"	"Submit"
Test cases	Visible examples only (1-3)	Full hidden test suite (10-50+)
Kafka topic	`submission.custom-test` (lowest priority)	`submission.practice` or `submission.contest`
PostgreSQL write	None (ephemeral result)	Full submission record (verdict, timing, memory)
Leaderboard update	None	Yes (if contest + Accepted)
Acceptance stats	Not affected	Updates problem acceptance rate
Response	Raw stdout/stderr + execution time	Verdict (Accepted, WA, TLE, etc.)
Typical latency	1-3 seconds	3-5 seconds (more test cases)

"Run Code" is the fast debug loop. Users hit it dozens of times per problem to check edge cases with their own inputs. "Submit" is the formal judgment against hidden test cases. Keeping these paths separate prevents debug runs from consuming contest/practice worker capacity.

11.0.1 TLE vs Timeout

Two different failure modes:

TLE: Code ran but exceeded the time limit. Algorithmic issue. Verdict is permanent.
Timeout (SYSTEM_BUSY): Submission sat in queue too long. System issue. Auto-requeued with elevated priority (up to 2 retries):

java

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "status": "PENDING",
    "submitted_at": "2026-04-18T14:30:00Z",
    "estimated_wait_seconds": 3
}

11.1 Code Sandbox Design (Firecracker microVM Lifecycle)

Warm microVM Pool Architecture

The system maintains a pool of pre-restored Firecracker microVMs, one pool per supported language. Each VM is a fully-booted machine with its own Linux kernel, restored from a snapshot with the language runtime already initialized. When a submission arrives, a VM is claimed from the pool (~2ms) instead of booting a new one (~125ms).

How Firecracker Execution Works

Each submission follows this flow:

java

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "problem_id": 1,
    "language": "python3",
    "status": "ACCEPTED",
    "execution_time_ms": 48,
    "memory_usage_kb": 16384,
    "test_cases_passed": 35,
    "test_cases_total": 35,
    "submitted_at": "2026-04-18T14:30:00Z",
    "judged_at": "2026-04-18T14:30:03Z",
    "percentile": {
        "time": 92.3,
        "memory": 87.1
    }
}

Keeping Firecracker Warm: Zero Boot Time via Snapshot/Restore

Cold-booting a Firecracker microVM takes ~125ms (start VMM, load kernel, mount rootfs, init language runtime). That's too slow at 2K submissions/sec. The solution: Firecracker native snapshots — no external CRIU dependency.

Note

Firecracker snapshot restore ~25ms vs cold boot ~125ms — 5x faster. The snapshot captures the full VM state (memory, guest kernel, language runtime, JIT-warmed code). Restoring is a single API call: PUT /snapshot/load. The VM resumes exactly where it was.

Three-tier warm strategy:

95%+ of submissions claim from the warm pool (2ms). During burst traffic, snapshot restore kicks in (25ms). Cold boot only happens when building snapshots after a language version update.

Snapshot creation (once per language, at image build time):

java

{
    "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "problem_id": 1,
    "language": "java",
    "status": "WRONG_ANSWER",
    "test_cases_passed": 18,
    "test_cases_total": 35,
    "verdict_detail": {
        "failed_test_case": 19,
        "input": "[3,2,4]\n6",
        "expected_output": "[1,2]",
        "actual_output": "[0,2]"
    },
    "submitted_at": "2026-04-18T14:31:00Z",
    "judged_at": "2026-04-18T14:31:04Z"
}

Cold boot (fallback, ~125ms):

java

// Request
{"language": "cpp", "code": "...", "input": "42"}

// Response 200
{"output": "84", "execution_time_ms": 12, "memory_usage_kb": 3456, "exit_code": 0}

Tip

VM reset via diff snapshot (~5ms) instead of killing and re-creating. After each submission, restore the VM from a pre-execution diff snapshot. This reverts all filesystem and memory changes atomically. A single VM can serve multiple submissions back-to-back without being destroyed.

Pre-warming strategy: The pool dynamically adjusts every 5 seconds: target_idle = max(min_idle, demand_5min_avg * 1.5, contest_pre_warm). Before contests, it boosts based on historical language distribution (Python 45%, C++ 30%, Java 15%).

VM health scoring: Each VM tracks age (max 60 min) and execution count (max 100). VMs below health score 0.3 are killed and replaced from snapshot. Security rotation guarantee — no VM survives longer than an hour.

Show warm pool management code (pool struct, snapshot restore, pre-warming)

java

void handleTimeout(Submission sub) {
    if (sub.getRetryCount() < 2) {
        sub.incrementRetryCount();
        sub.setPriority(Priority.HIGH);
        kafka.send("submission.retry", sub);
        db.updateStatus(sub.getId(), "QUEUED", "Auto-retry due to system timeout");
    } else {
        db.updateStatus(sub.getId(), "TIMEOUT",
            "Server busy. Please try again in a few seconds.");
    }
}

Isolation comparison — Docker vs gVisor vs Firecracker:

Docker (runc):
  User code → syscall → host Linux kernel (shared!)
  Risk: Kernel exploit (Dirty COW) → host compromise

gVisor (runsc):
  User code → syscall → gVisor Sentry (user-space Go) → limited host syscalls
  Risk: Bug in gVisor code → limited blast radius
  Limitation: ~200 of ~300 syscalls reimplemented. Edge cases break.

Firecracker microVM:
  User code → syscall → guest Linux kernel → KVM hypercall → host kernel
  Risk: KVM escape (extremely rare, hardware-enforced boundary)
  Advantage: Full Linux kernel inside VM. Zero syscall compatibility issues.

Resource limit enforcement (inside the microVM):

11.2 Judge Pipeline

Test case execution. Commands run inside the Firecracker microVM via vsock: stdin input, capture stdout/stderr, wall-clock time via System.nanoTime(), peak memory from guest cgroup stats. A ScheduledFuture at timeLimitMs + 2s kills the VM on deadline.

Output comparison. Token-based by default (split by whitespace, compare token-by-token — handles trailing whitespace/blank lines). Float comparator: epsilon 1e-6 absolute, 1e-9 relative. Multi-answer problems: checker program in a separate container validates output properties.

Compilation pipeline by language (10 languages shown)

Language	Compiler/Runtime	Compile Command	Run Command	Notes
Python 3.12	CPython 3.12	(none, interpreted)	`python3 solution.py`	3x time limit vs C++
Java 21	OpenJDK 21	`javac -d /sandbox Solution.java`	`java -cp /sandbox -Xmx256m Solution`	Heap limited by -Xmx
C++ 20	g++ 13	`g++ -O2 -std=c++20 -o solution solution.cpp`	`./solution`	Reference language for time limits
Go 1.22	go 1.22	`go build -o solution solution.go`	`./solution`	2x time limit vs C++
Rust 1.77	rustc 1.77	`rustc -O -o solution solution.rs`	`./solution`	Same time limit as C++
JavaScript	Node 22	(none)	`node solution.js`	3x time limit
C	gcc 13	`gcc -O2 -std=c17 -o solution solution.c -lm`	`./solution`	Same time limit as C++
C#	.NET 8	`dotnet build -c Release`	`dotnet run`	3x time limit
Kotlin	Kotlin 1.9 / JVM	`kotlinc solution.kt -include-runtime -d solution.jar`	`java -jar solution.jar`	2x time limit
TypeScript	tsx (Node 22)	(transpiled on the fly)	`tsx solution.ts`	3x time limit

11.3 Contest System (Real-Time Leaderboard, Penalty Calculation, Elo Rating)

Leaderboard update. On Accepted verdict, a Valkey Lua script atomically: check duplicate → mark accepted → calculate penalty (solve_time + wrong_attempts * 5min) → update sorted set (solved * 10^9 - total_penalty) → PUBLISH rank change for WebSocket push. Wrong attempts: separate Lua script, increments only if problem isn't already accepted.

Show leaderboard Lua scripts (atomic update + wrong attempt tracking)

lua

Verdict executeSubmission(Submission sub, MicroVM vm) throws Exception {
    // 1. Write user code to the VM via vsock
    vm.writeFile("/sandbox/solution" + ext(sub.getLanguage()), sub.getCode());

    // 2. Compile (if needed) via vsock command channel
    long deadlineMs = sub.getTimeLimitMs() + 2000;

    if (needsCompilation(sub.getLanguage())) {
        ExecResult compile = vm.exec(compileCommand(sub.getLanguage()), deadlineMs);
        if (compile.getExitCode() != 0)
            return Verdict.compilationError(compile.getStderr());
    }

    // 3. Run against each test case (stop on first failure)
    List<TestCase> testCases = testCaseCache.get(sub.getProblemId());
    int maxTimeMs = 0, maxMemoryKB = 0, passed = 0;

    for (int i = 0; i < testCases.size(); i++) {
        TestCase tc = testCases.get(i);
        ExecResult result = vm.execWithStdin(runCommand(sub.getLanguage()), tc.getInput(), deadlineMs);

        maxTimeMs = Math.max(maxTimeMs, result.getTimeMs());
        maxMemoryKB = Math.max(maxMemoryKB, result.getMemoryKB());

        if (result.getTimeMs() > sub.getTimeLimitMs())
            return Verdict.tle(passed, testCases.size());
        if (result.getMemoryKB() > sub.getMemoryLimitKB())
            return Verdict.mle(passed, testCases.size());
        if (result.getExitCode() != 0)
            return Verdict.runtimeError(passed, result.getStderr());
        if (!compareOutput(sub.getProblemId(), tc, result.getStdout()))
            return Verdict.wrongAnswer(passed, i + 1);
        passed++;
    }

    return Verdict.accepted(passed, testCases.size(), maxTimeMs, maxMemoryKB);
}

Live leaderboard push via WebSocket:

Post-Contest Elo Rating Computation:

The live leaderboard (penalty-time sorting) determines contest placement. After the contest ends, a separate async job computes rating changes using an Elo system adapted for multi-player competition. This is the same approach LeetCode uses (introduced 2020).

The key insight: standard Elo is designed for 1v1 matches. In a contest with N participants, the algorithm treats it as a round-robin where each pair of participants "plays a match" — the higher-ranked participant "wins."

The formula: for each player, compute Expected Rank (sum of win probabilities against all other participants using standard Elo 1/(1 + 10^((Ri - Rj)/400))), then delta = K * (ln(expected_rank) - ln(actual_rank)), clamped to [-150, +150]. K-factor starts at 80 for new players and decreases to 20 with experience. Starting rating: 1500. Absence penalty: ~-10 to -30.

Show Elo rating computation code (Java)

java

void createSnapshot(String language) throws Exception {
    // 1. Cold-boot a fresh microVM
    MicroVM vm = coldBoot(language);

    // 2. Warm up the runtime inside the VM
    //    Python: import sys, collections, heapq, math, itertools
    //    Java: trigger classloader + JIT on common paths (HashMap, Arrays.sort, Scanner)
    //    Go/Rust/C++: no warm-up needed (statically compiled)
    vm.exec(warmupCommands.get(language));

    // 3. Snapshot the fully-initialized VM via Firecracker API
    //    PUT /snapshot/create {"snapshot_type": "Full", ...}
    String snapshotDir = "/snapshots/" + language + "/latest";
    vm.getVMM().createSnapshot(new SnapshotConfig(
        Path.of(snapshotDir, "vmstate"),
        Path.of(snapshotDir, "mem"),
        true  // enableDiffSnapshots — for fast reset
    ));

    // 4. Kill the VM. Snapshot files (~50-100MB) stored on worker SSD.
    vm.getVMM().stop();
}

11.4 Multi-Language Support (Per-Language Rootfs Images and Compilation Pipeline)

What runs where — everything is on the same physical machine:

EC2 Instance (c5.metal — bare metal, KVM enabled)
│
├── K8s Node (kubelet, containerd)
│     └── Worker Pod (regular Docker container)
│           └── Java Worker App (Spring Boot)
│                 ├── Kafka Consumer (pulls submissions)
│                 ├── MicroVM Pool Manager
│                 ├── spawns: /usr/bin/firecracker → microVM (python3.ext4)
│                 ├── spawns: /usr/bin/firecracker → microVM (java21.ext4)
│                 ├── spawns: /usr/bin/firecracker → microVM (cpp20.ext4)
│                 └── ... (8 VMMs per pod = 8 concurrent submissions)
│
├── /dev/kvm  ← Firecracker needs this device (hardware virtualization)
├── /var/lib/firecracker/
│     └── vmlinux  ← shared minimal kernel (~4MB, boots in all VMs)
├── /var/lib/rootfs/
│     ├── python3.ext4   (~300MB, CPython 3.12 pre-installed)
│     ├── java21.ext4    (~450MB, OpenJDK 21 pre-installed)
│     ├── cpp20.ext4     (~250MB, g++ 13 pre-installed)
│     ├── go122.ext4     (~200MB, Go 1.22 pre-installed)
│     └── ... (20+ languages, all compilers pre-installed in rootfs)
└── /var/lib/snapshots/
      ├── python3/ (vmstate + mem, ~100MB — runtime pre-initialized)
      ├── java21/  (vmstate + mem, ~120MB — JIT pre-warmed)
      └── ... (one snapshot per language for fast restore)

Note

/dev/kvm is mandatory. Firecracker uses KVM for hardware virtualization. The K8s node pool must use bare-metal instances (c5.metal, m5.metal) or instances with nested virtualization enabled. The worker pod mounts /dev/kvm as a host device.

The Java app, Firecracker VMMs, and all microVMs share the same physical CPU and RAM. Each Firecracker process is ~5MB. Each microVM gets 256MB RAM and 2 vCPU via KVM resource partitioning. The Java app communicates with each VM via vsock (host↔guest data channel, no network stack).

Software flow (how a submission moves through the system):

Each ext4 rootfs image is a full minimal Linux filesystem with the language compiler pre-installed. Built once in CI, not at runtime:

python3.ext4   → Ubuntu 22.04 minimal + CPython 3.12 + stdlib        (~300MB)
java21.ext4    → Ubuntu 22.04 minimal + OpenJDK 21 JDK headless      (~450MB)
cpp20.ext4     → Ubuntu 22.04 minimal + g++ 13                       (~250MB)
go122.ext4     → Ubuntu 22.04 minimal + Go 1.22                      (~200MB)
rust177.ext4   → Ubuntu 22.04 minimal + rustc 1.77                   (~280MB)
node22.ext4    → Ubuntu 22.04 minimal + Node.js 22                   (~250MB)

All 20+ rootfs images + snapshots fit on a 16GB worker SSD. The build process:

debootstrap a minimal Ubuntu rootfs (not Docker — Firecracker boots raw ext4)
Install the language runtime via apt/curl
Create sandbox user, /sandbox workspace, mount points
Package as ext4: dd if=/dev/zero of=python3.ext4 bs=1M count=500 && mkfs.ext4 python3.ext4 && mount -o loop && copy → umount
Snapshot a warm VM from this rootfs (§11.1) — the snapshot includes initialized runtime + JIT

Language-specific time limit multipliers:

The same algorithm runs at very different speeds across languages. A two-sum hash map solution in C++ runs in 5ms. In Python, the same logic takes 50ms. Time limits scale by language.

What LeetCode actually does: Same time limit regardless of language. This makes Python significantly harder for some problems — a common community complaint. This design adds per-language multipliers for fairness. The tradeoff: multipliers add complexity and problem setters must calibrate limits against all languages, not just C++.

Multiplier	Languages
1.0x (reference)	C, C++, Rust
1.5x	Swift
2.0x	Go, Java, Kotlin, C#, Haskell
2.5x	Scala
3.0x (default)	Python, JavaScript, TypeScript, Ruby, PHP

If the problem's base time limit is 2000ms (set for C++), a Python solution gets 6000ms. Unknown languages default to 3.0x.

Language-specific memory overhead and image versioning

Memory considerations:

Language	Base Memory Overhead	Notes
C/C++	~2MB	Minimal runtime
Rust	~2MB	Minimal runtime
Go	~10MB	Go runtime + GC
Java	~60MB	JVM startup (with -Xmx256m)
Kotlin	~60MB	JVM-based
Scala	~80MB	JVM + Scala runtime
Python 3	~20MB	Interpreter
JavaScript (Node)	~30MB	V8 engine
C# (.NET)	~40MB	CLR
Ruby	~15MB	Interpreter
Haskell (GHC)	~25MB	RTS

Image versioning (source of truth):

yaml

MicroVM coldBoot(String language) throws Exception {
    FirecrackerConfig cfg = FirecrackerConfig.builder()
        .socketPath("/tmp/fc-" + UUID.randomUUID() + ".sock")
        .kernelImagePath("/var/lib/firecracker/vmlinux")    // Minimal kernel, ~4MB
        .addDrive(Drive.builder()
            .driveId("rootfs")
            .pathOnHost(rootfsImages.get(language))           // ext4 with language runtime
            .isRootDevice(true)
            .isReadOnly(false)
            .build())
        .machineConfig(new MachineConfig(2, 256))            // 2 vCPU, 256MB
        // No network interfaces — completely isolated. Communication via vsock only.
        .build();

    FirecrackerVMM vmm = FirecrackerVMM.start(cfg);

    // PUT /actions {"action_type": "InstanceStart"}
    vmm.startInstance();

    return new MicroVM(vmm, language, VMState.IDLE);
}

11.5 Submission Queue Management (Priority, Fair Queuing, Auto-Scaling)

During contests, submission traffic spikes 5-10x. The queue layer must handle this without either dropping contest submissions or starving practice submissions entirely.

Queue architecture with premium priority:

Premium "Lightning Judge" priority (practice submissions only):

LeetCode Premium subscribers get 3-10x faster judging during peak hours. This is implemented as a separate Kafka topic (submission.premium) that practice workers consume with higher priority. The key constraint: during contests, all contest submissions are equal priority regardless of premium status — fairness in competition is non-negotiable.

java

public class MicroVMPool {
    private final Map<String, BlockingQueue<MicroVM>> idle = new ConcurrentHashMap<>();
    private final Map<String, AtomicInteger> busyCount = new ConcurrentHashMap<>();
    private final Map<String, String> snapshotDirs;       // language -> snapshot path
    private final PoolConfig config;

    // Default pool sizes per language
    static final Map<String, Integer> DEFAULT_MIN_IDLE = Map.ofEntries(
        entry("python3", 200), entry("java", 150), entry("cpp", 150),
        entry("go", 100), entry("rust", 50), entry("javascript", 80),
        entry("csharp", 50), entry("kotlin", 40), entry("ruby", 30),
        entry("swift", 30), entry("typescript", 60), entry("c", 100)
    );

    // Restore a VM from Firecracker snapshot — ~25ms
    MicroVM restoreVM(String language) throws Exception {
        String snapshotDir = snapshotDirs.get(language);
        String socket = "/tmp/fc-" + UUID.randomUUID() + ".sock";

        FirecrackerVMM vmm = FirecrackerVMM.start(
            FirecrackerConfig.builder().socketPath(socket).build());

        vmm.loadSnapshot(new SnapshotConfig(
            Path.of(snapshotDir, "vmstate"),
            Path.of(snapshotDir, "mem"),
            true  // enableDiffSnapshots
        ));

        return new MicroVM(vmm, language, VMState.IDLE);
    }

    // Reset VM via diff snapshot — ~5ms, reverts all changes
    void resetVM(MicroVM vm) throws Exception {
        vm.getVMM().loadSnapshot(vm.getBaseSnapshot());
        vm.incrementExecCount();
    }

    // Pre-warming — runs every 5 seconds via ScheduledExecutorService
    void adjustPoolSize() {
        for (String lang : idle.keySet()) {
            int target = Math.max(
                config.getMinIdle(lang),
                Math.max(
                    (int)(demandMovingAvg(lang, Duration.ofMinutes(5)) * 1.5),
                    contestPreWarm(lang)));
            int current = idle.get(lang).size();
            if (current < target) executor.submit(() -> spawnVMs(lang, target - current));
            if (current > target * 2) retireExcess(lang, current - target);
        }
    }
}

During non-contest hours, premium submissions typically get verdicts in ~1-2 seconds (vs ~3-5 seconds for free tier). During peak contest hours with practice traffic spillover, the gap widens to 3-10x.

Fair queuing during contests:

The key insight: contest workers are dedicated and never share capacity with practice. But practice workers can be temporarily re-assigned to contest duty during peak load. This is done by having practice workers additionally subscribe to the contest topic when contest queue depth exceeds a threshold.

The adaptive consumer checks contest queue depth periodically. When depth exceeds a threshold, practice workers subscribe to the contest topic as well (dual-consume). When depth drops below half the threshold, they stop helping and return to practice-only mode. Hysteresis (threshold vs threshold/2) prevents flapping.

Auto-scaling based on queue depth. KEDA ScaledObject watches Kafka consumer lag per topic. Contest workers: min 100, max 1000 pods, scale up when lag >50 per partition, 5-second polling, 60-second cooldown. Practice workers follow the same pattern with higher min (200) and max (1500).

Queue position feedback. When the queue is deep, the API returns an estimated_wait_seconds field calculated as consumer_lag / processing_rate. This lets the UI show "Estimated wait: ~8 seconds" instead of a spinner.

Per-user fair queuing. Rate limiting via Valkey sliding window (INCR ratelimit:{user_id}:{minute_bucket}, 2-min TTL) prevents any single user from monopolizing execution capacity. Fails open on Valkey errors.

Contest-period auto-scaling timeline:

11.6 Web IDE (Monaco Editor Integration)

Monaco Editor — the VS Code engine. Syntax highlighting, bracket matching, minimap, multi-cursor for 20+ languages.

Feature	Free Tier	Premium
Syntax highlighting (20+ languages)	Yes	Yes
Context-aware autocomplete	Basic (keyword-based)	Enhanced (language-server-backed)
Vim / Emacs keybinding modes	Yes	Yes
Font size, tab size customization	Yes	Yes
Integrated debugger (breakpoints, step-through)	No	Yes
Dark/light theme	Yes	Yes

Starter code: Per-language templates from problems.starter_code JSONB. Defines the function signature the user implements.

Premium debugger: Breakpoints, step-through, variable inspection. Runs in a separate container with relaxed limits (60s). Not counted as submissions.

11.7 Problem Creation and Test Case Pipeline

4,000+ problems aren't hand-crafted. There's a pipeline.

Problem setters: Contracted engineers (rating 2000+, 1000+ solved). $20-65 per problem.

Test case generation pipeline:

Generator: Python script that produces random inputs satisfying the problem's constraints. For "Two Sum": generates arrays of length 2 to 10^4 with values in [-10^9, 10^9] and a valid target.
Validator: Verifies every generated input satisfies stated constraints (length bounds, value ranges, graph connectivity, etc.). Rejects malformed test cases.
Reference solutions: Problem setter writes correct solutions in at least C++, Python, and Java. These are run against all test cases to generate expected outputs and calibrate time limits.
Time limit calibration: Set at 3x the runtime of the slowest reference solution (in C++). Language multipliers (§11.4) adjust from there.
Cache invalidation: When test cases are updated for an existing problem, a Kafka message invalidates the worker SSD cache so workers fetch the new version on next access.

12. Identify Bottlenecks

Bottleneck 1: Warm container pool exhaustion during contest spikes

During a contest, Python and C++ submissions dominate (60% and 25% respectively). If the Python pool has 200 warm containers and 300 concurrent Python submissions arrive, 100 submissions block waiting for a container.

Mitigation: Dynamic pool rebalancing. A rebalancer runs every 5 seconds: when a language's queue depth exceeds 2x its idle pool, it finds a "donor" language with excess idle containers (Haskell, Ruby, Scala) and transfers capacity — kill idle donor containers, restore microVMs from Firecracker snapshot (~25ms) for the starved language. Also, pre-warm extra Python and C++ VMs 15 minutes before contest start (see §11.1 pre-warming strategy).

Bottleneck 2: PostgreSQL write throughput for submission status updates

Each submission generates 3-4 status updates (PENDING -> QUEUED -> COMPILING -> RUNNING -> verdict). At 2,000 submissions/sec peak, that is 8,000 UPDATE statements per second.

Mitigation: Batch status updates. Workers buffer intermediate status transitions (COMPILING, RUNNING) and flush to PostgreSQL every 500ms in a single multi-row UPDATE ... FROM (VALUES ...). Only the final verdict update is sent immediately (the user is waiting for it). The PENDING→QUEUED transition batches naturally via Kafka producer's linger.ms.

Bottleneck 3: Test case file I/O on workers

Each submission reads test case files from disk. With 30 test cases per problem and 2,000 submissions/sec, that is 60,000 file reads per second across the worker fleet.

Mitigation: In-memory test case cache on each worker. The entire test case corpus is <2GB — load all test cases into a Map<Integer, List<TestCase>> at worker startup. Refresh every 10 minutes or on cache invalidation signal via Kafka when a problem's test cases are updated. This eliminates all disk I/O for test case reads.

13. Failure Scenarios

Scenario 1: Execution Worker Crash Mid-Execution

Time	Event
T+0s	Worker pod crashes (OOM kill, node eviction, or Firecracker VMM crash).
T+0s	Submission is mid-execution. Test case 15 of 30 was running.
T+10s	Kafka consumer session timeout expires. The unacknowledged message is reassigned to another worker.
T+10s	New worker picks up the submission, starts fresh (compilation + all test cases from scratch).
T+15s	Submission status was stuck at "RUNNING". New worker updates it. User sees the verdict.

Impact: 10-15 seconds of additional delay for the affected submission. The user sees the verdict eventually. No data loss. The wasted execution time (test cases 1-15 on the crashed worker) is unrecoverable, but at 3 seconds average execution time, the total waste is small.

What about the VMs? The crashed worker's Firecracker processes are orphaned. The Kubernetes kubelet detects the dead pod and cleans up all associated processes. The warm pool on other workers is unaffected.

Scenario 2: Valkey Crash During Contest

Time	Event
T+0s	Valkey primary crashes.
T+0-1s	Valkey Sentinel detects the failure and promotes the replica to primary.
T+1-2s	All leaderboard writes fail during failover window.
T+2s	New primary is ready. Workers reconnect via Sentinel.
T+2s	Leaderboard updates resume. Submissions that got verdicts during the 2-second window had their leaderboard updates dropped.

Impact: 1-2 seconds of leaderboard inconsistency. Submissions still get judged (that flow only depends on Kafka and PostgreSQL). The missed leaderboard updates are recovered by a reconciliation job that runs every 30 seconds: it queries PostgreSQL for all "Accepted" contest submissions and ensures each has a corresponding leaderboard entry in Valkey. Any missing or mismatched entries are corrected automatically.

Scenario 3: Kafka Broker Goes Down

Time	Event
T+0s	One of 3 Kafka brokers crashes.
T+0s	Partitions led by the crashed broker become unavailable.
T+0-15s	Kafka controller detects failure, re-elects partition leaders on surviving brokers.
T+15s	All partitions available again. Messages in-flight during failure are retried by producers (idempotent producer with retries=5).

Impact: 0-15 seconds of elevated submission latency. No submissions are lost (replication factor 3 means 2 copies of every message survive). The API returns 202 Accepted before the Kafka write completes, so users don't see the delay on submission. They see a longer wait for the verdict.

Scenario 4: Container Escape Attempt

Time	Event
T+0s	User submits code that attempts a known container escape exploit.
T+0.1s	The exploit targets the guest Linux kernel inside the microVM. Even if it succeeds, the attacker is still inside the VM.
T+0.1s	The KVM boundary prevents any guest-to-host escape. The attacker controls a 256MB VM with no network — a dead end.
T+0.2s	Code crashes or times out. Worker reports Runtime Error or TLE verdict.
T+1s	Security monitoring detects anomalous VM behavior (unexpected syscall patterns logged by guest kernel). Alert fires.
T+5m	Security team reviews the submission. User account flagged.

Impact: Zero. Even a successful kernel exploit inside the VM is contained by the KVM hardware boundary. The attacker gets a broken VM with no network. This is why Firecracker was chosen — a compromised guest kernel is a non-event.

Scenario 5: Database Partition Exhaustion

Time	Event
T+0	The current weekly partition for submissions table fills up (approaching partition boundary date).
T+0	A cron job that creates future partitions 4 weeks in advance failed silently 3 weeks ago.
T+0	INSERTs into the submissions table fail: "no partition of relation submissions found for row."

Impact: All new submissions fail. Judging continues for already-queued submissions.

Prevention: Daily check on pg_inherits for future partitions. Alert if <2 exist. Weekly job creates 4 weeks ahead; daily job verifies next 14 days as a safety net.

14. Deployment Strategy

Workers. Rolling deployment, maxUnavailable: 10%. Canary pool (5%) first, monitor 15 min (SYSTEM_ERROR rate, latency percentiles, crash rate), then full rollout.

Key Kubernetes Deployment settings for the practice worker pool:

yaml

-- ── Leaderboard update (called on Accepted verdict) ──
-- KEYS[1] = contest:<id>:leaderboard (sorted set)
-- KEYS[2] = contest:<id>:user:<user_id> (hash)
-- ARGV[1] = user_id, ARGV[2] = problem_key, ARGV[3] = solve_time_seconds, ARGV[4] = penalty_per_wrong

local already = redis.call('HGET', KEYS[2], ARGV[2] .. '_accepted')
if already == '1' then return 0 end  -- Duplicate, ignore

redis.call('HSET', KEYS[2], ARGV[2] .. '_accepted', '1')
redis.call('HSET', KEYS[2], ARGV[2] .. '_time', ARGV[3])
local wrong = tonumber(redis.call('HGET', KEYS[2], ARGV[2] .. '_wrong_attempts') or '0')
local solved = redis.call('HINCRBY', KEYS[2], 'problems_solved', 1)
local penalty = tonumber(ARGV[3]) + (wrong * tonumber(ARGV[4]))
local new_penalty = tonumber(redis.call('HGET', KEYS[2], 'total_penalty') or '0') + penalty
redis.call('HSET', KEYS[2], 'total_penalty', new_penalty)
local score = solved * 1000000000 - new_penalty
redis.call('ZADD', KEYS[1], score, ARGV[1])
local rank = redis.call('ZREVRANK', KEYS[1], ARGV[1])
redis.call('PUBLISH', 'contest:' .. KEYS[1] .. ':update',
    cjson.encode({user_id=ARGV[1], rank=rank+1, problems_solved=solved, total_penalty=new_penalty}))
return rank + 1

-- ── Wrong attempt tracking (called on non-Accepted verdict) ──
-- local already = redis.call('HGET', KEYS[1], ARGV[1] .. '_accepted')
-- if already == '1' then return -1 end
-- return redis.call('HINCRBY', KEYS[1], ARGV[1] .. '_wrong_attempts', 1)

Kata Containers RuntimeClass: handler: kata-fc (Firecracker backend) with a nodeSelector ensuring pods land on KVM-enabled nodes (kata-containers.io/firecracker: "true").

Language image updates. When a new language version is released (e.g., Python 3.13): build image in CI → run 100 curated problems → flag >20% time regression → deploy to 1% canary → monitor 24h → roll out to all workers → update IDE version display.

Contest-period deployment freeze. A CI/CD gate checks the contest schedule and blocks all deploys from 30 minutes before contest start to 30 minutes after end.

Database migrations. Run as K8s Jobs before app deployment. All backward-compatible. ALTER TABLE ADD COLUMN DEFAULT is instant in PG 11+ (no table rewrite).

15. Observability

Key Metrics

Metric	Source	Alert Threshold
`submission_e2e_latency_p95`	Workers	> 5 seconds for 5 minutes
`submission_e2e_latency_p99`	Workers	> 10 seconds for 2 minutes
`verdict_distribution{type="SYSTEM_ERROR"}`	Workers	> 0.1% of submissions
`container_pool_idle{language}`	Workers	< 10% of min idle for any language for 2 min
`container_pool_exhausted{language}`	Workers	Any exhaustion event
`container_crash_rate`	Workers	> 1% in 10 minutes
`kafka_consumer_lag{topic}`	Kafka	> 5000 messages for 2 min (contest), > 10000 (practice)
`leaderboard_update_latency_p99`	Valkey	> 100ms for 1 minute
`websocket_connections_active`	WS Gateway	> 90% of capacity
`vm_anomalous_behavior`	Firecracker/Workers	Unexpected VMM exit, KVM error, or guest kernel panic
`compilation_error_rate{language}`	Workers	> 50% (indicates broken language image)
`test_case_cache_miss_rate`	Workers	> 5% (cache not warming properly)
`pg_submission_insert_latency_p99`	PostgreSQL	> 50ms

Dashboard Layout

+------------------------------------------+------------------------------------------+
|  Submissions/sec (real-time)             |  Verdict Distribution (pie chart)        |
|  [contest] [practice] [custom]           |  [AC] [WA] [TLE] [MLE] [RE] [CE] [SE]   |
+------------------------------------------+------------------------------------------+
|  E2E Latency (p50, p95, p99)            |  Kafka Consumer Lag (per topic)          |
|  [time series, last 4 hours]             |  [contest] [practice] [custom-test]      |
+------------------------------------------+------------------------------------------+
|  Container Pool Status (per language)    |  Active WebSocket Connections            |
|  [idle] [busy] [total] per language      |  [time series, last 1 hour]              |
+------------------------------------------+------------------------------------------+
|  Contest Leaderboard Update Latency      |  Worker Pod Count (current / max)        |
|  [p50, p99, errors]                      |  [contest] [practice] [custom-test]      |
+------------------------------------------+------------------------------------------+

Distributed Tracing

Every submission gets a trace ID that follows it through the entire pipeline:

ClickHouse Analytics Schema

submission_analytics table: MergeTree engine, partitioned by month, ordered by (problem_id, language, submitted_at). Stores every submission's timing, memory, verdict, queue wait, and compile time. Key queries: language popularity trends, problem difficulty validation (p50/p95/p99 execution times for accepted submissions), and per-worker performance.

Show ClickHouse DDL and example queries

sql

Map<String, Integer> computeRatingChanges(List<ContestResult> participants) {
    Map<String, Integer> changes = new HashMap<>();

    for (int i = 0; i < participants.size(); i++) {
        ContestResult player = participants.get(i);
        int actualRank = i + 1;

        // Expected rank: sum of win probabilities against all opponents
        double expectedRank = 1.0;
        for (int j = 0; j < participants.size(); j++) {
            if (i == j) continue;
            ContestResult opp = participants.get(j);
            expectedRank += 1.0 / (1.0 + Math.pow(10, (player.getRating() - opp.getRating()) / 400.0));
        }

        // K-factor decreases with experience
        int k = Math.max(20, 80 - player.getContestsPlayed() * 2);

        // Delta: positive if performed better than expected
        int delta = (int)(k * (Math.log(expectedRank) - Math.log(actualRank)));
        delta = Math.max(-150, Math.min(150, delta));

        changes.put(player.getUserId(), delta);
    }
    return changes;
}

16. Security

Hardware VM isolation via Firecracker. Each submission gets its own kernel. Even a successful guest exploit is a dead end.

Sandbox security is the entire ballgame. Compromised sandbox = access to test cases, other users' code, or the host system. Every decision here is defense-in-depth.

16.1 Eight Layers of Isolation (Defense-in-Depth)

The primary isolation boundary is the Firecracker microVM — a hardware KVM boundary that gives each submission its own Linux kernel. Even if an attacker achieves a full kernel exploit inside the VM, they control a 256MB machine with no network and no path to the host. Inside the VM, additional layers provide defense-in-depth.

What each layer stops:

Layer	What It Does	What It Stops	Real CVE/Attack Prevented
1. Firecracker KVM	Hardware VM isolation. Separate guest kernel. Host only sees KVM hypercalls.	ALL guest-side exploits. Kernel exploits, container escapes — they compromise the guest, not the host.	CVE-2019-5736, CVE-2020-15257, CVE-2022-0185 — all are guest-only events under Firecracker
2. Namespaces (guest)	Isolates mount, PID, network, user, IPC, UTS inside guest	Seeing other guest processes, accessing guest filesystem	Additional isolation within the VM
3. cgroups v2 (guest)	Hard CPU, memory (256MB), PIDs (64) limits inside guest	Fork bombs, memory bombs, CPU starvation	`while(1) fork()` hits PID limit. `malloc(1TB)` OOM-killed at 256MB
4. seccomp-bpf (guest)	Syscall filter inside the guest kernel	Restricts even inside the VM — principle of least privilege	Blocks `io_uring`, `ptrace`, `keyctl` inside guest
5. Read-only rootfs	Only `/sandbox` writable inside guest	Writing malware, modifying system binaries	Can't modify `/etc/passwd` or system binaries
6. No network	No virtio-net device attached to VM. Only vsock to host worker.	Data exfiltration, solution fetching, reverse shells	No DNS, no TCP, no UDP — the VM has no network stack
7. Capability drop	All 41 Linux capabilities removed inside guest	Raw sockets, mounting filesystems, module loading	Defense-in-depth inside the VM
8. no-new-privileges	Prevents setuid/setgid escalation inside guest	Privilege escalation within the VM	Even with root inside VM, can't escape KVM boundary

Why Firecracker is stronger than gVisor:

Scenario	gVisor	Firecracker
Guest kernel exploit	Compromises gVisor Sentry (software boundary)	Stays inside VM (hardware KVM boundary)
Syscall compatibility	95% — edge cases with `io_uring`, `ptrace`	100% — full Linux kernel, zero compat issues
Resource accounting	Tricky — some kernel buffers not counted by Sentry	Clean — KVM enforces hard VM memory boundary
Debugging failures	"operation not permitted" / "bad system call" — ambiguous	Standard Linux errors — the guest is a real OS

No path from guest to host. The Firecracker VMM process runs as an unprivileged user on the host. It exposes no network to the guest, no shared filesystem, and communicates only via vsock. Even a compromised VMM (which has never happened) would only give access to a single-submission VM's data.

16.2 Network Isolation

Worker pods have a strict Kubernetes NetworkPolicy: sandbox containers can reach nothing on the network. The worker process itself can only egress to PostgreSQL (5432), Valkey (6379), Kafka (9092), and S3 (443). All other egress is denied.

16.3 microVM Resource Hard Limits

Every microVM runs with:

Memory: 256MB hard limit (no swap, OOM kill on exceed)
CPU: 2 vCPU (KVM resource partitioning)
PIDs: 64 maximum inside guest (prevents fork bombs)
Wall clock: 15 seconds (worker kills the VM)
Disk: /sandbox writable, rootfs read-only
Network: no virtio-net device attached — only vsock to host
Capabilities: all dropped inside guest
New privileges: blocked (no-new-privileges inside guest)

A per-VM watchdog thread in the Java worker starts a ScheduledFuture with 15s deadline. On expiry, it kills the Firecracker VMM process and increments the timeout metric.

16.4 Test Case Protection

Note

Expected output NEVER enters the sandbox. Only the input is copied in. Comparison happens in the worker process outside the container. Even if a user reads every file in their sandbox, they cannot discover the expected answers.

16.5 Code Scanning (Soft Blocks)

Before execution, the API performs a fast static scan of submitted code for known dangerous patterns:

Language	Blocked Patterns	Reason
Python	`import os`, `import subprocess`, `__import__`	System command execution
C/C++	`#include <sys/ptrace.h>`, `#include <sys/mount.h>`	Kernel interaction
Java	`Runtime.getRuntime().exec`, `ProcessBuilder`	Process spawning
Go	`os/exec`, `syscall.Exec`	System commands
JavaScript	`child_process`, `require('fs')`	System access

Note: These are soft blocks (warning + flag for review), not hard blocks. Legitimate solutions sometimes use OS-level primitives. The sandbox is the real security boundary, not code scanning.

16.6 Audit Logging and Incident Response

Every submission, VM creation, VM destruction, and resource limit breach is logged to an append-only Kafka topic with 365-day retention. Security incidents trigger automated response:

Event	Automated Response
Firecracker VMM unexpected exit	Log + alert. Flag submission. Investigate guest kernel panic or VMM bug.
Container OOM kill	Normal (MLE verdict). Log for capacity planning.
Container wall clock kill	Normal (TLE verdict). Log.
Repeated anomalous VM behavior from same user	Rate-limit user. Alert security team.
Suspected escape attempt (unusual VMM interaction pattern)	Kill VM. Flag user for review. Alert immediately.

Secret management. Workers authenticate to PostgreSQL, Kafka, and S3 using Kubernetes Secrets injected as environment variables. Secrets are rotated monthly via an automated pipeline. Database credentials use short-lived tokens from HashiCorp Vault (4-hour TTL, auto-renewed).

Rate limiting. Per-user rate limits prevent abuse:

Practice: 5 submissions per minute
Contest: 10 submissions per minute
Custom test: 10 runs per minute
API read endpoints: 100 requests per minute

Enforced via Valkey sliding window counters at the API gateway layer. Exceeding the limit returns HTTP 429 with a Retry-After header.

17. SLOs and Error Budgets

SLOs make the quality target concrete. Error budgets turn "be more careful" into "freeze deploys this week."

SLI	SLO	Monthly Error Budget
Submission e2e latency p95 ≤ 5s	99.9%	43.2 min
Judge availability (non-SYSTEM_ERROR, non-TIMEOUT)	99.95%	21.6 min
Contest leaderboard accuracy (rank matches PostgreSQL source of truth)	99.99%	4.3 min
Container warm-pool hit rate	95%	N/A (capacity metric, not error budget)
Sandbox escape rate	0%	Budget-less — any escape is a P0 incident

Show 2 more SLOs (cold-start, WebSocket)

SLI	SLO	Monthly Error Budget
microVM restore p99 ≤ 200ms (Firecracker snapshot)	99.9%	43.2 min
WebSocket verdict delivery ≤ 500ms from worker verdict	99.5%	3.6 h

Error budget policy:

Normal burn (≤1x over 30d): Business as usual. Ship features.
Fast burn (>2x over 7d): Freeze non-critical deploys. Director sign-off for launches. Investigate root cause.
Very fast burn (>4x over 1d): Page on-call. Freeze all non-rollback deploys. Launch incident response.
Exhausted: Next sprint goes entirely to reliability. No new features until budget replenishes.

Alert tiering:

Page-now (5-min response): Sandbox escape attempt, availability burn rate, Kafka lag >5000 for contest topic, container pool exhaustion for any language, SYSTEM_ERROR rate >0.1%.
Page-business-hours: Latency budget burn, cache-miss rate rising, compilation error rate spike (broken language image), leaderboard drift from PostgreSQL source of truth.
Ticket-only: Capacity warnings (worker CPU >70%), disk warnings, pending schema migrations, individual worker pod restarts.

18. Operational Playbook

A design that doesn't document its ops story isn't production-grade.

18.1 Backup and Recovery

Component	Backup Strategy	RPO	RTO
PostgreSQL	Continuous WAL archiving to S3 + daily base backups	~0 (WAL)	< 30 min (PITR)
Test cases (S3)	S3 versioning + cross-region replication	0	< 5 min
Valkey (leaderboards)	Reconstructible from PostgreSQL (source of truth)	N/A	< 2 min
Kafka	3x replication factor, 7-day retention	0	< 15 min (broker replacement)
Container images	Registry with multi-AZ replication	0	< 10 min
ClickHouse (analytics)	Daily snapshots to S3	24h	< 2h

Quarterly restore drill. Restore PG from a random backup to staging. Verify data integrity. Failed restore = P1.

18.2 Capacity Planning

Three leading indicators:

Worker CPU. Target <50% avg, <70% peak. Sustained >70% → KEDA should auto-scale, but verify node pool headroom.
Kafka lag. Contest >100 for 2 min = backing up. Practice >1000 = degraded. Custom-test >5000 = acceptable.
PG connection pool. PgBouncer >80% = queries queueing. ~1,700 DB ops/sec at steady state (3 ops per submission).

18.3 Schema Migrations

50M rows/day. Zero-downtime schema changes only:

ADD COLUMN DEFAULT — instant in PG 11+. CREATE INDEX CONCURRENTLY — no table lock.
Weekly partitions created 4 weeks ahead. Daily job verifies next 14 days as safety net.
Forward-only. Mistakes get a compensating migration, never a rollback.

18.4 Top 5 Alerts and Mitigations

Know these cold:

Container pool exhausted — Contest spike (pre-warm late) or container leak? Spike: verify KEDA scaling + node pool headroom. Leak: check worker logs for stuck containers.
Kafka lag spike (>5000 on contest topic) — Verify KEDA scaling. At max replicas: redirect practice workers to contest topic. Check for slow consumers (30s/submission = container issue).
PG connection pool >80% — Check pg_stat_activity for stuck queries. Verify partition creation didn't fail (missing partition = all INSERTs fail).
Firecracker VMM crash — Check if guest kernel panicked (user code triggered a kernel bug inside the VM — harmless to host, but the submission needs a retry). If VMM itself crashed, investigate — extremely rare, treat as P1.
Leaderboard drift (Valkey ≠ PostgreSQL) — Reconciliation job auto-corrects. If drift >10 entries: manual review. Check Lua script failures or dropped Pub/Sub messages.

19. Appendix

Appendix A: Firecracker microVM Configuration

Each microVM runs a minimal Linux kernel (~4MB, custom-built vmlinux) with a language-specific ext4 rootfs. Communication between host worker and guest is via vsock (virtio socket — no network stack, just a host↔guest data channel).

Key Firecracker API calls:

Operation	API Call	Notes
Set kernel	`PUT /boot-source {"kernel_image_path": "/var/lib/fc/vmlinux"}`	Minimal kernel, no modules
Set rootfs	`PUT /drives/rootfs {"path_on_host": "python3.ext4", "is_root_device": true}`	Per-language ext4 image
Set resources	`PUT /machine-config {"vcpu_count": 2, "mem_size_mib": 256}`	Hard limits enforced by KVM
Start VM	`PUT /actions {"action_type": "InstanceStart"}`	~125ms cold boot
Create snapshot	`PUT /snapshot/create {"snapshot_path": "...", "mem_file_path": "..."}`	Full VM state to disk
Restore snapshot	`PUT /snapshot/load {"snapshot_path": "...", "mem_file_path": "..."}`	~25ms restore

Language compatibility: 100% across all 20+ languages. The guest runs a standard Linux kernel — no syscall restrictions, no edge cases, no "works locally fails on judge." Every language runtime behaves identically to bare metal.

Appendix B: Contest Elo Rating — Full Formula

The post-contest rating computation (§11.3) uses a multi-player Elo adaptation:

Given N participants sorted by contest rank:

For player i with current rating Ri:

  Expected Rank(i) = 1 + Σ(j≠i) P(j beats i)

  where P(j beats i) = 1 / (1 + 10^((Ri - Rj) / 400))

  Performance = √(Expected_Rank × Actual_Rank)   [geometric mean]

  K = max(20, 80 - contests_played × 2)            [K-factor]

  ΔRating = K × (ln(Expected_Rank) - ln(Actual_Rank))

  Clamped to [-150, +150] per contest

Starting rating: 1500. Absence penalty: -K/4 (roughly -10 to -20 depending on experience).

Appendix C: Firecracker Warm Pool Benchmark

Measured on c5.metal (96 vCPU, 192GB RAM) with Firecracker v1.7:

Operation	Latency (p50)	Latency (p99)	Notes
Cold boot (fresh microVM)	120ms	180ms	Kernel boot + rootfs mount + runtime init
Snapshot restore	20ms	40ms	6x faster than cold boot. Native Firecracker API.
Warm pool claim (idle VM)	1.2ms	3.5ms	Channel receive + state update
Diff snapshot reset	4ms	8ms	Reverts all guest state (filesystem + memory)
JVM warm-up penalty (cold boot)	+180ms	+350ms	Without pre-warmed snapshot
JVM warm-up penalty (snapshot restore)	+3ms	+10ms	JIT + classloader already warm in snapshot

The warm pool + snapshot strategy ensures 95%+ of submissions see <5ms VM acquisition. The remaining 5% (burst traffic) see <50ms via snapshot restore. Cold boots only happen during language image updates.

Note

If you only remember six things

Sandbox. Firecracker microVM — each submission gets its own Linux kernel with hardware KVM isolation. Even a guest kernel exploit is a dead end.
Fast start. Warm VM pool + Firecracker native snapshot restore (~25ms) + diff snapshot reset (~5ms). JVM VMs pre-warm the JIT before snapshot.
Queue. Four Kafka topics (contest, premium, practice, custom-test) with dedicated worker pools and cross-pool borrowing during contests.
Leaderboard. Valkey sorted set with atomic Lua updates. Post-contest Elo rating computation. Reconciliation job as safety net.
Scale. 50M submissions/day, 12,000 VM slots across 1,500 pods, 14% steady-state utilization, auto-scale via KEDA on Kafka consumer lag.
Biggest risk. Warm pool exhaustion during contest spikes. Mitigated by pre-warming 15 min before contest + dynamic language rebalancing + snapshot restore fallback.

Explore the Technologies

Dive deeper into the technologies and infrastructure patterns used in this design:

Core Technologies

Technology	Role in This Design	Learn More
PostgreSQL	Problems, submissions, users, contests, source of truth	PostgreSQL
Valkey	Contest leaderboards, rate limiting, WebSocket routing, pub/sub	Redis/Valkey
Kafka	Submission queue with priority separation, audit logging	Kafka
Kubernetes	Worker fleet orchestration, Kata Containers RuntimeClass (Firecracker backend), auto-scaling	Kubernetes
ClickHouse	Submission analytics, language trends, problem difficulty analysis	ClickHouse

Infrastructure Patterns

Pattern	Relevance to This Design	Learn More
Circuit Breaker and Fault Tolerance	Fail-open on Valkey/PostgreSQL outages, fallback behaviors	Circuit Breakers
Message Queues	Kafka for durable submission dispatch with priority separation	Message Queues
Rate Limiting	Per-user submission rate limiting with sliding window counters	Rate Limiting
Container Orchestration	Kubernetes for worker fleet management with Kata Containers (Firecracker backend)	Container Orchestration

Linux Internals (used in this design)

Firecracker builds on Linux kernel primitives. The primary boundary is KVM (hardware virtualization). Inside the guest VM, standard Linux isolation (namespaces, cgroups, seccomp) provides defense-in-depth.

Linux Concept	Role in This Design	Learn More
KVM (Kernel-based Virtual Machine)	The primary isolation boundary. Firecracker uses `/dev/kvm` to create hardware-isolated VMs. Guest kernel exploits can't reach the host.	KVM docs
System Calls	Guest kernel handles all syscalls natively — 100% compatibility. Host only sees KVM hypercalls.	System Calls
Namespaces	Defense-in-depth inside the guest VM — isolates PID, mount, network, user, IPC, UTS	Namespaces
cgroups v2	Hard limits inside the guest: 256MB memory, 2 vCPU, 64 PIDs. OOM killer enforces MLE verdict.	cgroups v2
Memory Cgroups	How the 256MB limit and OOM kill work inside the guest kernel	Memory Cgroups
Seccomp	Additional syscall filtering inside the guest — principle of least privilege even within the VM	Seccomp
Linux Capabilities	All 41 capabilities dropped inside guest. No raw sockets, no mount, no module loading.	Capabilities
Network Namespaces	No virtio-net device attached — the VM has no network stack. Only vsock to host.	Network Namespaces
Container Runtime (runc/containerd/Kata)	How Kata Containers wraps Firecracker behind the OCI runtime interface for K8s integration	Container Runtime
OOM Killer	Guest kernel OOM-kills user code at 256MB. Worker detects exit and reports MLE verdict.	OOM Killer
Copy-on-Write	Firecracker snapshot restore uses CoW memory mapping for fast VM resume (~25ms)	Copy-on-Write
Virtual Memory	Each microVM has its own virtual address space. KVM maps guest physical → host physical.	Virtual Memory

System Design: URL Shortener (10B Short URLs, 100K Redirects/sec)

April 11, 2026 · 42 min read

System Design: Ad Click Aggregator (10B Clicks/day, Lambda Architecture, Fraud Detection)

April 10, 2026 · 70 min read

System Design: Ad Exchange (Real-Time Bidding, Sub-100ms Auctions, DSP/SSP, Impression Serving)

April 10, 2026 · 59 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System DesignApril 12, 2026· 68 min read

System Design: LeetCode (Code Sandbox, Container Isolation, Real-Time Contests)

Goal: Design an online judge platform like LeetCode that handles 50 million code submissions per day across 20+ programming languages. Execute untrusted user code safely in Firecracker microVMs with hardware-level KVM isolation. Support real-time contests with live leaderboards, Elo-based rating, and penalty calculation. Target sub-5-second end-to-end execution latency, 20K concurrent submissions, and 15M registered users.

TL;DR: Submissions enter through a REST API, get validated and enqueued into Kafka (partitioned by priority: premium, contest, practice, custom-test). Execution workers pull jobs, claim a pre-warmed Firecracker microVM from the warm pool (native snapshot restore in ~25ms), run user code inside a VM with its own Linux kernel — hardware-level KVM isolation, no network, 256MB RAM, 2 vCPU. VM reset uses diff snapshot restore (~5ms) to revert all changes atomically. Contest leaderboards use Valkey sorted sets updated atomically via Lua scripts. WebSocket connections push live rank updates. Post-contest Elo rating computation adjusts player ratings. Auto-scaling based on Kafka consumer lag handles contest traffic spikes.

Tip

Pick your path

Time	Read this	Covers
2 min	TL;DR + §1 + §16 Security	Shape of the system and why sandbox isolation is the whole ballgame
15 min	§1–§12	Every core design decision, interview-grade
30 min	Full post	Production detail: ops playbook, SLOs, failure scenarios, appendices

1. Final Architecture

Submission path:
  Client → API Gateway → Rate Limiter (Valkey)
         → Kafka (topic by priority: premium > contest > practice > custom-test)
         → 202 Accepted with submission ID

Execution path:
  Worker → Kafka poll → Claim warm microVM (Firecracker snapshot restore, ~25ms)
         → Write code via vsock → Compile → Run test cases (isolated VM, no network, 256MB, 2 vCPU)
         → Compare output → Verdict → PostgreSQL + Valkey pub/sub → WebSocket push

Contest path:
  Accepted verdict → Valkey Lua script (atomic leaderboard update)
         → PUBLISH to WebSocket gateways → Live rank push to all viewers
  Post-contest → Elo rating computation → PostgreSQL (rating_before, rating_after)

2. Problem Statement

An online judge sounds deceptively simple. Someone writes code. The system runs it. Check if the output matches. Done.

Problem 1: Running untrusted code without getting owned.

Problem 2: 50M submissions/day means thousands of containers running simultaneously.

Problem 3: Contest fairness requires deterministic execution and cheat prevention.

Same code, different machines: 48ms vs 72ms due to noisy neighbors. Ranking by raw execution time is unfair — timing must be normalized.

Scale numbers:

15M registered users, 3M monthly active users
50M code submissions per day (~580/sec avg, 2,000/sec peak)
20K concurrent submissions during peak contest windows
20+ supported programming languages
4,000+ problems with 10-50 test cases each
Average execution time: 3 seconds (compilation + running all test cases)
Weekly contests: 100K participants, 4-5 problems, 90-minute window

What NOT to do:

exec()/eval() on the app server. That's RCE on production.
Default Docker security. Shared kernel, no syscall filtering. Escapes are well-documented (CVE-2019-5736, CVE-2020-15257).
Network access in containers. Code could fetch solutions, exfiltrate data, or attack internal services.
Raw execution time for ranking. Noisy neighbors cause 2x variance. Normalize or use dedicated hardware.
Running all test cases after first failure. Wrong output on case 2 of 50? Stop. Only run all for Accepted.
Test cases inside the container image. Users could read expected outputs and hardcode answers.
Single queue for all submissions. Contest and practice traffic compete. Separate queues with priority.
Solutions and test cases in the same DB. Different sizes, different access patterns, different write rates.
Monolithic judge. API, execution, leaderboard, problem management scale completely differently.

The sandbox isolation layer is where most of the complexity lives. Getting it wrong means either a security breach or unacceptable performance. §11.1 breaks down the tradeoffs.

3. Functional Requirements

ID	Requirement	Priority
FR-01	Execute user-submitted code in a sandboxed environment with strict resource limits	P0
FR-02	Support 20+ programming languages (Python, Java, C++, Go, Rust, JavaScript, C, C#, Ruby, Kotlin, Swift, TypeScript, Scala, PHP, Haskell, Dart, Elixir, Erlang, Racket)	P0
FR-03	Run submitted code against ordered test cases and return a verdict (Accepted, Wrong Answer, TLE, MLE, Runtime Error, Compilation Error)	P0
FR-04	Display execution time and memory usage for each submission	P0
FR-05	Support contests with timed problem sets and real-time leaderboards	P0
FR-06	Support penalty time calculation (time to solve + penalty per wrong attempt)	P0
FR-07	Provide a problem bank with descriptions, constraints, examples, and hidden test cases	P0
FR-08	Show submission history per user per problem	P0
FR-10	Support "Run Code" (test against visible examples only, fast debug loop) separate from "Submit" (full hidden test suite)	P0
FR-11	Rate-limit submissions per user (5/minute for practice, 10/minute during contests)	P0
FR-12	Push real-time verdict updates to users via WebSocket	P1
FR-13	Support problem difficulty tagging and topic categorization	P1
FR-14	Track user statistics (problems solved, acceptance rate, contest Elo rating)	P1
FR-15	Support editorial solutions and community discussions per problem	P2
FR-16	Distinguish TLE (user's algorithm too slow) from Timeout (server overloaded, auto-retry)	P0
FR-17	Premium priority queue: 3-10x faster judging for premium subscribers (practice only, not contests)	P1
FR-18	Post-contest Elo rating computation with absence penalty	P1

4. Non-Functional Requirements

ID	Requirement	Target
NFR-01	End-to-end submission latency (submit to verdict)	< 5 seconds (p95), < 10 seconds (p99)
NFR-02	Execution throughput	50M submissions/day, 2,000/sec peak
NFR-03	Concurrent submissions	20,000 during peak contest windows
NFR-04	microVM snapshot restore time	< 50ms (with warm pool: < 5ms claim)
NFR-05	Availability	99.95% (26 min downtime/month, non-contest), 99.99% during contests
NFR-06	Sandbox escape rate	0 (any escape is a critical security incident)
NFR-07	Leaderboard update latency	< 1 second from verdict to leaderboard update
NFR-08	WebSocket message delivery	< 500ms from verdict to client notification
NFR-10	Horizontal scalability	Linear scale-out by adding execution workers
NFR-11	Data retention	Submissions: 2 years. Test cases: indefinite. Contest results: indefinite.
NFR-12	Language image update	< 4 hours from new language/version release to production availability

5. High-Level Approach & Technology Selection

5.1 What Kind of System Is This?

A batch job execution platform with a real-time scoring overlay. Accept code, run it safely, compare output. The "safely" part is where all the complexity lives.

5.2 Why Not Just ProcessBuilder?

The tempting approach: run user code directly on the server.

java

ProcessBuilder pb = new ProcessBuilder("python3", "solution.py");
Process p = pb.start();
String output = new String(p.getInputStream().readAllBytes());

The progression from "no isolation" to "hardware isolation":

Approach	Filesystem	Network	Resources	Kernel	Compatibility
ProcessBuilder	Full access	Full access	None	Shared	100%
+ chroot	Restricted root	Full access	None	Shared	100%
+ cgroups	Restricted root	Full access	CPU/mem limits	Shared	100%
+ namespaces	Isolated	Isolated	Limited	Shared	100%
Docker + seccomp	Isolated	Disabled	Limited	Shared (filtered)	100%
gVisor	Isolated	Disabled	Limited	User-space kernel	95%
Firecracker	Isolated	Disabled	Limited	Separate kernel (KVM)	100%

5.3 Sandbox Isolation Approaches

The sandbox must prevent: filesystem escape, network access, fork bombs, memory bombs, CPU starvation, and kernel exploits.

Approach 1: Docker + seccomp/AppArmor (baseline)

Standard Docker with a restrictive seccomp profile (~50 of 300+ syscalls allowed), AppArmor, and --network=none, --read-only, --memory, --cpus, --pids-limit.

json

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    { "names": ["read", "write", "open", "close", "stat", "fstat",
                "mmap", "mprotect", "munmap", "brk", "ioctl",
                "access", "execve", "exit_group", "arch_prctl",
                "clone", "wait4", "getpid", "getuid"],
      "action": "SCMP_ACT_ALLOW" }
  ]
}

Approach 2: gVisor (runsc)

User Code -> Container -> gVisor Sentry (user-space kernel) -> Host Kernel
                          ^^ intercepted here

Approach 3: Firecracker microVMs

User Code -> Guest Kernel (inside microVM) -> Firecracker VMM -> Host Kernel
             ^^ completely separate kernel

5.4 Sandbox Approach Comparison

Dimension	Docker + seccomp	gVisor (runsc)	Firecracker microVM
Security	Low (shared kernel)	High (user-space kernel)	Very high (hardware KVM)
Isolation	Process-level	Syscall interception	Separate guest kernel
Syscall compatibility	Full (native)	95% (reimplements ~200 of ~300)	100% (full kernel inside VM)
Cold start	~300ms	~400ms	~125ms (boot + rootfs)
Warm reuse	Easy (exec into running)	Easy (same as Docker)	Native snapshot/restore (~25ms)
Memory overhead	~10MB	~30MB (Sentry)	~35MB (5MB VMM + guest kernel)
Syscall perf overhead	~0%	10-30% (interception)	~5% (virtualization)
Known escape CVEs	6+ (runc, containerd)	0	0 (hardware boundary)
Operational complexity	Low	Medium (custom OCI runtime)	Medium (Kata Containers)
Production users	Everyone	Google Cloud Run	AWS Lambda, Fly.io
LeetCode fit	No	Yes	Yes (best)

5.5 Chosen Sandbox: Firecracker microVMs

Why Firecracker over gVisor for this use case:

Full kernel isolation. Guest kernel exploits don't affect the host. Zero syscall compatibility issues — every language runtime works exactly as on bare metal. No "works locally, fails on judge."
Native snapshot/restore. Firecracker can snapshot a fully-initialized VM (runtime loaded, JIT warmed) and restore it in ~25ms. No external CRIU dependency. Simpler than gVisor + CRIU.
Clean resource accounting. Each VM is a hard boundary. Memory, CPU, and I/O are tracked by KVM — no noisy-neighbor ambiguity.
Proven at untrusted-code scale. AWS Lambda, Fly.io, and Koyeb run on Firecracker in production. The "operational complexity" argument has weakened — firecracker-containerd provides OCI compatibility, and Kata Containers provides Kubernetes RuntimeClass integration.

gVisor remains a valid lighter alternative for development environments, internal judges, or lower-scale deployments where microVM overhead isn't justified.

5.6 Judge Strategy Comparison

How does the system decide whether a submission is correct?

Strategy	How It Works	Pros	Cons
Exact output match	Compare stdout byte-for-byte with expected output	Simple, deterministic	Fails on floating point, trailing whitespace, line endings
Token-based comparison	Split output into tokens (whitespace-delimited), compare tokens	Handles whitespace variations	Still fails on floating point precision
Special judge (checker)	Custom program compares user output against expected answer	Handles floating point, multiple valid answers, graph problems	Need to write a checker per problem
Interactive judge	User program communicates with judge program via stdin/stdout	Required for interactive problems (binary search, games)	Complex, harder to parallelize

Chosen: Token-based comparison as default. Special judge for floating-point (epsilon comparison) and multi-answer problems (validate output property, not exact match).

5.7 Contest Ranking Algorithms

Algorithm	Description	Used By
ICPC-style	Ranked by problems solved (desc), then penalty time (asc). Penalty = sum of solve times + 20 min per wrong attempt	ICPC, many college contests
Codeforces-style	Rating-based scoring. Points per problem decrease over time. Wrong attempts incur penalty.	Codeforces
LeetCode-style	Ranked by problems solved (desc), then total penalty (asc). Penalty = finish time of last accepted + 5 min per wrong attempt	LeetCode
IOI-style	Partial scoring. Each test case group gives points. No penalty for wrong attempts.	IOI, many olympiads

Chosen: LeetCode-style ranking (configurable to ICPC-style):

score = (problems_solved, -total_penalty)

total_penalty = finish_time_of_last_accepted
              + (5 * total_wrong_attempts_across_all_solved_problems)

Sorted by: problems_solved DESC, total_penalty ASC

5.8 Store Selection

Store	Technology	Role	Rationale
Problem metadata	PostgreSQL 16	Problem descriptions, constraints, tags, difficulty	Relational queries (filter by tag, difficulty), ACID transactions
Test case storage	S3 + local SSD cache	Input/output files for each problem	Test cases can be large (100MB+ for some problems). S3 for durability, SSD cache on workers for low-latency reads.
Submission records	PostgreSQL 16	User submissions, verdicts, execution stats	Relational (user_id, problem_id, contest_id), indexed for history queries
Submission code	S3	Raw source code files	Cheap storage for 50M files/day, rarely re-read
Contest leaderboards	Valkey 8	Real-time sorted sets for live rankings	Sub-ms sorted set operations, atomic updates, pub/sub for push
Submission queue	Kafka	Decouple API from execution workers	Durable, partitioned, consumer groups for parallel processing
User sessions/rate limits	Valkey 8	Rate limiting, auth tokens	In-memory speed for per-request checks
Analytics	ClickHouse	Submission trends, language popularity, problem difficulty stats	Columnar analytics on billions of rows
WebSocket state	Valkey Pub/Sub	Route verdict updates to the correct WebSocket gateway	Cross-pod message routing

5.9 Kata Containers: How Firecracker Integrates with Kubernetes

How it works:

Without Kata:  kubelet → containerd → runc → container (shared kernel)
With Kata:     kubelet → containerd → kata-runtime → Firecracker → microVM (separate kernel)

A single Kubernetes RuntimeClass resource switches the runtime:

yaml

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-fc
handler: kata-fc    # Kata with Firecracker backend

6. Design Assumptions

Non-negotiables. Every number downstream inherits from here.

Single-region (us-east-1), multi-AZ. No need for active-active — users already wait 2-5s for execution.
50M submissions/day is a design target, not a benchmark. Demonstrates scaling decisions.
Firecracker for production. Hardware VM isolation for untrusted code. gVisor is the lighter alternative for dev/test.
Premium tier. Priority queue for practice (3-10x faster). Contests: equal priority for fairness.
20+ languages. Python 2/3, Java, C++, Go, Rust, JS, TS, C, C#, Ruby, Kotlin, Swift, Scala, PHP, Haskell, Dart, Elixir, Erlang, Racket.
No AI coding features (Copilot-style). Separate product concern.
GDPR out of scope for code execution. IPs logged for rate limiting; no PII in sandboxes.

7. High-Level Architecture

Accept fast (202 Accepted), judge async (Kafka workers), push results (WebSocket).

Bird's-Eye View

Layer Responsibilities

API Gateway. Validate code → rate-limit (Valkey) → store submission in PostgreSQL (PENDING) → enqueue to Kafka → return 202. WebSocket gateway pushes verdict updates via Valkey Pub/Sub.

Queue. Four Kafka topics (contest, premium, practice, custom-test). Partitioned by user_id % partitions for per-user ordering.

Storage. PostgreSQL for structured data. S3 for blobs (test cases, source code). Test cases cached on worker SSDs.

Real-Time. Valkey for (1) leaderboard sorted sets, (2) rate-limit counters, (3) Pub/Sub routing verdicts to WebSocket gateways.

Analysis. ClickHouse stores submission analytics (language trends, problem difficulty calibration, per-worker performance). Fed async from PostgreSQL.

Submission Lifecycle

Execution Worker Detail Flow

8. Back-of-the-Envelope Estimation

580 req/sec average, 2K peak, 6K concurrent containers, 1.5TB hot PostgreSQL, 150MB Valkey.

Dimension	Result
Throughput	580/sec avg, 2K peak, design for 3K burst
Workers	1,500 pods, 12,000 container slots, 14.5% steady-state utilization
PostgreSQL	~1.5TB hot (30-day submissions + indexes), weekly partitions
S3	100GB/day source code, 36.5TB/year, <2GB total test cases
Valkey	~150MB total (leaderboards + rate limits + sessions) — one node handles it
Kafka	12MB/sec peak, 3 brokers, 72 partitions across 4 topics
Network	~10MB/sec total outbound (well within single-NIC capacity)

Show full derivations (throughput, worker sizing, storage, Valkey, Kafka, network)

Throughput:

Submissions per day:          50,000,000
Submissions per second (avg): 579 (~580)
Submissions per second (peak, 3.5x): 2,030 (~2,000)

Contest peak (weekly, 90 min):
  100K users * 5 problems * 3 attempts avg = 1,500,000 submissions
  1,500,000 / (90 * 60) = 278 contest submissions/sec
  Sunday evening combined peak: ~1,440/sec
  Design for 2,000/sec with burst to 3,000/sec.

Execution worker sizing:

Average execution time: 3 seconds
Concurrent executions at peak: 2,000/sec * 3s = 6,000

Worker pod: 8 vCPU, 16GB RAM, 8 containers per pod
Pods needed at peak: 6,000 / 8 = 750 pods

Practical: 500 contest + 800 practice + 200 custom-test = 1,500 pods, 12,000 slots
Steady-state utilization: 14.5%. Peak: 50%. Auto-scale down to 500 pods off-peak.

Storage:

PostgreSQL submissions: 50M/day * 500B/row = 750GB hot (30 days), ~1.5TB with indexes
PostgreSQL problems: 30MB. Users: 10GB. Contests: trivial.
S3 source code: 100GB/day = 36.5TB/year (lifecycle to S3-IA after 90 days)
S3 test cases: 900MB total (cached entirely on worker SSDs)

Valkey memory:

Leaderboards: ~8MB per contest (100K participants), 32MB with 4 concurrent
Rate limiting: 50MB (500K concurrent users)
WebSocket sessions: 60MB
Total: ~150MB — single Valkey node (4GB) with one replica

Kafka:

4,000 msg/sec peak * 3KB = 12MB/sec (single broker handles 200MB/sec)
3 brokers for fault tolerance. 16 + 16 + 32 + 8 = 72 partitions across 4 topics.

Network: ~10MB/sec total outbound (S3 fetches + PG updates + WebSocket).

9. Data Model

Six tables in PostgreSQL (submissions partitioned weekly), sorted sets in Valkey, blobs in S3.

Table	Key Columns	Notes
`problems`	slug, difficulty, time_limit_ms, starter_code (JSONB)	4K rows, GIN index on tags
`test_cases`	problem_id, input_s3, output_s3, is_sample	S3 paths, not inline data. SHA-256 hashes for integrity
`submissions`	user_id, problem_id, status, execution_time_ms	Partitioned weekly by submitted_at. 50M rows/day. Status includes TIMEOUT (distinct from TLE)
`contests`	start_time, end_time, problem_ids[], penalty_minutes	~52/year. Rated by default
`contest_participants`	contest_id, user_id, rank, rating_before/after/delta	Elo rating changes stored per contest
`users`	username, rating (default 1500), problems_solved, is_premium	15M rows. Streak tracking

Show full SQL schemas (5 CREATE TABLE statements + indexes)

9.1 Problems Table

sql

CREATE TABLE problems (
    id              SERIAL PRIMARY KEY,
    slug            VARCHAR(100) UNIQUE NOT NULL,
    title           VARCHAR(255) NOT NULL,
    difficulty      VARCHAR(20) NOT NULL CHECK (difficulty IN ('Easy', 'Medium', 'Hard')),
    description     TEXT NOT NULL,
    constraints     TEXT NOT NULL,
    examples        JSONB NOT NULL,
    time_limit_ms   INT NOT NULL DEFAULT 2000,
    memory_limit_mb INT NOT NULL DEFAULT 256,
    category        VARCHAR(100) NOT NULL,
    tags            TEXT[] NOT NULL DEFAULT '{}',
    has_special_judge BOOLEAN NOT NULL DEFAULT false,
    special_judge_code TEXT,
    starter_code    JSONB NOT NULL DEFAULT '{}',
    total_submissions BIGINT NOT NULL DEFAULT 0,
    total_accepted   BIGINT NOT NULL DEFAULT 0,
    is_premium      BOOLEAN NOT NULL DEFAULT false,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_problems_difficulty ON problems(difficulty);
CREATE INDEX idx_problems_category ON problems(category);
CREATE INDEX idx_problems_tags ON problems USING GIN(tags);

9.2 Test Cases Table

sql

# language-images.yaml
images:
  python3:
    image: "registry.internal/judge-python3"
    tag: "3.12.2-fc-v4"
    compiler: "CPython 3.12.2"
    updated: "2026-04-15"

  java:
    image: "registry.internal/judge-java"
    tag: "21.0.3-fc-v4"
    compiler: "OpenJDK 21.0.3"
    updated: "2026-04-12"

  cpp:
    image: "registry.internal/judge-cpp"
    tag: "13.2.0-fc-v4"
    compiler: "g++ 13.2.0"
    updated: "2026-04-10"

  go:
    image: "registry.internal/judge-go"
    tag: "1.22.2-fc-v4"
    compiler: "Go 1.22.2"
    updated: "2026-04-14"

  rust:
    image: "registry.internal/judge-rust"
    tag: "1.77.1-fc-v4"
    compiler: "rustc 1.77.1"
    updated: "2026-04-08"

9.3 Submissions Table

sql

// Practice workers consume from both premium and practice topics
// Premium messages are always dequeued first (weighted consumer)
Submission consumeWithPriority() {
    // Try premium topic first (non-blocking poll, 50ms timeout)
    ConsumerRecords<String, Submission> premium =
        premiumConsumer.poll(Duration.ofMillis(50));
    if (!premium.isEmpty())
        return premium.iterator().next().value();

    // Fall back to practice topic (blocking poll, 1s timeout)
    ConsumerRecords<String, Submission> practice =
        practiceConsumer.poll(Duration.ofSeconds(1));
    if (!practice.isEmpty())
        return practice.iterator().next().value();

    return null;
}

9.4 Contests + Participants Tables

sql

# practice-worker Deployment (abbreviated)
replicas: 800
strategy:
  rollingUpdate: { maxSurge: 100, maxUnavailable: 80 }  # 10%
resources: { cpu: "8", memory: "16Gi" }  # per pod
env:
  KAFKA_TOPIC: "submission.practice"
  VMM: "firecracker"
  WARM_POOL_SIZE: "8"         # microVMs per pod
  EXECUTION_TIMEOUT_SECONDS: "15"
runtimeClassName: kata-fc     # Kata Containers with Firecracker backend

9.5 Users Table

sql

CREATE TABLE submission_analytics (
    submission_id UUID, user_id UUID, problem_id UInt32,
    contest_id Nullable(UInt32), language LowCardinality(String),
    status LowCardinality(String), execution_time_ms UInt32,
    memory_usage_kb UInt32, queue_wait_ms UInt32, compile_time_ms UInt32,
    submitted_at DateTime64(3), judged_at DateTime64(3)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(submitted_at)
ORDER BY (problem_id, language, submitted_at);

-- Language popularity
SELECT language, count() as submissions,
       countIf(status = 'ACCEPTED') * 100.0 / count() as acceptance_rate
FROM submission_analytics WHERE submitted_at > now() - INTERVAL 30 DAY
GROUP BY language ORDER BY submissions DESC;

-- Time limit calibration
SELECT problem_id, language,
       quantile(0.5)(execution_time_ms) as p50,
       quantile(0.95)(execution_time_ms) as p95
FROM submission_analytics WHERE status = 'ACCEPTED'
GROUP BY problem_id, language;

9.6 Valkey Data Structures

# Contest leaderboard (sorted set)
# Score encoding: problems_solved * 10^9 - total_penalty_seconds
# Higher score = better rank (more problems, less penalty)
ZADD contest:123:leaderboard <score> <user_id>

# Example: User solved 3 problems with 1800 sec total penalty
# Score = 3 * 1_000_000_000 - 1800 = 2_999_998_200
ZADD contest:123:leaderboard 2999998200 "user-abc-123"

# Get rank (0-indexed, so add 1 for display)
ZREVRANK contest:123:leaderboard "user-abc-123"

# Get top 50
ZREVRANGE contest:123:leaderboard 0 49 WITHSCORES

# Per-user contest state (hash)
HSET contest:123:user:abc-123
    problems_solved 3
    total_penalty 1800
    p1_accepted 1
    p1_time 300
    p1_wrong_attempts 1
    p2_accepted 1
    p2_time 900
    p2_wrong_attempts 0
    p3_accepted 1
    p3_time 1800
    p3_wrong_attempts 2
    p4_accepted 0
    p4_wrong_attempts 3

# Rate limiting (sliding window)
# Key: ratelimit:<user_id>:<minute_bucket>
INCR ratelimit:user-abc-123:2026041815042
EXPIRE ratelimit:user-abc-123:2026041815042 120

# WebSocket session routing
HSET ws:sessions <user_id> <gateway_pod_id>

# Submission verdict pub/sub
PUBLISH verdict:<user_id> '{"submission_id":"...","status":"ACCEPTED","time_ms":48}'

9.7 Entity-Relationship Diagram

10. API Design

Async submission (202 + WebSocket push), not request-response. Two core endpoints: submit and get result.

10.1 Submit Solution

POST /api/v1/submissions

Request:

json

CREATE TABLE test_cases (
    id          SERIAL PRIMARY KEY,
    problem_id  INT NOT NULL REFERENCES problems(id),
    case_number INT NOT NULL,
    input_s3    VARCHAR(500) NOT NULL,
    output_s3   VARCHAR(500) NOT NULL,
    is_sample   BOOLEAN NOT NULL DEFAULT false,
    input_hash  CHAR(64) NOT NULL,
    output_hash CHAR(64) NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(problem_id, case_number)
);

Response (202 Accepted):

json

CREATE TABLE submissions (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         UUID NOT NULL REFERENCES users(id),
    problem_id      INT NOT NULL REFERENCES problems(id),
    contest_id      INT,
    language        VARCHAR(30) NOT NULL,
    code_s3_path    VARCHAR(500) NOT NULL,
    code_length     INT NOT NULL,
    status          VARCHAR(30) NOT NULL DEFAULT 'PENDING'
                    CHECK (status IN ('PENDING', 'QUEUED', 'COMPILING',
                           'RUNNING', 'ACCEPTED', 'WRONG_ANSWER',
                           'TIME_LIMIT_EXCEEDED', 'MEMORY_LIMIT_EXCEEDED',
                           'RUNTIME_ERROR', 'COMPILATION_ERROR',
                           'SYSTEM_ERROR', 'TIMEOUT')),
    verdict_detail  JSONB,
    execution_time_ms INT,
    memory_usage_kb   INT,
    test_cases_passed INT NOT NULL DEFAULT 0,
    test_cases_total  INT NOT NULL DEFAULT 0,
    submitted_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    judged_at       TIMESTAMPTZ,
    worker_id       VARCHAR(100),
    CONSTRAINT fk_contest FOREIGN KEY (contest_id) REFERENCES contests(id)
) PARTITION BY RANGE (submitted_at);

CREATE INDEX idx_submissions_user_problem ON submissions(user_id, problem_id, submitted_at DESC);
CREATE INDEX idx_submissions_contest ON submissions(contest_id, submitted_at) WHERE contest_id IS NOT NULL;
CREATE INDEX idx_submissions_status ON submissions(status) WHERE status IN ('PENDING', 'QUEUED', 'COMPILING', 'RUNNING');

Rate Limit Headers:

X-RateLimit-Limit: 10
X-RateLimit-Remaining: 8
X-RateLimit-Reset: 1745069460

10.2 Get Submission Result

GET /api/v1/submissions/{submission_id}

Response (200 OK):

json

CREATE TABLE contests (
    id              SERIAL PRIMARY KEY,
    title           VARCHAR(255) NOT NULL,
    contest_type    VARCHAR(30) NOT NULL CHECK (contest_type IN ('weekly', 'biweekly', 'special')),
    start_time      TIMESTAMPTZ NOT NULL,
    end_time        TIMESTAMPTZ NOT NULL,
    duration_minutes INT NOT NULL,
    problem_ids     INT[] NOT NULL,
    is_rated        BOOLEAN NOT NULL DEFAULT true,
    penalty_minutes INT NOT NULL DEFAULT 5,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE contest_participants (
    contest_id      INT NOT NULL REFERENCES contests(id),
    user_id         UUID NOT NULL REFERENCES users(id),
    rank            INT,
    problems_solved INT NOT NULL DEFAULT 0,
    total_penalty   INT NOT NULL DEFAULT 0,
    score_detail    JSONB NOT NULL DEFAULT '{}',
    rating_before   INT,
    rating_after    INT,
    rating_delta    INT,
    PRIMARY KEY (contest_id, user_id)
);

Response (200 OK, Wrong Answer):

json

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username        VARCHAR(50) UNIQUE NOT NULL,
    email           VARCHAR(255) UNIQUE NOT NULL,
    password_hash   VARCHAR(255) NOT NULL,
    rating          INT NOT NULL DEFAULT 1500,
    problems_solved INT NOT NULL DEFAULT 0,
    easy_solved     INT NOT NULL DEFAULT 0,
    medium_solved   INT NOT NULL DEFAULT 0,
    hard_solved     INT NOT NULL DEFAULT 0,
    streak_days     INT NOT NULL DEFAULT 0,
    is_premium      BOOLEAN NOT NULL DEFAULT false,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Show 5 more endpoints (Run Custom Test, Get Leaderboard, WebSocket, Register, Get Problem)

10.3 Run Custom Test

POST /api/v1/run — Runs user code against custom input (not hidden test cases). Returns raw stdout/stderr. Does not create a submission record.

json

{
    "problem_id": 1,
    "language": "python3",
    "code": "class Solution:\n    def twoSum(self, nums, target):\n        seen = {}\n        for i, n in enumerate(nums):\n            if target - n in seen:\n                return [seen[target - n], i]\n            seen[n] = i",
    "contest_id": 123
}

10.4 Get Contest Leaderboard

10.5 WebSocket Verdict Stream

10.6 Register for Contest

POST /api/v1/contests/{contest_id}/register → 201 Created with contest_starts_at.

10.7 Get Problem

GET /api/v1/problems/{slug} — Returns title, difficulty, description (markdown), constraints, examples, starter_code per language, tags, acceptance_rate. Hidden test cases are never exposed.

11. Deep Dives

Seven subsystems that each deserve their own article. Expand the collapsibles for implementation code.

11.0 "Run Code" vs "Submit" — Two Different Paths

These two buttons in the IDE look similar but follow very different paths through the system:

Dimension	"Run Code"	"Submit"
Test cases	Visible examples only (1-3)	Full hidden test suite (10-50+)
Kafka topic	`submission.custom-test` (lowest priority)	`submission.practice` or `submission.contest`
PostgreSQL write	None (ephemeral result)	Full submission record (verdict, timing, memory)
Leaderboard update	None	Yes (if contest + Accepted)
Acceptance stats	Not affected	Updates problem acceptance rate
Response	Raw stdout/stderr + execution time	Verdict (Accepted, WA, TLE, etc.)
Typical latency	1-3 seconds	3-5 seconds (more test cases)

11.0.1 TLE vs Timeout

Two different failure modes:

TLE: Code ran but exceeded the time limit. Algorithmic issue. Verdict is permanent.
Timeout (SYSTEM_BUSY): Submission sat in queue too long. System issue. Auto-requeued with elevated priority (up to 2 retries):

java

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "status": "PENDING",
    "submitted_at": "2026-04-18T14:30:00Z",
    "estimated_wait_seconds": 3
}

11.1 Code Sandbox Design (Firecracker microVM Lifecycle)

Warm microVM Pool Architecture

How Firecracker Execution Works

Each submission follows this flow:

java

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "problem_id": 1,
    "language": "python3",
    "status": "ACCEPTED",
    "execution_time_ms": 48,
    "memory_usage_kb": 16384,
    "test_cases_passed": 35,
    "test_cases_total": 35,
    "submitted_at": "2026-04-18T14:30:00Z",
    "judged_at": "2026-04-18T14:30:03Z",
    "percentile": {
        "time": 92.3,
        "memory": 87.1
    }
}

Keeping Firecracker Warm: Zero Boot Time via Snapshot/Restore

Note

Three-tier warm strategy:

95%+ of submissions claim from the warm pool (2ms). During burst traffic, snapshot restore kicks in (25ms). Cold boot only happens when building snapshots after a language version update.

Snapshot creation (once per language, at image build time):

java

{
    "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "problem_id": 1,
    "language": "java",
    "status": "WRONG_ANSWER",
    "test_cases_passed": 18,
    "test_cases_total": 35,
    "verdict_detail": {
        "failed_test_case": 19,
        "input": "[3,2,4]\n6",
        "expected_output": "[1,2]",
        "actual_output": "[0,2]"
    },
    "submitted_at": "2026-04-18T14:31:00Z",
    "judged_at": "2026-04-18T14:31:04Z"
}

Cold boot (fallback, ~125ms):

java

// Request
{"language": "cpp", "code": "...", "input": "42"}

// Response 200
{"output": "84", "execution_time_ms": 12, "memory_usage_kb": 3456, "exit_code": 0}

Tip

Show warm pool management code (pool struct, snapshot restore, pre-warming)

java

void handleTimeout(Submission sub) {
    if (sub.getRetryCount() < 2) {
        sub.incrementRetryCount();
        sub.setPriority(Priority.HIGH);
        kafka.send("submission.retry", sub);
        db.updateStatus(sub.getId(), "QUEUED", "Auto-retry due to system timeout");
    } else {
        db.updateStatus(sub.getId(), "TIMEOUT",
            "Server busy. Please try again in a few seconds.");
    }
}

Isolation comparison — Docker vs gVisor vs Firecracker:

Docker (runc):
  User code → syscall → host Linux kernel (shared!)
  Risk: Kernel exploit (Dirty COW) → host compromise

gVisor (runsc):
  User code → syscall → gVisor Sentry (user-space Go) → limited host syscalls
  Risk: Bug in gVisor code → limited blast radius
  Limitation: ~200 of ~300 syscalls reimplemented. Edge cases break.

Firecracker microVM:
  User code → syscall → guest Linux kernel → KVM hypercall → host kernel
  Risk: KVM escape (extremely rare, hardware-enforced boundary)
  Advantage: Full Linux kernel inside VM. Zero syscall compatibility issues.

Resource limit enforcement (inside the microVM):

11.2 Judge Pipeline

Compilation pipeline by language (10 languages shown)

Language	Compiler/Runtime	Compile Command	Run Command	Notes
Python 3.12	CPython 3.12	(none, interpreted)	`python3 solution.py`	3x time limit vs C++
Java 21	OpenJDK 21	`javac -d /sandbox Solution.java`	`java -cp /sandbox -Xmx256m Solution`	Heap limited by -Xmx
C++ 20	g++ 13	`g++ -O2 -std=c++20 -o solution solution.cpp`	`./solution`	Reference language for time limits
Go 1.22	go 1.22	`go build -o solution solution.go`	`./solution`	2x time limit vs C++
Rust 1.77	rustc 1.77	`rustc -O -o solution solution.rs`	`./solution`	Same time limit as C++
JavaScript	Node 22	(none)	`node solution.js`	3x time limit
C	gcc 13	`gcc -O2 -std=c17 -o solution solution.c -lm`	`./solution`	Same time limit as C++
C#	.NET 8	`dotnet build -c Release`	`dotnet run`	3x time limit
Kotlin	Kotlin 1.9 / JVM	`kotlinc solution.kt -include-runtime -d solution.jar`	`java -jar solution.jar`	2x time limit
TypeScript	tsx (Node 22)	(transpiled on the fly)	`tsx solution.ts`	3x time limit

11.3 Contest System (Real-Time Leaderboard, Penalty Calculation, Elo Rating)

Show leaderboard Lua scripts (atomic update + wrong attempt tracking)

lua

Verdict executeSubmission(Submission sub, MicroVM vm) throws Exception {
    // 1. Write user code to the VM via vsock
    vm.writeFile("/sandbox/solution" + ext(sub.getLanguage()), sub.getCode());

    // 2. Compile (if needed) via vsock command channel
    long deadlineMs = sub.getTimeLimitMs() + 2000;

    if (needsCompilation(sub.getLanguage())) {
        ExecResult compile = vm.exec(compileCommand(sub.getLanguage()), deadlineMs);
        if (compile.getExitCode() != 0)
            return Verdict.compilationError(compile.getStderr());
    }

    // 3. Run against each test case (stop on first failure)
    List<TestCase> testCases = testCaseCache.get(sub.getProblemId());
    int maxTimeMs = 0, maxMemoryKB = 0, passed = 0;

    for (int i = 0; i < testCases.size(); i++) {
        TestCase tc = testCases.get(i);
        ExecResult result = vm.execWithStdin(runCommand(sub.getLanguage()), tc.getInput(), deadlineMs);

        maxTimeMs = Math.max(maxTimeMs, result.getTimeMs());
        maxMemoryKB = Math.max(maxMemoryKB, result.getMemoryKB());

        if (result.getTimeMs() > sub.getTimeLimitMs())
            return Verdict.tle(passed, testCases.size());
        if (result.getMemoryKB() > sub.getMemoryLimitKB())
            return Verdict.mle(passed, testCases.size());
        if (result.getExitCode() != 0)
            return Verdict.runtimeError(passed, result.getStderr());
        if (!compareOutput(sub.getProblemId(), tc, result.getStdout()))
            return Verdict.wrongAnswer(passed, i + 1);
        passed++;
    }

    return Verdict.accepted(passed, testCases.size(), maxTimeMs, maxMemoryKB);
}

Live leaderboard push via WebSocket:

Post-Contest Elo Rating Computation:

Show Elo rating computation code (Java)

java

void createSnapshot(String language) throws Exception {
    // 1. Cold-boot a fresh microVM
    MicroVM vm = coldBoot(language);

    // 2. Warm up the runtime inside the VM
    //    Python: import sys, collections, heapq, math, itertools
    //    Java: trigger classloader + JIT on common paths (HashMap, Arrays.sort, Scanner)
    //    Go/Rust/C++: no warm-up needed (statically compiled)
    vm.exec(warmupCommands.get(language));

    // 3. Snapshot the fully-initialized VM via Firecracker API
    //    PUT /snapshot/create {"snapshot_type": "Full", ...}
    String snapshotDir = "/snapshots/" + language + "/latest";
    vm.getVMM().createSnapshot(new SnapshotConfig(
        Path.of(snapshotDir, "vmstate"),
        Path.of(snapshotDir, "mem"),
        true  // enableDiffSnapshots — for fast reset
    ));

    // 4. Kill the VM. Snapshot files (~50-100MB) stored on worker SSD.
    vm.getVMM().stop();
}

11.4 Multi-Language Support (Per-Language Rootfs Images and Compilation Pipeline)

What runs where — everything is on the same physical machine:

EC2 Instance (c5.metal — bare metal, KVM enabled)
│
├── K8s Node (kubelet, containerd)
│     └── Worker Pod (regular Docker container)
│           └── Java Worker App (Spring Boot)
│                 ├── Kafka Consumer (pulls submissions)
│                 ├── MicroVM Pool Manager
│                 ├── spawns: /usr/bin/firecracker → microVM (python3.ext4)
│                 ├── spawns: /usr/bin/firecracker → microVM (java21.ext4)
│                 ├── spawns: /usr/bin/firecracker → microVM (cpp20.ext4)
│                 └── ... (8 VMMs per pod = 8 concurrent submissions)
│
├── /dev/kvm  ← Firecracker needs this device (hardware virtualization)
├── /var/lib/firecracker/
│     └── vmlinux  ← shared minimal kernel (~4MB, boots in all VMs)
├── /var/lib/rootfs/
│     ├── python3.ext4   (~300MB, CPython 3.12 pre-installed)
│     ├── java21.ext4    (~450MB, OpenJDK 21 pre-installed)
│     ├── cpp20.ext4     (~250MB, g++ 13 pre-installed)
│     ├── go122.ext4     (~200MB, Go 1.22 pre-installed)
│     └── ... (20+ languages, all compilers pre-installed in rootfs)
└── /var/lib/snapshots/
      ├── python3/ (vmstate + mem, ~100MB — runtime pre-initialized)
      ├── java21/  (vmstate + mem, ~120MB — JIT pre-warmed)
      └── ... (one snapshot per language for fast restore)

Note

Software flow (how a submission moves through the system):

Each ext4 rootfs image is a full minimal Linux filesystem with the language compiler pre-installed. Built once in CI, not at runtime:

python3.ext4   → Ubuntu 22.04 minimal + CPython 3.12 + stdlib        (~300MB)
java21.ext4    → Ubuntu 22.04 minimal + OpenJDK 21 JDK headless      (~450MB)
cpp20.ext4     → Ubuntu 22.04 minimal + g++ 13                       (~250MB)
go122.ext4     → Ubuntu 22.04 minimal + Go 1.22                      (~200MB)
rust177.ext4   → Ubuntu 22.04 minimal + rustc 1.77                   (~280MB)
node22.ext4    → Ubuntu 22.04 minimal + Node.js 22                   (~250MB)

All 20+ rootfs images + snapshots fit on a 16GB worker SSD. The build process:

debootstrap a minimal Ubuntu rootfs (not Docker — Firecracker boots raw ext4)
Install the language runtime via apt/curl
Create sandbox user, /sandbox workspace, mount points
Package as ext4: dd if=/dev/zero of=python3.ext4 bs=1M count=500 && mkfs.ext4 python3.ext4 && mount -o loop && copy → umount
Snapshot a warm VM from this rootfs (§11.1) — the snapshot includes initialized runtime + JIT

Language-specific time limit multipliers:

The same algorithm runs at very different speeds across languages. A two-sum hash map solution in C++ runs in 5ms. In Python, the same logic takes 50ms. Time limits scale by language.

What LeetCode actually does: Same time limit regardless of language. This makes Python significantly harder for some problems — a common community complaint. This design adds per-language multipliers for fairness. The tradeoff: multipliers add complexity and problem setters must calibrate limits against all languages, not just C++.

Multiplier	Languages
1.0x (reference)	C, C++, Rust
1.5x	Swift
2.0x	Go, Java, Kotlin, C#, Haskell
2.5x	Scala
3.0x (default)	Python, JavaScript, TypeScript, Ruby, PHP

If the problem's base time limit is 2000ms (set for C++), a Python solution gets 6000ms. Unknown languages default to 3.0x.

Language-specific memory overhead and image versioning

Memory considerations:

Language	Base Memory Overhead	Notes
C/C++	~2MB	Minimal runtime
Rust	~2MB	Minimal runtime
Go	~10MB	Go runtime + GC
Java	~60MB	JVM startup (with -Xmx256m)
Kotlin	~60MB	JVM-based
Scala	~80MB	JVM + Scala runtime
Python 3	~20MB	Interpreter
JavaScript (Node)	~30MB	V8 engine
C# (.NET)	~40MB	CLR
Ruby	~15MB	Interpreter
Haskell (GHC)	~25MB	RTS

Image versioning (source of truth):

yaml

MicroVM coldBoot(String language) throws Exception {
    FirecrackerConfig cfg = FirecrackerConfig.builder()
        .socketPath("/tmp/fc-" + UUID.randomUUID() + ".sock")
        .kernelImagePath("/var/lib/firecracker/vmlinux")    // Minimal kernel, ~4MB
        .addDrive(Drive.builder()
            .driveId("rootfs")
            .pathOnHost(rootfsImages.get(language))           // ext4 with language runtime
            .isRootDevice(true)
            .isReadOnly(false)
            .build())
        .machineConfig(new MachineConfig(2, 256))            // 2 vCPU, 256MB
        // No network interfaces — completely isolated. Communication via vsock only.
        .build();

    FirecrackerVMM vmm = FirecrackerVMM.start(cfg);

    // PUT /actions {"action_type": "InstanceStart"}
    vmm.startInstance();

    return new MicroVM(vmm, language, VMState.IDLE);
}

11.5 Submission Queue Management (Priority, Fair Queuing, Auto-Scaling)

During contests, submission traffic spikes 5-10x. The queue layer must handle this without either dropping contest submissions or starving practice submissions entirely.

Queue architecture with premium priority:

Premium "Lightning Judge" priority (practice submissions only):

java

public class MicroVMPool {
    private final Map<String, BlockingQueue<MicroVM>> idle = new ConcurrentHashMap<>();
    private final Map<String, AtomicInteger> busyCount = new ConcurrentHashMap<>();
    private final Map<String, String> snapshotDirs;       // language -> snapshot path
    private final PoolConfig config;

    // Default pool sizes per language
    static final Map<String, Integer> DEFAULT_MIN_IDLE = Map.ofEntries(
        entry("python3", 200), entry("java", 150), entry("cpp", 150),
        entry("go", 100), entry("rust", 50), entry("javascript", 80),
        entry("csharp", 50), entry("kotlin", 40), entry("ruby", 30),
        entry("swift", 30), entry("typescript", 60), entry("c", 100)
    );

    // Restore a VM from Firecracker snapshot — ~25ms
    MicroVM restoreVM(String language) throws Exception {
        String snapshotDir = snapshotDirs.get(language);
        String socket = "/tmp/fc-" + UUID.randomUUID() + ".sock";

        FirecrackerVMM vmm = FirecrackerVMM.start(
            FirecrackerConfig.builder().socketPath(socket).build());

        vmm.loadSnapshot(new SnapshotConfig(
            Path.of(snapshotDir, "vmstate"),
            Path.of(snapshotDir, "mem"),
            true  // enableDiffSnapshots
        ));

        return new MicroVM(vmm, language, VMState.IDLE);
    }

    // Reset VM via diff snapshot — ~5ms, reverts all changes
    void resetVM(MicroVM vm) throws Exception {
        vm.getVMM().loadSnapshot(vm.getBaseSnapshot());
        vm.incrementExecCount();
    }

    // Pre-warming — runs every 5 seconds via ScheduledExecutorService
    void adjustPoolSize() {
        for (String lang : idle.keySet()) {
            int target = Math.max(
                config.getMinIdle(lang),
                Math.max(
                    (int)(demandMovingAvg(lang, Duration.ofMinutes(5)) * 1.5),
                    contestPreWarm(lang)));
            int current = idle.get(lang).size();
            if (current < target) executor.submit(() -> spawnVMs(lang, target - current));
            if (current > target * 2) retireExcess(lang, current - target);
        }
    }
}

Fair queuing during contests:

Contest-period auto-scaling timeline:

11.6 Web IDE (Monaco Editor Integration)

Monaco Editor — the VS Code engine. Syntax highlighting, bracket matching, minimap, multi-cursor for 20+ languages.

Feature	Free Tier	Premium
Syntax highlighting (20+ languages)	Yes	Yes
Context-aware autocomplete	Basic (keyword-based)	Enhanced (language-server-backed)
Vim / Emacs keybinding modes	Yes	Yes
Font size, tab size customization	Yes	Yes
Integrated debugger (breakpoints, step-through)	No	Yes
Dark/light theme	Yes	Yes

Starter code: Per-language templates from problems.starter_code JSONB. Defines the function signature the user implements.

Premium debugger: Breakpoints, step-through, variable inspection. Runs in a separate container with relaxed limits (60s). Not counted as submissions.

11.7 Problem Creation and Test Case Pipeline

4,000+ problems aren't hand-crafted. There's a pipeline.

Problem setters: Contracted engineers (rating 2000+, 1000+ solved). $20-65 per problem.

Test case generation pipeline:

Generator: Python script that produces random inputs satisfying the problem's constraints. For "Two Sum": generates arrays of length 2 to 10^4 with values in [-10^9, 10^9] and a valid target.
Validator: Verifies every generated input satisfies stated constraints (length bounds, value ranges, graph connectivity, etc.). Rejects malformed test cases.
Reference solutions: Problem setter writes correct solutions in at least C++, Python, and Java. These are run against all test cases to generate expected outputs and calibrate time limits.
Time limit calibration: Set at 3x the runtime of the slowest reference solution (in C++). Language multipliers (§11.4) adjust from there.
Cache invalidation: When test cases are updated for an existing problem, a Kafka message invalidates the worker SSD cache so workers fetch the new version on next access.

12. Identify Bottlenecks

Bottleneck 1: Warm container pool exhaustion during contest spikes

Bottleneck 2: PostgreSQL write throughput for submission status updates

Each submission generates 3-4 status updates (PENDING -> QUEUED -> COMPILING -> RUNNING -> verdict). At 2,000 submissions/sec peak, that is 8,000 UPDATE statements per second.

Bottleneck 3: Test case file I/O on workers

Each submission reads test case files from disk. With 30 test cases per problem and 2,000 submissions/sec, that is 60,000 file reads per second across the worker fleet.

13. Failure Scenarios

Scenario 1: Execution Worker Crash Mid-Execution

Time	Event
T+0s	Worker pod crashes (OOM kill, node eviction, or Firecracker VMM crash).
T+0s	Submission is mid-execution. Test case 15 of 30 was running.
T+10s	Kafka consumer session timeout expires. The unacknowledged message is reassigned to another worker.
T+10s	New worker picks up the submission, starts fresh (compilation + all test cases from scratch).
T+15s	Submission status was stuck at "RUNNING". New worker updates it. User sees the verdict.

Scenario 2: Valkey Crash During Contest

Time	Event
T+0s	Valkey primary crashes.
T+0-1s	Valkey Sentinel detects the failure and promotes the replica to primary.
T+1-2s	All leaderboard writes fail during failover window.
T+2s	New primary is ready. Workers reconnect via Sentinel.
T+2s	Leaderboard updates resume. Submissions that got verdicts during the 2-second window had their leaderboard updates dropped.

Scenario 3: Kafka Broker Goes Down

Time	Event
T+0s	One of 3 Kafka brokers crashes.
T+0s	Partitions led by the crashed broker become unavailable.
T+0-15s	Kafka controller detects failure, re-elects partition leaders on surviving brokers.
T+15s	All partitions available again. Messages in-flight during failure are retried by producers (idempotent producer with retries=5).

Scenario 4: Container Escape Attempt

Time	Event
T+0s	User submits code that attempts a known container escape exploit.
T+0.1s	The exploit targets the guest Linux kernel inside the microVM. Even if it succeeds, the attacker is still inside the VM.
T+0.1s	The KVM boundary prevents any guest-to-host escape. The attacker controls a 256MB VM with no network — a dead end.
T+0.2s	Code crashes or times out. Worker reports Runtime Error or TLE verdict.
T+1s	Security monitoring detects anomalous VM behavior (unexpected syscall patterns logged by guest kernel). Alert fires.
T+5m	Security team reviews the submission. User account flagged.

Scenario 5: Database Partition Exhaustion

Time	Event
T+0	The current weekly partition for submissions table fills up (approaching partition boundary date).
T+0	A cron job that creates future partitions 4 weeks in advance failed silently 3 weeks ago.
T+0	INSERTs into the submissions table fail: "no partition of relation submissions found for row."

Impact: All new submissions fail. Judging continues for already-queued submissions.

Prevention: Daily check on pg_inherits for future partitions. Alert if <2 exist. Weekly job creates 4 weeks ahead; daily job verifies next 14 days as a safety net.

14. Deployment Strategy

Workers. Rolling deployment, maxUnavailable: 10%. Canary pool (5%) first, monitor 15 min (SYSTEM_ERROR rate, latency percentiles, crash rate), then full rollout.

Key Kubernetes Deployment settings for the practice worker pool:

yaml

-- ── Leaderboard update (called on Accepted verdict) ──
-- KEYS[1] = contest:<id>:leaderboard (sorted set)
-- KEYS[2] = contest:<id>:user:<user_id> (hash)
-- ARGV[1] = user_id, ARGV[2] = problem_key, ARGV[3] = solve_time_seconds, ARGV[4] = penalty_per_wrong

local already = redis.call('HGET', KEYS[2], ARGV[2] .. '_accepted')
if already == '1' then return 0 end  -- Duplicate, ignore

redis.call('HSET', KEYS[2], ARGV[2] .. '_accepted', '1')
redis.call('HSET', KEYS[2], ARGV[2] .. '_time', ARGV[3])
local wrong = tonumber(redis.call('HGET', KEYS[2], ARGV[2] .. '_wrong_attempts') or '0')
local solved = redis.call('HINCRBY', KEYS[2], 'problems_solved', 1)
local penalty = tonumber(ARGV[3]) + (wrong * tonumber(ARGV[4]))
local new_penalty = tonumber(redis.call('HGET', KEYS[2], 'total_penalty') or '0') + penalty
redis.call('HSET', KEYS[2], 'total_penalty', new_penalty)
local score = solved * 1000000000 - new_penalty
redis.call('ZADD', KEYS[1], score, ARGV[1])
local rank = redis.call('ZREVRANK', KEYS[1], ARGV[1])
redis.call('PUBLISH', 'contest:' .. KEYS[1] .. ':update',
    cjson.encode({user_id=ARGV[1], rank=rank+1, problems_solved=solved, total_penalty=new_penalty}))
return rank + 1

-- ── Wrong attempt tracking (called on non-Accepted verdict) ──
-- local already = redis.call('HGET', KEYS[1], ARGV[1] .. '_accepted')
-- if already == '1' then return -1 end
-- return redis.call('HINCRBY', KEYS[1], ARGV[1] .. '_wrong_attempts', 1)

Kata Containers RuntimeClass: handler: kata-fc (Firecracker backend) with a nodeSelector ensuring pods land on KVM-enabled nodes (kata-containers.io/firecracker: "true").

Contest-period deployment freeze. A CI/CD gate checks the contest schedule and blocks all deploys from 30 minutes before contest start to 30 minutes after end.

Database migrations. Run as K8s Jobs before app deployment. All backward-compatible. ALTER TABLE ADD COLUMN DEFAULT is instant in PG 11+ (no table rewrite).

15. Observability

Key Metrics

Metric	Source	Alert Threshold
`submission_e2e_latency_p95`	Workers	> 5 seconds for 5 minutes
`submission_e2e_latency_p99`	Workers	> 10 seconds for 2 minutes
`verdict_distribution{type="SYSTEM_ERROR"}`	Workers	> 0.1% of submissions
`container_pool_idle{language}`	Workers	< 10% of min idle for any language for 2 min
`container_pool_exhausted{language}`	Workers	Any exhaustion event
`container_crash_rate`	Workers	> 1% in 10 minutes
`kafka_consumer_lag{topic}`	Kafka	> 5000 messages for 2 min (contest), > 10000 (practice)
`leaderboard_update_latency_p99`	Valkey	> 100ms for 1 minute
`websocket_connections_active`	WS Gateway	> 90% of capacity
`vm_anomalous_behavior`	Firecracker/Workers	Unexpected VMM exit, KVM error, or guest kernel panic
`compilation_error_rate{language}`	Workers	> 50% (indicates broken language image)
`test_case_cache_miss_rate`	Workers	> 5% (cache not warming properly)
`pg_submission_insert_latency_p99`	PostgreSQL	> 50ms

Dashboard Layout

+------------------------------------------+------------------------------------------+
|  Submissions/sec (real-time)             |  Verdict Distribution (pie chart)        |
|  [contest] [practice] [custom]           |  [AC] [WA] [TLE] [MLE] [RE] [CE] [SE]   |
+------------------------------------------+------------------------------------------+
|  E2E Latency (p50, p95, p99)            |  Kafka Consumer Lag (per topic)          |
|  [time series, last 4 hours]             |  [contest] [practice] [custom-test]      |
+------------------------------------------+------------------------------------------+
|  Container Pool Status (per language)    |  Active WebSocket Connections            |
|  [idle] [busy] [total] per language      |  [time series, last 1 hour]              |
+------------------------------------------+------------------------------------------+
|  Contest Leaderboard Update Latency      |  Worker Pod Count (current / max)        |
|  [p50, p99, errors]                      |  [contest] [practice] [custom-test]      |
+------------------------------------------+------------------------------------------+

Distributed Tracing

Every submission gets a trace ID that follows it through the entire pipeline:

ClickHouse Analytics Schema

Show ClickHouse DDL and example queries

sql

Map<String, Integer> computeRatingChanges(List<ContestResult> participants) {
    Map<String, Integer> changes = new HashMap<>();

    for (int i = 0; i < participants.size(); i++) {
        ContestResult player = participants.get(i);
        int actualRank = i + 1;

        // Expected rank: sum of win probabilities against all opponents
        double expectedRank = 1.0;
        for (int j = 0; j < participants.size(); j++) {
            if (i == j) continue;
            ContestResult opp = participants.get(j);
            expectedRank += 1.0 / (1.0 + Math.pow(10, (player.getRating() - opp.getRating()) / 400.0));
        }

        // K-factor decreases with experience
        int k = Math.max(20, 80 - player.getContestsPlayed() * 2);

        // Delta: positive if performed better than expected
        int delta = (int)(k * (Math.log(expectedRank) - Math.log(actualRank)));
        delta = Math.max(-150, Math.min(150, delta));

        changes.put(player.getUserId(), delta);
    }
    return changes;
}

16. Security

Hardware VM isolation via Firecracker. Each submission gets its own kernel. Even a successful guest exploit is a dead end.

Sandbox security is the entire ballgame. Compromised sandbox = access to test cases, other users' code, or the host system. Every decision here is defense-in-depth.

16.1 Eight Layers of Isolation (Defense-in-Depth)

What each layer stops:

Layer	What It Does	What It Stops	Real CVE/Attack Prevented
1. Firecracker KVM	Hardware VM isolation. Separate guest kernel. Host only sees KVM hypercalls.	ALL guest-side exploits. Kernel exploits, container escapes — they compromise the guest, not the host.	CVE-2019-5736, CVE-2020-15257, CVE-2022-0185 — all are guest-only events under Firecracker
2. Namespaces (guest)	Isolates mount, PID, network, user, IPC, UTS inside guest	Seeing other guest processes, accessing guest filesystem	Additional isolation within the VM
3. cgroups v2 (guest)	Hard CPU, memory (256MB), PIDs (64) limits inside guest	Fork bombs, memory bombs, CPU starvation	`while(1) fork()` hits PID limit. `malloc(1TB)` OOM-killed at 256MB
4. seccomp-bpf (guest)	Syscall filter inside the guest kernel	Restricts even inside the VM — principle of least privilege	Blocks `io_uring`, `ptrace`, `keyctl` inside guest
5. Read-only rootfs	Only `/sandbox` writable inside guest	Writing malware, modifying system binaries	Can't modify `/etc/passwd` or system binaries
6. No network	No virtio-net device attached to VM. Only vsock to host worker.	Data exfiltration, solution fetching, reverse shells	No DNS, no TCP, no UDP — the VM has no network stack
7. Capability drop	All 41 Linux capabilities removed inside guest	Raw sockets, mounting filesystems, module loading	Defense-in-depth inside the VM
8. no-new-privileges	Prevents setuid/setgid escalation inside guest	Privilege escalation within the VM	Even with root inside VM, can't escape KVM boundary

Why Firecracker is stronger than gVisor:

Scenario	gVisor	Firecracker
Guest kernel exploit	Compromises gVisor Sentry (software boundary)	Stays inside VM (hardware KVM boundary)
Syscall compatibility	95% — edge cases with `io_uring`, `ptrace`	100% — full Linux kernel, zero compat issues
Resource accounting	Tricky — some kernel buffers not counted by Sentry	Clean — KVM enforces hard VM memory boundary
Debugging failures	"operation not permitted" / "bad system call" — ambiguous	Standard Linux errors — the guest is a real OS

16.2 Network Isolation

16.3 microVM Resource Hard Limits

Every microVM runs with:

Memory: 256MB hard limit (no swap, OOM kill on exceed)
CPU: 2 vCPU (KVM resource partitioning)
PIDs: 64 maximum inside guest (prevents fork bombs)
Wall clock: 15 seconds (worker kills the VM)
Disk: /sandbox writable, rootfs read-only
Network: no virtio-net device attached — only vsock to host
Capabilities: all dropped inside guest
New privileges: blocked (no-new-privileges inside guest)

A per-VM watchdog thread in the Java worker starts a ScheduledFuture with 15s deadline. On expiry, it kills the Firecracker VMM process and increments the timeout metric.

16.4 Test Case Protection

Note

16.5 Code Scanning (Soft Blocks)

Before execution, the API performs a fast static scan of submitted code for known dangerous patterns:

Language	Blocked Patterns	Reason
Python	`import os`, `import subprocess`, `__import__`	System command execution
C/C++	`#include <sys/ptrace.h>`, `#include <sys/mount.h>`	Kernel interaction
Java	`Runtime.getRuntime().exec`, `ProcessBuilder`	Process spawning
Go	`os/exec`, `syscall.Exec`	System commands
JavaScript	`child_process`, `require('fs')`	System access

Note: These are soft blocks (warning + flag for review), not hard blocks. Legitimate solutions sometimes use OS-level primitives. The sandbox is the real security boundary, not code scanning.

16.6 Audit Logging and Incident Response

Every submission, VM creation, VM destruction, and resource limit breach is logged to an append-only Kafka topic with 365-day retention. Security incidents trigger automated response:

Event	Automated Response
Firecracker VMM unexpected exit	Log + alert. Flag submission. Investigate guest kernel panic or VMM bug.
Container OOM kill	Normal (MLE verdict). Log for capacity planning.
Container wall clock kill	Normal (TLE verdict). Log.
Repeated anomalous VM behavior from same user	Rate-limit user. Alert security team.
Suspected escape attempt (unusual VMM interaction pattern)	Kill VM. Flag user for review. Alert immediately.

Rate limiting. Per-user rate limits prevent abuse:

Practice: 5 submissions per minute
Contest: 10 submissions per minute
Custom test: 10 runs per minute
API read endpoints: 100 requests per minute

Enforced via Valkey sliding window counters at the API gateway layer. Exceeding the limit returns HTTP 429 with a Retry-After header.

17. SLOs and Error Budgets

SLOs make the quality target concrete. Error budgets turn "be more careful" into "freeze deploys this week."

SLI	SLO	Monthly Error Budget
Submission e2e latency p95 ≤ 5s	99.9%	43.2 min
Judge availability (non-SYSTEM_ERROR, non-TIMEOUT)	99.95%	21.6 min
Contest leaderboard accuracy (rank matches PostgreSQL source of truth)	99.99%	4.3 min
Container warm-pool hit rate	95%	N/A (capacity metric, not error budget)
Sandbox escape rate	0%	Budget-less — any escape is a P0 incident

Show 2 more SLOs (cold-start, WebSocket)

SLI	SLO	Monthly Error Budget
microVM restore p99 ≤ 200ms (Firecracker snapshot)	99.9%	43.2 min
WebSocket verdict delivery ≤ 500ms from worker verdict	99.5%	3.6 h

Error budget policy:

Normal burn (≤1x over 30d): Business as usual. Ship features.
Fast burn (>2x over 7d): Freeze non-critical deploys. Director sign-off for launches. Investigate root cause.
Very fast burn (>4x over 1d): Page on-call. Freeze all non-rollback deploys. Launch incident response.
Exhausted: Next sprint goes entirely to reliability. No new features until budget replenishes.

Alert tiering:

Page-now (5-min response): Sandbox escape attempt, availability burn rate, Kafka lag >5000 for contest topic, container pool exhaustion for any language, SYSTEM_ERROR rate >0.1%.
Page-business-hours: Latency budget burn, cache-miss rate rising, compilation error rate spike (broken language image), leaderboard drift from PostgreSQL source of truth.
Ticket-only: Capacity warnings (worker CPU >70%), disk warnings, pending schema migrations, individual worker pod restarts.

18. Operational Playbook

A design that doesn't document its ops story isn't production-grade.

18.1 Backup and Recovery

Component	Backup Strategy	RPO	RTO
PostgreSQL	Continuous WAL archiving to S3 + daily base backups	~0 (WAL)	< 30 min (PITR)
Test cases (S3)	S3 versioning + cross-region replication	0	< 5 min
Valkey (leaderboards)	Reconstructible from PostgreSQL (source of truth)	N/A	< 2 min
Kafka	3x replication factor, 7-day retention	0	< 15 min (broker replacement)
Container images	Registry with multi-AZ replication	0	< 10 min
ClickHouse (analytics)	Daily snapshots to S3	24h	< 2h

Quarterly restore drill. Restore PG from a random backup to staging. Verify data integrity. Failed restore = P1.

18.2 Capacity Planning

Three leading indicators:

Worker CPU. Target <50% avg, <70% peak. Sustained >70% → KEDA should auto-scale, but verify node pool headroom.
Kafka lag. Contest >100 for 2 min = backing up. Practice >1000 = degraded. Custom-test >5000 = acceptable.
PG connection pool. PgBouncer >80% = queries queueing. ~1,700 DB ops/sec at steady state (3 ops per submission).

18.3 Schema Migrations

50M rows/day. Zero-downtime schema changes only:

ADD COLUMN DEFAULT — instant in PG 11+. CREATE INDEX CONCURRENTLY — no table lock.
Weekly partitions created 4 weeks ahead. Daily job verifies next 14 days as safety net.
Forward-only. Mistakes get a compensating migration, never a rollback.

18.4 Top 5 Alerts and Mitigations

Know these cold:

Container pool exhausted — Contest spike (pre-warm late) or container leak? Spike: verify KEDA scaling + node pool headroom. Leak: check worker logs for stuck containers.
Kafka lag spike (>5000 on contest topic) — Verify KEDA scaling. At max replicas: redirect practice workers to contest topic. Check for slow consumers (30s/submission = container issue).
PG connection pool >80% — Check pg_stat_activity for stuck queries. Verify partition creation didn't fail (missing partition = all INSERTs fail).
Firecracker VMM crash — Check if guest kernel panicked (user code triggered a kernel bug inside the VM — harmless to host, but the submission needs a retry). If VMM itself crashed, investigate — extremely rare, treat as P1.
Leaderboard drift (Valkey ≠ PostgreSQL) — Reconciliation job auto-corrects. If drift >10 entries: manual review. Check Lua script failures or dropped Pub/Sub messages.

19. Appendix

Appendix A: Firecracker microVM Configuration

Key Firecracker API calls:

Operation	API Call	Notes
Set kernel	`PUT /boot-source {"kernel_image_path": "/var/lib/fc/vmlinux"}`	Minimal kernel, no modules
Set rootfs	`PUT /drives/rootfs {"path_on_host": "python3.ext4", "is_root_device": true}`	Per-language ext4 image
Set resources	`PUT /machine-config {"vcpu_count": 2, "mem_size_mib": 256}`	Hard limits enforced by KVM
Start VM	`PUT /actions {"action_type": "InstanceStart"}`	~125ms cold boot
Create snapshot	`PUT /snapshot/create {"snapshot_path": "...", "mem_file_path": "..."}`	Full VM state to disk
Restore snapshot	`PUT /snapshot/load {"snapshot_path": "...", "mem_file_path": "..."}`	~25ms restore

Appendix B: Contest Elo Rating — Full Formula

The post-contest rating computation (§11.3) uses a multi-player Elo adaptation:

Given N participants sorted by contest rank:

For player i with current rating Ri:

  Expected Rank(i) = 1 + Σ(j≠i) P(j beats i)

  where P(j beats i) = 1 / (1 + 10^((Ri - Rj) / 400))

  Performance = √(Expected_Rank × Actual_Rank)   [geometric mean]

  K = max(20, 80 - contests_played × 2)            [K-factor]

  ΔRating = K × (ln(Expected_Rank) - ln(Actual_Rank))

  Clamped to [-150, +150] per contest

Starting rating: 1500. Absence penalty: -K/4 (roughly -10 to -20 depending on experience).

Appendix C: Firecracker Warm Pool Benchmark

Measured on c5.metal (96 vCPU, 192GB RAM) with Firecracker v1.7:

Operation	Latency (p50)	Latency (p99)	Notes
Cold boot (fresh microVM)	120ms	180ms	Kernel boot + rootfs mount + runtime init
Snapshot restore	20ms	40ms	6x faster than cold boot. Native Firecracker API.
Warm pool claim (idle VM)	1.2ms	3.5ms	Channel receive + state update
Diff snapshot reset	4ms	8ms	Reverts all guest state (filesystem + memory)
JVM warm-up penalty (cold boot)	+180ms	+350ms	Without pre-warmed snapshot
JVM warm-up penalty (snapshot restore)	+3ms	+10ms	JIT + classloader already warm in snapshot

Note

If you only remember six things

Sandbox. Firecracker microVM — each submission gets its own Linux kernel with hardware KVM isolation. Even a guest kernel exploit is a dead end.
Fast start. Warm VM pool + Firecracker native snapshot restore (~25ms) + diff snapshot reset (~5ms). JVM VMs pre-warm the JIT before snapshot.
Queue. Four Kafka topics (contest, premium, practice, custom-test) with dedicated worker pools and cross-pool borrowing during contests.
Leaderboard. Valkey sorted set with atomic Lua updates. Post-contest Elo rating computation. Reconciliation job as safety net.
Scale. 50M submissions/day, 12,000 VM slots across 1,500 pods, 14% steady-state utilization, auto-scale via KEDA on Kafka consumer lag.
Biggest risk. Warm pool exhaustion during contest spikes. Mitigated by pre-warming 15 min before contest + dynamic language rebalancing + snapshot restore fallback.

Explore the Technologies

Dive deeper into the technologies and infrastructure patterns used in this design:

Core Technologies

Technology	Role in This Design	Learn More
PostgreSQL	Problems, submissions, users, contests, source of truth	PostgreSQL
Valkey	Contest leaderboards, rate limiting, WebSocket routing, pub/sub	Redis/Valkey
Kafka	Submission queue with priority separation, audit logging	Kafka
Kubernetes	Worker fleet orchestration, Kata Containers RuntimeClass (Firecracker backend), auto-scaling	Kubernetes
ClickHouse	Submission analytics, language trends, problem difficulty analysis	ClickHouse

Infrastructure Patterns

Pattern	Relevance to This Design	Learn More
Circuit Breaker and Fault Tolerance	Fail-open on Valkey/PostgreSQL outages, fallback behaviors	Circuit Breakers
Message Queues	Kafka for durable submission dispatch with priority separation	Message Queues
Rate Limiting	Per-user submission rate limiting with sliding window counters	Rate Limiting
Container Orchestration	Kubernetes for worker fleet management with Kata Containers (Firecracker backend)	Container Orchestration

Linux Internals (used in this design)

Linux Concept	Role in This Design	Learn More
KVM (Kernel-based Virtual Machine)	The primary isolation boundary. Firecracker uses `/dev/kvm` to create hardware-isolated VMs. Guest kernel exploits can't reach the host.	KVM docs
System Calls	Guest kernel handles all syscalls natively — 100% compatibility. Host only sees KVM hypercalls.	System Calls
Namespaces	Defense-in-depth inside the guest VM — isolates PID, mount, network, user, IPC, UTS	Namespaces
cgroups v2	Hard limits inside the guest: 256MB memory, 2 vCPU, 64 PIDs. OOM killer enforces MLE verdict.	cgroups v2
Memory Cgroups	How the 256MB limit and OOM kill work inside the guest kernel	Memory Cgroups
Seccomp	Additional syscall filtering inside the guest — principle of least privilege even within the VM	Seccomp
Linux Capabilities	All 41 capabilities dropped inside guest. No raw sockets, no mount, no module loading.	Capabilities
Network Namespaces	No virtio-net device attached — the VM has no network stack. Only vsock to host.	Network Namespaces
Container Runtime (runc/containerd/Kata)	How Kata Containers wraps Firecracker behind the OCI runtime interface for K8s integration	Container Runtime
OOM Killer	Guest kernel OOM-kills user code at 256MB. Worker detects exit and reports MLE verdict.	OOM Killer
Copy-on-Write	Firecracker snapshot restore uses CoW memory mapping for fast VM resume (~25ms)	Copy-on-Write
Virtual Memory	Each microVM has its own virtual address space. KVM maps guest physical → host physical.	Virtual Memory

System Design: URL Shortener (10B Short URLs, 100K Redirects/sec)

April 11, 2026 · 42 min read

System Design: Ad Click Aggregator (10B Clicks/day, Lambda Architecture, Fraud Detection)

April 10, 2026 · 70 min read

System Design: Ad Exchange (Real-Time Bidding, Sub-100ms Auctions, DSP/SSP, Impression Serving)

April 10, 2026 · 59 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.