Online Coding Judge
LeetCode-style judge. Problems, submissions, and a sandboxed runner that executes untrusted code against test cases without the judge catching fire.
Key Abstractions
Statement, input/output schema, limits (time, memory), and test cases
Input plus expected output, with optional hidden flag for final grading
Code + language + user + problem. Has state (queued, running, verdict).
Strategy per language — compile(code) plus execute(code, stdin, limits) in a sandbox
Value object — success flag plus compile errors. Default success for interpreted languages.
Accepted, WrongAnswer, TLE, MLE, RuntimeError, CompileError — one per submission
Facade. Compiles once, then grades test cases fail-fast; produces the verdict.
Class Diagram
The Key Insight
An online judge has two hard problems: running untrusted code safely and producing deterministic verdicts from noisy output. Everything else (problems, users, submissions) is CRUD.
Safe execution demands isolation. A subprocess with a timeout is a starting point — it's not enough for production. Real judges use gVisor, nsjail, or Firecracker microVMs to box untrusted code away from the filesystem, network, other submissions, and kernel-level escapes. The LanguageRunner interface hides the mechanism; whatever hardening a platform chooses lives behind it.
Output comparison is where WA vs AC gets murky. Trailing newlines, Windows line endings, extra spaces — all need normalization before comparison. The normalize helper on LanguageRunner is a one-place fix. Without it, correct solutions fail on rendering artifacts and beginners lose faith in the platform.
Fail-fast on first wrong test case is a throughput lever. A contest with 10k submissions can't afford to run all 30 test cases when case 2 already failed. The verdict can't improve; stop there. Subtle but it halves average grading time.
Requirements
Functional
- Register problems with test cases, time/memory limits, per-case points
- Accept submissions with source code and language
- Grade against visible and hidden test cases
- Return verdict plus per-case timing and memory stats
- Support multiple languages with different toolchains
Non-Functional
- Untrusted code cannot affect judge, filesystem, or other submissions
- Grading is parallelizable — N submissions, N workers
- Wall-clock enforcement of time limits
- Deterministic verdict — same code + same test case = same outcome
Design Decisions
Why Language as a Strategy?
Each language has a different compile/run story: Python interprets, Java compiles to bytecode then runs on the JVM, C++ compiles to a binary. Hiding this behind LanguageRunner.execute(code, stdin, limits) lets the judge treat them uniformly. Adding Rust is a new Runner class and a Language enum value.
Why run test cases serially?
Running in parallel sounds faster but the verdict on case 2 is pointless if case 1 already decided WA. Serial execution with fail-fast bounds total work. Parallelism comes from running different submissions in parallel, not different cases of one submission.
Why hash-then-compare instead of byte equality?
For output comparison, normalization is the right move (strip trailing whitespace, unify line endings). For code deduplication (detecting repeat submissions), hashing the code does catch equivalent texts. Separate concerns.
Why separate Submission from TestRun?
A submission has one verdict but many runs. Flattening them would drop per-case detail, and per-case detail is the user-visible feedback ("your code timed out on case 3"). Separating preserves the drill-down view.
Why memory limits alongside time?
A submission that passes time limits by allocating gigabytes per test case still abuses the platform. Memory limits enforced via cgroups or resource limits on the sandbox prevent one abusive submission from starving others.
Interview Follow-ups
- "How would you scale to 10k submissions per second during a contest?" Submission queue (Kafka/SQS) with N worker pods pulling. Each pod runs a single submission in a fresh sandbox container, then terminates. Stateless workers scale horizontally.
- "How do you prevent cheating?" Rate-limit submissions per user per problem, diff-check submissions for near-duplicates (MOSS or minhash), monitor timing patterns for cross-user plagiarism.
- "What about interactive problems (judge communicates with submitted code)?" Runner gives the judge a pipe in both directions. The judge process speaks a protocol with the submission — useful for interactive proofs or adaptive test cases.
- "How do you handle custom checkers (e.g., multiple valid outputs)?" A per-problem
ValidatorFunctionreplaces the default string-compare. The checker gets input, submission output, and reference output; returns accepted/rejected. - "How do you support partial credit?" Test cases carry points; a submission's score is the sum of accepted cases. Verdict stays the "worst outcome" for display but scoring is additive.
Code Implementation
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
import subprocess
import sys
import tempfile
import uuid
from pathlib import Path
class Language(Enum):
PYTHON = "python"
JAVA = "java"
GO = "go"
class Verdict(Enum):
PENDING = "pending"
ACCEPTED = "accepted"
WRONG_ANSWER = "wrong_answer"
TIME_LIMIT = "time_limit"
MEMORY_LIMIT = "memory_limit"
RUNTIME_ERROR = "runtime_error"
COMPILE_ERROR = "compile_error"
class SubmissionStatus(Enum):
QUEUED = "queued"
COMPILING = "compiling"
RUNNING = "running"
DONE = "done"
@dataclass
class TestCase:
id: str
input: str
expected_output: str
hidden: bool = False
points: int = 1
@dataclass
class Problem:
id: str
title: str
time_limit_ms: int
memory_limit_mb: int
test_cases: list[TestCase]
@dataclass
class RunResult:
stdout: str
stderr: str
exit_code: int
duration_ms: int
memory_kb: int
timed_out: bool
@dataclass
class TestRun:
test_case_id: str
verdict: Verdict
time_ms: int
memory_kb: int
output: str
@dataclass
class Submission:
id: str
user_id: str
problem_id: str
language: Language
code: str
status: SubmissionStatus = SubmissionStatus.QUEUED
verdict: Verdict = Verdict.PENDING
runs: list[TestRun] = field(default_factory=list)
submitted_at: datetime = field(default_factory=datetime.utcnow)
score: int = 0
@dataclass
class CompileResult:
success: bool
errors: str = ""
class LanguageRunner(ABC):
"""Sandboxed execution of untrusted code. Separate compile step so static
languages (Java/C++) can report compile errors once, not per test case."""
def compile(self, code: str) -> CompileResult:
"""Default: interpreted languages compile trivially. Override for javac/g++."""
return CompileResult(success=True)
@abstractmethod
def execute(self, code: str, stdin: str, time_limit_ms: int,
memory_limit_mb: int) -> RunResult: ...
@staticmethod
def normalize(text: str) -> str:
# Strip trailing whitespace per line; trim trailing blank lines.
lines = [ln.rstrip() for ln in text.splitlines()]
while lines and lines[-1] == "":
lines.pop()
return "\n".join(lines)
class PythonRunner(LanguageRunner):
"""Spawns a subprocess, wall-clock limit, stdin piped in. Production needs real isolation."""
def execute(self, code: str, stdin: str, time_limit_ms: int,
memory_limit_mb: int) -> RunResult:
start = datetime.utcnow()
with tempfile.NamedTemporaryFile(
mode="w", suffix=".py", delete=False
) as tmp:
tmp.write(code)
path = tmp.name
try:
# WARNING: subprocess is NOT a secure sandbox — production needs gVisor, nsjail, or similar.
# This demonstrates the interface; replace with a hardened runner in production.
proc = subprocess.run(
[sys.executable, path],
input=stdin,
capture_output=True,
text=True,
timeout=time_limit_ms / 1000.0,
)
duration_ms = int((datetime.utcnow() - start).total_seconds() * 1000)
return RunResult(
stdout=proc.stdout,
stderr=proc.stderr,
exit_code=proc.returncode,
duration_ms=duration_ms,
memory_kb=0, # memory tracking would require rusage / cgroups
timed_out=False,
)
except subprocess.TimeoutExpired:
duration_ms = int((datetime.utcnow() - start).total_seconds() * 1000)
return RunResult("", "Timeout", -1, duration_ms, 0, timed_out=True)
finally:
Path(path).unlink(missing_ok=True)
class Judge:
def __init__(self, runners: dict[Language, LanguageRunner] | None = None):
self._runners: dict[Language, LanguageRunner] = runners or {Language.PYTHON: PythonRunner()}
self._problems: dict[str, Problem] = {}
self._submissions: dict[str, Submission] = {}
def add_problem(self, problem: Problem) -> None:
self._problems[problem.id] = problem
def submit(self, user_id: str, problem_id: str, language: Language, code: str) -> Submission:
if problem_id not in self._problems:
raise ValueError("unknown problem")
if language not in self._runners:
raise ValueError(f"language {language.value} not supported")
submission = Submission(
id=str(uuid.uuid4())[:8],
user_id=user_id,
problem_id=problem_id,
language=language,
code=code,
)
self._submissions[submission.id] = submission
# Grading runs inline here. In production it's queued to a worker pool.
self._grade(submission)
return submission
def _grade(self, submission: Submission) -> None:
problem = self._problems[submission.problem_id]
runner = self._runners[submission.language]
# Compile once before any test runs. For static languages a failure here
# is the canonical reason for COMPILE_ERROR; no point hitting tests.
submission.status = SubmissionStatus.COMPILING
compile_result = runner.compile(submission.code)
if not compile_result.success:
submission.verdict = Verdict.COMPILE_ERROR
submission.status = SubmissionStatus.DONE
# Store the compile error as a single pseudo-run for the UI.
submission.runs.append(TestRun(
test_case_id="compile",
verdict=Verdict.COMPILE_ERROR,
time_ms=0, memory_kb=0,
output=compile_result.errors,
))
return
submission.status = SubmissionStatus.RUNNING
overall = Verdict.ACCEPTED
for tc in problem.test_cases:
result = runner.execute(
submission.code, tc.input,
problem.time_limit_ms, problem.memory_limit_mb,
)
if result.timed_out:
verdict = Verdict.TIME_LIMIT
elif result.exit_code != 0:
verdict = Verdict.RUNTIME_ERROR
elif LanguageRunner.normalize(result.stdout) == LanguageRunner.normalize(tc.expected_output):
verdict = Verdict.ACCEPTED
else:
verdict = Verdict.WRONG_ANSWER
submission.runs.append(TestRun(
test_case_id=tc.id,
verdict=verdict,
time_ms=result.duration_ms,
memory_kb=result.memory_kb,
output=result.stdout if not tc.hidden else "",
))
if verdict == Verdict.ACCEPTED:
submission.score += tc.points
else:
overall = verdict
break # fail-fast: remaining test cases don't change the verdict
submission.verdict = overall
submission.status = SubmissionStatus.DONE
def get_submission(self, submission_id: str) -> Submission | None:
return self._submissions.get(submission_id)
if __name__ == "__main__":
problem = Problem(
id="sum-two",
title="Sum Two Numbers",
time_limit_ms=2000,
memory_limit_mb=64,
test_cases=[
TestCase("t1", input="2 3\n", expected_output="5\n"),
TestCase("t2", input="10 15\n", expected_output="25\n"),
TestCase("t3", input="-1 1\n", expected_output="0\n", hidden=True),
],
)
judge = Judge()
judge.add_problem(problem)
# Correct solution.
correct = "a, b = map(int, input().split())\nprint(a + b)"
sub1 = judge.submit("alice", "sum-two", Language.PYTHON, correct)
print(f"Correct solution -> {sub1.verdict.value}, score {sub1.score}")
for run in sub1.runs:
print(f" {run.test_case_id}: {run.verdict.value}, {run.time_ms} ms")
# Wrong answer.
wrong = "a, b = map(int, input().split())\nprint(a - b)"
sub2 = judge.submit("bob", "sum-two", Language.PYTHON, wrong)
print(f"\nWrong solution -> {sub2.verdict.value}, score {sub2.score}")
print(f"Failed on {sub2.runs[-1].test_case_id}")
# Runtime error.
runtime_err = "a, b = map(int, input().split())\nprint(1 / 0)"
sub3 = judge.submit("carol", "sum-two", Language.PYTHON, runtime_err)
print(f"\nRuntime error -> {sub3.verdict.value}")
# TLE (infinite loop).
tle = "while True: pass"
sub4 = judge.submit("dan", "sum-two", Language.PYTHON, tle)
print(f"\nInfinite loop -> {sub4.verdict.value}")
# Compile error path — register a custom runner that rejects any code
# containing the literal "BROKEN". Static-language runners (javac/g++)
# plug in the same way.
class CompileCheckingRunner(PythonRunner):
def compile(self, code: str) -> CompileResult:
if "BROKEN" in code:
return CompileResult(success=False, errors="line 1: BROKEN token rejected")
return CompileResult(success=True)
judge._runners[Language.PYTHON] = CompileCheckingRunner()
sub5 = judge.submit("eve", "sum-two", Language.PYTHON, "BROKEN")
print(f"\nCompile-error solution -> {sub5.verdict.value}, runs: {len(sub5.runs)}")
assert sub5.verdict == Verdict.COMPILE_ERROR
assert sub5.runs[0].output.startswith("line 1")Common Mistakes
- ✗Running untrusted code in the same process as the judge. One infinite loop or fork bomb kills the platform.
- ✗Sharing stdin/stdout with the runner. Need pipes with strict byte limits to prevent output flooding.
- ✗Running all test cases even after the first failure. Wastes CPU; verdict is decided.
- ✗Comparing output with ==. Whitespace and line-ending differences cause spurious WA. Normalize first.
Key Points
- ✓Sandbox isolation is non-negotiable — untrusted code can't touch filesystem, network, or other submissions.
- ✓Language as a Strategy lets Python, Java, Go run the same submission contract with different toolchains.
- ✓Test cases run serially per submission (fail-fast on first WA/TLE) but submissions run in parallel across workers.
- ✓Verdicts are ordered by severity — compile error dominates WA dominates TLE dominates anything else.