Online Coding Judge
LeetCode-style judge. Problems, submissions, and a sandboxed runner that executes untrusted code against test cases without the judge catching fire.
Key Abstractions
Statement, input/output schema, limits (time, memory), and test cases
Input plus expected output, with optional hidden flag for final grading
Code + language + user + problem. Has state (queued, running, verdict).
Strategy per language — compile(code) plus execute(code, stdin, limits) in a sandbox
Value object — success flag plus compile errors. Default success for interpreted languages.
Accepted, WrongAnswer, TLE, MLE, RuntimeError, CompileError — one per submission
Facade. Compiles once, then grades test cases fail-fast; produces the verdict.
Class Diagram
The Key Insight
An online judge has two hard problems: running untrusted code safely and producing deterministic verdicts from noisy output. Everything else (problems, users, submissions) is CRUD.
Safe execution demands isolation. A subprocess with a timeout is a starting point — it's not enough for production. Real judges use gVisor, nsjail, or Firecracker microVMs to box untrusted code away from the filesystem, network, other submissions, and kernel-level escapes. The LanguageRunner interface hides the mechanism; whatever hardening a platform chooses lives behind it.
Output comparison is where WA vs AC gets murky. Trailing newlines, Windows line endings, extra spaces — all need normalization before comparison. The normalize helper on LanguageRunner is a one-place fix. Without it, correct solutions fail on rendering artifacts and beginners lose faith in the platform.
Fail-fast on first wrong test case is a throughput lever. A contest with 10k submissions can't afford to run all 30 test cases when case 2 already failed. The verdict can't improve; stop there. Subtle but it halves average grading time.
Requirements
Functional
- Register problems with test cases, time/memory limits, per-case points
- Accept submissions with source code and language
- Grade against visible and hidden test cases
- Return verdict plus per-case timing and memory stats
- Support multiple languages with different toolchains
Non-Functional
- Untrusted code cannot affect judge, filesystem, or other submissions
- Grading is parallelizable — N submissions, N workers
- Wall-clock enforcement of time limits
- Deterministic verdict — same code + same test case = same outcome
Design Decisions
Why Language as a Strategy?
Each language has a different compile/run story: Python interprets, Java compiles to bytecode then runs on the JVM, C++ compiles to a binary. Hiding this behind LanguageRunner.execute(code, stdin, limits) lets the judge treat them uniformly. Adding Rust is a new Runner class and a Language enum value.
Why run test cases serially?
Running in parallel sounds faster but the verdict on case 2 is pointless if case 1 already decided WA. Serial execution with fail-fast bounds total work. Parallelism comes from running different submissions in parallel, not different cases of one submission.
Why hash-then-compare instead of byte equality?
For output comparison, normalization is the right move (strip trailing whitespace, unify line endings). For code deduplication (detecting repeat submissions), hashing the code does catch equivalent texts. Separate concerns.
Why separate Submission from TestRun?
A submission has one verdict but many runs. Flattening them would drop per-case detail, and per-case detail is the user-visible feedback ("your code timed out on case 3"). Separating preserves the drill-down view.
Why memory limits alongside time?
A submission that passes time limits by allocating gigabytes per test case still abuses the platform. Memory limits enforced via cgroups or resource limits on the sandbox prevent one abusive submission from starving others.
Interview Follow-ups
- "How would you scale to 10k submissions per second during a contest?" Submission queue (Kafka/SQS) with N worker pods pulling. Each pod runs a single submission in a fresh sandbox container, then terminates. Stateless workers scale horizontally.
- "How do you prevent cheating?" Rate-limit submissions per user per problem, diff-check submissions for near-duplicates (MOSS or minhash), monitor timing patterns for cross-user plagiarism.
- "What about interactive problems (judge communicates with submitted code)?" Runner gives the judge a pipe in both directions. The judge process speaks a protocol with the submission — useful for interactive proofs or adaptive test cases.
- "How do you handle custom checkers (e.g., multiple valid outputs)?" A per-problem
ValidatorFunctionreplaces the default string-compare. The checker gets input, submission output, and reference output; returns accepted/rejected. - "How do you support partial credit?" Test cases carry points; a submission's score is the sum of accepted cases. Verdict stays the "worst outcome" for display but scoring is additive.
Code Implementation
1 from __future__ import annotations
2 from abc import ABC, abstractmethod
3 from dataclasses import dataclass, field
4 from datetime import datetime
5 from enum import Enum
6 import subprocess
7 import sys
8 import tempfile
9 import uuid
10 from pathlib import Path
11
12
13 class Language(Enum):
14 PYTHON = "python"
15 JAVA = "java"
16 GO = "go"
17
18
19 class Verdict(Enum):
20 PENDING = "pending"
21 ACCEPTED = "accepted"
22 WRONG_ANSWER = "wrong_answer"
23 TIME_LIMIT = "time_limit"
24 MEMORY_LIMIT = "memory_limit"
25 RUNTIME_ERROR = "runtime_error"
26 COMPILE_ERROR = "compile_error"
27
28
29 class SubmissionStatus(Enum):
30 QUEUED = "queued"
31 COMPILING = "compiling"
32 RUNNING = "running"
33 DONE = "done"
34
35
36 @dataclass
37 class TestCase:
38 id: str
39 input: str
40 expected_output: str
41 hidden: bool = False
42 points: int = 1
43
44
45 @dataclass
46 class Problem:
47 id: str
48 title: str
49 time_limit_ms: int
50 memory_limit_mb: int
51 test_cases: list[TestCase]
52
53
54 @dataclass
55 class RunResult:
56 stdout: str
57 stderr: str
58 exit_code: int
59 duration_ms: int
60 memory_kb: int
61 timed_out: bool
62
63
64 @dataclass
65 class TestRun:
66 test_case_id: str
67 verdict: Verdict
68 time_ms: int
69 memory_kb: int
70 output: str
71
72
73 @dataclass
74 class Submission:
75 id: str
76 user_id: str
77 problem_id: str
78 language: Language
79 code: str
80 status: SubmissionStatus = SubmissionStatus.QUEUED
81 verdict: Verdict = Verdict.PENDING
82 runs: list[TestRun] = field(default_factory=list)
83 submitted_at: datetime = field(default_factory=datetime.utcnow)
84 score: int = 0
85
86
87 @dataclass
88 class CompileResult:
89 success: bool
90 errors: str = ""
91
92
93 class LanguageRunner(ABC):
94 """Sandboxed execution of untrusted code. Separate compile step so static
95 languages (Java/C++) can report compile errors once, not per test case."""
96
97 def compile(self, code: str) -> CompileResult:
98 """Default: interpreted languages compile trivially. Override for javac/g++."""
99 return CompileResult(success=True)
100
101 @abstractmethod
102 def execute(self, code: str, stdin: str, time_limit_ms: int,
103 memory_limit_mb: int) -> RunResult: ...
104
105 @staticmethod
106 def normalize(text: str) -> str:
107 # Strip trailing whitespace per line; trim trailing blank lines.
108 lines = [ln.rstrip() for ln in text.splitlines()]
109 while lines and lines[-1] == "":
110 lines.pop()
111 return "\n".join(lines)
112
113
114 class PythonRunner(LanguageRunner):
115 """Spawns a subprocess, wall-clock limit, stdin piped in. Production needs real isolation."""
116
117 def execute(self, code: str, stdin: str, time_limit_ms: int,
118 memory_limit_mb: int) -> RunResult:
119 start = datetime.utcnow()
120 with tempfile.NamedTemporaryFile(
121 mode="w", suffix=".py", delete=False
122 ) as tmp:
123 tmp.write(code)
124 path = tmp.name
125 try:
126 # WARNING: subprocess is NOT a secure sandbox — production needs gVisor, nsjail, or similar.
127 # This demonstrates the interface; replace with a hardened runner in production.
128 proc = subprocess.run(
129 [sys.executable, path],
130 input=stdin,
131 capture_output=True,
132 text=True,
133 timeout=time_limit_ms / 1000.0,
134 )
135 duration_ms = int((datetime.utcnow() - start).total_seconds() * 1000)
136 return RunResult(
137 stdout=proc.stdout,
138 stderr=proc.stderr,
139 exit_code=proc.returncode,
140 duration_ms=duration_ms,
141 memory_kb=0, # memory tracking would require rusage / cgroups
142 timed_out=False,
143 )
144 except subprocess.TimeoutExpired:
145 duration_ms = int((datetime.utcnow() - start).total_seconds() * 1000)
146 return RunResult("", "Timeout", -1, duration_ms, 0, timed_out=True)
147 finally:
148 Path(path).unlink(missing_ok=True)
149
150
151 class Judge:
152 def __init__(self, runners: dict[Language, LanguageRunner] | None = None):
153 self._runners: dict[Language, LanguageRunner] = runners or {Language.PYTHON: PythonRunner()}
154 self._problems: dict[str, Problem] = {}
155 self._submissions: dict[str, Submission] = {}
156
157 def add_problem(self, problem: Problem) -> None:
158 self._problems[problem.id] = problem
159
160 def submit(self, user_id: str, problem_id: str, language: Language, code: str) -> Submission:
161 if problem_id not in self._problems:
162 raise ValueError("unknown problem")
163 if language not in self._runners:
164 raise ValueError(f"language {language.value} not supported")
165 submission = Submission(
166 id=str(uuid.uuid4())[:8],
167 user_id=user_id,
168 problem_id=problem_id,
169 language=language,
170 code=code,
171 )
172 self._submissions[submission.id] = submission
173 # Grading runs inline here. In production it's queued to a worker pool.
174 self._grade(submission)
175 return submission
176
177 def _grade(self, submission: Submission) -> None:
178 problem = self._problems[submission.problem_id]
179 runner = self._runners[submission.language]
180
181 # Compile once before any test runs. For static languages a failure here
182 # is the canonical reason for COMPILE_ERROR; no point hitting tests.
183 submission.status = SubmissionStatus.COMPILING
184 compile_result = runner.compile(submission.code)
185 if not compile_result.success:
186 submission.verdict = Verdict.COMPILE_ERROR
187 submission.status = SubmissionStatus.DONE
188 # Store the compile error as a single pseudo-run for the UI.
189 submission.runs.append(TestRun(
190 test_case_id="compile",
191 verdict=Verdict.COMPILE_ERROR,
192 time_ms=0, memory_kb=0,
193 output=compile_result.errors,
194 ))
195 return
196
197 submission.status = SubmissionStatus.RUNNING
198 overall = Verdict.ACCEPTED
199 for tc in problem.test_cases:
200 result = runner.execute(
201 submission.code, tc.input,
202 problem.time_limit_ms, problem.memory_limit_mb,
203 )
204
205 if result.timed_out:
206 verdict = Verdict.TIME_LIMIT
207 elif result.exit_code != 0:
208 verdict = Verdict.RUNTIME_ERROR
209 elif LanguageRunner.normalize(result.stdout) == LanguageRunner.normalize(tc.expected_output):
210 verdict = Verdict.ACCEPTED
211 else:
212 verdict = Verdict.WRONG_ANSWER
213
214 submission.runs.append(TestRun(
215 test_case_id=tc.id,
216 verdict=verdict,
217 time_ms=result.duration_ms,
218 memory_kb=result.memory_kb,
219 output=result.stdout if not tc.hidden else "",
220 ))
221
222 if verdict == Verdict.ACCEPTED:
223 submission.score += tc.points
224 else:
225 overall = verdict
226 break # fail-fast: remaining test cases don't change the verdict
227
228 submission.verdict = overall
229 submission.status = SubmissionStatus.DONE
230
231 def get_submission(self, submission_id: str) -> Submission | None:
232 return self._submissions.get(submission_id)
233
234
235 if __name__ == "__main__":
236 problem = Problem(
237 id="sum-two",
238 title="Sum Two Numbers",
239 time_limit_ms=2000,
240 memory_limit_mb=64,
241 test_cases=[
242 TestCase("t1", input="2 3\n", expected_output="5\n"),
243 TestCase("t2", input="10 15\n", expected_output="25\n"),
244 TestCase("t3", input="-1 1\n", expected_output="0\n", hidden=True),
245 ],
246 )
247 judge = Judge()
248 judge.add_problem(problem)
249
250 # Correct solution.
251 correct = "a, b = map(int, input().split())\nprint(a + b)"
252 sub1 = judge.submit("alice", "sum-two", Language.PYTHON, correct)
253 print(f"Correct solution -> {sub1.verdict.value}, score {sub1.score}")
254 for run in sub1.runs:
255 print(f" {run.test_case_id}: {run.verdict.value}, {run.time_ms} ms")
256
257 # Wrong answer.
258 wrong = "a, b = map(int, input().split())\nprint(a - b)"
259 sub2 = judge.submit("bob", "sum-two", Language.PYTHON, wrong)
260 print(f"\nWrong solution -> {sub2.verdict.value}, score {sub2.score}")
261 print(f"Failed on {sub2.runs[-1].test_case_id}")
262
263 # Runtime error.
264 runtime_err = "a, b = map(int, input().split())\nprint(1 / 0)"
265 sub3 = judge.submit("carol", "sum-two", Language.PYTHON, runtime_err)
266 print(f"\nRuntime error -> {sub3.verdict.value}")
267
268 # TLE (infinite loop).
269 tle = "while True: pass"
270 sub4 = judge.submit("dan", "sum-two", Language.PYTHON, tle)
271 print(f"\nInfinite loop -> {sub4.verdict.value}")
272
273 # Compile error path — register a custom runner that rejects any code
274 # containing the literal "BROKEN". Static-language runners (javac/g++)
275 # plug in the same way.
276 class CompileCheckingRunner(PythonRunner):
277 def compile(self, code: str) -> CompileResult:
278 if "BROKEN" in code:
279 return CompileResult(success=False, errors="line 1: BROKEN token rejected")
280 return CompileResult(success=True)
281 judge._runners[Language.PYTHON] = CompileCheckingRunner()
282 sub5 = judge.submit("eve", "sum-two", Language.PYTHON, "BROKEN")
283 print(f"\nCompile-error solution -> {sub5.verdict.value}, runs: {len(sub5.runs)}")
284 assert sub5.verdict == Verdict.COMPILE_ERROR
285 assert sub5.runs[0].output.startswith("line 1")Common Mistakes
- ✗Running untrusted code in the same process as the judge. One infinite loop or fork bomb kills the platform.
- ✗Sharing stdin/stdout with the runner. Need pipes with strict byte limits to prevent output flooding.
- ✗Running all test cases even after the first failure. Wastes CPU; verdict is decided.
- ✗Comparing output with ==. Whitespace and line-ending differences cause spurious WA. Normalize first.
Key Points
- ✓Sandbox isolation is non-negotiable — untrusted code can't touch filesystem, network, or other submissions.
- ✓Language as a Strategy lets Python, Java, Go run the same submission contract with different toolchains.
- ✓Test cases run serially per submission (fail-fast on first WA/TLE) but submissions run in parallel across workers.
- ✓Verdicts are ordered by severity — compile error dominates WA dominates TLE dominates anything else.