Platform ScaleProblem 21 of 29

Platform ScaleAdvanced

Online Coding Judge

Strategy PatternState PatternCommand Pattern

LeetCode-style judge. Problems, submissions, and a sandboxed runner that executes untrusted code against test cases without the judge catching fire.

Key Abstractions

Problem

Statement, input/output schema, limits (time, memory), and test cases

TestCase

Input plus expected output, with optional hidden flag for final grading

Submission

Code + language + user + problem. Has state (queued, running, verdict).

LanguageRunner

Strategy per language — compile(code) plus execute(code, stdin, limits) in a sandbox

CompileResult

Value object — success flag plus compile errors. Default success for interpreted languages.

Verdict

Accepted, WrongAnswer, TLE, MLE, RuntimeError, CompileError — one per submission

Judge

Facade. Compiles once, then grades test cases fail-fast; produces the verdict.

Class Diagram

The Key Insight

An online judge has two hard problems: running untrusted code safely and producing deterministic verdicts from noisy output. Everything else (problems, users, submissions) is CRUD.

Safe execution demands isolation. A subprocess with a timeout is a starting point — it's not enough for production. Real judges use gVisor, nsjail, or Firecracker microVMs to box untrusted code away from the filesystem, network, other submissions, and kernel-level escapes. The LanguageRunner interface hides the mechanism; whatever hardening a platform chooses lives behind it.

Output comparison is where WA vs AC gets murky. Trailing newlines, Windows line endings, extra spaces — all need normalization before comparison. The normalize helper on LanguageRunner is a one-place fix. Without it, correct solutions fail on rendering artifacts and beginners lose faith in the platform.

Fail-fast on first wrong test case is a throughput lever. A contest with 10k submissions can't afford to run all 30 test cases when case 2 already failed. The verdict can't improve; stop there. Subtle but it halves average grading time.

Requirements

Functional

Register problems with test cases, time/memory limits, per-case points
Accept submissions with source code and language
Grade against visible and hidden test cases
Return verdict plus per-case timing and memory stats
Support multiple languages with different toolchains

Non-Functional

Untrusted code cannot affect judge, filesystem, or other submissions
Grading is parallelizable — N submissions, N workers
Wall-clock enforcement of time limits
Deterministic verdict — same code + same test case = same outcome

Design Decisions

Why Language as a Strategy?

Each language has a different compile/run story: Python interprets, Java compiles to bytecode then runs on the JVM, C++ compiles to a binary. Hiding this behind LanguageRunner.execute(code, stdin, limits) lets the judge treat them uniformly. Adding Rust is a new Runner class and a Language enum value.

Why run test cases serially?

Running in parallel sounds faster but the verdict on case 2 is pointless if case 1 already decided WA. Serial execution with fail-fast bounds total work. Parallelism comes from running different submissions in parallel, not different cases of one submission.

Why hash-then-compare instead of byte equality?

For output comparison, normalization is the right move (strip trailing whitespace, unify line endings). For code deduplication (detecting repeat submissions), hashing the code does catch equivalent texts. Separate concerns.

Why separate Submission from TestRun?

A submission has one verdict but many runs. Flattening them would drop per-case detail, and per-case detail is the user-visible feedback ("your code timed out on case 3"). Separating preserves the drill-down view.

Why memory limits alongside time?

A submission that passes time limits by allocating gigabytes per test case still abuses the platform. Memory limits enforced via cgroups or resource limits on the sandbox prevent one abusive submission from starving others.

Interview Follow-ups

"How would you scale to 10k submissions per second during a contest?" Submission queue (Kafka/SQS) with N worker pods pulling. Each pod runs a single submission in a fresh sandbox container, then terminates. Stateless workers scale horizontally.
"How do you prevent cheating?" Rate-limit submissions per user per problem, diff-check submissions for near-duplicates (MOSS or minhash), monitor timing patterns for cross-user plagiarism.
"What about interactive problems (judge communicates with submitted code)?" Runner gives the judge a pipe in both directions. The judge process speaks a protocol with the submission — useful for interactive proofs or adaptive test cases.
"How do you handle custom checkers (e.g., multiple valid outputs)?" A per-problem ValidatorFunction replaces the default string-compare. The checker gets input, submission output, and reference output; returns accepted/rejected.
"How do you support partial credit?" Test cases carry points; a submission's score is the sum of accepted cases. Verdict stays the "worst outcome" for display but scoring is additive.

Code Implementation

  1  from __future__ import annotations
  2  from abc import ABC, abstractmethod
  3  from dataclasses import dataclass, field
  4  from datetime import datetime
  5  from enum import Enum
  6  import subprocess
  7  import sys
  8  import tempfile
  9  import uuid
 10  from pathlib import Path
 11  
 12  
 13  class Language(Enum):
 14      PYTHON = "python"
 15      JAVA = "java"
 16      GO = "go"
 17  
 18  
 19  class Verdict(Enum):
 20      PENDING = "pending"
 21      ACCEPTED = "accepted"
 22      WRONG_ANSWER = "wrong_answer"
 23      TIME_LIMIT = "time_limit"
 24      MEMORY_LIMIT = "memory_limit"
 25      RUNTIME_ERROR = "runtime_error"
 26      COMPILE_ERROR = "compile_error"
 27  
 28  
 29  class SubmissionStatus(Enum):
 30      QUEUED = "queued"
 31      COMPILING = "compiling"
 32      RUNNING = "running"
 33      DONE = "done"
 34  
 35  
 36  @dataclass
 37  class TestCase:
 38      id: str
 39      input: str
 40      expected_output: str
 41      hidden: bool = False
 42      points: int = 1
 43  
 44  
 45  @dataclass
 46  class Problem:
 47      id: str
 48      title: str
 49      time_limit_ms: int
 50      memory_limit_mb: int
 51      test_cases: list[TestCase]
 52  
 53  
 54  @dataclass
 55  class RunResult:
 56      stdout: str
 57      stderr: str
 58      exit_code: int
 59      duration_ms: int
 60      memory_kb: int
 61      timed_out: bool
 62  
 63  
 64  @dataclass
 65  class TestRun:
 66      test_case_id: str
 67      verdict: Verdict
 68      time_ms: int
 69      memory_kb: int
 70      output: str
 71  
 72  
 73  @dataclass
 74  class Submission:
 75      id: str
 76      user_id: str
 77      problem_id: str
 78      language: Language
 79      code: str
 80      status: SubmissionStatus = SubmissionStatus.QUEUED
 81      verdict: Verdict = Verdict.PENDING
 82      runs: list[TestRun] = field(default_factory=list)
 83      submitted_at: datetime = field(default_factory=datetime.utcnow)
 84      score: int = 0
 85  
 86  
 87  @dataclass
 88  class CompileResult:
 89      success: bool
 90      errors: str = ""
 91  
 92  
 93  class LanguageRunner(ABC):
 94      """Sandboxed execution of untrusted code. Separate compile step so static
 95      languages (Java/C++) can report compile errors once, not per test case."""
 96  
 97      def compile(self, code: str) -> CompileResult:
 98          """Default: interpreted languages compile trivially. Override for javac/g++."""
 99          return CompileResult(success=True)
100  
101      @abstractmethod
102      def execute(self, code: str, stdin: str, time_limit_ms: int,
103                  memory_limit_mb: int) -> RunResult: ...
104  
105      @staticmethod
106      def normalize(text: str) -> str:
107          # Strip trailing whitespace per line; trim trailing blank lines.
108          lines = [ln.rstrip() for ln in text.splitlines()]
109          while lines and lines[-1] == "":
110              lines.pop()
111          return "\n".join(lines)
112  
113  
114  class PythonRunner(LanguageRunner):
115      """Spawns a subprocess, wall-clock limit, stdin piped in. Production needs real isolation."""
116  
117      def execute(self, code: str, stdin: str, time_limit_ms: int,
118                  memory_limit_mb: int) -> RunResult:
119          start = datetime.utcnow()
120          with tempfile.NamedTemporaryFile(
121              mode="w", suffix=".py", delete=False
122          ) as tmp:
123              tmp.write(code)
124              path = tmp.name
125          try:
126              # WARNING: subprocess is NOT a secure sandbox — production needs gVisor, nsjail, or similar.
127              # This demonstrates the interface; replace with a hardened runner in production.
128              proc = subprocess.run(
129                  [sys.executable, path],
130                  input=stdin,
131                  capture_output=True,
132                  text=True,
133                  timeout=time_limit_ms / 1000.0,
134              )
135              duration_ms = int((datetime.utcnow() - start).total_seconds() * 1000)
136              return RunResult(
137                  stdout=proc.stdout,
138                  stderr=proc.stderr,
139                  exit_code=proc.returncode,
140                  duration_ms=duration_ms,
141                  memory_kb=0,                  # memory tracking would require rusage / cgroups
142                  timed_out=False,
143              )
144          except subprocess.TimeoutExpired:
145              duration_ms = int((datetime.utcnow() - start).total_seconds() * 1000)
146              return RunResult("", "Timeout", -1, duration_ms, 0, timed_out=True)
147          finally:
148              Path(path).unlink(missing_ok=True)
149  
150  
151  class Judge:
152      def __init__(self, runners: dict[Language, LanguageRunner] | None = None):
153          self._runners: dict[Language, LanguageRunner] = runners or {Language.PYTHON: PythonRunner()}
154          self._problems: dict[str, Problem] = {}
155          self._submissions: dict[str, Submission] = {}
156  
157      def add_problem(self, problem: Problem) -> None:
158          self._problems[problem.id] = problem
159  
160      def submit(self, user_id: str, problem_id: str, language: Language, code: str) -> Submission:
161          if problem_id not in self._problems:
162              raise ValueError("unknown problem")
163          if language not in self._runners:
164              raise ValueError(f"language {language.value} not supported")
165          submission = Submission(
166              id=str(uuid.uuid4())[:8],
167              user_id=user_id,
168              problem_id=problem_id,
169              language=language,
170              code=code,
171          )
172          self._submissions[submission.id] = submission
173          # Grading runs inline here. In production it's queued to a worker pool.
174          self._grade(submission)
175          return submission
176  
177      def _grade(self, submission: Submission) -> None:
178          problem = self._problems[submission.problem_id]
179          runner = self._runners[submission.language]
180  
181          # Compile once before any test runs. For static languages a failure here
182          # is the canonical reason for COMPILE_ERROR; no point hitting tests.
183          submission.status = SubmissionStatus.COMPILING
184          compile_result = runner.compile(submission.code)
185          if not compile_result.success:
186              submission.verdict = Verdict.COMPILE_ERROR
187              submission.status = SubmissionStatus.DONE
188              # Store the compile error as a single pseudo-run for the UI.
189              submission.runs.append(TestRun(
190                  test_case_id="compile",
191                  verdict=Verdict.COMPILE_ERROR,
192                  time_ms=0, memory_kb=0,
193                  output=compile_result.errors,
194              ))
195              return
196  
197          submission.status = SubmissionStatus.RUNNING
198          overall = Verdict.ACCEPTED
199          for tc in problem.test_cases:
200              result = runner.execute(
201                  submission.code, tc.input,
202                  problem.time_limit_ms, problem.memory_limit_mb,
203              )
204  
205              if result.timed_out:
206                  verdict = Verdict.TIME_LIMIT
207              elif result.exit_code != 0:
208                  verdict = Verdict.RUNTIME_ERROR
209              elif LanguageRunner.normalize(result.stdout) == LanguageRunner.normalize(tc.expected_output):
210                  verdict = Verdict.ACCEPTED
211              else:
212                  verdict = Verdict.WRONG_ANSWER
213  
214              submission.runs.append(TestRun(
215                  test_case_id=tc.id,
216                  verdict=verdict,
217                  time_ms=result.duration_ms,
218                  memory_kb=result.memory_kb,
219                  output=result.stdout if not tc.hidden else "",
220              ))
221  
222              if verdict == Verdict.ACCEPTED:
223                  submission.score += tc.points
224              else:
225                  overall = verdict
226                  break      # fail-fast: remaining test cases don't change the verdict
227  
228          submission.verdict = overall
229          submission.status = SubmissionStatus.DONE
230  
231      def get_submission(self, submission_id: str) -> Submission | None:
232          return self._submissions.get(submission_id)
233  
234  
235  if __name__ == "__main__":
236      problem = Problem(
237          id="sum-two",
238          title="Sum Two Numbers",
239          time_limit_ms=2000,
240          memory_limit_mb=64,
241          test_cases=[
242              TestCase("t1", input="2 3\n", expected_output="5\n"),
243              TestCase("t2", input="10 15\n", expected_output="25\n"),
244              TestCase("t3", input="-1 1\n", expected_output="0\n", hidden=True),
245          ],
246      )
247      judge = Judge()
248      judge.add_problem(problem)
249  
250      # Correct solution.
251      correct = "a, b = map(int, input().split())\nprint(a + b)"
252      sub1 = judge.submit("alice", "sum-two", Language.PYTHON, correct)
253      print(f"Correct solution -> {sub1.verdict.value}, score {sub1.score}")
254      for run in sub1.runs:
255          print(f"  {run.test_case_id}: {run.verdict.value}, {run.time_ms} ms")
256  
257      # Wrong answer.
258      wrong = "a, b = map(int, input().split())\nprint(a - b)"
259      sub2 = judge.submit("bob", "sum-two", Language.PYTHON, wrong)
260      print(f"\nWrong solution -> {sub2.verdict.value}, score {sub2.score}")
261      print(f"Failed on {sub2.runs[-1].test_case_id}")
262  
263      # Runtime error.
264      runtime_err = "a, b = map(int, input().split())\nprint(1 / 0)"
265      sub3 = judge.submit("carol", "sum-two", Language.PYTHON, runtime_err)
266      print(f"\nRuntime error -> {sub3.verdict.value}")
267  
268      # TLE (infinite loop).
269      tle = "while True: pass"
270      sub4 = judge.submit("dan", "sum-two", Language.PYTHON, tle)
271      print(f"\nInfinite loop -> {sub4.verdict.value}")
272  
273      # Compile error path — register a custom runner that rejects any code
274      # containing the literal "BROKEN". Static-language runners (javac/g++)
275      # plug in the same way.
276      class CompileCheckingRunner(PythonRunner):
277          def compile(self, code: str) -> CompileResult:
278              if "BROKEN" in code:
279                  return CompileResult(success=False, errors="line 1: BROKEN token rejected")
280              return CompileResult(success=True)
281      judge._runners[Language.PYTHON] = CompileCheckingRunner()
282      sub5 = judge.submit("eve", "sum-two", Language.PYTHON, "BROKEN")
283      print(f"\nCompile-error solution -> {sub5.verdict.value}, runs: {len(sub5.runs)}")
284      assert sub5.verdict == Verdict.COMPILE_ERROR
285      assert sub5.runs[0].output.startswith("line 1")

Common Mistakes

✗Running untrusted code in the same process as the judge. One infinite loop or fork bomb kills the platform.
✗Sharing stdin/stdout with the runner. Need pipes with strict byte limits to prevent output flooding.
✗Running all test cases even after the first failure. Wastes CPU; verdict is decided.
✗Comparing output with ==. Whitespace and line-ending differences cause spurious WA. Normalize first.

Key Points

✓Sandbox isolation is non-negotiable — untrusted code can't touch filesystem, network, or other submissions.
✓Language as a Strategy lets Python, Java, Go run the same submission contract with different toolchains.
✓Test cases run serially per submission (fail-fast on first WA/TLE) but submissions run in parallel across workers.
✓Verdicts are ordered by severity — compile error dominates WA dominates TLE dominates anything else.

Related Designs