DAG Task Runner with Backpressure
Run tasks with dependencies, each task waits for its prerequisites. Topological-order scheduling + worker pool + bounded concurrency. Pattern: per-task CountDownLatch / Future / WaitGroup, with dependency graph driving submission.
The problem
Run a graph of tasks where each task has prerequisites. Goal: every task runs as soon as its dependencies finish; independent tasks run concurrently; total runtime ≈ critical-path length.
Examples in the wild: build systems (Make, Bazel), data pipelines (Airflow, Argo Workflows), ML pipelines (Kubeflow), CI/CD (any system with needs:).
What this tests Composing topological ordering + concurrent execution + failure propagation. Most candidates know each piece individually; senior candidates compose them and discuss failure semantics.
The pattern: per-task Future + dependency awaits
Why Future-based composition wins
Each task is a Future. Dependent tasks .thenCompose (Java) / await asyncio.gather(*deps) (Python) / <-dep.done (Go) on their parents.
The runtime handles scheduling order automatically, a Future doesn't run until its whenComplete chain reaches it. No topological sort is written by hand; the dependency graph IS the sort.
Failure propagation strategies
The most-debated design choice What happens when task A (a dep of B and C) fails?
- Fail-fast: cancel B and C immediately. Airflow default.
- Skip-and-continue: mark B and C as skipped; run their non-failed siblings.
- Retry-then-fail: retry A N times with backoff before declaring failure.
- Per-task: each task declares its policy ('run if all upstream succeed' vs 'run if any upstream succeeded' vs 'always run').
Most production systems support all four. Picking the right one is part of system design.
Backpressure
Submitting 10K tasks at once turns all of them into Futures simultaneously, even though only N workers actually run. Memory cost is fine for thousands; for millions, paginate submissions OR use a coordinator that streams tasks into the pool as workers finish.
When a DAG runner isn't needed
- All tasks independent: just use a worker pool, no graph.
- Linear pipeline: just chain Futures, don't build a generic DAG.
- Existing tool fits: Airflow / Argo / Bazel exist for a reason; don't build a custom DAG runner without very specific needs.
The interview signal Naming the existing tools (Airflow, Argo, Make) and noting "I'd build my own only if X, Y, Z", that's the senior signal. Junior candidates jump to implementation; senior ones compare against existing tools.
Implementations
Each task is a CompletableFuture. thenCompose runs after parents complete. Failure propagates via exceptionally. JVM scheduling handles ordering automatically, no explicit graph traversal needed.
1 import java.util.*;
2 import java.util.concurrent.*;
3 import java.util.function.Supplier;
4
5 class DagRunner {
6 private final ExecutorService executor;
7 private final ConcurrentHashMap<String, CompletableFuture<Object>> futures = new ConcurrentHashMap<>();
8
9 public DagRunner(int workers) {
10 this.executor = new ThreadPoolExecutor(
11 workers, workers, 0, TimeUnit.SECONDS,
12 new ArrayBlockingQueue<>(1000),
13 new ThreadPoolExecutor.CallerRunsPolicy());
14 }
15
16 public CompletableFuture<Object> submit(String id, List<String> deps, Supplier<Object> task) {
17 CompletableFuture<Void> readyToRun = deps.isEmpty()
18 ? CompletableFuture.completedFuture(null)
19 : CompletableFuture.allOf(deps.stream()
20 .map(futures::get)
21 .toArray(CompletableFuture[]::new));
22
23 CompletableFuture<Object> result = readyToRun.thenApplyAsync(
24 v -> task.get(), executor);
25 futures.put(id, result);
26 return result;
27 }
28
29 public void awaitAll() {
30 CompletableFuture.allOf(futures.values().toArray(new CompletableFuture[0])).join();
31 }
32 }
33
34 // Usage:
35 // dag.submit("a", List.of(), () -> "result-a");
36 // dag.submit("b", List.of("a"), () -> "result-b");
37 // dag.submit("c", List.of("a"), () -> "result-c");
38 // dag.submit("d", List.of("b", "c"), () -> "merge-result");
39 // dag.awaitAll();Key points
- •Topological order: tasks run as soon as their dependencies complete
- •Per-task synchronization: Future, CompletableFuture, CountDownLatch, errgroup
- •Worker pool bounds concurrency; max-in-flight is the backpressure knob
- •Failure propagation: if a task fails, downstream tasks are cancelled (or retried)
- •Used by Apache Airflow, Argo Workflows, Make, build systems