ForkJoinPool & Parallel Streams
ForkJoinPool is Java's work-stealing thread pool optimised for divide-and-conquer recursion. RecursiveTask / RecursiveAction split work, fork() schedules, join() waits. Parallel streams run on the common pool. Sharp tool: useful for CPU-bound recursive work, terrible for blocking I/O.
The pattern in plain English
ForkJoinPool is a thread pool tuned for problems that split themselves into smaller pieces. Hand it one big task. The task splits into two smaller subtasks. Each of those splits again. The splitting stops when each piece is small enough to compute directly. Then the answers bubble back up, getting combined as they go.
Sort 1,000,000 items. Solid arrows fork (split going down). Dashed arrows join (merge coming back up).
That's the fork-join shape. Fork to split, join to combine.
Work stealing in plain English
A naive thread pool has one shared queue of pending tasks. Every worker thread reaches into the same queue to grab the next task. When there are many workers and many small tasks, the queue itself becomes the bottleneck. Every grab is a write on the same data structure, the cache line bounces between cores, and the workers spend more time fighting over the queue than doing the work.
ForkJoinPool solves this by giving every worker its own local double-ended queue, called a deque. Each worker pushes and pops tasks from one end of its own deque, so there is no contention with anyone else as long as the worker has its own work to do. When a worker runs out of tasks on its own deque, it goes looking for a busy worker and steals a task from the other end of that worker's deque.
The two ends are what make it work. The owner uses one end, called the own end, for fast push and pop with no synchronization. Other workers steal from the far end, which only sees activity when somebody is actually idle. Most of the time the two ends are far apart and the lock-free CAS on the far end never even contends with the owner.
A worked example with three workers, one of which runs out of work and steals.
Initial state. Workers 1 and 3 have tasks on their own deques. Worker 2 has nothing.
| Worker | Deque (own end on the right) | Status |
|---|---|---|
| Worker 1 | [T1.1, T1.2, T1.3, T1.4] | busy, pops T1.4 next |
| Worker 2 | [ ] | idle, looking for work to steal |
| Worker 3 | [T3.1, T3.2] | busy, pops T3.2 next |
Worker 2 steals from Worker 1. Worker 2 picks the busiest deque it can find and takes one task from the far end (the opposite end from where Worker 1 is pushing and popping).
| Worker | Deque after the steal | What just happened |
|---|---|---|
| Worker 1 | [T1.2, T1.3, T1.4] | unchanged on the own end, T1.1 was taken from the far end |
| Worker 2 | running T1.1 | stole one task and is executing it |
| Worker 3 | [T3.1, T3.2] | not touched at all |
Worker 1's own-end work continues uninterrupted; T1.4 is still next on Worker 1's pop. Worker 2 is busy again. Worker 3 was never involved. No global queue, no central lock, just one CAS at the moment of theft.
The result is load balancing for free, with almost no coordination overhead. The structure was introduced in Java 7 and made famous by Java 8 parallel streams, which use ForkJoinPool.commonPool() by default.
When it shines
Pure CPU-bound divide-and-conquer with no blocking. Sorting a large array. Summing or reducing over a large collection. Tree traversal with independent branch work. Image processing. Numerical algorithms.
The fork-join model assumes tasks make progress without waiting on external events. When tasks block on I/O, the pool's thread count is effectively the I/O concurrency limit, and most threads sit idle while blocked.
When to avoid
Anything I/O-bound. Anything with significant lock contention between tasks. Anything where the per-task overhead is comparable to or larger than the actual work. Anything with strict ordering requirements where parallelism does not help.
For I/O-bound work, use a dedicated ExecutorService sized for concurrency (often 50-100 threads or more), or use virtual threads in Java 21+, or use CompletableFuture chaining on a non-blocking I/O library.
The common pool trap
Stream.parallel() uses the common pool. So does any code that calls ForkJoinPool.commonPool(). So does CompletableFuture.runAsync without an explicit executor.
If one of those callers blocks (sleeps, waits on a lock, calls a slow API), the common pool's threads are tied up. Every other parallel stream in the JVM slows down or stops. This is one of the most common production surprises with parallel streams.
The fix: never block in code that runs on the common pool. When unavoidable, submit to a dedicated ForkJoinPool, or wrap the block in ManagedBlocker so the pool can compensate.
Choosing a threshold
Splitting has a cost: fork, queue, join. Splitting too aggressively makes overhead exceed the speedup. Splitting too little leaves cores idle. The sweet spot depends on the per-element cost.
For a cheap operation (adding longs), 10,000 elements per leaf is reasonable. For an expensive one (parsing a JSON document), 50 might be too many. Aim for total leaves in the range 10x-100x the core count, then profile.
A note on parallel streams
Parallel streams are the easy mode for fork-join. Same stream pipeline, add .parallel(), and the runtime splits and combines. Most of the time it works.
When it does not: blocking inside the pipeline (don't), unbalanced split (some sources split worse than others, ArrayList splits well, LinkedList does not, generators split badly), or pipelines where ordering matters and the parallel order machinery dominates.
For most application code, sequential streams are fine. Reach for parallel only after measuring a real speedup on a real workload.
Primitives by language
- ForkJoinPool / ForkJoinPool.commonPool()
- RecursiveTask<V> (returns a value)
- RecursiveAction (no return value)
- Stream.parallel() / Collection.parallelStream()
- ManagedBlocker (escape hatch for blocking inside fork-join tasks)
Implementation
The split threshold matters. Too small: scheduling overhead dominates. Too large: parallelism is wasted. A few thousand elements per leaf is typical for cheap-per-element work; lower for expensive work.
1 import java.util.concurrent.RecursiveTask;
2 import java.util.concurrent.ForkJoinPool;
3
4 class SumTask extends RecursiveTask<Long> {
5 static final int THRESHOLD = 10_000;
6 private final long[] data;
7 private final int from, to;
8
9 SumTask(long[] data, int from, int to) {
10 this.data = data; this.from = from; this.to = to;
11 }
12
13 @Override
14 protected Long compute() {
15 if (to - from <= THRESHOLD) {
16 long sum = 0;
17 for (int i = from; i < to; i++) sum += data[i];
18 return sum;
19 }
20 int mid = (from + to) >>> 1;
21 SumTask left = new SumTask(data, from, mid);
22 SumTask right = new SumTask(data, mid, to);
23 left.fork(); // schedule left
24 long r = right.compute(); // run right inline
25 long l = left.join(); // wait for left
26 return l + r;
27 }
28 }
29
30 long total = ForkJoinPool.commonPool().invoke(new SumTask(arr, 0, arr.length));Most fork-join use in practice is hidden behind parallel streams. Splitting and combining is automatic. The catch: it always uses the common pool, so blocking work in a parallel stream starves every other parallel stream and any caller using commonPool.
1 List<Order> orders = loadOrders();
2
3 // Parallel reduction
4 BigDecimal total = orders.parallelStream()
5 .map(Order::amount)
6 .reduce(BigDecimal.ZERO, BigDecimal::add);
7
8 // Parallel filter + collect
9 List<Order> large = orders.parallelStream()
10 .filter(o -> o.amount().compareTo(THRESHOLD) > 0)
11 .toList();
12
13 // BAD: blocking I/O in parallel stream uses commonPool, can starve others
14 // List<Result> results = ids.parallelStream().map(this::callApi).toList();To keep a parallel stream off the common pool, submit it to a dedicated ForkJoinPool. Each pool has its own threads and deques. Important when one workload has different latency characteristics than the rest of the JVM.
1 ForkJoinPool pool = new ForkJoinPool(8);
2
3 try {
4 List<Result> results = pool.submit(() ->
5 ids.parallelStream().map(this::callApi).toList()
6 ).get();
7 } finally {
8 pool.shutdown();
9 }When blocking inside a fork-join task is unavoidable (e.g., waiting on a CompletableFuture), wrap the block in ManagedBlocker. The pool spawns a compensation thread so other tasks keep making progress. This is the escape hatch; a sign that fork-join may be the wrong tool.
1 import java.util.concurrent.ForkJoinPool.ManagedBlocker;
2
3 class FutureBlocker<T> implements ManagedBlocker {
4 private final CompletableFuture<T> future;
5 private T result;
6 private boolean done = false;
7
8 FutureBlocker(CompletableFuture<T> future) { this.future = future; }
9
10 @Override public boolean block() throws InterruptedException {
11 try { result = future.get(); done = true; return true; }
12 catch (Exception e) { throw new RuntimeException(e); }
13 }
14 @Override public boolean isReleasable() { return done || future.isDone(); }
15 public T result() { return result; }
16 }
17
18 // Usage inside a fork-join task:
19 FutureBlocker<String> b = new FutureBlocker<>(slowFuture);
20 ForkJoinPool.managedBlock(b);
21 String s = b.result();Key points
- •Work stealing: each worker has its own deque; when idle, it steals from another worker's deque tail.
- •Designed for divide-and-conquer with no blocking. Tasks split until small, compute, combine.
- •Common pool is shared across the JVM. Parallel streams use it. Blocking inside it starves everyone.
- •Default parallelism = available cores - 1. Override with -Djava.util.concurrent.ForkJoinPool.common.parallelism.
- •For CPU-bound recursive work, fork-join can outperform a fixed thread pool. For I/O, use a dedicated executor.
Follow-up questions
▸Why is the common pool shared across the JVM?
▸When to avoid parallel streams?
▸What does work stealing buy?
▸How big should THRESHOLD be?
Gotchas
- !Blocking inside the common pool starves every other parallel stream caller
- !Recursive tasks that share mutable state defeat the model. Compose pure functions.
- !Parallel stream parallelism = commonPool size, NOT number of stream sources
- !fork() then immediately join() on the same task is just compute(); fork-join shines when one side runs inline and the other forks
- !Setting common.parallelism too high (>cores) wastes context switches; too low (=1) defeats the point