Java ConcurrencyTopic 14 of 16

LanguageJavaIntermediateSometimes

ForkJoinPool & Parallel Streams

In one line

ForkJoinPool is Java's work-stealing thread pool optimised for divide-and-conquer recursion. RecursiveTask / RecursiveAction split work, fork() schedules, join() waits. Parallel streams run on the common pool. Sharp tool: useful for CPU-bound recursive work, terrible for blocking I/O.

The pattern in plain English

ForkJoinPool is a thread pool tuned for problems that split themselves into smaller pieces. Hand it one big task. The task splits into two smaller subtasks. Each of those splits again. The splitting stops when each piece is small enough to compute directly. Then the answers bubble back up, getting combined as they go.

Sort 1,000,000 items. Solid arrows fork (split going down). Dashed arrows join (merge coming back up).

That's the fork-join shape. Fork to split, join to combine.

Work stealing in plain English

A naive thread pool has one shared queue of pending tasks. Every worker thread reaches into the same queue to grab the next task. When there are many workers and many small tasks, the queue itself becomes the bottleneck. Every grab is a write on the same data structure, the cache line bounces between cores, and the workers spend more time fighting over the queue than doing the work.

ForkJoinPool solves this by giving every worker its own local double-ended queue, called a deque. Each worker pushes and pops tasks from one end of its own deque, so there is no contention with anyone else as long as the worker has its own work to do. When a worker runs out of tasks on its own deque, it goes looking for a busy worker and steals a task from the other end of that worker's deque.

The two ends are what make it work. The owner uses one end, called the own end, for fast push and pop with no synchronization. Other workers steal from the far end, which only sees activity when somebody is actually idle. Most of the time the two ends are far apart and the lock-free CAS on the far end never even contends with the owner.

A worked example with three workers, one of which runs out of work and steals.

Initial state. Workers 1 and 3 have tasks on their own deques. Worker 2 has nothing.

Worker	Deque (own end on the right)	Status
Worker 1	`[T1.1, T1.2, T1.3, T1.4]`	busy, pops T1.4 next
Worker 2	`[ ]`	idle, looking for work to steal
Worker 3	`[T3.1, T3.2]`	busy, pops T3.2 next

Worker 2 steals from Worker 1. Worker 2 picks the busiest deque it can find and takes one task from the far end (the opposite end from where Worker 1 is pushing and popping).

Worker	Deque after the steal	What just happened
Worker 1	`[T1.2, T1.3, T1.4]`	unchanged on the own end, T1.1 was taken from the far end
Worker 2	running `T1.1`	stole one task and is executing it
Worker 3	`[T3.1, T3.2]`	not touched at all

Worker 1's own-end work continues uninterrupted; T1.4 is still next on Worker 1's pop. Worker 2 is busy again. Worker 3 was never involved. No global queue, no central lock, just one CAS at the moment of theft.

The result is load balancing for free, with almost no coordination overhead. The structure was introduced in Java 7 and made famous by Java 8 parallel streams, which use ForkJoinPool.commonPool() by default.

When it shines

Pure CPU-bound divide-and-conquer with no blocking. Sorting a large array. Summing or reducing over a large collection. Tree traversal with independent branch work. Image processing. Numerical algorithms.

The fork-join model assumes tasks make progress without waiting on external events. When tasks block on I/O, the pool's thread count is effectively the I/O concurrency limit, and most threads sit idle while blocked.

When to avoid

Anything I/O-bound. Anything with significant lock contention between tasks. Anything where the per-task overhead is comparable to or larger than the actual work. Anything with strict ordering requirements where parallelism does not help.

For I/O-bound work, use a dedicated ExecutorService sized for concurrency (often 50-100 threads or more), or use virtual threads in Java 21+, or use CompletableFuture chaining on a non-blocking I/O library.

The common pool trap

Stream.parallel() uses the common pool. So does any code that calls ForkJoinPool.commonPool(). So does CompletableFuture.runAsync without an explicit executor.

If one of those callers blocks (sleeps, waits on a lock, calls a slow API), the common pool's threads are tied up. Every other parallel stream in the JVM slows down or stops. This is one of the most common production surprises with parallel streams.

The fix: never block in code that runs on the common pool. When unavoidable, submit to a dedicated ForkJoinPool, or wrap the block in ManagedBlocker so the pool can compensate.

Choosing a threshold

Splitting has a cost: fork, queue, join. Splitting too aggressively makes overhead exceed the speedup. Splitting too little leaves cores idle. The sweet spot depends on the per-element cost.

For a cheap operation (adding longs), 10,000 elements per leaf is reasonable. For an expensive one (parsing a JSON document), 50 might be too many. Aim for total leaves in the range 10x-100x the core count, then profile.

A note on parallel streams

Parallel streams are the easy mode for fork-join. Same stream pipeline, add .parallel(), and the runtime splits and combines. Most of the time it works.

When it does not: blocking inside the pipeline (don't), unbalanced split (some sources split worse than others, ArrayList splits well, LinkedList does not, generators split badly), or pipelines where ordering matters and the parallel order machinery dominates.

For most application code, sequential streams are fine. Reach for parallel only after measuring a real speedup on a real workload.

Primitives by language

ForkJoinPool / ForkJoinPool.commonPool()
RecursiveTask<V> (returns a value)
RecursiveAction (no return value)
Stream.parallel() / Collection.parallelStream()
ManagedBlocker (escape hatch for blocking inside fork-join tasks)

Implementation

Parallel sum with RecursiveTask

The split threshold matters. Too small: scheduling overhead dominates. Too large: parallelism is wasted. A few thousand elements per leaf is typical for cheap-per-element work; lower for expensive work.

 1  import java.util.concurrent.RecursiveTask;
 2  import java.util.concurrent.ForkJoinPool;
 3  
 4  class SumTask extends RecursiveTask<Long> {
 5      static final int THRESHOLD = 10_000;
 6      private final long[] data;
 7      private final int from, to;
 8  
 9      SumTask(long[] data, int from, int to) {
10          this.data = data; this.from = from; this.to = to;
11      }
12  
13      @Override
14      protected Long compute() {
15          if (to - from <= THRESHOLD) {
16              long sum = 0;
17              for (int i = from; i < to; i++) sum += data[i];
18              return sum;
19          }
20          int mid = (from + to) >>> 1;
21          SumTask left = new SumTask(data, from, mid);
22          SumTask right = new SumTask(data, mid, to);
23          left.fork();                                    // schedule left
24          long r = right.compute();                       // run right inline
25          long l = left.join();                           // wait for left
26          return l + r;
27      }
28  }
29  
30  long total = ForkJoinPool.commonPool().invoke(new SumTask(arr, 0, arr.length));

Parallel stream: the easy mode

Most fork-join use in practice is hidden behind parallel streams. Splitting and combining is automatic. The catch: it always uses the common pool, so blocking work in a parallel stream starves every other parallel stream and any caller using commonPool.

 1  List<Order> orders = loadOrders();
 2  
 3  // Parallel reduction
 4  BigDecimal total = orders.parallelStream()
 5      .map(Order::amount)
 6      .reduce(BigDecimal.ZERO, BigDecimal::add);
 7  
 8  // Parallel filter + collect
 9  List<Order> large = orders.parallelStream()
10      .filter(o -> o.amount().compareTo(THRESHOLD) > 0)
11      .toList();
12  
13  // BAD: blocking I/O in parallel stream uses commonPool, can starve others
14  // List<Result> results = ids.parallelStream().map(this::callApi).toList();

Running on a dedicated pool

To keep a parallel stream off the common pool, submit it to a dedicated ForkJoinPool. Each pool has its own threads and deques. Important when one workload has different latency characteristics than the rest of the JVM.

1  ForkJoinPool pool = new ForkJoinPool(8);
2  
3  try {
4      List<Result> results = pool.submit(() ->
5          ids.parallelStream().map(this::callApi).toList()
6      ).get();
7  } finally {
8      pool.shutdown();
9  }

ManagedBlocker: blocking without starving the pool

When blocking inside a fork-join task is unavoidable (e.g., waiting on a CompletableFuture), wrap the block in ManagedBlocker. The pool spawns a compensation thread so other tasks keep making progress. This is the escape hatch; a sign that fork-join may be the wrong tool.

 1  import java.util.concurrent.ForkJoinPool.ManagedBlocker;
 2  
 3  class FutureBlocker<T> implements ManagedBlocker {
 4      private final CompletableFuture<T> future;
 5      private T result;
 6      private boolean done = false;
 7  
 8      FutureBlocker(CompletableFuture<T> future) { this.future = future; }
 9  
10      @Override public boolean block() throws InterruptedException {
11          try { result = future.get(); done = true; return true; }
12          catch (Exception e) { throw new RuntimeException(e); }
13      }
14      @Override public boolean isReleasable() { return done || future.isDone(); }
15      public T result() { return result; }
16  }
17  
18  // Usage inside a fork-join task:
19  FutureBlocker<String> b = new FutureBlocker<>(slowFuture);
20  ForkJoinPool.managedBlock(b);
21  String s = b.result();

Key points

•Work stealing: each worker has its own deque; when idle, it steals from another worker's deque tail.
•Designed for divide-and-conquer with no blocking. Tasks split until small, compute, combine.
•Common pool is shared across the JVM. Parallel streams use it. Blocking inside it starves everyone.
•Default parallelism = available cores - 1. Override with -Djava.util.concurrent.ForkJoinPool.common.parallelism.
•For CPU-bound recursive work, fork-join can outperform a fixed thread pool. For I/O, use a dedicated executor.

Follow-up questions

▸Why is the common pool shared across the JVM?

Performance and simplicity. Allocating a new ForkJoinPool per parallel stream call would create dozens of pools, dozens of thread pools, and contention for cores. The common pool centralises that. The downside is that any user of parallel streams shares it, so one badly-behaved task affects everyone.

▸When to avoid parallel streams?

For pipelines where the per-element cost is small (the parallelisation overhead dominates), for I/O-bound work (the pool starves), and for ordered pipelines where preserving order matters (parallel streams have order-preservation modes but they cost). Default to sequential streams; reach for parallel only when the elements are expensive and CPU-bound.

▸What does work stealing buy?

Load balancing for free. With a fixed-size pool and a global queue, all workers contend on the queue head. With per-worker deques, contention only happens when a worker steals (rare on balanced workloads). Work stealing turns out to be the right scheduler for divide-and-conquer where subtask sizes are uneven and unpredictable.

▸How big should THRESHOLD be?

Big enough that the per-task overhead (fork, queue, join) is small relative to the leaf work, but small enough to leave many leaves to balance across cores. Rule of thumb: aim for 10x to 100x the number of cores in total tasks. Profile to tune.

Gotchas

!Blocking inside the common pool starves every other parallel stream caller
!Recursive tasks that share mutable state defeat the model. Compose pure functions.
!Parallel stream parallelism = commonPool size, NOT number of stream sources
!fork() then immediately join() on the same task is just compute(); fork-join shines when one side runs inline and the other forks
!Setting common.parallelism too high (>cores) wastes context switches; too low (=1) defeats the point

ForkJoinPool & Parallel Streams

In one line

The pattern in plain English

Sort 1,000,000 items. Solid arrows fork (split going down). Dashed arrows join (merge coming back up).

That's the fork-join shape. Fork to split, join to combine.

Work stealing in plain English

A worked example with three workers, one of which runs out of work and steals.

Initial state. Workers 1 and 3 have tasks on their own deques. Worker 2 has nothing.

Worker	Deque (own end on the right)	Status
Worker 1	`[T1.1, T1.2, T1.3, T1.4]`	busy, pops T1.4 next
Worker 2	`[ ]`	idle, looking for work to steal
Worker 3	`[T3.1, T3.2]`	busy, pops T3.2 next

Worker 2 steals from Worker 1. Worker 2 picks the busiest deque it can find and takes one task from the far end (the opposite end from where Worker 1 is pushing and popping).

Worker	Deque after the steal	What just happened
Worker 1	`[T1.2, T1.3, T1.4]`	unchanged on the own end, T1.1 was taken from the far end
Worker 2	running `T1.1`	stole one task and is executing it
Worker 3	`[T3.1, T3.2]`	not touched at all

When it shines

When to avoid

The common pool trap

Stream.parallel() uses the common pool. So does any code that calls ForkJoinPool.commonPool(). So does CompletableFuture.runAsync without an explicit executor.

The fix: never block in code that runs on the common pool. When unavoidable, submit to a dedicated ForkJoinPool, or wrap the block in ManagedBlocker so the pool can compensate.

Choosing a threshold

Splitting has a cost: fork, queue, join. Splitting too aggressively makes overhead exceed the speedup. Splitting too little leaves cores idle. The sweet spot depends on the per-element cost.

A note on parallel streams

Parallel streams are the easy mode for fork-join. Same stream pipeline, add .parallel(), and the runtime splits and combines. Most of the time it works.

For most application code, sequential streams are fine. Reach for parallel only after measuring a real speedup on a real workload.

Primitives by language

ForkJoinPool / ForkJoinPool.commonPool()
RecursiveTask<V> (returns a value)
RecursiveAction (no return value)
Stream.parallel() / Collection.parallelStream()
ManagedBlocker (escape hatch for blocking inside fork-join tasks)

Implementation

Parallel sum with RecursiveTask

 1  import java.util.concurrent.RecursiveTask;
 2  import java.util.concurrent.ForkJoinPool;
 3  
 4  class SumTask extends RecursiveTask<Long> {
 5      static final int THRESHOLD = 10_000;
 6      private final long[] data;
 7      private final int from, to;
 8  
 9      SumTask(long[] data, int from, int to) {
10          this.data = data; this.from = from; this.to = to;
11      }
12  
13      @Override
14      protected Long compute() {
15          if (to - from <= THRESHOLD) {
16              long sum = 0;
17              for (int i = from; i < to; i++) sum += data[i];
18              return sum;
19          }
20          int mid = (from + to) >>> 1;
21          SumTask left = new SumTask(data, from, mid);
22          SumTask right = new SumTask(data, mid, to);
23          left.fork();                                    // schedule left
24          long r = right.compute();                       // run right inline
25          long l = left.join();                           // wait for left
26          return l + r;
27      }
28  }
29  
30  long total = ForkJoinPool.commonPool().invoke(new SumTask(arr, 0, arr.length));

Parallel stream: the easy mode

 1  List<Order> orders = loadOrders();
 2  
 3  // Parallel reduction
 4  BigDecimal total = orders.parallelStream()
 5      .map(Order::amount)
 6      .reduce(BigDecimal.ZERO, BigDecimal::add);
 7  
 8  // Parallel filter + collect
 9  List<Order> large = orders.parallelStream()
10      .filter(o -> o.amount().compareTo(THRESHOLD) > 0)
11      .toList();
12  
13  // BAD: blocking I/O in parallel stream uses commonPool, can starve others
14  // List<Result> results = ids.parallelStream().map(this::callApi).toList();

Running on a dedicated pool

1  ForkJoinPool pool = new ForkJoinPool(8);
2  
3  try {
4      List<Result> results = pool.submit(() ->
5          ids.parallelStream().map(this::callApi).toList()
6      ).get();
7  } finally {
8      pool.shutdown();
9  }

ManagedBlocker: blocking without starving the pool

 1  import java.util.concurrent.ForkJoinPool.ManagedBlocker;
 2  
 3  class FutureBlocker<T> implements ManagedBlocker {
 4      private final CompletableFuture<T> future;
 5      private T result;
 6      private boolean done = false;
 7  
 8      FutureBlocker(CompletableFuture<T> future) { this.future = future; }
 9  
10      @Override public boolean block() throws InterruptedException {
11          try { result = future.get(); done = true; return true; }
12          catch (Exception e) { throw new RuntimeException(e); }
13      }
14      @Override public boolean isReleasable() { return done || future.isDone(); }
15      public T result() { return result; }
16  }
17  
18  // Usage inside a fork-join task:
19  FutureBlocker<String> b = new FutureBlocker<>(slowFuture);
20  ForkJoinPool.managedBlock(b);
21  String s = b.result();

Key points

•Work stealing: each worker has its own deque; when idle, it steals from another worker's deque tail.
•Designed for divide-and-conquer with no blocking. Tasks split until small, compute, combine.
•Common pool is shared across the JVM. Parallel streams use it. Blocking inside it starves everyone.
•Default parallelism = available cores - 1. Override with -Djava.util.concurrent.ForkJoinPool.common.parallelism.
•For CPU-bound recursive work, fork-join can outperform a fixed thread pool. For I/O, use a dedicated executor.

Follow-up questions

▸Why is the common pool shared across the JVM?

▸When to avoid parallel streams?

▸What does work stealing buy?

▸How big should THRESHOLD be?

Gotchas

!Blocking inside the common pool starves every other parallel stream caller
!Recursive tasks that share mutable state defeat the model. Compose pure functions.
!Parallel stream parallelism = commonPool size, NOT number of stream sources
!fork() then immediately join() on the same task is just compute(); fork-join shines when one side runs inline and the other forks
!Setting common.parallelism too high (>cores) wastes context switches; too low (=1) defeats the point

ForkJoinPool & Parallel Streams

The pattern in plain English

Work stealing in plain English

When it shines

When to avoid

The common pool trap

Choosing a threshold

A note on parallel streams

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Related reading

ForkJoinPool & Parallel Streams

The pattern in plain English

Work stealing in plain English

When it shines

When to avoid

The common pool trap

Choosing a threshold

A note on parallel streams

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Related reading