Parallel Merge Sort & ForkJoin
Divide-and-conquer naturally parallelizes, recursive halves run on different cores. Java's ForkJoinPool is purpose-built for this; Go uses goroutines + WaitGroups; Python uses concurrent.futures. Threshold check stops parallelism for small subarrays, fork overhead exceeds the parallelism win.
Why merge sort parallelizes well
The recursive structure: sort(arr) = merge(sort(left), sort(right)). The two recursive calls are independent, perfect for parallel execution. Only the merge step is sequential.
Amdahl's Law in action Sequential merge limits speedup. Even with infinite cores, the merge sequence sets a floor. For a 1M-element array, the top-level merge is ~1M comparisons, done by ONE core. That dominates.
Speedup formula: 1 / (sequential_fraction + parallel_fraction / N). For merge sort, sequential ≈ 1/log(n). At n=1M, log(n)=20, so sequential ≈ 5%. Theoretical max speedup ≈ 20x, not unbounded.
Why ForkJoinPool over ExecutorService
ForkJoin's work-stealing matches D&C perfectly:
- Each worker has its own deque.
- Idle workers steal tasks from busy workers (LIFO from owner, FIFO from stealer).
- Recursive splits create many small tasks; work-stealing balances them.
Regular ThreadPoolExecutor uses one central queue, serialization bottleneck for fine-grained tasks.
When parallel sort is NOT worth it
- Small arrays, fork overhead wastes time. Stick to
Arrays.sort(Java) which is internally parallelized for >8K anyway. - Already cache-friendly, sequential merge sort gets great cache locality; parallel can thrash caches across cores.
- Memory-bound, when the array is too big for L3 cache, the workload is memory-bound, not CPU-bound. Parallelism helps less.
Java 8+ shortcut
Arrays.parallelSort(array) handles the parallelization automatically. Switches to parallel for >8192 elements; uses ForkJoinPool.commonPool. For most production code, just call this and skip the hand-rolled version.
Implementations
RecursiveAction for void; RecursiveTask for results. compute() decides: small enough → solve directly; else split + fork + join. ForkJoinPool's work-stealing keeps cores busy.
1 import java.util.concurrent.*;
2
3 class ParallelMergeSort extends RecursiveAction {
4 private static final int THRESHOLD = 1000;
5 private final int[] arr;
6 private final int lo, hi;
7
8 public ParallelMergeSort(int[] arr, int lo, int hi) {
9 this.arr = arr; this.lo = lo; this.hi = hi;
10 }
11
12 @Override
13 protected void compute() {
14 if (hi - lo < THRESHOLD) {
15 Arrays.sort(arr, lo, hi); // sequential for small
16 return;
17 }
18 int mid = (lo + hi) / 2;
19 invokeAll(
20 new ParallelMergeSort(arr, lo, mid),
21 new ParallelMergeSort(arr, mid, hi));
22 merge(arr, lo, mid, hi); // sequential merge
23 }
24
25 private void merge(int[] a, int lo, int mid, int hi) {
26 int[] tmp = new int[hi - lo];
27 int i = lo, j = mid, k = 0;
28 while (i < mid && j < hi) {
29 tmp[k++] = a[i] <= a[j] ? a[i++] : a[j++];
30 }
31 while (i < mid) tmp[k++] = a[i++];
32 while (j < hi) tmp[k++] = a[j++];
33 System.arraycopy(tmp, 0, a, lo, tmp.length);
34 }
35 }
36
37 // Usage:
38 ForkJoinPool pool = new ForkJoinPool();
39 pool.invoke(new ParallelMergeSort(array, 0, array.length));Key points
- •Divide-and-conquer = natural parallelism (each half is independent)
- •Threshold: stop forking when subarray < N (~~ 1000 elements)
- •Java ForkJoinPool: work-stealing scheduler optimized for D&C
- •Sequential merge step is unavoidable bottleneck, Amdahl's law
- •Speedup is logarithmic vs linear: 8 cores ≠ 8x speedup
Follow-up questions
▸Why a threshold? Why not parallelize all the way down?
▸What's the theoretical speedup of parallel merge sort?
▸Why ForkJoinPool over regular ExecutorService?
▸Can the merge step itself be parallelized?
Gotchas
- !ForkJoinPool.commonPool() is shared, heavy use starves parallel streams
- !ProcessPoolExecutor pickles arguments, large arrays have IPC overhead
- !Threshold too low → fork overhead dominates
- !Threshold too high → underutilization on big arrays
- !Parallel merge sort has O(n) extra memory for the merge buffer