Performance & TradeoffsTopic 12 of 14

ProblemMediumSometimes

Parallel Merge Sort & ForkJoin

In one line

Divide-and-conquer naturally parallelizes, recursive halves run on different cores. Java's ForkJoinPool is purpose-built for this; Go uses goroutines + WaitGroups; Python uses concurrent.futures. Threshold check stops parallelism for small subarrays, fork overhead exceeds the parallelism win.

Why merge sort parallelizes well

The recursive structure: sort(arr) = merge(sort(left), sort(right)). The two recursive calls are independent, perfect for parallel execution. Only the merge step is sequential.

Important

Amdahl's Law in action Sequential merge limits speedup. Even with infinite cores, the merge sequence sets a floor. For a 1M-element array, the top-level merge is ~1M comparisons, done by ONE core. That dominates.

Speedup formula: 1 / (sequential_fraction + parallel_fraction / N). For merge sort, sequential ≈ 1/log(n). At n=1M, log(n)=20, so sequential ≈ 5%. Theoretical max speedup ≈ 20x, not unbounded.

Why ForkJoinPool over ExecutorService

ForkJoin's work-stealing matches D&C perfectly:

Each worker has its own deque.
Idle workers steal tasks from busy workers (LIFO from owner, FIFO from stealer).
Recursive splits create many small tasks; work-stealing balances them.

Regular ThreadPoolExecutor uses one central queue, serialization bottleneck for fine-grained tasks.

When parallel sort is NOT worth it

Small arrays, fork overhead wastes time. Stick to Arrays.sort (Java) which is internally parallelized for >8K anyway.
Already cache-friendly, sequential merge sort gets great cache locality; parallel can thrash caches across cores.
Memory-bound, when the array is too big for L3 cache, the workload is memory-bound, not CPU-bound. Parallelism helps less.

Tip

Java 8+ shortcut Arrays.parallelSort(array) handles the parallelization automatically. Switches to parallel for >8192 elements; uses ForkJoinPool.commonPool. For most production code, just call this and skip the hand-rolled version.

Implementations

Java ForkJoinPool, purpose-built D&C

RecursiveAction for void; RecursiveTask for results. compute() decides: small enough → solve directly; else split + fork + join. ForkJoinPool's work-stealing keeps cores busy.

 1  import java.util.concurrent.*;
 2  
 3  class ParallelMergeSort extends RecursiveAction {
 4      private static final int THRESHOLD = 1000;
 5      private final int[] arr;
 6      private final int lo, hi;
 7  
 8      public ParallelMergeSort(int[] arr, int lo, int hi) {
 9          this.arr = arr; this.lo = lo; this.hi = hi;
10      }
11  
12      @Override
13      protected void compute() {
14          if (hi - lo < THRESHOLD) {
15              Arrays.sort(arr, lo, hi);                 // sequential for small
16              return;
17          }
18          int mid = (lo + hi) / 2;
19          invokeAll(
20              new ParallelMergeSort(arr, lo, mid),
21              new ParallelMergeSort(arr, mid, hi));
22          merge(arr, lo, mid, hi);                       // sequential merge
23      }
24  
25      private void merge(int[] a, int lo, int mid, int hi) {
26          int[] tmp = new int[hi - lo];
27          int i = lo, j = mid, k = 0;
28          while (i < mid && j < hi) {
29              tmp[k++] = a[i] <= a[j] ? a[i++] : a[j++];
30          }
31          while (i < mid) tmp[k++] = a[i++];
32          while (j < hi) tmp[k++] = a[j++];
33          System.arraycopy(tmp, 0, a, lo, tmp.length);
34      }
35  }
36  
37  // Usage:
38  ForkJoinPool pool = new ForkJoinPool();
39  pool.invoke(new ParallelMergeSort(array, 0, array.length));

Key points

•Divide-and-conquer = natural parallelism (each half is independent)
•Threshold: stop forking when subarray < N (~~ 1000 elements)
•Java ForkJoinPool: work-stealing scheduler optimized for D&C
•Sequential merge step is unavoidable bottleneck, Amdahl's law
•Speedup is logarithmic vs linear: 8 cores ≠ 8x speedup

Follow-up questions

▸Why a threshold? Why not parallelize all the way down?

Fork overhead (creating a task, scheduling, joining) is a few μs. For a subarray of 10 elements, the fork cost dwarfs the work. Threshold ~= 1000 elements is empirical, fork cost amortized over enough work. Below threshold, sequential is faster.

▸What's the theoretical speedup of parallel merge sort?

O(n log n) sequential; O(log² n) on infinite cores. With P cores: O(n log n / P + log n). Practically: 4 cores ~= 3.2x speedup, 8 cores ~= 5.5x, due to merge step bottleneck and synchronization overhead. Amdahl's Law: sequential fraction caps speedup.

▸Why ForkJoinPool over regular ExecutorService?

FJP uses work-stealing, idle threads steal tasks from busy threads' queues. Optimized for recursive divide-and-conquer where subtasks are short and bursty. Regular ThreadPoolExecutor's central queue serializes scheduling, bottleneck for fine-grained tasks.

▸Can the merge step itself be parallelized?

Yes, with O(log n) parallelism using rank-based parallel merge (find each element's rank in the other array via binary search, write to position in parallel). Used in libraries like Intel TBB. Doubles complexity for marginal gains; usually not worth it.

Gotchas

!ForkJoinPool.commonPool() is shared, heavy use starves parallel streams
!ProcessPoolExecutor pickles arguments, large arrays have IPC overhead
!Threshold too low → fork overhead dominates
!Threshold too high → underutilization on big arrays
!Parallel merge sort has O(n) extra memory for the merge buffer

Parallel Merge Sort & ForkJoin

In one line

Why merge sort parallelizes well

The recursive structure: sort(arr) = merge(sort(left), sort(right)). The two recursive calls are independent, perfect for parallel execution. Only the merge step is sequential.

Important

Speedup formula: 1 / (sequential_fraction + parallel_fraction / N). For merge sort, sequential ≈ 1/log(n). At n=1M, log(n)=20, so sequential ≈ 5%. Theoretical max speedup ≈ 20x, not unbounded.

Why ForkJoinPool over ExecutorService

ForkJoin's work-stealing matches D&C perfectly:

Each worker has its own deque.
Idle workers steal tasks from busy workers (LIFO from owner, FIFO from stealer).
Recursive splits create many small tasks; work-stealing balances them.

Regular ThreadPoolExecutor uses one central queue, serialization bottleneck for fine-grained tasks.

When parallel sort is NOT worth it

Small arrays, fork overhead wastes time. Stick to Arrays.sort (Java) which is internally parallelized for >8K anyway.
Already cache-friendly, sequential merge sort gets great cache locality; parallel can thrash caches across cores.
Memory-bound, when the array is too big for L3 cache, the workload is memory-bound, not CPU-bound. Parallelism helps less.

Tip

Implementations

Java ForkJoinPool, purpose-built D&C

RecursiveAction for void; RecursiveTask for results. compute() decides: small enough → solve directly; else split + fork + join. ForkJoinPool's work-stealing keeps cores busy.

 1  import java.util.concurrent.*;
 2  
 3  class ParallelMergeSort extends RecursiveAction {
 4      private static final int THRESHOLD = 1000;
 5      private final int[] arr;
 6      private final int lo, hi;
 7  
 8      public ParallelMergeSort(int[] arr, int lo, int hi) {
 9          this.arr = arr; this.lo = lo; this.hi = hi;
10      }
11  
12      @Override
13      protected void compute() {
14          if (hi - lo < THRESHOLD) {
15              Arrays.sort(arr, lo, hi);                 // sequential for small
16              return;
17          }
18          int mid = (lo + hi) / 2;
19          invokeAll(
20              new ParallelMergeSort(arr, lo, mid),
21              new ParallelMergeSort(arr, mid, hi));
22          merge(arr, lo, mid, hi);                       // sequential merge
23      }
24  
25      private void merge(int[] a, int lo, int mid, int hi) {
26          int[] tmp = new int[hi - lo];
27          int i = lo, j = mid, k = 0;
28          while (i < mid && j < hi) {
29              tmp[k++] = a[i] <= a[j] ? a[i++] : a[j++];
30          }
31          while (i < mid) tmp[k++] = a[i++];
32          while (j < hi) tmp[k++] = a[j++];
33          System.arraycopy(tmp, 0, a, lo, tmp.length);
34      }
35  }
36  
37  // Usage:
38  ForkJoinPool pool = new ForkJoinPool();
39  pool.invoke(new ParallelMergeSort(array, 0, array.length));

Key points

•Divide-and-conquer = natural parallelism (each half is independent)
•Threshold: stop forking when subarray < N (~~ 1000 elements)
•Java ForkJoinPool: work-stealing scheduler optimized for D&C
•Sequential merge step is unavoidable bottleneck, Amdahl's law
•Speedup is logarithmic vs linear: 8 cores ≠ 8x speedup

Follow-up questions

▸Why a threshold? Why not parallelize all the way down?

▸What's the theoretical speedup of parallel merge sort?

▸Why ForkJoinPool over regular ExecutorService?

▸Can the merge step itself be parallelized?

Gotchas

!ForkJoinPool.commonPool() is shared, heavy use starves parallel streams
!ProcessPoolExecutor pickles arguments, large arrays have IPC overhead
!Threshold too low → fork overhead dominates
!Threshold too high → underutilization on big arrays
!Parallel merge sort has O(n) extra memory for the merge buffer

Parallel Merge Sort & ForkJoin

Why merge sort parallelizes well

Why ForkJoinPool over ExecutorService

When parallel sort is NOT worth it

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Parallel Merge Sort & ForkJoin

Why merge sort parallelizes well

Why ForkJoinPool over ExecutorService

When parallel sort is NOT worth it

Implementations

Key points

Follow-up questions

Gotchas

Related reading