Performance & TradeoffsTopic 13 of 14

ProblemMediumSometimes

Parallel MapReduce: Word Frequency Counter

In one line

Map: each worker processes a chunk of input → partial counts. Reduce: merge partial counts into final result. Map step is embarrassingly parallel; reduce step is the synchronization point. Java's parallelStream, Python's multiprocessing.Pool.map, Go's errgroup, all express this.

The MapReduce template

1. Map: split input into chunks; each worker processes its chunk > partial result
2. Reduce: merge partial results into final answer

Embarrassingly parallel because the map step has no inter-worker communication. The reduce step is the synchronization point.

Tip

When the problem reads "count Y across X", think MapReduce Word counts, log entries, distinct visitors, sums by category. The pattern: per-chunk count, then merge. Generalizes from in-process (parallelStream) to distributed (Hadoop, Spark, BigQuery).

Per-worker accumulator vs shared state

Important

Don't share the result map Tempting: have all workers write to one ConcurrentHashMap. Looks simple. Result: lock contention dominates; parallelism wasted.

Right pattern: each worker has its OWN local accumulator. No locking. Final reduce merges. This is THE pattern for "concurrent computation that builds up state."

Same idea as LongAdder (per-thread shards + sum on read).

Reduce strategies, ranked by efficiency

Strategy	When
Sequential merge	Few workers, small partials
Tree-reduce	Many workers, expensive merge, log(N) depth
Concurrent merge target	Last resort, high contention

When parallel MapReduce isn't worth it

Tiny inputs, overhead exceeds the work.
Reduce dominates, if 90% of time is in the merge, parallelism in map doesn't help.
Memory-bound, partial maps may be huge. Per-worker memory × N workers can OOM.

Warning

The interview signal Junior: "I'd use parallelStream." (Fine, but doesn't show depth.)

Senior: "Map step parallelizes embarrassingly; reduce is the synchronization point. Per-worker accumulators avoid contention. Reduce can tree-reduce for log(N) depth. Generalizes to Hadoop/Spark, same pattern, just at machine scale instead of thread scale."

Implementations

Java parallelStream, concise but ForkJoin commonPool

parallelStream().collect(toMap(...)) runs the map step in parallel using ForkJoinPool.commonPool. For most cases, this is the right answer. Don't hand-roll a replacement.

1  import java.util.*;
2  import java.util.stream.Collectors;
3  
4  Map<String, Long> wordCount = lines.parallelStream()
5      .flatMap(line -> Arrays.stream(line.split("\\s+")))
6      .filter(w -> !w.isBlank())
7      .map(String::toLowerCase)
8      .collect(Collectors.groupingBy(w -> w, Collectors.counting()));

Manual MapReduce with ForkJoinPool

For more control: split input, fork map tasks, merge results. Useful when a custom reducer or non-stream input is needed.

 1  import java.util.concurrent.*;
 2  
 3  class WordCountTask extends RecursiveTask<Map<String, Long>> {
 4      private final List<String> lines;
 5      private final int lo, hi;
 6      private static final int THRESHOLD = 1000;
 7  
 8      public WordCountTask(List<String> lines, int lo, int hi) {
 9          this.lines = lines; this.lo = lo; this.hi = hi;
10      }
11  
12      @Override
13      protected Map<String, Long> compute() {
14          if (hi - lo < THRESHOLD) {
15              return countSequential(lines.subList(lo, hi));
16          }
17          int mid = (lo + hi) / 2;
18          WordCountTask leftTask = new WordCountTask(lines, lo, mid);
19          WordCountTask rightTask = new WordCountTask(lines, mid, hi);
20          leftTask.fork();
21          Map<String, Long> right = rightTask.compute();
22          Map<String, Long> left = leftTask.join();
23          return mergeMaps(left, right);
24      }
25  
26      private Map<String, Long> countSequential(List<String> chunk) {
27          Map<String, Long> m = new HashMap<>();
28          for (String line : chunk) {
29              for (String word : line.split("\\s+")) {
30                  m.merge(word.toLowerCase(), 1L, Long::sum);
31              }
32          }
33          return m;
34      }
35  
36      private Map<String, Long> mergeMaps(Map<String, Long> a, Map<String, Long> b) {
37          b.forEach((k, v) -> a.merge(k, v, Long::sum));
38          return a;
39      }
40  }

Key points

•Map: chunk input, each worker computes partial result independently
•Reduce: merge partial results, must be associative (and ideally commutative)
•Embarrassingly parallel: no inter-worker communication during map
•Reduce strategies: tree-reduce (parallel), sequential merge, ConcurrentHashMap
•Foundation of distributed computing, Hadoop, Spark, BigQuery all extend this

Follow-up questions

▸Why per-worker map instead of shared ConcurrentHashMap?

Avoid lock contention. Each worker writes to its OWN map (no synchronization needed). Final reduce merges. Concurrent updates to one shared map serialize at the lock, defeats parallelism. The shard-then-merge pattern is universal in MapReduce.

▸How does this connect to Hadoop / Spark?

Distributed MapReduce. Same pattern at machine scale: each machine processes a chunk of input from HDFS/S3, emits partial results; reduce stage shuffles by key and merges. Hadoop's per-mapper / per-reducer abstraction is exactly this.

▸What's a tree-reduce?

Pair up partial results and merge in pairs in parallel; repeat until one result. log(N) merge depth instead of linear. Useful when reduce is expensive and many partials exist. Java's parallelStream uses this internally.

▸Why must reduce be associative?

Parallel execution can merge in any order: ((A+B)+(C+D)) or (A+(B+(C+D))). Associative ⇒ same result regardless of grouping. For commutative reducers (like sum), order also doesn't matter. Common bug: reducers that look fine but aren't strictly associative under floating-point arithmetic.

Gotchas

!Mutable accumulators (like StringBuilder) need synchronization or per-worker copies
!parallelStream uses commonPool, heavy use starves other parallelStream callers
!Reduce step can be the bottleneck if maps are huge, consider tree-reduce
!Float addition is not strictly associative, parallel reduce gives slightly different results than sequential

Parallel MapReduce: Word Frequency Counter

In one line

The MapReduce template

1. Map: split input into chunks; each worker processes its chunk > partial result
2. Reduce: merge partial results into final answer

Embarrassingly parallel because the map step has no inter-worker communication. The reduce step is the synchronization point.

Tip

Per-worker accumulator vs shared state

Important

Don't share the result map Tempting: have all workers write to one ConcurrentHashMap. Looks simple. Result: lock contention dominates; parallelism wasted.

Right pattern: each worker has its OWN local accumulator. No locking. Final reduce merges. This is THE pattern for "concurrent computation that builds up state."

Same idea as LongAdder (per-thread shards + sum on read).

Reduce strategies, ranked by efficiency

Strategy	When
Sequential merge	Few workers, small partials
Tree-reduce	Many workers, expensive merge, log(N) depth
Concurrent merge target	Last resort, high contention

When parallel MapReduce isn't worth it

Tiny inputs, overhead exceeds the work.
Reduce dominates, if 90% of time is in the merge, parallelism in map doesn't help.
Memory-bound, partial maps may be huge. Per-worker memory × N workers can OOM.

Warning

The interview signal Junior: "I'd use parallelStream." (Fine, but doesn't show depth.)

Implementations

Java parallelStream, concise but ForkJoin commonPool

parallelStream().collect(toMap(...)) runs the map step in parallel using ForkJoinPool.commonPool. For most cases, this is the right answer. Don't hand-roll a replacement.

1  import java.util.*;
2  import java.util.stream.Collectors;
3  
4  Map<String, Long> wordCount = lines.parallelStream()
5      .flatMap(line -> Arrays.stream(line.split("\\s+")))
6      .filter(w -> !w.isBlank())
7      .map(String::toLowerCase)
8      .collect(Collectors.groupingBy(w -> w, Collectors.counting()));

Manual MapReduce with ForkJoinPool

For more control: split input, fork map tasks, merge results. Useful when a custom reducer or non-stream input is needed.

 1  import java.util.concurrent.*;
 2  
 3  class WordCountTask extends RecursiveTask<Map<String, Long>> {
 4      private final List<String> lines;
 5      private final int lo, hi;
 6      private static final int THRESHOLD = 1000;
 7  
 8      public WordCountTask(List<String> lines, int lo, int hi) {
 9          this.lines = lines; this.lo = lo; this.hi = hi;
10      }
11  
12      @Override
13      protected Map<String, Long> compute() {
14          if (hi - lo < THRESHOLD) {
15              return countSequential(lines.subList(lo, hi));
16          }
17          int mid = (lo + hi) / 2;
18          WordCountTask leftTask = new WordCountTask(lines, lo, mid);
19          WordCountTask rightTask = new WordCountTask(lines, mid, hi);
20          leftTask.fork();
21          Map<String, Long> right = rightTask.compute();
22          Map<String, Long> left = leftTask.join();
23          return mergeMaps(left, right);
24      }
25  
26      private Map<String, Long> countSequential(List<String> chunk) {
27          Map<String, Long> m = new HashMap<>();
28          for (String line : chunk) {
29              for (String word : line.split("\\s+")) {
30                  m.merge(word.toLowerCase(), 1L, Long::sum);
31              }
32          }
33          return m;
34      }
35  
36      private Map<String, Long> mergeMaps(Map<String, Long> a, Map<String, Long> b) {
37          b.forEach((k, v) -> a.merge(k, v, Long::sum));
38          return a;
39      }
40  }

Key points

•Map: chunk input, each worker computes partial result independently
•Reduce: merge partial results, must be associative (and ideally commutative)
•Embarrassingly parallel: no inter-worker communication during map
•Reduce strategies: tree-reduce (parallel), sequential merge, ConcurrentHashMap
•Foundation of distributed computing, Hadoop, Spark, BigQuery all extend this

Follow-up questions

▸Why per-worker map instead of shared ConcurrentHashMap?

▸How does this connect to Hadoop / Spark?

▸What's a tree-reduce?

▸Why must reduce be associative?

Gotchas

!Mutable accumulators (like StringBuilder) need synchronization or per-worker copies
!parallelStream uses commonPool, heavy use starves other parallelStream callers
!Reduce step can be the bottleneck if maps are huge, consider tree-reduce
!Float addition is not strictly associative, parallel reduce gives slightly different results than sequential

Parallel MapReduce: Word Frequency Counter

The MapReduce template

Per-worker accumulator vs shared state

Reduce strategies, ranked by efficiency

When parallel MapReduce isn't worth it

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Parallel MapReduce: Word Frequency Counter

The MapReduce template

Per-worker accumulator vs shared state

Reduce strategies, ranked by efficiency

When parallel MapReduce isn't worth it

Implementations

Key points

Follow-up questions

Gotchas

Related reading