Performance & TradeoffsTopic 14 of 14

ProblemHardSometimes

Parallel Graph Traversal (BFS/DFS)

In one line

Parallel BFS: process each level in parallel; serial frontier-to-frontier transitions. Parallel DFS: harder due to depth-recursion; usually use task-pool of subgraphs. Concurrent visited-set required either way; ConcurrentHashMap or Bloom filter.

When to parallelize graph traversal

Web crawling, each page is a node; edges are links. Per-level parallelism.
Social-network queries, "all friends-of-friends." Two-level BFS, parallelizable.
Dependency resolution, package managers, build systems. DAG traversal.
PageRank, GraphX, Pregel, distributed graph processing frameworks built on parallel BFS-like primitives.

Why BFS, not DFS

BFS has a natural barrier at each level, process all nodes at depth k, then proceed to k+1. The parallelism unit is "all nodes at the current level." Inside a level, each node's neighbor expansion is independent.

DFS has no such barrier, traversal goes as deep as possible, then backtracks. Naive parallelization breaks DFS's ordering invariants. Workarounds: spawn task per subtree, dedupe with shared visited.

Important

The synchronization rule Between levels: ALL workers must finish before next level starts. Otherwise levels mix, a node at distance 3 might be processed before all distance-2 nodes finish, breaking distance correctness.

The visited-set bottleneck

Discovery rate >> uniqueness rate (most edges go to already-visited nodes). The visited set sees lots of contention. Choices:

Set type	Pro	Con
Synchronized HashSet	Simple	Single lock = bottleneck
ConcurrentHashMap.newKeySet()	Atomic add	Memory-heavy
sync.Map (Go)	Lock-free reads	Slow if many writes
Bloom filter	Constant memory	False positives
Sharded sync.Mutex + map	Reduced contention	More complex

Most production: ConcurrentHashMap (Java) or sharded sync.Mutex (Go).

When NOT to parallelize

Small graphs, overhead exceeds work.
Sparse graphs, frontier is small per level; parallelism is wasted on synchronization.
Tree shape, recursion is the natural structure; parallel doesn't help.

Tip

The interview answer "Parallel BFS: process each level concurrently, synchronize between levels, concurrent visited set for dedup. Parallel DFS: harder; use task pool. Visited set is the hot spot, choose Concurrent map for correctness, Bloom for memory at extreme scale."

Implementations

Parallel BFS, level-by-level

Each level: process all nodes in parallel, collect next-level nodes, dedupe via concurrent visited set, swap to next level. Use ForkJoinPool or parallelStream.

 1  import java.util.*;
 2  import java.util.concurrent.*;
 3  
 4  class ParallelBFS {
 5      public Map<Integer, Integer> traverse(Graph graph, int source) {
 6          Set<Integer> visited = ConcurrentHashMap.newKeySet();
 7          Map<Integer, Integer> distance = new ConcurrentHashMap<>();
 8          List<Integer> frontier = new ArrayList<>(List.of(source));
 9          visited.add(source);
10          distance.put(source, 0);
11          int level = 0;
12  
13          while (!frontier.isEmpty()) {
14              final int currentLevel = level;
15              // Process current level in parallel; collect next frontier
16              List<Integer> nextFrontier = frontier.parallelStream()
17                  .flatMap(node -> graph.neighbors(node).stream()
18                      .filter(neighbor -> visited.add(neighbor)))   // atomic add
19                  .peek(neighbor -> distance.put(neighbor, currentLevel + 1))
20                  .collect(Collectors.toList());
21              frontier = nextFrontier;
22              level++;
23          }
24          return distance;
25      }
26  }

Key points

•BFS parallelizes naturally per-level, all nodes at level k processed concurrently
•DFS harder, recursive nature serializes; use bounded task pool of subgraphs
•Visited set must be concurrent, CHM, sync.Map, atomic bitset
•Frontier transitions are barriers, synchronize all workers between levels
•Used in PageRank, social-graph queries, dependency resolution at scale

Follow-up questions

▸Why is BFS easier to parallelize than DFS?

BFS has natural levels, all nodes at distance k are processed before moving to k+1. Each level is embarrassingly parallel. DFS has recursive call structure, depth-first ordering must be preserved (somewhat), and recursion serializes naturally. Workarounds: parallel DFS uses task pool of subgraphs.

▸Why does visited set need to be concurrent?

Many workers may discover the same neighbor simultaneously. Without atomic add, two workers might both think they discovered it first → both enqueue → duplicate work. ConcurrentHashMap.newKeySet().add() returns true exactly once per key, natural deduplication.

▸When does parallel BFS NOT win?

Tree-shaped graphs (each node has one parent), frontier is small, parallelism is wasted on coordination. Sparse graphs in general. Win is biggest on dense, wide graphs (many neighbors per node, many nodes per level).

▸What about Bloom filter for visited set?

Useful at extreme scale (billions of nodes) where exact set is too big. Trade memory for false positives. Used by web-scale crawlers. Risk: Bloom false positive → drop a real new node. Usually pair with exact set as fallback for confirmation.

Gotchas

!If workers don't synchronize between levels, processing crosses levels and BFS becomes wrong
!Concurrent map's `add` semantics matter, use `newKeySet().add()` (returns true if newly added) for dedup
!Level barrier requires ALL workers from level k to finish before level k+1 starts, wg.Wait()
!DFS isn't level-based, naive parallelization breaks ordering
!Memory-bound when frontier is huge, bound concurrency to avoid OOM

Parallel Graph Traversal (BFS/DFS)

In one line

When to parallelize graph traversal

Web crawling, each page is a node; edges are links. Per-level parallelism.
Social-network queries, "all friends-of-friends." Two-level BFS, parallelizable.
Dependency resolution, package managers, build systems. DAG traversal.
PageRank, GraphX, Pregel, distributed graph processing frameworks built on parallel BFS-like primitives.

Why BFS, not DFS

DFS has no such barrier, traversal goes as deep as possible, then backtracks. Naive parallelization breaks DFS's ordering invariants. Workarounds: spawn task per subtree, dedupe with shared visited.

Important

The visited-set bottleneck

Discovery rate >> uniqueness rate (most edges go to already-visited nodes). The visited set sees lots of contention. Choices:

Set type	Pro	Con
Synchronized HashSet	Simple	Single lock = bottleneck
ConcurrentHashMap.newKeySet()	Atomic add	Memory-heavy
sync.Map (Go)	Lock-free reads	Slow if many writes
Bloom filter	Constant memory	False positives
Sharded sync.Mutex + map	Reduced contention	More complex

Most production: ConcurrentHashMap (Java) or sharded sync.Mutex (Go).

When NOT to parallelize

Small graphs, overhead exceeds work.
Sparse graphs, frontier is small per level; parallelism is wasted on synchronization.
Tree shape, recursion is the natural structure; parallel doesn't help.

Tip

Implementations

Parallel BFS, level-by-level

Each level: process all nodes in parallel, collect next-level nodes, dedupe via concurrent visited set, swap to next level. Use ForkJoinPool or parallelStream.

 1  import java.util.*;
 2  import java.util.concurrent.*;
 3  
 4  class ParallelBFS {
 5      public Map<Integer, Integer> traverse(Graph graph, int source) {
 6          Set<Integer> visited = ConcurrentHashMap.newKeySet();
 7          Map<Integer, Integer> distance = new ConcurrentHashMap<>();
 8          List<Integer> frontier = new ArrayList<>(List.of(source));
 9          visited.add(source);
10          distance.put(source, 0);
11          int level = 0;
12  
13          while (!frontier.isEmpty()) {
14              final int currentLevel = level;
15              // Process current level in parallel; collect next frontier
16              List<Integer> nextFrontier = frontier.parallelStream()
17                  .flatMap(node -> graph.neighbors(node).stream()
18                      .filter(neighbor -> visited.add(neighbor)))   // atomic add
19                  .peek(neighbor -> distance.put(neighbor, currentLevel + 1))
20                  .collect(Collectors.toList());
21              frontier = nextFrontier;
22              level++;
23          }
24          return distance;
25      }
26  }

Key points

•BFS parallelizes naturally per-level, all nodes at level k processed concurrently
•DFS harder, recursive nature serializes; use bounded task pool of subgraphs
•Visited set must be concurrent, CHM, sync.Map, atomic bitset
•Frontier transitions are barriers, synchronize all workers between levels
•Used in PageRank, social-graph queries, dependency resolution at scale

Follow-up questions

▸Why is BFS easier to parallelize than DFS?

▸Why does visited set need to be concurrent?

▸When does parallel BFS NOT win?

▸What about Bloom filter for visited set?

Gotchas

!If workers don't synchronize between levels, processing crosses levels and BFS becomes wrong
!Concurrent map's `add` semantics matter, use `newKeySet().add()` (returns true if newly added) for dedup
!Level barrier requires ALL workers from level k to finish before level k+1 starts, wg.Wait()
!DFS isn't level-based, naive parallelization breaks ordering
!Memory-bound when frontier is huge, bound concurrency to avoid OOM

Parallel Graph Traversal (BFS/DFS)

When to parallelize graph traversal

Why BFS, not DFS

The visited-set bottleneck

When NOT to parallelize

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Parallel Graph Traversal (BFS/DFS)

When to parallelize graph traversal

Why BFS, not DFS

The visited-set bottleneck

When NOT to parallelize

Implementations

Key points

Follow-up questions

Gotchas

Related reading