Real-World ScenariosTopic 9 of 18

ProblemHardAsked Often

Polite Concurrent Web Crawler

In one line

Crawl 1M URLs concurrently while: (1) bounding total in-flight requests, (2) deduplicating visited URLs, (3) respecting per-domain politeness (max N concurrent per host). Bounded worker pool + concurrent set + per-domain semaphore.

The problem

Crawl a website (or set of websites) starting from one URL, follow links, fetch every reachable page within scope. Production constraints:

Bounded total concurrency, don't open 10,000 sockets at once.
Deduplication, fetch each URL once, even if discovered from many pages.
Politeness, limit concurrent requests per host to avoid DDOSing.
Termination, stop cleanly when no more URLs to crawl.

Note

What's actually being tested Combining four concurrent primitives in one design: worker pool + concurrent set + per-key semaphores + termination detection. Each is a separate problem; getting them to compose cleanly is the senior signal.

The decomposition

Tip

Four concerns, four primitives

Discovery → fetch: producer-consumer. Found links go in queue; workers consume.
Bounded concurrency: fixed-size worker pool / Semaphore.
Deduplication: concurrent set (CHM keyset, sync.Map, asyncio's single-thread set).
Politeness: per-host semaphore, lazily created on first encounter.

Real-world considerations

Important

What production crawlers actually do

Distributed crawling: shard URLs by host across machines (consistent hashing). Avoids cross-machine politeness coordination.
Frontier: priority queue, not FIFO. Higher-priority URLs (newly seeded, recently changed) crawled first.
Backoff: 429 / 503 → exponential backoff per host.
Persistent visited set: bloom filter (memory-efficient) + Redis (exactness).
Polite delays: enforce min-time-between-requests-per-host, not just concurrency cap.

Where the LeetCode version diverges from production

LC 1242 (Web Crawler Multithreaded) is the simplified version: visit URLs, no per-host limits, no robots.txt. The senior interview version asks about politeness, distribution, and frontier prioritization. Lead with the simple version, then expand with "in production I'd add..."

Implementations

ConcurrentHashMap.newKeySet + ExecutorService + per-host semaphores

Visited set is a concurrent hash set. Worker pool processes URLs. Per-host semaphores enforce politeness. Counter tracks in-flight workers; combined with empty queue → termination.

 1  import java.util.concurrent.*;
 2  
 3  class WebCrawler {
 4      private final Set<String> visited = ConcurrentHashMap.newKeySet();
 5      private final ConcurrentHashMap<String, Semaphore> hostLimits = new ConcurrentHashMap<>();
 6      private final BlockingQueue<String> queue = new LinkedBlockingQueue<>();
 7      private final ExecutorService pool;
 8      private final int perHostMax;
 9  
10      public WebCrawler(int workers, int perHostMax) {
11          this.pool = Executors.newFixedThreadPool(workers);
12          this.perHostMax = perHostMax;
13      }
14  
15      public void crawl(String startUrl) {
16          queue.offer(startUrl);
17          visited.add(startUrl);
18          while (!queue.isEmpty()) {
19              String url = queue.poll();
20              pool.submit(() -> fetchAndDiscover(url));
21          }
22          pool.shutdown();
23      }
24  
25      private void fetchAndDiscover(String url) {
26          String host = URI.create(url).getHost();
27          Semaphore hostLim = hostLimits.computeIfAbsent(host, h -> new Semaphore(perHostMax));
28          try {
29              hostLim.acquire();
30              String html = fetch(url);
31              for (String link : extractLinks(html)) {
32                  if (visited.add(link)) {     // atomic check-and-add
33                      queue.offer(link);
34                  }
35              }
36          } catch (Exception e) { log.warn("fetch failed: {}", url, e); }
37          finally { hostLim.release(); }
38      }
39  }

Key points

•Worker pool with N workers fetches from a shared work queue
•Concurrent set (visited) prevents fetching the same URL twice
•Per-domain semaphore caps requests-per-host (politeness)
•BFS-style traversal: each fetched page enqueues newly-discovered links
•Termination: when queue empty AND no in-flight workers

Follow-up questions

▸How is pool size decided?

Web crawling is I/O-bound, pool size much larger than core count. Goetz's formula: cores × (1 + wait/compute). For ~500ms HTTP fetches and millisecond parsing: 100-500× cores. Empirically: ramp up until upstream complains or the file-descriptor table maxes out.

▸Why per-host limits?

Politeness, don't DDOS individual hosts. Most sites tolerate 4-8 concurrent connections from one client. Public crawler etiquette and robots.txt expect this.

▸How does termination work when the queue drains?

Track in-flight worker count. Done = queue empty AND inFlight == 0. Tricky because between 'queue empty' check and worker incrementing inFlight, the queue could re-fill. Real solution: a CountDownLatch / WaitGroup that fires when both conditions hold atomically.

▸How is robots.txt handled?

Per-host robots.txt cache. Fetch and parse once per host; cache the rules. Apply rules before queuing URLs for that host. Add to the per-host limit logic.

Polite Concurrent Web Crawler

In one line

The problem

Crawl a website (or set of websites) starting from one URL, follow links, fetch every reachable page within scope. Production constraints:

Bounded total concurrency, don't open 10,000 sockets at once.
Deduplication, fetch each URL once, even if discovered from many pages.
Politeness, limit concurrent requests per host to avoid DDOSing.
Termination, stop cleanly when no more URLs to crawl.

Note

The decomposition

Tip

Four concerns, four primitives

Discovery → fetch: producer-consumer. Found links go in queue; workers consume.
Bounded concurrency: fixed-size worker pool / Semaphore.
Deduplication: concurrent set (CHM keyset, sync.Map, asyncio's single-thread set).
Politeness: per-host semaphore, lazily created on first encounter.

Real-world considerations

Important

What production crawlers actually do

Distributed crawling: shard URLs by host across machines (consistent hashing). Avoids cross-machine politeness coordination.
Frontier: priority queue, not FIFO. Higher-priority URLs (newly seeded, recently changed) crawled first.
Backoff: 429 / 503 → exponential backoff per host.
Persistent visited set: bloom filter (memory-efficient) + Redis (exactness).
Polite delays: enforce min-time-between-requests-per-host, not just concurrency cap.

Where the LeetCode version diverges from production

Implementations

ConcurrentHashMap.newKeySet + ExecutorService + per-host semaphores

Visited set is a concurrent hash set. Worker pool processes URLs. Per-host semaphores enforce politeness. Counter tracks in-flight workers; combined with empty queue → termination.

 1  import java.util.concurrent.*;
 2  
 3  class WebCrawler {
 4      private final Set<String> visited = ConcurrentHashMap.newKeySet();
 5      private final ConcurrentHashMap<String, Semaphore> hostLimits = new ConcurrentHashMap<>();
 6      private final BlockingQueue<String> queue = new LinkedBlockingQueue<>();
 7      private final ExecutorService pool;
 8      private final int perHostMax;
 9  
10      public WebCrawler(int workers, int perHostMax) {
11          this.pool = Executors.newFixedThreadPool(workers);
12          this.perHostMax = perHostMax;
13      }
14  
15      public void crawl(String startUrl) {
16          queue.offer(startUrl);
17          visited.add(startUrl);
18          while (!queue.isEmpty()) {
19              String url = queue.poll();
20              pool.submit(() -> fetchAndDiscover(url));
21          }
22          pool.shutdown();
23      }
24  
25      private void fetchAndDiscover(String url) {
26          String host = URI.create(url).getHost();
27          Semaphore hostLim = hostLimits.computeIfAbsent(host, h -> new Semaphore(perHostMax));
28          try {
29              hostLim.acquire();
30              String html = fetch(url);
31              for (String link : extractLinks(html)) {
32                  if (visited.add(link)) {     // atomic check-and-add
33                      queue.offer(link);
34                  }
35              }
36          } catch (Exception e) { log.warn("fetch failed: {}", url, e); }
37          finally { hostLim.release(); }
38      }
39  }

Key points

•Worker pool with N workers fetches from a shared work queue
•Concurrent set (visited) prevents fetching the same URL twice
•Per-domain semaphore caps requests-per-host (politeness)
•BFS-style traversal: each fetched page enqueues newly-discovered links
•Termination: when queue empty AND no in-flight workers

Follow-up questions

▸How is pool size decided?

▸Why per-host limits?

Politeness, don't DDOS individual hosts. Most sites tolerate 4-8 concurrent connections from one client. Public crawler etiquette and robots.txt expect this.

▸How does termination work when the queue drains?

▸How is robots.txt handled?

Per-host robots.txt cache. Fetch and parse once per host; cache the rules. Apply rules before queuing URLs for that host. Add to the per-host limit logic.

Polite Concurrent Web Crawler

The problem

The decomposition

Real-world considerations

Where the LeetCode version diverges from production

Implementations

Key points

Follow-up questions

Related reading

Polite Concurrent Web Crawler

The problem

The decomposition

Real-world considerations

Where the LeetCode version diverges from production

Implementations

Key points

Follow-up questions

Related reading