Polite Concurrent Web Crawler
Crawl 1M URLs concurrently while: (1) bounding total in-flight requests, (2) deduplicating visited URLs, (3) respecting per-domain politeness (max N concurrent per host). Bounded worker pool + concurrent set + per-domain semaphore.
The problem
Crawl a website (or set of websites) starting from one URL, follow links, fetch every reachable page within scope. Production constraints:
- Bounded total concurrency, don't open 10,000 sockets at once.
- Deduplication, fetch each URL once, even if discovered from many pages.
- Politeness, limit concurrent requests per host to avoid DDOSing.
- Termination, stop cleanly when no more URLs to crawl.
What's actually being tested Combining four concurrent primitives in one design: worker pool + concurrent set + per-key semaphores + termination detection. Each is a separate problem; getting them to compose cleanly is the senior signal.
The decomposition
Four concerns, four primitives
- Discovery → fetch: producer-consumer. Found links go in queue; workers consume.
- Bounded concurrency: fixed-size worker pool / Semaphore.
- Deduplication: concurrent set (CHM keyset, sync.Map, asyncio's single-thread set).
- Politeness: per-host semaphore, lazily created on first encounter.
Real-world considerations
What production crawlers actually do
- Distributed crawling: shard URLs by host across machines (consistent hashing). Avoids cross-machine politeness coordination.
- Frontier: priority queue, not FIFO. Higher-priority URLs (newly seeded, recently changed) crawled first.
- Backoff: 429 / 503 → exponential backoff per host.
- Persistent visited set: bloom filter (memory-efficient) + Redis (exactness).
- Polite delays: enforce min-time-between-requests-per-host, not just concurrency cap.
Where the LeetCode version diverges from production
LC 1242 (Web Crawler Multithreaded) is the simplified version: visit URLs, no per-host limits, no robots.txt. The senior interview version asks about politeness, distribution, and frontier prioritization. Lead with the simple version, then expand with "in production I'd add..."
Implementations
Visited set is a concurrent hash set. Worker pool processes URLs. Per-host semaphores enforce politeness. Counter tracks in-flight workers; combined with empty queue → termination.
1 import java.util.concurrent.*;
2
3 class WebCrawler {
4 private final Set<String> visited = ConcurrentHashMap.newKeySet();
5 private final ConcurrentHashMap<String, Semaphore> hostLimits = new ConcurrentHashMap<>();
6 private final BlockingQueue<String> queue = new LinkedBlockingQueue<>();
7 private final ExecutorService pool;
8 private final int perHostMax;
9
10 public WebCrawler(int workers, int perHostMax) {
11 this.pool = Executors.newFixedThreadPool(workers);
12 this.perHostMax = perHostMax;
13 }
14
15 public void crawl(String startUrl) {
16 queue.offer(startUrl);
17 visited.add(startUrl);
18 while (!queue.isEmpty()) {
19 String url = queue.poll();
20 pool.submit(() -> fetchAndDiscover(url));
21 }
22 pool.shutdown();
23 }
24
25 private void fetchAndDiscover(String url) {
26 String host = URI.create(url).getHost();
27 Semaphore hostLim = hostLimits.computeIfAbsent(host, h -> new Semaphore(perHostMax));
28 try {
29 hostLim.acquire();
30 String html = fetch(url);
31 for (String link : extractLinks(html)) {
32 if (visited.add(link)) { // atomic check-and-add
33 queue.offer(link);
34 }
35 }
36 } catch (Exception e) { log.warn("fetch failed: {}", url, e); }
37 finally { hostLim.release(); }
38 }
39 }Key points
- •Worker pool with N workers fetches from a shared work queue
- •Concurrent set (visited) prevents fetching the same URL twice
- •Per-domain semaphore caps requests-per-host (politeness)
- •BFS-style traversal: each fetched page enqueues newly-discovered links
- •Termination: when queue empty AND no in-flight workers