Fan-Out / Fan-In

What it is

Fan-out: hand work out to N concurrent workers. Fan-in: merge their results back into a single stream (see diagram above).

The key property: workers are interchangeable. Any worker can take the next item; nobody owns specific data. Fast workers grab more items than slow ones (automatic load balancing).

This is the parallelisation pattern for the middle of a pipeline. It comes up everywhere: web crawlers (one URL queue, N fetchers), batch processing (one input list, N processors), HTTP fan-out to backends (one request, N parallel calls, one merged response).

Why fan-out works

Three properties combine. First, the worker is the thing being parallelised, not the data. Adding more workers raises throughput up to whatever bottleneck is downstream. Second, the work is independent: workers don't talk to each other. Third, work-stealing happens automatically when workers share an input queue: fast workers grab more items.

The result: parallelism scales by changing one number (the worker count) without changing the structure of the code.

Why bound it

The naive impulse: "more workers is always better". Wrong. Each worker uses memory, threads/goroutines, file descriptors. Each worker hits the downstream. Unbounded workers can:

Exhaust file descriptors and trigger "too many open files".
Overwhelm the downstream and turn its slow into the caller's slow.
Burn CPU on context switching with no real throughput gain.

Pick a worker count and stick to it. For CPU-bound work, that is core count. For I/O-bound work, somewhere between 10 and 1000 depending on what the downstream can take.

Order preservation

Fan-out loses order. Workers run concurrently; whichever finishes first writes its result first. If the consumer needs results in input order, add explicit sequence numbers and re-sort downstream.

There are alternatives:

Per-key partitioning. Hash each input by some key; route to one of N workers based on the hash. Within a key, order is preserved. Used for partitioned event streams (Kafka).
Min-heap consumer. Consumer holds a min-heap, only emits the next sequence number when it arrives. Streaming preservation, but the heap can grow if one worker is slow.
Drop the requirement. Often the downstream does not actually care about order. Confirm before adding complexity.

Errors

Two strategies.

Cancel on first error. Shared context. First worker that hits an error cancels the rest, fan-in surfaces the error to the caller. Right when any one failure invalidates the result (transactional fan-out: load all parts of an order page; if any fails, fail the page).

Collect errors and continue. Each worker reports its outcome (success or error). Fan-in returns a list of results, some good, some bad. Caller decides. Right for batch processing where partial success is acceptable.

Pick at the call site. Don't mix; that confuses everyone reading the code.

A note on libraries

The standard libraries make this easy:

Go: errgroup with SetLimit(n). Cancellation built in.
Java: StructuredTaskScope (Java 21+) or ExecutorService.invokeAll.
Python (asyncio): TaskGroup with semaphore for limit.
Python (threads/processes): concurrent.futures.Executor.map.

Hand-rolling fan-out with raw channels/threads is fine for learning. For production code, use the helpers; they handle the cancellation and the wait-for-all that everyone gets wrong.

Follow-up questions

▸How many workers should run?

For CPU-bound work, equal to or slightly less than core count. For I/O-bound work, much higher (50, 100, 500), bounded by the downstream's concurrency limit. The wrong default is 'as many as possible'; that exhausts file descriptors, overwhelms the downstream, and yields nothing. Measure throughput at different worker counts and find the knee.

▸How is order preserved?

Tag each input with a sequence number. Fan out. In the consumer, re-sort by sequence. Or use a single-threaded consumer with a min-heap to stream in order. Or use a partitioned approach where each partition is single-threaded but partitions are parallel.

▸What if some workers are much slower than others?

With a shared input channel, work-stealing happens automatically: fast workers grab more items. With per-worker input queues, balancing has to be manual. Default to shared input unless there is a specific reason for per-worker queues (cache locality, stateful workers).

▸How are errors propagated?

Two patterns. Cancel-on-first-error: shared context, first error cancels the rest (errgroup, asyncio.TaskGroup). Or collect errors and continue: each worker logs/returns its error, the consumer decides. Cancel-on-first is the default for transactional fan-out; collect-and-continue for batch processing where partial success is fine.

What it is

Fan-out: hand work out to N concurrent workers. Fan-in: merge their results back into a single stream (see diagram above).

The key property: workers are interchangeable. Any worker can take the next item; nobody owns specific data. Fast workers grab more items than slow ones (automatic load balancing).

Why fan-out works

The result: parallelism scales by changing one number (the worker count) without changing the structure of the code.

Why bound it

The naive impulse: "more workers is always better". Wrong. Each worker uses memory, threads/goroutines, file descriptors. Each worker hits the downstream. Unbounded workers can:

Exhaust file descriptors and trigger "too many open files".
Overwhelm the downstream and turn its slow into the caller's slow.
Burn CPU on context switching with no real throughput gain.

Pick a worker count and stick to it. For CPU-bound work, that is core count. For I/O-bound work, somewhere between 10 and 1000 depending on what the downstream can take.

Order preservation

Fan-out loses order. Workers run concurrently; whichever finishes first writes its result first. If the consumer needs results in input order, add explicit sequence numbers and re-sort downstream.

There are alternatives:

Per-key partitioning. Hash each input by some key; route to one of N workers based on the hash. Within a key, order is preserved. Used for partitioned event streams (Kafka).
Min-heap consumer. Consumer holds a min-heap, only emits the next sequence number when it arrives. Streaming preservation, but the heap can grow if one worker is slow.
Drop the requirement. Often the downstream does not actually care about order. Confirm before adding complexity.

Errors

Two strategies.

Pick at the call site. Don't mix; that confuses everyone reading the code.

A note on libraries

The standard libraries make this easy:

Go: errgroup with SetLimit(n). Cancellation built in.
Java: StructuredTaskScope (Java 21+) or ExecutorService.invokeAll.
Python (asyncio): TaskGroup with semaphore for limit.
Python (threads/processes): concurrent.futures.Executor.map.

Hand-rolling fan-out with raw channels/threads is fine for learning. For production code, use the helpers; they handle the cancellation and the wait-for-all that everyone gets wrong.

Follow-up questions

▸How many workers should run?

▸How is order preserved?

▸What if some workers are much slower than others?

▸How are errors propagated?

Diagram

What it is

Why fan-out works

Why bound it

Order preservation

Errors

A note on libraries

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Fan-Out / Fan-In

Diagram

What it is

Why fan-out works

Why bound it

Order preservation

Errors

A note on libraries

Implementations

Key points

Follow-up questions

Gotchas

Related reading