Python ConcurrencyTopic 2 of 11

LanguagePythonIntermediateAsked Often

threading vs multiprocessing, Picking the Right Tool

In one line

threading shares memory and is great for I/O-bound work because the GIL releases around blocking calls. multiprocessing spawns separate processes that bypass the GIL, the only way to get real CPU parallelism in stock CPython.

Diagram

What it is

Python offers two genuinely different ways to run code in parallel: threads (which share memory but are bottlenecked by the GIL) and processes (which have separate memory and bypass the GIL). Picking wrong is the #1 cause of "I added concurrency and it got slower" bugs in Python.

The decision tree is short, but worth being precise about because the answer is the opposite of what most other languages would suggest.

Why it matters

Most Python optimisation stories end at "multiprocessing for CPU, threading/asyncio for I/O." Interviewers want to see that recommendation derived from first principles: the GIL, the cost of process startup, when IPC overhead matters.

Important

The decision in one sentence If the bottleneck is waiting on something (network, disk, DB), threads work. If it is crunching numbers, processes work. Measure first; optimise second.

How they differ

	threading	multiprocessing
Memory	Shared	Separate (copy or shared_memory)
Startup cost	~few μs	~50ms
GIL impact	Bottleneck for CPU	Each has its own, no bottleneck
Communication	Just shared variables + locks	Pickle + pipe/queue
CPU parallelism	No (GIL)	Yes
I/O parallelism	Yes (GIL releases on blocking calls)	Yes but overkill
Crash blast radius	Whole process	Just the worker

When to reach for which

Use threading when:

The bottleneck is waiting (HTTP, DB, file I/O, sleep).
Workers need to share large in-memory data structures.
You need fast inter-worker communication.

Use multiprocessing when:

The bottleneck is CPU (numerical compute, parsing, image work).
Tasks are large enough to amortize ~50ms startup.
You need crash isolation.

Use neither, reach for asyncio when:

You have 10K+ concurrent I/O operations.
The libraries in use are async-aware.

The modern API: concurrent.futures

For 99% of cases, prefer concurrent.futures over raw threading.Thread and multiprocessing.Process. It provides:

A unified Executor API, submit, map, shutdown.
Future objects with .result(), .exception(), .cancel(), .add_done_callback.
Clean shutdown via context manager (with ... as ex:).
as_completed for streaming results.
Easy switching between thread pool and process pool.

Tip

The pattern that works for most Python concurrency

with ThreadPoolExecutor(max_workers=N) as ex:
    results = list(ex.map(work, items))

Swap ThreadPoolExecutor for ProcessPoolExecutor when CPU-bound. That's the entire pattern.

The platform trap, fork vs spawn

On Linux, multiprocessing currently defaults to fork: the child inherits a copy of the parent's memory (copy-on-write). Module-level globals are visible. Cheap startup. (Python 3.14 deprecates fork as the default on Linux as well, planning to switch to forkserver.)

On macOS (3.8+) and Windows, the default is spawn: the child starts a fresh Python interpreter and re-imports the module. Module-level state set by the parent is gone. Code that worked on Linux dev environments breaks in macOS or Windows production.

Warning

Always guard with if __name__ == "__main__": Without it, spawn-mode children re-execute the module top-to-bottom, including the Pool() creation, leading to infinite recursion of process spawning. The guard is mandatory on Windows; harmless on Linux.

Picking pool size

Tip

Two different formulas CPU-bound (process pool): os.cpu_count(), sometimes cpu_count() + 1. More processes than cores means context-switching cost without throughput gain.

I/O-bound (thread pool): cores × (1 + wait_time / compute_time). For workloads that are 90% wait, that's roughly 10 × cores. Tune by ramping up until throughput plateaus.

When to mix them

A common high-performance pattern: process pool at the top, thread pool inside each worker. Each process gets a CPU core for compute; threads inside each handle blocking I/O without pinning a whole process to wait. Useful for ML inference servers (CPU-heavy compute + S3 reads), data pipelines (CPU transform + DB writes).

# Each worker process spins up its own thread pool
def worker(batch):
    with ThreadPoolExecutor(max_workers=10) as io_pool:
        urls = list(io_pool.map(fetch, batch))
    return cpu_heavy_transform(urls)

with ProcessPoolExecutor(max_workers=8) as proc_pool:
    proc_pool.map(worker, batches)

This is the pattern Celery, Dask, and Ray use under the hood.

Primitives by language

threading.Thread
threading.Lock / Event / Condition / Semaphore
multiprocessing.Process
multiprocessing.Pool / Queue / Pipe / Manager / shared_memory
concurrent.futures.{ThreadPoolExecutor, ProcessPoolExecutor}

Implementation

ThreadPoolExecutor for I/O concurrency

concurrent.futures.ThreadPoolExecutor is the modern Pythonic way to manage a thread pool. map preserves input order; as_completed yields futures as they finish. Prefer this 99% of the time over raw threading.Thread.

 1  from concurrent.futures import ThreadPoolExecutor, as_completed
 2  import requests
 3  
 4  urls = [f"https://api.example.com/items/{i}" for i in range(100)]
 5  
 6  with ThreadPoolExecutor(max_workers=20) as ex:
 7      # In-order results
 8      for resp in ex.map(requests.get, urls):
 9          process(resp)
10  
11      # Or first-finished-first
12      futures = {ex.submit(requests.get, u): u for u in urls}
13      for fut in as_completed(futures):
14          url = futures[fut]
15          try:
16              process(fut.result())
17          except Exception as e:
18              print(f"Failed {url}: {e}")

ProcessPoolExecutor for CPU parallelism

Same API, different executor, but each task runs in a separate process. The arguments and return values are pickled and sent via a pipe. Best for medium-sized CPU work where the IPC overhead is amortized.

 1  from concurrent.futures import ProcessPoolExecutor
 2  
 3  def transform(image_path):
 4      # CPU-heavy: resize, encode, compress
 5      return process_image(image_path)
 6  
 7  paths = [f"img_{i}.jpg" for i in range(1000)]
 8  
 9  with ProcessPoolExecutor(max_workers=8) as ex:
10      for original, result in zip(paths, ex.map(transform, paths)):
11          save(original, result)

Sharing data between processes

Each process has its own memory by default; pickling arguments and return values is the easy path. For large shared state, use multiprocessing.shared_memory (3.8+) or Manager (slower but flexible).

 1  from multiprocessing import Pool, Manager
 2  from multiprocessing.shared_memory import SharedMemory
 3  import numpy as np
 4  
 5  # Approach 1: pickle args/results (simplest, slowest for big data)
 6  with Pool(4) as pool:
 7      results = pool.map(work, big_list)
 8  
 9  # Approach 2: Manager for shared dict/list (slow but cross-process)
10  with Manager() as mgr:
11      counter = mgr.Value("i", 0)
12      shared_list = mgr.list()
13      # workers get proxy objects
14  
15  # Approach 3: shared_memory for numpy arrays (fastest for numeric)
16  arr = np.ones(10_000_000, dtype=np.float64)
17  shm = SharedMemory(create=True, size=arr.nbytes)
18  buffer = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
19  buffer[:] = arr
20  # children attach by name

Common pitfall, fork vs spawn

On Linux, fork copies the parent's memory (cheap, copy-on-write). On macOS (3.8+) and Windows, the default is spawn, a fresh interpreter that re-imports modules. Module-level globals from the parent aren't visible. This breaks code that worked on Linux.

 1  import multiprocessing as mp
 2  
 3  # SAFE on every platform, explicit, fork-on-Linux, spawn-elsewhere
 4  if __name__ == "__main__":
 5      ctx = mp.get_context("spawn")  # or "fork" on Linux
 6      with ctx.Pool(4) as pool:
 7          pool.map(worker, args)
 8  
 9  # COMMON BUG, module-level mutation, only works under fork
10  _global_state = {}
11  def worker(x):
12      # Under spawn, _global_state is empty in the child
13      return _global_state[x]

Picking pool size, the right rule for each kind

For threads (I/O-bound), the right size depends on wait/compute ratio, often dozens to hundreds. For processes (CPU-bound), don't exceed core count; one or two extra is fine, but more thrashes the scheduler.

 1  import os
 2  from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
 3  
 4  cores = os.cpu_count()
 5  
 6  # CPU-bound: ~core count
 7  cpu_pool = ProcessPoolExecutor(max_workers=cores)
 8  
 9  # I/O-bound: depends on workload, often much higher
10  # Goetz's formula: cores * target_util * (1 + wait_time / compute_time)
11  # 8 cores * 1.0 * (1 + 50ms/5ms) ≈ 88
12  io_pool = ThreadPoolExecutor(max_workers=64)

Key points

•threading: shared memory, light, GIL-bound, pick for I/O concurrency
•multiprocessing: separate memory, ~50ms startup each, no GIL, pick for CPU parallelism
•concurrent.futures gives a unified API over both, Executor, submit(), map(), Future
•Process startup method differs by OS: 'fork' on Linux (fast, copy-on-write), 'spawn' on macOS/Windows (slower, isolated)
•multiprocessing requires picklable arguments, lambdas, locks, sockets won't cross the boundary
•Process pools are best for medium-large CPU tasks; small tasks lose to IPC overhead

Tradeoffs

Option	Pros	Cons	When to use
threading.Thread / ThreadPoolExecutor	Cheap (~few KB stack each) Shared memory Fast IPC (just shared variables + locks)	GIL serializes Python bytecode Need locks for shared mutable state Doesn't parallelize CPU work	I/O-bound, HTTP, DB queries, file I/O
multiprocessing.Process / ProcessPoolExecutor	True CPU parallelism (one GIL per process) Crash isolation No locking for separate state	~50ms startup per process Pickle overhead on args/results Shared state requires special primitives	CPU-bound, numerical compute, image processing, ML preprocessing
asyncio (different lesson)	Scales to 100K+ connections No thread overhead Cooperative, explicit yield points	Library must be async-aware Single-threaded, CPU work blocks all	Massive I/O concurrency, web scrapers, message brokers

Follow-up questions

▸When does multiprocessing pool overhead outweigh the parallelism win?

When tasks are small. Each task pickle + IPC + result unpickle costs ~100μs. If a task takes 50μs, processes will be slower than serial. Rule of thumb: process pool wins when each task is at least a few ms and there are enough tasks to amortize startup.

▸Why are sockets and locks not picklable?

Their state isn't meaningful in another process, a socket file descriptor or a lock identity in process A makes no sense in process B. multiprocessing rejects them rather than silently producing broken state. Use Manager.Lock() to cross processes.

▸How can a numpy array be shared across processes without pickling it?

Use multiprocessing.shared_memory.SharedMemory (3.8+), the array lives in OS shared memory, processes attach by name. Or use a memory-mapped file. Or use a multiprocessing.Array. Manager-backed arrays work but are slow because every access proxies through IPC.

▸concurrent.futures or raw threading/multiprocessing?

concurrent.futures, almost always. It's the modern API: Executor abstraction, Future objects, clean shutdown via context manager, exception propagation, as_completed iteration. Reach for raw threading.Thread only for long-lived workers with custom lifecycles.

▸Is a thread pool inside each worker process viable?

Yes, common pattern for mixed CPU+I/O workloads. ProcessPoolExecutor at the top, ThreadPoolExecutor inside each worker. The processes get CPU parallelism; the threads inside each handle blocking I/O without pinning a whole core to wait.

Gotchas

!Forgetting `if __name__ == '__main__':` guard on Windows, re-imports module recursively, infinite spawn
!Module-level state set by the parent isn't visible to spawn-mode children, must pass explicitly
!Pickle errors on lambdas, use def or functools.partial
!ProcessPoolExecutor swallows worker crashes, wrap in try/except and check fut.exception()
!fork on macOS is officially discouraged (3.8+) due to subtle bugs with system libraries
!Closing a Pool inside the with block is automatic; outside, pool.close() + pool.join() are required or workers leak

Common pitfalls

Reaching for multiprocessing for tiny tasks, IPC dwarfs the work
Sharing mutable Python objects via Manager and being surprised by the slowness
Using threading for CPU work and 'optimizing' by adding more threads
Mixing threads and asyncio without asyncio.to_thread() bridges

Practice problems

Parallel image resize, pick threads vs processes and justify

CPU-bound (resize is pure compute) → ProcessPoolExecutor with max_workers=os.cpu_count()

Fan-out 1000 HTTP requests

I/O-bound → ThreadPoolExecutor with max_workers=50-100, or asyncio + aiohttp

APIs worth memorising

concurrent.futures: ThreadPoolExecutor, ProcessPoolExecutor, Future, as_completed, wait
threading: Thread, Lock, Event, Condition, Semaphore, BoundedSemaphore
multiprocessing: Process, Pool, Queue, Pipe, Manager, shared_memory, get_context

Where this shows up

Celery uses multiprocessing for CPU-heavy workers. gunicorn runs N processes, each handling requests synchronously. Dask and Ray build distributed computing on top of multiprocessing. SciPy and scikit-learn use joblib (multiprocessing) for parallel ML. NumPy releases the GIL inside C code so threads work for it.

threading vs multiprocessing, Picking the Right Tool

In one line

Diagram

What it is

The decision tree is short, but worth being precise about because the answer is the opposite of what most other languages would suggest.

Why it matters

Important

The decision in one sentence If the bottleneck is waiting on something (network, disk, DB), threads work. If it is crunching numbers, processes work. Measure first; optimise second.

How they differ

	threading	multiprocessing
Memory	Shared	Separate (copy or shared_memory)
Startup cost	~few μs	~50ms
GIL impact	Bottleneck for CPU	Each has its own, no bottleneck
Communication	Just shared variables + locks	Pickle + pipe/queue
CPU parallelism	No (GIL)	Yes
I/O parallelism	Yes (GIL releases on blocking calls)	Yes but overkill
Crash blast radius	Whole process	Just the worker

When to reach for which

Use threading when:

The bottleneck is waiting (HTTP, DB, file I/O, sleep).
Workers need to share large in-memory data structures.
You need fast inter-worker communication.

Use multiprocessing when:

The bottleneck is CPU (numerical compute, parsing, image work).
Tasks are large enough to amortize ~50ms startup.
You need crash isolation.

Use neither, reach for asyncio when:

You have 10K+ concurrent I/O operations.
The libraries in use are async-aware.

The modern API: concurrent.futures

For 99% of cases, prefer concurrent.futures over raw threading.Thread and multiprocessing.Process. It provides:

A unified Executor API, submit, map, shutdown.
Future objects with .result(), .exception(), .cancel(), .add_done_callback.
Clean shutdown via context manager (with ... as ex:).
as_completed for streaming results.
Easy switching between thread pool and process pool.

Tip

The pattern that works for most Python concurrency

with ThreadPoolExecutor(max_workers=N) as ex:
    results = list(ex.map(work, items))

Swap ThreadPoolExecutor for ProcessPoolExecutor when CPU-bound. That's the entire pattern.

The platform trap, fork vs spawn

Warning

Picking pool size

Tip

Two different formulas CPU-bound (process pool): os.cpu_count(), sometimes cpu_count() + 1. More processes than cores means context-switching cost without throughput gain.

I/O-bound (thread pool): cores × (1 + wait_time / compute_time). For workloads that are 90% wait, that's roughly 10 × cores. Tune by ramping up until throughput plateaus.

When to mix them

# Each worker process spins up its own thread pool
def worker(batch):
    with ThreadPoolExecutor(max_workers=10) as io_pool:
        urls = list(io_pool.map(fetch, batch))
    return cpu_heavy_transform(urls)

with ProcessPoolExecutor(max_workers=8) as proc_pool:
    proc_pool.map(worker, batches)

This is the pattern Celery, Dask, and Ray use under the hood.

Primitives by language

threading.Thread
threading.Lock / Event / Condition / Semaphore
multiprocessing.Process
multiprocessing.Pool / Queue / Pipe / Manager / shared_memory
concurrent.futures.{ThreadPoolExecutor, ProcessPoolExecutor}

Implementation

ThreadPoolExecutor for I/O concurrency

 1  from concurrent.futures import ThreadPoolExecutor, as_completed
 2  import requests
 3  
 4  urls = [f"https://api.example.com/items/{i}" for i in range(100)]
 5  
 6  with ThreadPoolExecutor(max_workers=20) as ex:
 7      # In-order results
 8      for resp in ex.map(requests.get, urls):
 9          process(resp)
10  
11      # Or first-finished-first
12      futures = {ex.submit(requests.get, u): u for u in urls}
13      for fut in as_completed(futures):
14          url = futures[fut]
15          try:
16              process(fut.result())
17          except Exception as e:
18              print(f"Failed {url}: {e}")

ProcessPoolExecutor for CPU parallelism

 1  from concurrent.futures import ProcessPoolExecutor
 2  
 3  def transform(image_path):
 4      # CPU-heavy: resize, encode, compress
 5      return process_image(image_path)
 6  
 7  paths = [f"img_{i}.jpg" for i in range(1000)]
 8  
 9  with ProcessPoolExecutor(max_workers=8) as ex:
10      for original, result in zip(paths, ex.map(transform, paths)):
11          save(original, result)

Sharing data between processes

 1  from multiprocessing import Pool, Manager
 2  from multiprocessing.shared_memory import SharedMemory
 3  import numpy as np
 4  
 5  # Approach 1: pickle args/results (simplest, slowest for big data)
 6  with Pool(4) as pool:
 7      results = pool.map(work, big_list)
 8  
 9  # Approach 2: Manager for shared dict/list (slow but cross-process)
10  with Manager() as mgr:
11      counter = mgr.Value("i", 0)
12      shared_list = mgr.list()
13      # workers get proxy objects
14  
15  # Approach 3: shared_memory for numpy arrays (fastest for numeric)
16  arr = np.ones(10_000_000, dtype=np.float64)
17  shm = SharedMemory(create=True, size=arr.nbytes)
18  buffer = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
19  buffer[:] = arr
20  # children attach by name

Common pitfall, fork vs spawn

 1  import multiprocessing as mp
 2  
 3  # SAFE on every platform, explicit, fork-on-Linux, spawn-elsewhere
 4  if __name__ == "__main__":
 5      ctx = mp.get_context("spawn")  # or "fork" on Linux
 6      with ctx.Pool(4) as pool:
 7          pool.map(worker, args)
 8  
 9  # COMMON BUG, module-level mutation, only works under fork
10  _global_state = {}
11  def worker(x):
12      # Under spawn, _global_state is empty in the child
13      return _global_state[x]

Picking pool size, the right rule for each kind

 1  import os
 2  from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
 3  
 4  cores = os.cpu_count()
 5  
 6  # CPU-bound: ~core count
 7  cpu_pool = ProcessPoolExecutor(max_workers=cores)
 8  
 9  # I/O-bound: depends on workload, often much higher
10  # Goetz's formula: cores * target_util * (1 + wait_time / compute_time)
11  # 8 cores * 1.0 * (1 + 50ms/5ms) ≈ 88
12  io_pool = ThreadPoolExecutor(max_workers=64)

Key points

•threading: shared memory, light, GIL-bound, pick for I/O concurrency
•multiprocessing: separate memory, ~50ms startup each, no GIL, pick for CPU parallelism
•concurrent.futures gives a unified API over both, Executor, submit(), map(), Future
•Process startup method differs by OS: 'fork' on Linux (fast, copy-on-write), 'spawn' on macOS/Windows (slower, isolated)
•multiprocessing requires picklable arguments, lambdas, locks, sockets won't cross the boundary
•Process pools are best for medium-large CPU tasks; small tasks lose to IPC overhead

Tradeoffs

Option	Pros	Cons	When to use
threading.Thread / ThreadPoolExecutor	Cheap (~few KB stack each) Shared memory Fast IPC (just shared variables + locks)	GIL serializes Python bytecode Need locks for shared mutable state Doesn't parallelize CPU work	I/O-bound, HTTP, DB queries, file I/O
multiprocessing.Process / ProcessPoolExecutor	True CPU parallelism (one GIL per process) Crash isolation No locking for separate state	~50ms startup per process Pickle overhead on args/results Shared state requires special primitives	CPU-bound, numerical compute, image processing, ML preprocessing
asyncio (different lesson)	Scales to 100K+ connections No thread overhead Cooperative, explicit yield points	Library must be async-aware Single-threaded, CPU work blocks all	Massive I/O concurrency, web scrapers, message brokers

Follow-up questions

▸When does multiprocessing pool overhead outweigh the parallelism win?

▸Why are sockets and locks not picklable?

▸How can a numpy array be shared across processes without pickling it?

▸concurrent.futures or raw threading/multiprocessing?

▸Is a thread pool inside each worker process viable?

Gotchas

!Forgetting `if __name__ == '__main__':` guard on Windows, re-imports module recursively, infinite spawn
!Module-level state set by the parent isn't visible to spawn-mode children, must pass explicitly
!Pickle errors on lambdas, use def or functools.partial
!ProcessPoolExecutor swallows worker crashes, wrap in try/except and check fut.exception()
!fork on macOS is officially discouraged (3.8+) due to subtle bugs with system libraries
!Closing a Pool inside the with block is automatic; outside, pool.close() + pool.join() are required or workers leak

Common pitfalls

Reaching for multiprocessing for tiny tasks, IPC dwarfs the work
Sharing mutable Python objects via Manager and being surprised by the slowness
Using threading for CPU work and 'optimizing' by adding more threads
Mixing threads and asyncio without asyncio.to_thread() bridges

Practice problems

Parallel image resize, pick threads vs processes and justify

CPU-bound (resize is pure compute) → ProcessPoolExecutor with max_workers=os.cpu_count()

Fan-out 1000 HTTP requests

I/O-bound → ThreadPoolExecutor with max_workers=50-100, or asyncio + aiohttp

APIs worth memorising

concurrent.futures: ThreadPoolExecutor, ProcessPoolExecutor, Future, as_completed, wait
threading: Thread, Lock, Event, Condition, Semaphore, BoundedSemaphore
multiprocessing: Process, Pool, Queue, Pipe, Manager, shared_memory, get_context

Where this shows up

threading vs multiprocessing, Picking the Right Tool

Diagram

What it is

Why it matters

How they differ

When to reach for which

The modern API: concurrent.futures

The platform trap, fork vs spawn

Picking pool size

When to mix them

Primitives by language

Implementation

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading

threading vs multiprocessing, Picking the Right Tool

Diagram

What it is

Why it matters

How they differ

When to reach for which

The modern API: concurrent.futures

The platform trap, fork vs spawn

Picking pool size

When to mix them

Primitives by language

Implementation

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading