threading vs multiprocessing, Picking the Right Tool
threading shares memory and is great for I/O-bound work because the GIL releases around blocking calls. multiprocessing spawns separate processes that bypass the GIL, the only way to get real CPU parallelism in stock CPython.
Diagram
What it is
Python offers two genuinely different ways to run code in parallel: threads (which share memory but are bottlenecked by the GIL) and processes (which have separate memory and bypass the GIL). Picking wrong is the #1 cause of "I added concurrency and it got slower" bugs in Python.
The decision tree is short, but worth being precise about because the answer is the opposite of what most other languages would suggest.
Why it matters
Most Python optimisation stories end at "multiprocessing for CPU, threading/asyncio for I/O." Interviewers want to see that recommendation derived from first principles: the GIL, the cost of process startup, when IPC overhead matters.
The decision in one sentence If the bottleneck is waiting on something (network, disk, DB), threads work. If it is crunching numbers, processes work. Measure first; optimise second.
How they differ
| threading | multiprocessing | |
|---|---|---|
| Memory | Shared | Separate (copy or shared_memory) |
| Startup cost | ~few μs | ~50ms |
| GIL impact | Bottleneck for CPU | Each has its own, no bottleneck |
| Communication | Just shared variables + locks | Pickle + pipe/queue |
| CPU parallelism | No (GIL) | Yes |
| I/O parallelism | Yes (GIL releases on blocking calls) | Yes but overkill |
| Crash blast radius | Whole process | Just the worker |
When to reach for which
Use threading when:
- The bottleneck is waiting (HTTP, DB, file I/O, sleep).
- Workers need to share large in-memory data structures.
- You need fast inter-worker communication.
Use multiprocessing when:
- The bottleneck is CPU (numerical compute, parsing, image work).
- Tasks are large enough to amortize ~50ms startup.
- You need crash isolation.
Use neither, reach for asyncio when:
- You have 10K+ concurrent I/O operations.
- The libraries in use are async-aware.
The modern API: concurrent.futures
For 99% of cases, prefer concurrent.futures over raw threading.Thread and multiprocessing.Process. It provides:
- A unified
ExecutorAPI,submit,map,shutdown. Futureobjects with.result(),.exception(),.cancel(),.add_done_callback.- Clean shutdown via context manager (
with ... as ex:). as_completedfor streaming results.- Easy switching between thread pool and process pool.
The pattern that works for most Python concurrency
with ThreadPoolExecutor(max_workers=N) as ex:
results = list(ex.map(work, items))
Swap ThreadPoolExecutor for ProcessPoolExecutor when CPU-bound. That's the entire pattern.
The platform trap, fork vs spawn
On Linux, multiprocessing currently defaults to fork: the child inherits a copy of the parent's memory (copy-on-write). Module-level globals are visible. Cheap startup. (Python 3.14 deprecates fork as the default on Linux as well, planning to switch to forkserver.)
On macOS (3.8+) and Windows, the default is spawn: the child starts a fresh Python interpreter and re-imports the module. Module-level state set by the parent is gone. Code that worked on Linux dev environments breaks in macOS or Windows production.
Always guard with if __name__ == "__main__":
Without it, spawn-mode children re-execute the module top-to-bottom, including the Pool() creation, leading to infinite recursion of process spawning. The guard is mandatory on Windows; harmless on Linux.
Picking pool size
Two different formulas
CPU-bound (process pool): os.cpu_count(), sometimes cpu_count() + 1. More processes than cores means context-switching cost without throughput gain.
I/O-bound (thread pool): cores × (1 + wait_time / compute_time). For workloads that are 90% wait, that's roughly 10 × cores. Tune by ramping up until throughput plateaus.
When to mix them
A common high-performance pattern: process pool at the top, thread pool inside each worker. Each process gets a CPU core for compute; threads inside each handle blocking I/O without pinning a whole process to wait. Useful for ML inference servers (CPU-heavy compute + S3 reads), data pipelines (CPU transform + DB writes).
# Each worker process spins up its own thread pool
def worker(batch):
with ThreadPoolExecutor(max_workers=10) as io_pool:
urls = list(io_pool.map(fetch, batch))
return cpu_heavy_transform(urls)
with ProcessPoolExecutor(max_workers=8) as proc_pool:
proc_pool.map(worker, batches)
This is the pattern Celery, Dask, and Ray use under the hood.
Primitives by language
- threading.Thread
- threading.Lock / Event / Condition / Semaphore
- multiprocessing.Process
- multiprocessing.Pool / Queue / Pipe / Manager / shared_memory
- concurrent.futures.{ThreadPoolExecutor, ProcessPoolExecutor}
Implementation
concurrent.futures.ThreadPoolExecutor is the modern Pythonic way to manage a thread pool. map preserves input order; as_completed yields futures as they finish. Prefer this 99% of the time over raw threading.Thread.
1 from concurrent.futures import ThreadPoolExecutor, as_completed
2 import requests
3
4 urls = [f"https://api.example.com/items/{i}" for i in range(100)]
5
6 with ThreadPoolExecutor(max_workers=20) as ex:
7 # In-order results
8 for resp in ex.map(requests.get, urls):
9 process(resp)
10
11 # Or first-finished-first
12 futures = {ex.submit(requests.get, u): u for u in urls}
13 for fut in as_completed(futures):
14 url = futures[fut]
15 try:
16 process(fut.result())
17 except Exception as e:
18 print(f"Failed {url}: {e}")Same API, different executor, but each task runs in a separate process. The arguments and return values are pickled and sent via a pipe. Best for medium-sized CPU work where the IPC overhead is amortized.
1 from concurrent.futures import ProcessPoolExecutor
2
3 def transform(image_path):
4 # CPU-heavy: resize, encode, compress
5 return process_image(image_path)
6
7 paths = [f"img_{i}.jpg" for i in range(1000)]
8
9 with ProcessPoolExecutor(max_workers=8) as ex:
10 for original, result in zip(paths, ex.map(transform, paths)):
11 save(original, result)Each process has its own memory by default; pickling arguments and return values is the easy path. For large shared state, use multiprocessing.shared_memory (3.8+) or Manager (slower but flexible).
1 from multiprocessing import Pool, Manager
2 from multiprocessing.shared_memory import SharedMemory
3 import numpy as np
4
5 # Approach 1: pickle args/results (simplest, slowest for big data)
6 with Pool(4) as pool:
7 results = pool.map(work, big_list)
8
9 # Approach 2: Manager for shared dict/list (slow but cross-process)
10 with Manager() as mgr:
11 counter = mgr.Value("i", 0)
12 shared_list = mgr.list()
13 # workers get proxy objects
14
15 # Approach 3: shared_memory for numpy arrays (fastest for numeric)
16 arr = np.ones(10_000_000, dtype=np.float64)
17 shm = SharedMemory(create=True, size=arr.nbytes)
18 buffer = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
19 buffer[:] = arr
20 # children attach by nameOn Linux, fork copies the parent's memory (cheap, copy-on-write). On macOS (3.8+) and Windows, the default is spawn, a fresh interpreter that re-imports modules. Module-level globals from the parent aren't visible. This breaks code that worked on Linux.
1 import multiprocessing as mp
2
3 # SAFE on every platform, explicit, fork-on-Linux, spawn-elsewhere
4 if __name__ == "__main__":
5 ctx = mp.get_context("spawn") # or "fork" on Linux
6 with ctx.Pool(4) as pool:
7 pool.map(worker, args)
8
9 # COMMON BUG, module-level mutation, only works under fork
10 _global_state = {}
11 def worker(x):
12 # Under spawn, _global_state is empty in the child
13 return _global_state[x]For threads (I/O-bound), the right size depends on wait/compute ratio, often dozens to hundreds. For processes (CPU-bound), don't exceed core count; one or two extra is fine, but more thrashes the scheduler.
1 import os
2 from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
3
4 cores = os.cpu_count()
5
6 # CPU-bound: ~core count
7 cpu_pool = ProcessPoolExecutor(max_workers=cores)
8
9 # I/O-bound: depends on workload, often much higher
10 # Goetz's formula: cores * target_util * (1 + wait_time / compute_time)
11 # 8 cores * 1.0 * (1 + 50ms/5ms) ≈ 88
12 io_pool = ThreadPoolExecutor(max_workers=64)Key points
- •threading: shared memory, light, GIL-bound, pick for I/O concurrency
- •multiprocessing: separate memory, ~50ms startup each, no GIL, pick for CPU parallelism
- •concurrent.futures gives a unified API over both, Executor, submit(), map(), Future
- •Process startup method differs by OS: 'fork' on Linux (fast, copy-on-write), 'spawn' on macOS/Windows (slower, isolated)
- •multiprocessing requires picklable arguments, lambdas, locks, sockets won't cross the boundary
- •Process pools are best for medium-large CPU tasks; small tasks lose to IPC overhead
Tradeoffs
| Option | Pros | Cons | When to use |
|---|---|---|---|
| threading.Thread / ThreadPoolExecutor |
|
| I/O-bound, HTTP, DB queries, file I/O |
| multiprocessing.Process / ProcessPoolExecutor |
|
| CPU-bound, numerical compute, image processing, ML preprocessing |
| asyncio (different lesson) |
|
| Massive I/O concurrency, web scrapers, message brokers |
Follow-up questions
▸When does multiprocessing pool overhead outweigh the parallelism win?
▸Why are sockets and locks not picklable?
▸How can a numpy array be shared across processes without pickling it?
▸concurrent.futures or raw threading/multiprocessing?
▸Is a thread pool inside each worker process viable?
Gotchas
- !Forgetting `if __name__ == '__main__':` guard on Windows, re-imports module recursively, infinite spawn
- !Module-level state set by the parent isn't visible to spawn-mode children, must pass explicitly
- !Pickle errors on lambdas, use def or functools.partial
- !ProcessPoolExecutor swallows worker crashes, wrap in try/except and check fut.exception()
- !fork on macOS is officially discouraged (3.8+) due to subtle bugs with system libraries
- !Closing a Pool inside the with block is automatic; outside, pool.close() + pool.join() are required or workers leak
Common pitfalls
- Reaching for multiprocessing for tiny tasks, IPC dwarfs the work
- Sharing mutable Python objects via Manager and being surprised by the slowness
- Using threading for CPU work and 'optimizing' by adding more threads
- Mixing threads and asyncio without asyncio.to_thread() bridges
Practice problems
CPU-bound (resize is pure compute) → ProcessPoolExecutor with max_workers=os.cpu_count()
I/O-bound → ThreadPoolExecutor with max_workers=50-100, or asyncio + aiohttp
APIs worth memorising
- concurrent.futures: ThreadPoolExecutor, ProcessPoolExecutor, Future, as_completed, wait
- threading: Thread, Lock, Event, Condition, Semaphore, BoundedSemaphore
- multiprocessing: Process, Pool, Queue, Pipe, Manager, shared_memory, get_context
Celery uses multiprocessing for CPU-heavy workers. gunicorn runs N processes, each handling requests synchronously. Dask and Ray build distributed computing on top of multiprocessing. SciPy and scikit-learn use joblib (multiprocessing) for parallel ML. NumPy releases the GIL inside C code so threads work for it.