Python ConcurrencyTopic 9 of 11

LanguagePythonAdvancedSometimes

Multiprocessing Deep Dive

In one line

multiprocessing spawns separate Python processes to escape the GIL. Pool, Process, Queue, Pipe, Manager, shared memory. The catch: every argument and return value is pickled across the boundary, start methods (fork/spawn/forkserver) behave very differently, and shared state requires explicit machinery.

What it is

multiprocessing is the standard library's tool for running Python code in separate processes. It exists for one main reason: to escape the GIL.

Each process has its own Python interpreter, its own GIL, its own memory. CPU-bound Python code that does not benefit from threads scales linearly across cores with processes (up to the per-process overhead of creation, IPC, and pickling).

ProcessPoolExecutor (in concurrent.futures) is built on top of multiprocessing.Pool. Reach for the executor first; reach for raw multiprocessing only for capabilities the executor doesn't expose (custom IPC, Manager, shared memory, start method control).

The pickling tax

Because processes do not share memory, every argument to a worker and every return value is pickled (serialised), sent over a pipe, and unpickled in the worker. The cost is real:

Picklable types only. Lambdas fail. Closures over local variables fail. File handles, sockets, threading.Lock fail. Anything with a __reduce__ issue fails.
Big arguments (a 100MB numpy array) take real time to serialise and copy.
For small operations, pickling overhead can exceed the work, making multiprocessing slower than sequential.

The escape hatches: shared memory for big buffers (no copy), Manager for shared mutable state (every access is IPC, slow), or simply refactor to do more work per task.

Start methods

multiprocessing has three ways to create a worker process: fork, spawn, forkserver. Choosing matters more than people realise.

fork (Unix default historically; deprecated as the default starting Python 3.14): the child gets a copy-on-write copy of the parent's memory. Fast (no re-imports, no fresh interpreter). But it inherits everything: open file descriptors, threads, locks, library state. Many libraries (boto3, urllib3, opencv) explicitly warn against fork after threading because the child inherits a thread mid-flight.

spawn (Windows always; macOS default in 3.8+): the child starts a fresh interpreter and re-imports the module. Clean, no inherited state, no fork-after-threading bugs. Slow (hundreds of ms per worker). Requires if __name__ == "__main__": guards on top-level code.

forkserver: a small server process is forked once at startup. Subsequent workers fork from it. Fast like fork, clean like spawn (the server has no library state). This becomes the recommended default on Linux from 3.14 onward.

For new code, prefer spawn or forkserver. For legacy code on Linux that already works, fork is fine for now, but be aware of the threading rule and the upcoming default change.

Sharing state

The default model is "no shared state". Workers communicate via Queue, Pipe, or pickled return values. This avoids most concurrency bugs: there is nothing to race on.

When shared state is genuinely required, the options are:

Manager: a server process owns the shared object. Other processes get proxies. Every access (read or write) is an IPC call. Convenient (Manager.dict, Manager.list, Manager.Lock all just work) but slow.

shared_memory: an OS-level memory region mapped into every process. Zero copy. The right tool for big numpy arrays, image buffers, or any fixed-size byte structure. Synchronisation is the caller's responsibility.

Value / Array: small fixed-size shared variables (one int, one float, an array of doubles). With a Lock for synchronisation. Useful for counters, flags.

For most workloads, shared state can be avoided entirely by passing data through Queue or by chunking work and combining results in the parent. Reach for shared state only after measurement shows IPC cost dominates.

When the GIL is gone

PEP 703 (no-GIL CPython) is shipping as a build option in 3.13+ and is on track to become default later this decade. When it lands, threads achieve true CPU parallelism in Python without processes.

multiprocessing does not become obsolete. It still provides fault isolation (one crashed worker does not take the parent), memory isolation (no shared mutable state by default), and the ability to use multiple machines via process-based libraries. But for "just use the cores", threads will become a viable option without the multiprocessing tax.

Primitives by language

multiprocessing.Pool (worker pool, like ProcessPoolExecutor)
multiprocessing.Process (single subprocess)
multiprocessing.Queue / Pipe (IPC channels)
multiprocessing.Manager (shared dict/list/Lock proxied across processes)
multiprocessing.shared_memory (zero-copy bytes/numpy)
Start methods: fork, spawn, forkserver

Implementation

Pool with map and imap_unordered

Pool's API mirrors itertools. map preserves order and waits for all. imap_unordered streams results in completion order, which makes it possible to process the first finished result without waiting for slowpokes. For long-running batches, imap_unordered is often the right default.

 1  from multiprocessing import Pool
 2  
 3  def heavy(x):
 4      # CPU-bound; releasing GIL would not help
 5      return sum(i * i for i in range(x))
 6  
 7  if __name__ == "__main__":                            # required on spawn (Windows)
 8      with Pool(processes=4) as pool:
 9          # Order-preserving
10          for r in pool.map(heavy, range(100)):
11              print(r)
12  
13          # Completion order, lower latency to first result
14          for r in pool.imap_unordered(heavy, range(100), chunksize=10):
15              print(r)

Manager for shared state

Manager runs a server process that owns the shared object. Other processes get proxies. Every access is an IPC call (pickle, send, unpickle, return). Convenient but slow. For high-throughput shared state, prefer shared_memory.

 1  from multiprocessing import Manager, Process
 2  
 3  def worker(shared_dict, key, value):
 4      shared_dict[key] = value                          # proxied: pickled, sent, applied
 5  
 6  if __name__ == "__main__":
 7      with Manager() as mgr:
 8          shared = mgr.dict()
 9          procs = [
10              Process(target=worker, args=(shared, i, i * i))
11              for i in range(10)
12          ]
13          for p in procs: p.start()
14          for p in procs: p.join()
15          print(dict(shared))                           # snapshot

shared_memory for zero-copy buffers

shared_memory (Python 3.8+) gives multiple processes a view onto the same OS-level memory region. No pickling, no copying. The right tool for big numpy arrays shared across workers.

 1  from multiprocessing import shared_memory, Process
 2  import numpy as np
 3  
 4  def worker(name, shape, dtype):
 5      shm = shared_memory.SharedMemory(name=name)
 6      arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
 7      arr[:] = arr * 2                                  # in-place, visible to all
 8      shm.close()                                       # close handle, do not unlink
 9  
10  if __name__ == "__main__":
11      base = np.arange(1_000_000, dtype=np.int64)
12      shm = shared_memory.SharedMemory(create=True, size=base.nbytes)
13      arr = np.ndarray(base.shape, dtype=base.dtype, buffer=shm.buf)
14      arr[:] = base
15  
16      p = Process(target=worker, args=(shm.name, base.shape, base.dtype))
17      p.start()
18      p.join()
19  
20      print(arr[:5])                                     # doubled
21      shm.close(); shm.unlink()                          # creator unlinks

Why start method matters

fork: child gets a copy-on-write copy of the parent's memory. Fast. Default on Linux. But: if the parent had threads or held locks, the child inherits half-state. Many libraries (boto3, urllib3) caution against fork after threading. spawn: child starts a fresh interpreter and re-imports the module. Slow but clean. Default on Windows and macOS in 3.8+. forkserver: a small server process is forked early; subsequent workers fork from it (no parent state inherited).

 1  import multiprocessing as mp
 2  
 3  # Force a specific start method (must be done at program start)
 4  if __name__ == "__main__":
 5      mp.set_start_method("spawn")                       # or "fork", or "forkserver"
 6  
 7      # Pool created here uses the chosen method
 8      with mp.Pool(4) as pool:
 9          pool.map(work, items)
10  
11  # Pickling failure under spawn: this lambda will fail because spawn re-imports
12  # and the lambda is not importable.
13  # pool.map(lambda x: x * 2, items)                    # BAD on spawn

Key points

•Each process has its own Python interpreter and its own GIL. True CPU parallelism.
•Arguments and return values must be picklable. Lambdas, closures, file handles, locks all fail.
•Start method matters: fork (Unix default, fast, copy-on-write) vs spawn (Windows, slower, fresh interpreter) vs forkserver.
•Shared state needs Manager (proxied, slow) or shared_memory (fast, raw bytes).
•multiprocessing.Pool has the same map/imap/apply API as concurrent.futures' ProcessPoolExecutor; the latter is usually preferred.

Follow-up questions

▸When is multiprocessing preferable over ProcessPoolExecutor?

ProcessPoolExecutor is built on multiprocessing and provides a cleaner API for the common case (map, submit, futures). Reach for raw multiprocessing when needed: a Manager for shared state, shared_memory, custom IPC via Pipe, or fine control over the start method. For 'run this function on N cores', use ProcessPoolExecutor.

▸Why does code work on Linux but fail on macOS / Windows?

Default start method changed. macOS in Python 3.8+ defaults to spawn (was fork before). Windows has always been spawn. Spawn re-imports the module in each worker, which breaks lambdas, top-level code without an `if __name__ == '__main__':` guard, and code that depends on shared parent state. Add the guard, refactor lambdas to module-level functions, and pass shared state explicitly.

▸Manager.dict vs shared_memory: when to pick which?

Manager.dict for low-frequency access to small objects. Each access is an IPC round trip (pickle, send, unpickle), so it does not scale. shared_memory for large fixed-size buffers (numpy arrays, byte buffers) needing lots of reads and writes without copying. The trade-off is convenience vs speed.

▸How does PEP 703 (no-GIL Python) change this?

Free-threaded CPython (3.13+ as a build option, supported but non-default in 3.14, default-on planned for later) removes the GIL. Threads achieve true parallelism, so multiprocessing becomes less necessary for CPU-bound work. multiprocessing remains useful for fault isolation (a crash in one process does not take down the parent) and for memory isolation (no shared mutable state by default).

Gotchas

!Forgetting `if __name__ == '__main__':` on Windows/macOS spawn causes infinite recursion
!Lambdas, local functions, and closures cannot be pickled; use module-level callables
!Manager proxies are SLOW; every read or write is an IPC round trip
!fork after threading is undefined behaviour in many libraries; prefer spawn or forkserver
!Forgetting shm.unlink() leaks shared memory until reboot on some platforms

Multiprocessing Deep Dive

In one line

What it is

multiprocessing is the standard library's tool for running Python code in separate processes. It exists for one main reason: to escape the GIL.

The pickling tax

Because processes do not share memory, every argument to a worker and every return value is pickled (serialised), sent over a pipe, and unpickled in the worker. The cost is real:

Picklable types only. Lambdas fail. Closures over local variables fail. File handles, sockets, threading.Lock fail. Anything with a __reduce__ issue fails.
Big arguments (a 100MB numpy array) take real time to serialise and copy.
For small operations, pickling overhead can exceed the work, making multiprocessing slower than sequential.

The escape hatches: shared memory for big buffers (no copy), Manager for shared mutable state (every access is IPC, slow), or simply refactor to do more work per task.

Start methods

multiprocessing has three ways to create a worker process: fork, spawn, forkserver. Choosing matters more than people realise.

For new code, prefer spawn or forkserver. For legacy code on Linux that already works, fork is fine for now, but be aware of the threading rule and the upcoming default change.

Sharing state

The default model is "no shared state". Workers communicate via Queue, Pipe, or pickled return values. This avoids most concurrency bugs: there is nothing to race on.

When shared state is genuinely required, the options are:

Value / Array: small fixed-size shared variables (one int, one float, an array of doubles). With a Lock for synchronisation. Useful for counters, flags.

When the GIL is gone

PEP 703 (no-GIL CPython) is shipping as a build option in 3.13+ and is on track to become default later this decade. When it lands, threads achieve true CPU parallelism in Python without processes.

Primitives by language

multiprocessing.Pool (worker pool, like ProcessPoolExecutor)
multiprocessing.Process (single subprocess)
multiprocessing.Queue / Pipe (IPC channels)
multiprocessing.Manager (shared dict/list/Lock proxied across processes)
multiprocessing.shared_memory (zero-copy bytes/numpy)
Start methods: fork, spawn, forkserver

Implementation

Pool with map and imap_unordered

 1  from multiprocessing import Pool
 2  
 3  def heavy(x):
 4      # CPU-bound; releasing GIL would not help
 5      return sum(i * i for i in range(x))
 6  
 7  if __name__ == "__main__":                            # required on spawn (Windows)
 8      with Pool(processes=4) as pool:
 9          # Order-preserving
10          for r in pool.map(heavy, range(100)):
11              print(r)
12  
13          # Completion order, lower latency to first result
14          for r in pool.imap_unordered(heavy, range(100), chunksize=10):
15              print(r)

Manager for shared state

 1  from multiprocessing import Manager, Process
 2  
 3  def worker(shared_dict, key, value):
 4      shared_dict[key] = value                          # proxied: pickled, sent, applied
 5  
 6  if __name__ == "__main__":
 7      with Manager() as mgr:
 8          shared = mgr.dict()
 9          procs = [
10              Process(target=worker, args=(shared, i, i * i))
11              for i in range(10)
12          ]
13          for p in procs: p.start()
14          for p in procs: p.join()
15          print(dict(shared))                           # snapshot

shared_memory for zero-copy buffers

shared_memory (Python 3.8+) gives multiple processes a view onto the same OS-level memory region. No pickling, no copying. The right tool for big numpy arrays shared across workers.

 1  from multiprocessing import shared_memory, Process
 2  import numpy as np
 3  
 4  def worker(name, shape, dtype):
 5      shm = shared_memory.SharedMemory(name=name)
 6      arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
 7      arr[:] = arr * 2                                  # in-place, visible to all
 8      shm.close()                                       # close handle, do not unlink
 9  
10  if __name__ == "__main__":
11      base = np.arange(1_000_000, dtype=np.int64)
12      shm = shared_memory.SharedMemory(create=True, size=base.nbytes)
13      arr = np.ndarray(base.shape, dtype=base.dtype, buffer=shm.buf)
14      arr[:] = base
15  
16      p = Process(target=worker, args=(shm.name, base.shape, base.dtype))
17      p.start()
18      p.join()
19  
20      print(arr[:5])                                     # doubled
21      shm.close(); shm.unlink()                          # creator unlinks

Why start method matters

 1  import multiprocessing as mp
 2  
 3  # Force a specific start method (must be done at program start)
 4  if __name__ == "__main__":
 5      mp.set_start_method("spawn")                       # or "fork", or "forkserver"
 6  
 7      # Pool created here uses the chosen method
 8      with mp.Pool(4) as pool:
 9          pool.map(work, items)
10  
11  # Pickling failure under spawn: this lambda will fail because spawn re-imports
12  # and the lambda is not importable.
13  # pool.map(lambda x: x * 2, items)                    # BAD on spawn

Key points

•Each process has its own Python interpreter and its own GIL. True CPU parallelism.
•Arguments and return values must be picklable. Lambdas, closures, file handles, locks all fail.
•Start method matters: fork (Unix default, fast, copy-on-write) vs spawn (Windows, slower, fresh interpreter) vs forkserver.
•Shared state needs Manager (proxied, slow) or shared_memory (fast, raw bytes).
•multiprocessing.Pool has the same map/imap/apply API as concurrent.futures' ProcessPoolExecutor; the latter is usually preferred.

Follow-up questions

▸When is multiprocessing preferable over ProcessPoolExecutor?

▸Why does code work on Linux but fail on macOS / Windows?

▸Manager.dict vs shared_memory: when to pick which?

▸How does PEP 703 (no-GIL Python) change this?

Gotchas

!Forgetting `if __name__ == '__main__':` on Windows/macOS spawn causes infinite recursion
!Lambdas, local functions, and closures cannot be pickled; use module-level callables
!Manager proxies are SLOW; every read or write is an IPC round trip
!fork after threading is undefined behaviour in many libraries; prefer spawn or forkserver
!Forgetting shm.unlink() leaks shared memory until reboot on some platforms

Multiprocessing Deep Dive

What it is

The pickling tax

Start methods

Sharing state

When the GIL is gone

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Related reading

Multiprocessing Deep Dive

What it is

The pickling tax

Start methods

Sharing state

When the GIL is gone

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Related reading