Python ConcurrencyTopic 1 of 11

LanguagePythonIntermediateAsked Often

The Python GIL, What It Is and Why It Matters

In one line

The Global Interpreter Lock is a mutex that serialises execution of Python bytecode in CPython. It makes single-threaded code fast and simple, but means threads cannot run Python code in parallel, even on a 64-core machine. Reach for multiprocessing for CPU-bound work, threading or asyncio for I/O.

Diagram

The GIL in plain English

The diagram above shows the two regimes side by side: under CPU work, threads queue on the GIL and only one core makes progress; under blocking I/O, the GIL is released, other threads run, and wall-clock time collapses.

Inside CPython (the Python interpreter most people use) there's a single lock called the Global Interpreter Lock. The rule is simple: only one thread can be running Python code at any moment. Even on a 32-core machine with 32 threads, exactly one of them is executing Python at any given instant.

Two consequences fall out of this:

Python threads cannot speed up CPU work. Four threads doing math on four cores end up taking turns on one core. No throughput win.
Python threads can speed up I/O work. When a thread waits on the network, a file read, or time.sleep, the GIL gets released. Other threads run during the wait. 10-20x speedups on I/O-heavy code are typical.

That's it. Almost every Python performance question is downstream of this one rule.

The GIL only blocks Python bytecode. It does not block a thread that is waiting on something outside Python (sockets, files, sleep, subprocess.wait, NumPy operations, etc.).

When the GIL is released

There are two situations where a Python thread holding the GIL hands it back so another thread can run.

Trigger	Detail
Around a blocking I/O call	Functions like `socket.recv`, `time.sleep`, `file.read`, database drivers, and `subprocess.wait` are implemented in C. The C code explicitly releases the GIL, makes the syscall, and reacquires the GIL once the syscall returns.
Periodically, every ~5 milliseconds	The interpreter checks whether it should hand the GIL to another waiting thread. The interval is configurable via `sys.setswitchinterval`. (Pre-3.2 CPython did this every 100 bytecodes; modern CPython is time-based.)

The first case is the one that buys parallelism. While one thread is waiting on the network, another thread is free to run Python code. The second case rarely buys parallelism in pure-Python CPU loops, because the same thread usually reacquires the GIL immediately at the tick. To anyone watching, it looks as if the GIL is held continuously. That is why CPU-threaded Python code shows almost no parallelism even on a many-core machine.

Why `+=` is NOT atomic even with the GIL

A persistent myth says "the GIL serialises Python code, so shared variables between threads are safe". They are not. The GIL serialises one bytecode at a time, not one source-code statement at a time. A statement like counter += 1 compiles to four separate bytecodes, and the GIL can be released between any two of them.

Each individual bytecode (LOAD_FAST, BINARY_OP, STORE_FAST) was atomic on its own. The GIL guaranteed that. What the GIL did not guarantee is that the four bytecodes of counter += 1 would run as one indivisible group. The compiled sequence for counter += 1 looks like this:

counter += 1   compiles to:

  LOAD_FAST   counter   ← GIL switch can happen after this
  LOAD_CONST  1
  BINARY_OP   add       ← or here
  STORE_FAST  counter

The fix is the same as in any other language: protect the read-modify-write with a threading.Lock, or use a primitive that bundles the operation atomically (a Queue, an Event, an itertools.count counter handed out monotonically).

Warning

The GIL is not a synchronisation primitive For mutable state shared across threads, use threading.Lock, threading.RLock, or queue.Queue. Code that relies on the GIL for correctness breaks on PyPy, on free-threaded CPython 3.13+, and even on regular CPython for compound operations.

When to reach for what

Workload	Right tool
Many concurrent HTTP requests	`threading` or `asyncio`
CPU-heavy numerical work	`multiprocessing` (or NumPy/PyTorch which release the GIL)
10K+ concurrent connections	`asyncio`
Mixed I/O and CPU	`asyncio` + `to_thread` for blocking calls, or process pool for CPU
Bypass GIL entirely	Free-threaded CPython 3.13+ (experimental)

What's changing, PEP 703 and free-threaded CPython

In 2023, Python accepted PEP 703: a no-GIL build of CPython. Starting with 3.13, opting in to a free-threaded interpreter (python3.13t) lets threads run Python code in parallel. The single-thread performance cost is currently ~10–15%; many C extensions need updates to work safely. Adoption will be gradual, the GIL will not disappear from default CPython for years.

Tip

The interview answer for 2026 "Today, Python threads can't parallelise CPU work because of the GIL. PEP 703 is changing that: free-threaded CPython is an opt-in build in 3.13, supported (still non-default) in 3.14, and gradually moving toward default. For now, multiprocessing is the answer for CPU and threading/asyncio for I/O. The fundamentals will outlast the GIL."

Primitives by language

threading (limited by GIL for CPU work)
multiprocessing (separate processes, no GIL)
asyncio (single-threaded, GIL irrelevant)
concurrent.futures.{ThreadPoolExecutor, ProcessPoolExecutor}

Implementation

Demonstrating the GIL, threads make CPU work slower

Pure Python CPU work doesn't benefit from threads, the GIL serialises execution AND adds context-switch overhead. The threaded version is reliably slower than serial. This catches every Python developer at least once.

 1  import time
 2  from threading import Thread
 3  
 4  def cpu_work(n):
 5      total = 0
 6      for i in range(n):
 7          total += i * i
 8      return total
 9  
10  N = 50_000_000
11  
12  # Serial, baseline
13  t0 = time.perf_counter()
14  cpu_work(N); cpu_work(N); cpu_work(N); cpu_work(N)
15  print(f"Serial:    {time.perf_counter() - t0:.2f}s")
16  
17  # 4 threads, slower despite 4 cores
18  t0 = time.perf_counter()
19  threads = [Thread(target=cpu_work, args=(N,)) for _ in range(4)]
20  for t in threads: t.start()
21  for t in threads: t.join()
22  print(f"Threaded:  {time.perf_counter() - t0:.2f}s")
23  
24  # Sample output on 8-core machine:
25  # Serial:    8.4s
26  # Threaded:  9.1s   ← worse, not better

I/O work, threads DO help

Blocking I/O calls (sockets, file I/O, sleep) release the GIL while waiting. Other Python threads run during the wait. Going from serial to 20 threads on 100 HTTP requests gives a 15-20× speedup.

 1  import time
 2  import requests
 3  from concurrent.futures import ThreadPoolExecutor
 4  
 5  urls = [f"https://httpbin.org/delay/1?id={i}" for i in range(20)]
 6  
 7  # Serial, 20 seconds
 8  t0 = time.perf_counter()
 9  [requests.get(u) for u in urls]
10  print(f"Serial:    {time.perf_counter() - t0:.1f}s")
11  
12  # 20 threads, ~1 second
13  t0 = time.perf_counter()
14  with ThreadPoolExecutor(max_workers=20) as ex:
15      list(ex.map(requests.get, urls))
16  print(f"Threaded:  {time.perf_counter() - t0:.1f}s")

CPU parallelism, multiprocessing bypasses the GIL

Each process has its own Python interpreter and its own GIL. Four processes really do run in parallel on four cores. Cost: ~50ms process startup, IPC overhead for arguments/results, no shared memory by default.

 1  from multiprocessing import Pool
 2  import time
 3  
 4  def cpu_work(n):
 5      total = 0
 6      for i in range(n):
 7          total += i * i
 8      return total
 9  
10  N = 50_000_000
11  
12  t0 = time.perf_counter()
13  with Pool(4) as pool:
14      pool.map(cpu_work, [N] * 4)
15  print(f"4 processes:  {time.perf_counter() - t0:.2f}s")
16  
17  # On 8-core: ~2.2s, true parallel speedup

Even += is not atomic, GIL doesn't prevent races

counter += 1 compiles to LOAD_FAST, LOAD_CONST, BINARY_ADD, STORE_FAST. The GIL can be released between bytecodes. Two threads racing on the same counter can lose updates. Always use threading.Lock for shared mutable state.

Note: under the modern compiler, BINARY_ADD was unified into BINARY_OP; the principle (multiple bytecodes, interleaving) is unchanged.

 1  import threading
 2  import dis
 3  
 4  counter = 0
 5  def inc():
 6      global counter
 7      for _ in range(1_000_000):
 8          counter += 1   # NOT atomic
 9  
10  ts = [threading.Thread(target=inc) for _ in range(4)]
11  for t in ts: t.start()
12  for t in ts: t.join()
13  print(counter)   # < 4_000_000 every time
14  
15  # Disassemble to see the multiple bytecodes
16  def increment(x):
17      x += 1
18  dis.dis(increment)
19  # Emits LOAD_FAST, LOAD_CONST, BINARY_OP, STORE_FAST, interleavable

Key points

•The GIL is a CPython implementation detail, not part of the language spec
•Threads release the GIL around blocking I/O calls, so threading IS useful for I/O work
•Threads do NOT help CPU-bound code in CPython, they serialise via the GIL
•GIL switches between threads roughly every 5ms (sys.setswitchinterval)
•PyPy, Jython, and IronPython have no GIL; CPython 3.13+ has experimental free-threaded mode (PEP 703)
•Even with the GIL, common bytecode operations like += are NOT atomic, bytecode boundaries can interleave

Tradeoffs

Option	Pros	Cons	When to use
threading	Cheap to start Shared memory Works for I/O concurrency	No CPU parallelism Need locks for shared state GIL contention adds overhead	I/O-bound work, HTTP, DB, file system
multiprocessing	True CPU parallelism Separate GILs Crash isolation	~50ms startup per process IPC overhead (pickle args/results) No shared memory by default	CPU-bound work, numerical compute, image processing, data transformation
asyncio	Scales to 100K+ concurrent I/O tasks Cheap per-task overhead No GIL contention (single thread)	Library must be async-aware (aiohttp not requests) CPU-heavy task stalls everything Mental model shift	High-concurrency I/O, web servers, scrapers, message brokers
Free-threaded CPython (3.13+)	No GIL, true thread parallelism Same threading API	Experimental, many C extensions don't yet work Some single-thread perf cost	When mature, currently for adventurous CPU-heavy workloads

Follow-up questions

▸Why does the GIL exist?

Two reasons: (1) it makes the CPython interpreter implementation simple, reference counting and other internals don't need fine-grained locks; (2) it makes single-threaded code fast and most Python code is single-threaded. Removing the GIL has historically slowed single-threaded performance, that's the tradeoff free-threaded CPython is now navigating.

▸If the GIL serialises Python bytecode, why is += not atomic?

The GIL is released between bytecode operations to let other threads run. += is multiple bytecodes (LOAD, ADD, STORE). Two threads can read the same value, both add 1, both write, losing one update. Use threading.Lock or queue.Queue to make compound operations atomic.

▸Does the GIL apply to NumPy / Pandas / TensorFlow operations?

No, these libraries release the GIL inside their C/Fortran code. NumPy operations on large arrays run in parallel across threads. The GIL only affects pure Python bytecode. This is why scientific Python is fast despite the GIL.

▸Will the GIL be removed?

PEP 703 (accepted 2023) introduces free-threaded CPython, opt-in starting 3.13. The plan is gradual, make it experimental, then default, then default-only. Timeline is years. Until then, the GIL is here.

▸Why does asyncio not have GIL problems?

asyncio is single-threaded, only one piece of code runs at a time. The GIL becomes irrelevant because there's no parallelism to serialise. asyncio achieves concurrency by interleaving tasks at await points, not by running them simultaneously.

Gotchas

!Adding threading to CPU-bound code makes it slower, measure before adding
!multiprocessing on Windows/macOS uses 'spawn' by default, globals from main aren't copied
!Pickle errors when passing un-picklable objects (lambdas, locks) to ProcessPoolExecutor
!asyncio + a CPU-heavy synchronous call → entire event loop stalls until it finishes
!Mixing threads and asyncio is a footgun, prefer asyncio.to_thread() for blocking calls
!The GIL is released around blocking I/O, but NOT around CPU loops in pure Python

Common pitfalls

Reaching for threading first when the bottleneck is CPU
Using multiprocessing for tiny tasks, IPC overhead dwarfs the work
Assuming free-threaded CPython will solve all CPU-Python problems, it'll have its own quirks
Using sys.setswitchinterval() to 'tune' the GIL, almost never the right answer

APIs worth memorising

threading: Thread, Lock, Event, Condition, Semaphore
multiprocessing: Process, Pool, Queue, Pipe, shared_memory
asyncio: run, create_task, gather, Queue, to_thread, Lock
concurrent.futures: ThreadPoolExecutor, ProcessPoolExecutor, Future, as_completed

Where this shows up

Every Python web service navigates this. Django/Flask sync workers run as multiple processes (gunicorn -w N). FastAPI uses asyncio. Celery uses multiprocessing for CPU tasks. NumPy/PyTorch release the GIL internally. The GIL shapes the entire Python ecosystem.

The Python GIL, What It Is and Why It Matters

In one line

Diagram

The GIL in plain English

Two consequences fall out of this:

Python threads cannot speed up CPU work. Four threads doing math on four cores end up taking turns on one core. No throughput win.
Python threads can speed up I/O work. When a thread waits on the network, a file read, or time.sleep, the GIL gets released. Other threads run during the wait. 10-20x speedups on I/O-heavy code are typical.

That's it. Almost every Python performance question is downstream of this one rule.

The GIL only blocks Python bytecode. It does not block a thread that is waiting on something outside Python (sockets, files, sleep, subprocess.wait, NumPy operations, etc.).

When the GIL is released

There are two situations where a Python thread holding the GIL hands it back so another thread can run.

Trigger	Detail
Around a blocking I/O call	Functions like `socket.recv`, `time.sleep`, `file.read`, database drivers, and `subprocess.wait` are implemented in C. The C code explicitly releases the GIL, makes the syscall, and reacquires the GIL once the syscall returns.
Periodically, every ~5 milliseconds	The interpreter checks whether it should hand the GIL to another waiting thread. The interval is configurable via `sys.setswitchinterval`. (Pre-3.2 CPython did this every 100 bytecodes; modern CPython is time-based.)

Why `+=` is NOT atomic even with the GIL

counter += 1   compiles to:

  LOAD_FAST   counter   ← GIL switch can happen after this
  LOAD_CONST  1
  BINARY_OP   add       ← or here
  STORE_FAST  counter

Warning

When to reach for what

Workload	Right tool
Many concurrent HTTP requests	`threading` or `asyncio`
CPU-heavy numerical work	`multiprocessing` (or NumPy/PyTorch which release the GIL)
10K+ concurrent connections	`asyncio`
Mixed I/O and CPU	`asyncio` + `to_thread` for blocking calls, or process pool for CPU
Bypass GIL entirely	Free-threaded CPython 3.13+ (experimental)

What's changing, PEP 703 and free-threaded CPython

Tip

Primitives by language

threading (limited by GIL for CPU work)
multiprocessing (separate processes, no GIL)
asyncio (single-threaded, GIL irrelevant)
concurrent.futures.{ThreadPoolExecutor, ProcessPoolExecutor}

Implementation

Demonstrating the GIL, threads make CPU work slower

 1  import time
 2  from threading import Thread
 3  
 4  def cpu_work(n):
 5      total = 0
 6      for i in range(n):
 7          total += i * i
 8      return total
 9  
10  N = 50_000_000
11  
12  # Serial, baseline
13  t0 = time.perf_counter()
14  cpu_work(N); cpu_work(N); cpu_work(N); cpu_work(N)
15  print(f"Serial:    {time.perf_counter() - t0:.2f}s")
16  
17  # 4 threads, slower despite 4 cores
18  t0 = time.perf_counter()
19  threads = [Thread(target=cpu_work, args=(N,)) for _ in range(4)]
20  for t in threads: t.start()
21  for t in threads: t.join()
22  print(f"Threaded:  {time.perf_counter() - t0:.2f}s")
23  
24  # Sample output on 8-core machine:
25  # Serial:    8.4s
26  # Threaded:  9.1s   ← worse, not better

I/O work, threads DO help

Blocking I/O calls (sockets, file I/O, sleep) release the GIL while waiting. Other Python threads run during the wait. Going from serial to 20 threads on 100 HTTP requests gives a 15-20× speedup.

 1  import time
 2  import requests
 3  from concurrent.futures import ThreadPoolExecutor
 4  
 5  urls = [f"https://httpbin.org/delay/1?id={i}" for i in range(20)]
 6  
 7  # Serial, 20 seconds
 8  t0 = time.perf_counter()
 9  [requests.get(u) for u in urls]
10  print(f"Serial:    {time.perf_counter() - t0:.1f}s")
11  
12  # 20 threads, ~1 second
13  t0 = time.perf_counter()
14  with ThreadPoolExecutor(max_workers=20) as ex:
15      list(ex.map(requests.get, urls))
16  print(f"Threaded:  {time.perf_counter() - t0:.1f}s")

CPU parallelism, multiprocessing bypasses the GIL

 1  from multiprocessing import Pool
 2  import time
 3  
 4  def cpu_work(n):
 5      total = 0
 6      for i in range(n):
 7          total += i * i
 8      return total
 9  
10  N = 50_000_000
11  
12  t0 = time.perf_counter()
13  with Pool(4) as pool:
14      pool.map(cpu_work, [N] * 4)
15  print(f"4 processes:  {time.perf_counter() - t0:.2f}s")
16  
17  # On 8-core: ~2.2s, true parallel speedup

Even += is not atomic, GIL doesn't prevent races

Note: under the modern compiler, BINARY_ADD was unified into BINARY_OP; the principle (multiple bytecodes, interleaving) is unchanged.

 1  import threading
 2  import dis
 3  
 4  counter = 0
 5  def inc():
 6      global counter
 7      for _ in range(1_000_000):
 8          counter += 1   # NOT atomic
 9  
10  ts = [threading.Thread(target=inc) for _ in range(4)]
11  for t in ts: t.start()
12  for t in ts: t.join()
13  print(counter)   # < 4_000_000 every time
14  
15  # Disassemble to see the multiple bytecodes
16  def increment(x):
17      x += 1
18  dis.dis(increment)
19  # Emits LOAD_FAST, LOAD_CONST, BINARY_OP, STORE_FAST, interleavable

Key points

•The GIL is a CPython implementation detail, not part of the language spec
•Threads release the GIL around blocking I/O calls, so threading IS useful for I/O work
•Threads do NOT help CPU-bound code in CPython, they serialise via the GIL
•GIL switches between threads roughly every 5ms (sys.setswitchinterval)
•PyPy, Jython, and IronPython have no GIL; CPython 3.13+ has experimental free-threaded mode (PEP 703)
•Even with the GIL, common bytecode operations like += are NOT atomic, bytecode boundaries can interleave

Tradeoffs

Option	Pros	Cons	When to use
threading	Cheap to start Shared memory Works for I/O concurrency	No CPU parallelism Need locks for shared state GIL contention adds overhead	I/O-bound work, HTTP, DB, file system
multiprocessing	True CPU parallelism Separate GILs Crash isolation	~50ms startup per process IPC overhead (pickle args/results) No shared memory by default	CPU-bound work, numerical compute, image processing, data transformation
asyncio	Scales to 100K+ concurrent I/O tasks Cheap per-task overhead No GIL contention (single thread)	Library must be async-aware (aiohttp not requests) CPU-heavy task stalls everything Mental model shift	High-concurrency I/O, web servers, scrapers, message brokers
Free-threaded CPython (3.13+)	No GIL, true thread parallelism Same threading API	Experimental, many C extensions don't yet work Some single-thread perf cost	When mature, currently for adventurous CPU-heavy workloads

Follow-up questions

▸Why does the GIL exist?

▸If the GIL serialises Python bytecode, why is += not atomic?

▸Does the GIL apply to NumPy / Pandas / TensorFlow operations?

▸Will the GIL be removed?

▸Why does asyncio not have GIL problems?

Gotchas

!Adding threading to CPU-bound code makes it slower, measure before adding
!multiprocessing on Windows/macOS uses 'spawn' by default, globals from main aren't copied
!Pickle errors when passing un-picklable objects (lambdas, locks) to ProcessPoolExecutor
!asyncio + a CPU-heavy synchronous call → entire event loop stalls until it finishes
!Mixing threads and asyncio is a footgun, prefer asyncio.to_thread() for blocking calls
!The GIL is released around blocking I/O, but NOT around CPU loops in pure Python

Common pitfalls

Reaching for threading first when the bottleneck is CPU
Using multiprocessing for tiny tasks, IPC overhead dwarfs the work
Assuming free-threaded CPython will solve all CPU-Python problems, it'll have its own quirks
Using sys.setswitchinterval() to 'tune' the GIL, almost never the right answer

APIs worth memorising

threading: Thread, Lock, Event, Condition, Semaphore
multiprocessing: Process, Pool, Queue, Pipe, shared_memory
asyncio: run, create_task, gather, Queue, to_thread, Lock
concurrent.futures: ThreadPoolExecutor, ProcessPoolExecutor, Future, as_completed

Where this shows up

The Python GIL, What It Is and Why It Matters

Diagram

The GIL in plain English

When the GIL is released

Why `+=` is NOT atomic even with the GIL

When to reach for what

What's changing, PEP 703 and free-threaded CPython

Primitives by language

Implementation

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading

The Python GIL, What It Is and Why It Matters

Diagram

The GIL in plain English

When the GIL is released

Why `+=` is NOT atomic even with the GIL

When to reach for what

What's changing, PEP 703 and free-threaded CPython

Primitives by language

Implementation

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading

Diagram

The GIL in plain English

When the GIL is released

Why += is NOT atomic even with the GIL

When to reach for what

What's changing, PEP 703 and free-threaded CPython

Primitives by language

Implementation

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading

Diagram

The GIL in plain English

When the GIL is released

Why += is NOT atomic even with the GIL

When to reach for what

What's changing, PEP 703 and free-threaded CPython

Primitives by language

Implementation

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading

Why `+=` is NOT atomic even with the GIL

Why `+=` is NOT atomic even with the GIL