The Python GIL, What It Is and Why It Matters
The Global Interpreter Lock is a mutex that serialises execution of Python bytecode in CPython. It makes single-threaded code fast and simple, but means threads cannot run Python code in parallel, even on a 64-core machine. Reach for multiprocessing for CPU-bound work, threading or asyncio for I/O.
Diagram
The GIL in plain English
The diagram above shows the two regimes side by side: under CPU work, threads queue on the GIL and only one core makes progress; under blocking I/O, the GIL is released, other threads run, and wall-clock time collapses.
Inside CPython (the Python interpreter most people use) there's a single lock called the Global Interpreter Lock. The rule is simple: only one thread can be running Python code at any moment. Even on a 32-core machine with 32 threads, exactly one of them is executing Python at any given instant.
Two consequences fall out of this:
- Python threads cannot speed up CPU work. Four threads doing math on four cores end up taking turns on one core. No throughput win.
- Python threads can speed up I/O work. When a thread waits on the network, a file read, or
time.sleep, the GIL gets released. Other threads run during the wait. 10-20x speedups on I/O-heavy code are typical.
That's it. Almost every Python performance question is downstream of this one rule.
The GIL only blocks Python bytecode. It does not block a thread that is waiting on something outside Python (sockets, files, sleep, subprocess.wait, NumPy operations, etc.).
When the GIL is released
There are two situations where a Python thread holding the GIL hands it back so another thread can run.
| Trigger | Detail |
|---|---|
| Around a blocking I/O call | Functions like socket.recv, time.sleep, file.read, database drivers, and subprocess.wait are implemented in C. The C code explicitly releases the GIL, makes the syscall, and reacquires the GIL once the syscall returns. |
| Periodically, every ~5 milliseconds | The interpreter checks whether it should hand the GIL to another waiting thread. The interval is configurable via sys.setswitchinterval. (Pre-3.2 CPython did this every 100 bytecodes; modern CPython is time-based.) |
The first case is the one that buys parallelism. While one thread is waiting on the network, another thread is free to run Python code. The second case rarely buys parallelism in pure-Python CPU loops, because the same thread usually reacquires the GIL immediately at the tick. To anyone watching, it looks as if the GIL is held continuously. That is why CPU-threaded Python code shows almost no parallelism even on a many-core machine.
Why += is NOT atomic even with the GIL
A persistent myth says "the GIL serialises Python code, so shared variables between threads are safe". They are not. The GIL serialises one bytecode at a time, not one source-code statement at a time. A statement like counter += 1 compiles to four separate bytecodes, and the GIL can be released between any two of them.
Each individual bytecode (LOAD_FAST, BINARY_OP, STORE_FAST) was atomic on its own. The GIL guaranteed that. What the GIL did not guarantee is that the four bytecodes of counter += 1 would run as one indivisible group. The compiled sequence for counter += 1 looks like this:
counter += 1 compiles to:
LOAD_FAST counter ← GIL switch can happen after this
LOAD_CONST 1
BINARY_OP add ← or here
STORE_FAST counter
The fix is the same as in any other language: protect the read-modify-write with a threading.Lock, or use a primitive that bundles the operation atomically (a Queue, an Event, an itertools.count counter handed out monotonically).
The GIL is not a synchronisation primitive
For mutable state shared across threads, use threading.Lock, threading.RLock, or queue.Queue. Code that relies on the GIL for correctness breaks on PyPy, on free-threaded CPython 3.13+, and even on regular CPython for compound operations.
When to reach for what
| Workload | Right tool |
|---|---|
| Many concurrent HTTP requests | threading or asyncio |
| CPU-heavy numerical work | multiprocessing (or NumPy/PyTorch which release the GIL) |
| 10K+ concurrent connections | asyncio |
| Mixed I/O and CPU | asyncio + to_thread for blocking calls, or process pool for CPU |
| Bypass GIL entirely | Free-threaded CPython 3.13+ (experimental) |
What's changing, PEP 703 and free-threaded CPython
In 2023, Python accepted PEP 703: a no-GIL build of CPython. Starting with 3.13, opting in to a free-threaded interpreter (python3.13t) lets threads run Python code in parallel. The single-thread performance cost is currently ~10–15%; many C extensions need updates to work safely. Adoption will be gradual, the GIL will not disappear from default CPython for years.
The interview answer for 2026 "Today, Python threads can't parallelise CPU work because of the GIL. PEP 703 is changing that: free-threaded CPython is an opt-in build in 3.13, supported (still non-default) in 3.14, and gradually moving toward default. For now, multiprocessing is the answer for CPU and threading/asyncio for I/O. The fundamentals will outlast the GIL."
Primitives by language
- threading (limited by GIL for CPU work)
- multiprocessing (separate processes, no GIL)
- asyncio (single-threaded, GIL irrelevant)
- concurrent.futures.{ThreadPoolExecutor, ProcessPoolExecutor}
Implementation
Pure Python CPU work doesn't benefit from threads, the GIL serialises execution AND adds context-switch overhead. The threaded version is reliably slower than serial. This catches every Python developer at least once.
1 import time
2 from threading import Thread
3
4 def cpu_work(n):
5 total = 0
6 for i in range(n):
7 total += i * i
8 return total
9
10 N = 50_000_000
11
12 # Serial, baseline
13 t0 = time.perf_counter()
14 cpu_work(N); cpu_work(N); cpu_work(N); cpu_work(N)
15 print(f"Serial: {time.perf_counter() - t0:.2f}s")
16
17 # 4 threads, slower despite 4 cores
18 t0 = time.perf_counter()
19 threads = [Thread(target=cpu_work, args=(N,)) for _ in range(4)]
20 for t in threads: t.start()
21 for t in threads: t.join()
22 print(f"Threaded: {time.perf_counter() - t0:.2f}s")
23
24 # Sample output on 8-core machine:
25 # Serial: 8.4s
26 # Threaded: 9.1s ← worse, not betterBlocking I/O calls (sockets, file I/O, sleep) release the GIL while waiting. Other Python threads run during the wait. Going from serial to 20 threads on 100 HTTP requests gives a 15-20× speedup.
1 import time
2 import requests
3 from concurrent.futures import ThreadPoolExecutor
4
5 urls = [f"https://httpbin.org/delay/1?id={i}" for i in range(20)]
6
7 # Serial, 20 seconds
8 t0 = time.perf_counter()
9 [requests.get(u) for u in urls]
10 print(f"Serial: {time.perf_counter() - t0:.1f}s")
11
12 # 20 threads, ~1 second
13 t0 = time.perf_counter()
14 with ThreadPoolExecutor(max_workers=20) as ex:
15 list(ex.map(requests.get, urls))
16 print(f"Threaded: {time.perf_counter() - t0:.1f}s")Each process has its own Python interpreter and its own GIL. Four processes really do run in parallel on four cores. Cost: ~50ms process startup, IPC overhead for arguments/results, no shared memory by default.
1 from multiprocessing import Pool
2 import time
3
4 def cpu_work(n):
5 total = 0
6 for i in range(n):
7 total += i * i
8 return total
9
10 N = 50_000_000
11
12 t0 = time.perf_counter()
13 with Pool(4) as pool:
14 pool.map(cpu_work, [N] * 4)
15 print(f"4 processes: {time.perf_counter() - t0:.2f}s")
16
17 # On 8-core: ~2.2s, true parallel speedupcounter += 1 compiles to LOAD_FAST, LOAD_CONST, BINARY_ADD, STORE_FAST. The GIL can be released between bytecodes. Two threads racing on the same counter can lose updates. Always use threading.Lock for shared mutable state.
Note: under the modern compiler, BINARY_ADD was unified into BINARY_OP; the principle (multiple bytecodes, interleaving) is unchanged.
1 import threading
2 import dis
3
4 counter = 0
5 def inc():
6 global counter
7 for _ in range(1_000_000):
8 counter += 1 # NOT atomic
9
10 ts = [threading.Thread(target=inc) for _ in range(4)]
11 for t in ts: t.start()
12 for t in ts: t.join()
13 print(counter) # < 4_000_000 every time
14
15 # Disassemble to see the multiple bytecodes
16 def increment(x):
17 x += 1
18 dis.dis(increment)
19 # Emits LOAD_FAST, LOAD_CONST, BINARY_OP, STORE_FAST, interleavableKey points
- •The GIL is a CPython implementation detail, not part of the language spec
- •Threads release the GIL around blocking I/O calls, so threading IS useful for I/O work
- •Threads do NOT help CPU-bound code in CPython, they serialise via the GIL
- •GIL switches between threads roughly every 5ms (sys.setswitchinterval)
- •PyPy, Jython, and IronPython have no GIL; CPython 3.13+ has experimental free-threaded mode (PEP 703)
- •Even with the GIL, common bytecode operations like += are NOT atomic, bytecode boundaries can interleave
Tradeoffs
| Option | Pros | Cons | When to use |
|---|---|---|---|
| threading |
|
| I/O-bound work, HTTP, DB, file system |
| multiprocessing |
|
| CPU-bound work, numerical compute, image processing, data transformation |
| asyncio |
|
| High-concurrency I/O, web servers, scrapers, message brokers |
| Free-threaded CPython (3.13+) |
|
| When mature, currently for adventurous CPU-heavy workloads |
Follow-up questions
▸Why does the GIL exist?
▸If the GIL serialises Python bytecode, why is += not atomic?
▸Does the GIL apply to NumPy / Pandas / TensorFlow operations?
▸Will the GIL be removed?
▸Why does asyncio not have GIL problems?
Gotchas
- !Adding threading to CPU-bound code makes it slower, measure before adding
- !multiprocessing on Windows/macOS uses 'spawn' by default, globals from main aren't copied
- !Pickle errors when passing un-picklable objects (lambdas, locks) to ProcessPoolExecutor
- !asyncio + a CPU-heavy synchronous call → entire event loop stalls until it finishes
- !Mixing threads and asyncio is a footgun, prefer asyncio.to_thread() for blocking calls
- !The GIL is released around blocking I/O, but NOT around CPU loops in pure Python
Common pitfalls
- Reaching for threading first when the bottleneck is CPU
- Using multiprocessing for tiny tasks, IPC overhead dwarfs the work
- Assuming free-threaded CPython will solve all CPU-Python problems, it'll have its own quirks
- Using sys.setswitchinterval() to 'tune' the GIL, almost never the right answer
APIs worth memorising
- threading: Thread, Lock, Event, Condition, Semaphore
- multiprocessing: Process, Pool, Queue, Pipe, shared_memory
- asyncio: run, create_task, gather, Queue, to_thread, Lock
- concurrent.futures: ThreadPoolExecutor, ProcessPoolExecutor, Future, as_completed
Every Python web service navigates this. Django/Flask sync workers run as multiple processes (gunicorn -w N). FastAPI uses asyncio. Celery uses multiprocessing for CPU tasks. NumPy/PyTorch release the GIL internally. The GIL shapes the entire Python ecosystem.