Performance & TradeoffsTopic 1 of 14

ConceptBasicAsked Often

CPU-bound vs I/O-bound

In one line

A CPU-bound task spends most of its time doing computation; an I/O-bound task spends most of its time waiting. Different workloads need really different concurrency strategies, and picking the wrong one makes the code slower, not faster.

Diagram

What it is

Every task lives somewhere on a spectrum:

CPU-bound: most of the time is spent doing math, comparisons, parsing, encoding. The CPU is pegged.
I/O-bound: most of the time is spent waiting, for the network, the database, the disk, another service.

The bottleneck dictates the concurrency strategy. Pick wrong and the "optimization" makes the code slower.

Why it matters

This is the single most useful classification when reasoning about performance:

A web service hitting 100% CPU at 1K req/sec needs different help than one stuck at 5% CPU but slow on tail latency.
The right thread pool size for image processing is very different from the right size for an HTTP fan-out.
Reaching for async/await on a CPU-bound workload is wasted work; it stalls the event loop.

Note

Most production code is I/O-bound Web servers, microservices, ETL pipelines, message brokers, almost all of it spends > 80% of wall-clock time waiting on something. CPU-bound work is the special case: image/video processing, ML inference, cryptography, search indexing.

How to tell which one applies

Tip

The one-minute test

time python my_script.py

If user + sys is close to real, the program was busy → CPU-bound. If user + sys is much less than real, the program was waiting → I/O-bound.

For a deeper look, profile:

Python: py-spy top (sampling profiler, no instrumentation needed).
Java: async-profiler, JFR, or attach a profiler in IntelliJ.
Go: go tool pprof http://server/debug/pprof/profile.

The profile shows where the time goes. CPU-bound profiles show hot functions burning cycles. I/O-bound profiles show calls to socket reads, DB drivers, file I/O.

Why concurrency strategies diverge

For CPU-bound work, the cores cap how many useful instructions per second can run. Adding more threads beyond core count doesn't help, it adds context-switching overhead. The tools that help:

More cores (vertical scale).
Faster algorithms (algorithmic improvement).
SIMD/vector instructions, GPU offload.
In Python: multiprocessing (the GIL means threads can't help here).

For I/O-bound work, the bottleneck is wall-clock time waiting. More concurrent in-flight tasks overlap the waits, a pure win up to the point where the upstream service or the file-descriptor table can't keep up. The tools that help:

More threads (each one waits during the I/O).
Async/await (cheap concurrency without thread overhead).
Connection pooling and keep-alive.
In Python: threading works fine because the GIL is released around blocking calls.

Warning

The GIL trap that catches everyone Python developer adds threads to a CPU-heavy script. Wall-clock time gets worse. Why? The GIL forces serial execution; the threads compete for it instead of running in parallel. Symptom: CPU stays at ~100% of one core (not N cores). Fix: multiprocessing.Pool.

Sizing thread pools, Brian Goetz's formula

For mixed workloads (some compute, some wait), the optimal thread count is roughly:

N_threads = N_cores × target_utilization × (1 + wait_time / compute_time)

Worked example: 8-core machine, target 100% utilization, each task is 50ms of network wait + 5ms of compute → ratio 10 → optimal pool ≈ 8 × 1.0 × 11 = 88 threads.

Tip

Modern shortcut With Java 21+ virtual threads or Go goroutines, the formula collapses to "spawn one per task." The runtime handles the multiplexing onto cores. You still need to bound the in-flight count to avoid hammering upstream services, but the thread-vs-task arithmetic stops mattering.

When the categories blur

High-throughput network code can become CPU-bound at the parser/serializer.
Async services with CPU-heavy hot paths stall the event loop, common in Python/Node services that grew CPU work over time.
GPU-accelerated compute is "I/O" from the CPU's perspective: dispatch → wait → result.

Important

The interview answer that wins When asked "is X CPU or I/O bound?", the right move is "let me check" followed by describing how to measure it. The answer that loses is guessing based on intuition. Profilers exist for a reason.

Implementations

Sizing thread pools, Brian Goetz's rule

For CPU-bound work, exceeding core count doesn't help, extra threads just compete. For I/O-bound, the ideal pool size is cores × (1 + wait_time / compute_time). A worker that's 90% I/O can be sized 10× core count.

 1  // CPU-bound: pool size = N cores
 2  int cores = Runtime.getRuntime().availableProcessors();
 3  ExecutorService cpuPool = Executors.newFixedThreadPool(cores);
 4  
 5  // I/O-bound: pool size much larger
 6  // Goetz's formula: N_threads = N_cpu * U_cpu * (1 + W/C)
 7  // where U_cpu = target utilization, W = wait time, C = compute time per task
 8  // Example: 8 cores, 1.0 utilization, ratio 50ms wait / 5ms compute = 10
 9  // → 8 * 1.0 * 11 = 88 threads
10  ExecutorService ioPool = Executors.newFixedThreadPool(88);
11  
12  // Or use virtual threads (Java 21+); sizing becomes irrelevant for I/O
13  ExecutorService virtualPool = Executors.newVirtualThreadPerTaskExecutor();

Key points

•CPU-bound: bottleneck is compute. Speedup requires more cores OR faster code.
•I/O-bound: bottleneck is waiting (network, disk, DB). Speedup requires more concurrency to overlap waits.
•Threads help I/O-bound work even with the GIL, Python releases the GIL around blocking calls
•Threads do NOT help CPU-bound work in Python (GIL serializes execution), use multiprocessing
•Async/await scales I/O-bound to 100K+ tasks but does nothing for CPU-bound
•Right pool size: CPU-bound ≈ core count; I/O-bound ≈ much higher (often 100s)

Follow-up questions

▸How does one tell if code is CPU or I/O bound?

First step: `time` it. If user+sys ≈ real, CPU-bound; if user+sys << real, I/O-bound. Then profile (py-spy / pprof / async-profiler) to confirm where the time goes. Most production code is I/O-bound, DBs, RPCs, disk.

▸Why doesn't async/await help CPU-bound code?

Async/await is cooperative single-threaded concurrency. There's only one thread, so CPU work runs on it serially. A CPU-heavy task in an async function blocks the event loop and stalls every other task. Use a thread/process pool for CPU work, async for I/O.

▸What's the right thread pool size?

CPU-bound: cores (sometimes cores+1 to handle minor I/O). I/O-bound: cores × (1 + wait/compute). For 90% I/O work, that's 10× cores. With Java 21+ virtual threads or Go goroutines, the answer collapses to 'one per task.'

▸When does CPU-bound matter in practice?

Image/video encoding, ML inference (without GPU), regex on huge inputs, JSON parsing of massive payloads, cryptography, compression. Most web services are I/O-bound; data pipelines are often CPU-bound.

▸Why does Python's GIL release on I/O?

Blocking system calls (sockets, file I/O, sleep) release the GIL because the C implementation knows the thread will idle. Other Python threads can run during the wait. CPU work in pure Python keeps the GIL held until a bytecode boundary.

Gotchas

!A function can be CPU-bound in one workload and I/O-bound in another, measure on actual data
!Hashing/JSON-parsing/compression in 'I/O' code paths is often the hidden CPU bottleneck
!GPU-accelerated compute is technically I/O from the CPU's perspective, different concurrency model again
!Network I/O can become CPU-bound at very high throughput (parsing dominates)
!asyncio + a CPU-heavy task = stalled event loop and timeouts on every concurrent task

Common pitfalls

Adding threads/goroutines to fix slowness without checking what's bottlenecked
Sizing all thread pools to core count, even for I/O work
Mixing CPU-heavy work into an async function without offloading to a worker
Assuming SSD I/O is 'fast enough to not matter', it's still 1000× slower than RAM

APIs worth memorising

Python: concurrent.futures.ThreadPoolExecutor (I/O), ProcessPoolExecutor (CPU), asyncio (I/O)
Java: Executors.newFixedThreadPool, ForkJoinPool (CPU divide-and-conquer), virtual threads (I/O)
Go: GOMAXPROCS (defaults to cores), runtime.NumCPU(), bounded channels for I/O fan-out

Where this shows up

Every performance-tuning conversation starts here. Sizing pools wrong is the most common scaling bug at growth-stage companies. Netflix, Uber, and Stripe have public engineering posts about specific instances of this exact issue.

CPU-bound vs I/O-bound

In one line

Diagram

What it is

Every task lives somewhere on a spectrum:

CPU-bound: most of the time is spent doing math, comparisons, parsing, encoding. The CPU is pegged.
I/O-bound: most of the time is spent waiting, for the network, the database, the disk, another service.

The bottleneck dictates the concurrency strategy. Pick wrong and the "optimization" makes the code slower.

Why it matters

This is the single most useful classification when reasoning about performance:

A web service hitting 100% CPU at 1K req/sec needs different help than one stuck at 5% CPU but slow on tail latency.
The right thread pool size for image processing is very different from the right size for an HTTP fan-out.
Reaching for async/await on a CPU-bound workload is wasted work; it stalls the event loop.

Note

How to tell which one applies

Tip

The one-minute test

time python my_script.py

If user + sys is close to real, the program was busy → CPU-bound. If user + sys is much less than real, the program was waiting → I/O-bound.

For a deeper look, profile:

Python: py-spy top (sampling profiler, no instrumentation needed).
Java: async-profiler, JFR, or attach a profiler in IntelliJ.
Go: go tool pprof http://server/debug/pprof/profile.

The profile shows where the time goes. CPU-bound profiles show hot functions burning cycles. I/O-bound profiles show calls to socket reads, DB drivers, file I/O.

Why concurrency strategies diverge

For CPU-bound work, the cores cap how many useful instructions per second can run. Adding more threads beyond core count doesn't help, it adds context-switching overhead. The tools that help:

More cores (vertical scale).
Faster algorithms (algorithmic improvement).
SIMD/vector instructions, GPU offload.
In Python: multiprocessing (the GIL means threads can't help here).

More threads (each one waits during the I/O).
Async/await (cheap concurrency without thread overhead).
Connection pooling and keep-alive.
In Python: threading works fine because the GIL is released around blocking calls.

Warning

Sizing thread pools, Brian Goetz's formula

For mixed workloads (some compute, some wait), the optimal thread count is roughly:

N_threads = N_cores × target_utilization × (1 + wait_time / compute_time)

Worked example: 8-core machine, target 100% utilization, each task is 50ms of network wait + 5ms of compute → ratio 10 → optimal pool ≈ 8 × 1.0 × 11 = 88 threads.

Tip

When the categories blur

High-throughput network code can become CPU-bound at the parser/serializer.
Async services with CPU-heavy hot paths stall the event loop, common in Python/Node services that grew CPU work over time.
GPU-accelerated compute is "I/O" from the CPU's perspective: dispatch → wait → result.

Important

Implementations

Sizing thread pools, Brian Goetz's rule

 1  // CPU-bound: pool size = N cores
 2  int cores = Runtime.getRuntime().availableProcessors();
 3  ExecutorService cpuPool = Executors.newFixedThreadPool(cores);
 4  
 5  // I/O-bound: pool size much larger
 6  // Goetz's formula: N_threads = N_cpu * U_cpu * (1 + W/C)
 7  // where U_cpu = target utilization, W = wait time, C = compute time per task
 8  // Example: 8 cores, 1.0 utilization, ratio 50ms wait / 5ms compute = 10
 9  // → 8 * 1.0 * 11 = 88 threads
10  ExecutorService ioPool = Executors.newFixedThreadPool(88);
11  
12  // Or use virtual threads (Java 21+); sizing becomes irrelevant for I/O
13  ExecutorService virtualPool = Executors.newVirtualThreadPerTaskExecutor();

Key points

•CPU-bound: bottleneck is compute. Speedup requires more cores OR faster code.
•I/O-bound: bottleneck is waiting (network, disk, DB). Speedup requires more concurrency to overlap waits.
•Threads help I/O-bound work even with the GIL, Python releases the GIL around blocking calls
•Threads do NOT help CPU-bound work in Python (GIL serializes execution), use multiprocessing
•Async/await scales I/O-bound to 100K+ tasks but does nothing for CPU-bound
•Right pool size: CPU-bound ≈ core count; I/O-bound ≈ much higher (often 100s)

Follow-up questions

▸How does one tell if code is CPU or I/O bound?

▸Why doesn't async/await help CPU-bound code?

▸What's the right thread pool size?

▸When does CPU-bound matter in practice?

▸Why does Python's GIL release on I/O?

Gotchas

!A function can be CPU-bound in one workload and I/O-bound in another, measure on actual data
!Hashing/JSON-parsing/compression in 'I/O' code paths is often the hidden CPU bottleneck
!GPU-accelerated compute is technically I/O from the CPU's perspective, different concurrency model again
!Network I/O can become CPU-bound at very high throughput (parsing dominates)
!asyncio + a CPU-heavy task = stalled event loop and timeouts on every concurrent task

Common pitfalls

Adding threads/goroutines to fix slowness without checking what's bottlenecked
Sizing all thread pools to core count, even for I/O work
Mixing CPU-heavy work into an async function without offloading to a worker
Assuming SSD I/O is 'fast enough to not matter', it's still 1000× slower than RAM

APIs worth memorising

Python: concurrent.futures.ThreadPoolExecutor (I/O), ProcessPoolExecutor (CPU), asyncio (I/O)
Java: Executors.newFixedThreadPool, ForkJoinPool (CPU divide-and-conquer), virtual threads (I/O)
Go: GOMAXPROCS (defaults to cores), runtime.NumCPU(), bounded channels for I/O fan-out

Where this shows up

CPU-bound vs I/O-bound

Diagram

What it is

Why it matters

How to tell which one applies

Why concurrency strategies diverge

Sizing thread pools, Brian Goetz's formula

When the categories blur

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading

CPU-bound vs I/O-bound

Diagram

What it is

Why it matters

How to tell which one applies

Why concurrency strategies diverge

Sizing thread pools, Brian Goetz's formula

When the categories blur

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading