Memory Overhead Comparison
OS thread: ~1MB stack reserved. Goroutine: ~2KB initial, grows on demand. Virtual thread (Java 21): ~few KB. asyncio task: ~1KB. The difference dictates whether the same hardware can run 10K vs 1M concurrent units.
The numbers, ranked
| Unit | Per-unit memory | 1M units cost |
|---|---|---|
| OS thread | ~1MB stack + KBs of TCB | ~1TB (impossible) |
| Goroutine | ~2KB-8KB typical | ~2-8GB |
| Virtual thread | ~3-5KB | ~3-5GB |
| asyncio task | ~1KB | ~1GB |
| HTTP connection | ~few KB (kernel buffers) | ~few GB |
The factor between "OS thread" and "lightweight" is 100-1000x. That's the difference between 10K and 1M concurrent units on the same hardware.
Where memory overhead bites
The "10K problem" reborn The C10K problem (10K concurrent connections per server) was about scaling beyond what one-thread-per-connection allowed. We solved it with async (Node.js, asyncio) and event loops (epoll, kqueue).
Now we have C10M and beyond, 10 million concurrent connections per server. Possible only with virtual threads, goroutines, or async tasks where per-connection memory is KB-sized, not MB.
The hidden cost: ThreadLocal
50 ThreadLocals in framework × 1M virtual threads × 100B avg ≈ 5GB
Java's virtual thread era exposed this. ThreadLocal was fine at thousands of threads; with millions, the per-thread state dwarfs the thread itself. Hence Java 21's ScopedValue.
How to measure
- Linux:
ps -o rss,vsz <pid>, physical (RSS) and virtual (VSZ) memory. - Java:
jcmd <pid> VM.native_memory summary, breakdown by thread, GC, code cache. - Go:
runtime.MemStats, heap, stacks, GC overhead. - Per-thread:
pmap -x <pid>shows mapped regions with sizes.
Reproduce the numbers locally
Spawn N concurrent units that all park, snapshot RSS, divide. Useful for sanity-checking the table on local hardware.
// Goroutine memory cost. Run: go run thread_mem.go
package main
import (
"fmt"
"runtime"
"sync"
)
func main() {
const N = 100_000
var wg sync.WaitGroup
block := make(chan struct{})
var m1, m2 runtime.MemStats
runtime.ReadMemStats(&m1)
wg.Add(N)
for i := 0; i < N; i++ {
go func() { defer wg.Done(); <-block }()
}
runtime.ReadMemStats(&m2)
delta := float64(m2.StackInuse-m1.StackInuse) / float64(N)
fmt.Printf("%d goroutines parked, ~%.0f bytes stack each\n", N, delta)
close(block)
wg.Wait()
// Typical output: 100000 goroutines parked, ~2200 bytes stack each
}
// Virtual thread memory cost. Run on Java 21+
import java.util.concurrent.CountDownLatch;
public class VThreadMem {
public static void main(String[] args) throws Exception {
int n = 100_000;
CountDownLatch park = new CountDownLatch(1);
Runtime.getRuntime().gc();
long before = used();
for (int i = 0; i < n; i++) {
Thread.startVirtualThread(() -> {
try { park.await(); } catch (InterruptedException e) {}
});
}
Thread.sleep(500);
long after = used();
System.out.printf("%d virtual threads, ~%d bytes each%n",
n, (after - before) / n);
park.countDown();
}
static long used() {
Runtime r = Runtime.getRuntime();
return r.totalMemory() - r.freeMemory();
}
}
// Typical: ~3000-5000 bytes per parked virtual thread
# Asyncio task memory cost
import asyncio, tracemalloc
async def park(ev): await ev.wait()
async def main():
n = 100_000
ev = asyncio.Event()
tracemalloc.start()
snap1 = tracemalloc.take_snapshot()
tasks = [asyncio.create_task(park(ev)) for _ in range(n)]
await asyncio.sleep(0.1)
snap2 = tracemalloc.take_snapshot()
diff = sum(s.size for s in snap2.statistics("filename")) - \
sum(s.size for s in snap1.statistics("filename"))
print(f"{n} tasks, ~{diff//n} bytes each")
ev.set()
await asyncio.gather(*tasks)
asyncio.run(main())
# Typical: ~1500-2500 bytes per parked task
For OS threads, the cheapest way is to use the same Java program with Thread.ofPlatform().start(...); this shows ~1MB per thread because the JVM allocates a 1MB stack whether or not it gets touched. Spawning 100K platform threads will OOM most laptops.
The right concurrency primitive at each scale
- <1K concurrent: any. OS threads are simplest.
- 1K-100K concurrent I/O: virtual threads, goroutines, or thread pool with async.
- 100K-10M concurrent connections: must be lightweight (goroutines, asyncio, virtual threads).
- Compute-bound: thread count = core count regardless of unit memory.
Key points
- •OS thread: 1MB default stack reserved (most unused) → 10K threads = 10GB RSS
- •Goroutine: 2KB initial, grows up to 1GB → 10K goroutines = 20MB
- •Virtual thread (Java 21): ~few KB stack → 1M virtual threads ~ few GB
- •asyncio task: ~1KB per task (Python) → 1M tasks ~ 1GB
- •Per-thread/task overhead matters at millions of concurrent connections
Tradeoffs
| Option | Pros | Cons | When to use |
|---|---|---|---|
| OS thread (1MB) |
|
| Compute-bound, small thread counts |
| Goroutine (2KB → grows) |
|
| Default for Go I/O-bound work |
| Virtual thread (few KB) |
|
| Java 21+ I/O-bound services |
| Async task (~1KB) |
|
| Python/Node/JS high-concurrency I/O |
Follow-up questions
▸Why do OS threads reserve 1MB even if unused?
▸How does goroutine stack 'grow on demand'?
▸What's the memory cost of 1M asyncio tasks?
▸Where does ThreadLocal cost matter?
Gotchas
- !1M threads on Linux = ~1MB×1M = 1TB RSS, won't fit on most servers
- !Async stacks live across awaits, long-lived tasks accumulate stack memory
- !ThreadLocal × virtual threads = memory explosion (use ScopedValue in Java 21+)
- !Goroutine stack growth costs CPU during the copy, usually invisible but profiles can show it