Memory Overhead Comparison

The numbers, ranked

Unit	Per-unit memory	1M units cost
OS thread	~1MB stack + KBs of TCB	~1TB (impossible)
Goroutine	~2KB-8KB typical	~2-8GB
Virtual thread	~3-5KB	~3-5GB
asyncio task	~1KB	~1GB
HTTP connection	~few KB (kernel buffers)	~few GB

The factor between "OS thread" and "lightweight" is 100-1000x. That's the difference between 10K and 1M concurrent units on the same hardware.

Where memory overhead bites

Important

The "10K problem" reborn The C10K problem (10K concurrent connections per server) was about scaling beyond what one-thread-per-connection allowed. We solved it with async (Node.js, asyncio) and event loops (epoll, kqueue).

Now we have C10M and beyond, 10 million concurrent connections per server. Possible only with virtual threads, goroutines, or async tasks where per-connection memory is KB-sized, not MB.

The hidden cost: ThreadLocal

50 ThreadLocals in framework × 1M virtual threads × 100B avg ≈ 5GB

Java's virtual thread era exposed this. ThreadLocal was fine at thousands of threads; with millions, the per-thread state dwarfs the thread itself. Hence Java 21's ScopedValue.

How to measure

Linux: ps -o rss,vsz <pid>, physical (RSS) and virtual (VSZ) memory.
Java: jcmd <pid> VM.native_memory summary, breakdown by thread, GC, code cache.
Go: runtime.MemStats, heap, stacks, GC overhead.
Per-thread: pmap -x <pid> shows mapped regions with sizes.

Reproduce the numbers locally

Spawn N concurrent units that all park, snapshot RSS, divide. Useful for sanity-checking the table on local hardware.

// Goroutine memory cost. Run: go run thread_mem.go
package main

import (
    "fmt"
    "runtime"
    "sync"
)

func main() {
    const N = 100_000
    var wg sync.WaitGroup
    block := make(chan struct{})

    var m1, m2 runtime.MemStats
    runtime.ReadMemStats(&m1)

    wg.Add(N)
    for i := 0; i < N; i++ {
        go func() { defer wg.Done(); <-block }()
    }

    runtime.ReadMemStats(&m2)
    delta := float64(m2.StackInuse-m1.StackInuse) / float64(N)
    fmt.Printf("%d goroutines parked, ~%.0f bytes stack each\n", N, delta)

    close(block)
    wg.Wait()
    // Typical output: 100000 goroutines parked, ~2200 bytes stack each
}

// Virtual thread memory cost. Run on Java 21+
import java.util.concurrent.CountDownLatch;

public class VThreadMem {
    public static void main(String[] args) throws Exception {
        int n = 100_000;
        CountDownLatch park = new CountDownLatch(1);
        Runtime.getRuntime().gc();
        long before = used();
        for (int i = 0; i < n; i++) {
            Thread.startVirtualThread(() -> {
                try { park.await(); } catch (InterruptedException e) {}
            });
        }
        Thread.sleep(500);
        long after = used();
        System.out.printf("%d virtual threads, ~%d bytes each%n",
                          n, (after - before) / n);
        park.countDown();
    }
    static long used() {
        Runtime r = Runtime.getRuntime();
        return r.totalMemory() - r.freeMemory();
    }
}
// Typical: ~3000-5000 bytes per parked virtual thread

# Asyncio task memory cost
import asyncio, tracemalloc

async def park(ev): await ev.wait()

async def main():
    n = 100_000
    ev = asyncio.Event()
    tracemalloc.start()
    snap1 = tracemalloc.take_snapshot()
    tasks = [asyncio.create_task(park(ev)) for _ in range(n)]
    await asyncio.sleep(0.1)
    snap2 = tracemalloc.take_snapshot()
    diff = sum(s.size for s in snap2.statistics("filename")) - \
           sum(s.size for s in snap1.statistics("filename"))
    print(f"{n} tasks, ~{diff//n} bytes each")
    ev.set()
    await asyncio.gather(*tasks)

asyncio.run(main())
# Typical: ~1500-2500 bytes per parked task

For OS threads, the cheapest way is to use the same Java program with Thread.ofPlatform().start(...); this shows ~1MB per thread because the JVM allocates a 1MB stack whether or not it gets touched. Spawning 100K platform threads will OOM most laptops.

Tip

The right concurrency primitive at each scale

<1K concurrent: any. OS threads are simplest.
1K-100K concurrent I/O: virtual threads, goroutines, or thread pool with async.
100K-10M concurrent connections: must be lightweight (goroutines, asyncio, virtual threads).
Compute-bound: thread count = core count regardless of unit memory.

Tradeoffs

Option	Pros	Cons	When to use
OS thread (1MB)	Compatible with all libraries Direct OS scheduling	10K threads = 10GB RAM minimum Most stack space wasted	Compute-bound, small thread counts
Goroutine (2KB → grows)	Cheap to spawn Stack grows on demand Millions per process	Stack growth has cost (copy) Pinned-and-forgotten stacks waste memory	Default for Go I/O-bound work
Virtual thread (few KB)	Standard JDK APIs work No async-everywhere migration	ThreadLocals multiply per-thread Synchronized blocks pin to carrier	Java 21+ I/O-bound services
Async task (~1KB)	Tiniest per-task overhead Single-threaded → no locks	All-or-nothing async Stack frames live across awaits	Python/Node/JS high-concurrency I/O

Follow-up questions

▸Why do OS threads reserve 1MB even if unused?

Stack is virtually mapped at thread creation; physical pages allocated only when touched (most aren't). RSS may be smaller than 1MB × thread count, but the virtual address space cost is real; eventually 32-bit address space exhausts (less of an issue on 64-bit, but still). Some kernels actually allocate the stack eagerly.

▸How does goroutine stack 'grow on demand'?

Each goroutine starts with a tiny stack (2KB in Go 1.4+). When a function call would overflow, the runtime allocates a larger stack, copies the contents, updates pointers, and re-runs. Cost is per-growth (rare in practice). Stacks also shrink on GC if usage drops.

▸What's the memory cost of 1M asyncio tasks?

Each task is a coroutine state object, ~1KB in Python (the suspended frame, locals, await chain). 1M tasks ~ 1GB. Add aiohttp connections (~few KB each) and the total reaches several GB. Bound concurrency or risk OOM.

▸Where does ThreadLocal cost matter?

Each thread has a ThreadLocal map. Java apps with many ThreadLocals (Spring, MDC, request context, ORM) can use ~50KB per thread JUST for ThreadLocals. With 1M virtual threads, that's 50GB. Java 21 ScopedValue solves this for new code.

The numbers, ranked

Unit	Per-unit memory	1M units cost
OS thread	~1MB stack + KBs of TCB	~1TB (impossible)
Goroutine	~2KB-8KB typical	~2-8GB
Virtual thread	~3-5KB	~3-5GB
asyncio task	~1KB	~1GB
HTTP connection	~few KB (kernel buffers)	~few GB

The factor between "OS thread" and "lightweight" is 100-1000x. That's the difference between 10K and 1M concurrent units on the same hardware.

Where memory overhead bites

Important

Now we have C10M and beyond, 10 million concurrent connections per server. Possible only with virtual threads, goroutines, or async tasks where per-connection memory is KB-sized, not MB.

The hidden cost: ThreadLocal

50 ThreadLocals in framework × 1M virtual threads × 100B avg ≈ 5GB

Java's virtual thread era exposed this. ThreadLocal was fine at thousands of threads; with millions, the per-thread state dwarfs the thread itself. Hence Java 21's ScopedValue.

How to measure

Linux: ps -o rss,vsz <pid>, physical (RSS) and virtual (VSZ) memory.
Java: jcmd <pid> VM.native_memory summary, breakdown by thread, GC, code cache.
Go: runtime.MemStats, heap, stacks, GC overhead.
Per-thread: pmap -x <pid> shows mapped regions with sizes.

Reproduce the numbers locally

Spawn N concurrent units that all park, snapshot RSS, divide. Useful for sanity-checking the table on local hardware.

// Goroutine memory cost. Run: go run thread_mem.go
package main

import (
    "fmt"
    "runtime"
    "sync"
)

func main() {
    const N = 100_000
    var wg sync.WaitGroup
    block := make(chan struct{})

    var m1, m2 runtime.MemStats
    runtime.ReadMemStats(&m1)

    wg.Add(N)
    for i := 0; i < N; i++ {
        go func() { defer wg.Done(); <-block }()
    }

    runtime.ReadMemStats(&m2)
    delta := float64(m2.StackInuse-m1.StackInuse) / float64(N)
    fmt.Printf("%d goroutines parked, ~%.0f bytes stack each\n", N, delta)

    close(block)
    wg.Wait()
    // Typical output: 100000 goroutines parked, ~2200 bytes stack each
}

// Virtual thread memory cost. Run on Java 21+
import java.util.concurrent.CountDownLatch;

public class VThreadMem {
    public static void main(String[] args) throws Exception {
        int n = 100_000;
        CountDownLatch park = new CountDownLatch(1);
        Runtime.getRuntime().gc();
        long before = used();
        for (int i = 0; i < n; i++) {
            Thread.startVirtualThread(() -> {
                try { park.await(); } catch (InterruptedException e) {}
            });
        }
        Thread.sleep(500);
        long after = used();
        System.out.printf("%d virtual threads, ~%d bytes each%n",
                          n, (after - before) / n);
        park.countDown();
    }
    static long used() {
        Runtime r = Runtime.getRuntime();
        return r.totalMemory() - r.freeMemory();
    }
}
// Typical: ~3000-5000 bytes per parked virtual thread

# Asyncio task memory cost
import asyncio, tracemalloc

async def park(ev): await ev.wait()

async def main():
    n = 100_000
    ev = asyncio.Event()
    tracemalloc.start()
    snap1 = tracemalloc.take_snapshot()
    tasks = [asyncio.create_task(park(ev)) for _ in range(n)]
    await asyncio.sleep(0.1)
    snap2 = tracemalloc.take_snapshot()
    diff = sum(s.size for s in snap2.statistics("filename")) - \
           sum(s.size for s in snap1.statistics("filename"))
    print(f"{n} tasks, ~{diff//n} bytes each")
    ev.set()
    await asyncio.gather(*tasks)

asyncio.run(main())
# Typical: ~1500-2500 bytes per parked task

Tip

The right concurrency primitive at each scale

<1K concurrent: any. OS threads are simplest.
1K-100K concurrent I/O: virtual threads, goroutines, or thread pool with async.
100K-10M concurrent connections: must be lightweight (goroutines, asyncio, virtual threads).
Compute-bound: thread count = core count regardless of unit memory.

Tradeoffs

Option	Pros	Cons	When to use
OS thread (1MB)	Compatible with all libraries Direct OS scheduling	10K threads = 10GB RAM minimum Most stack space wasted	Compute-bound, small thread counts
Goroutine (2KB → grows)	Cheap to spawn Stack grows on demand Millions per process	Stack growth has cost (copy) Pinned-and-forgotten stacks waste memory	Default for Go I/O-bound work
Virtual thread (few KB)	Standard JDK APIs work No async-everywhere migration	ThreadLocals multiply per-thread Synchronized blocks pin to carrier	Java 21+ I/O-bound services
Async task (~1KB)	Tiniest per-task overhead Single-threaded → no locks	All-or-nothing async Stack frames live across awaits	Python/Node/JS high-concurrency I/O

Follow-up questions

▸Why do OS threads reserve 1MB even if unused?

▸How does goroutine stack 'grow on demand'?

▸What's the memory cost of 1M asyncio tasks?

▸Where does ThreadLocal cost matter?

The numbers, ranked

Where memory overhead bites

The hidden cost: ThreadLocal

How to measure

Reproduce the numbers locally

Key points

Tradeoffs

Follow-up questions

Gotchas

Related reading

Memory Overhead Comparison

The numbers, ranked

Where memory overhead bites

The hidden cost: ThreadLocal

How to measure

Reproduce the numbers locally

Key points

Tradeoffs

Follow-up questions

Gotchas

Related reading