Context Switch Cost

Why context switch cost matters

Every concurrency primitive has a cost per switch. At small scale (hundreds of switches/sec) it's invisible. At scale (10K-1M switches/sec) it becomes a measurable percentage of CPU time. Choosing a primitive whose switch cost is 10x cheaper can mean the difference between maxing out a server at 100K rps vs 1M rps.

Order-of-magnitude numbers

Switch type	Typical cost	What's involved
Function call	~1-5ns	push/pop, no kernel
Async task switch	~50-100ns	coroutine resume, single thread
Goroutine / virtual thread	~200-500ns	user-space scheduler, no kernel mode
OS thread (same process)	~1-5μs	kernel mode, register save/restore, partial TLB
Process	~5-20μs	full TLB flush, address space switch

The numbers vary by ~3x with hardware/OS, but the orders-of-magnitude relationships hold.

When this dominates

Important

The math worth being able to do in interviews Service handles 1M requests per second. Each request involves 5 context switches (network read, processing, network write, etc.). Per-switch cost = 1μs.

Total switching CPU = 1M × 5 × 1μs = 5 seconds of CPU per second = 5 cores just for switching.

With virtual threads (~300ns/switch): 1M × 5 × 300ns = 1.5 seconds = 1.5 cores. Saved 3.5 cores per server.

Multiply by hundreds of servers = real money.

How to measure

Linux system-wide: vmstat 1 shows cs (context switches per second).
Per-process: pidstat -w 1 shows voluntary and involuntary cs.
Go: runtime/pprof goroutine and mutex profiles show coordination overhead.
Java: JFR JavaThreadStatistics event tracks thread states.
Async (Python/JS): harder, track task-resume rate via custom instrumentation.

Reproduce the numbers locally

A pipe ping-pong is the standard microbenchmark for OS thread switch cost. Two threads (or two processes) each write one byte and read one byte, forever; each round trip is two context switches. Divide.

// pipe_pong.c -- one process, two threads ping-ponging via a pipe
// Build: cc -O2 -pthread pipe_pong.c -o pipe_pong
// Run:   taskset -c 0,1 ./pipe_pong         # pin to two cores to avoid noise
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

#define ROUNDS 1000000

int p1[2], p2[2];
char b;

void *pong(void *_) {
    for (long i = 0; i < ROUNDS; i++) {
        read(p1[0], &b, 1);
        write(p2[1], &b, 1);
    }
    return NULL;
}

int main() {
    pipe(p1); pipe(p2);
    pthread_t t;
    pthread_create(&t, NULL, pong, NULL);

    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);

    for (long i = 0; i < ROUNDS; i++) {
        write(p1[1], &b, 1);
        read(p2[0], &b, 1);
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double sec = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("%.0f ns per round trip (%.0f ns per switch)\n",
           sec * 1e9 / ROUNDS, sec * 1e9 / ROUNDS / 2);
    return 0;
}

// Typical output on a modern Linux box:
//   ~3000-5000 ns per round trip = ~1500-2500 ns per OS thread switch

For goroutines, the equivalent benchmark is two goroutines exchanging on a channel:

// chan_pong_test.go -- run with: go test -bench=Pong -benchtime=1000000x
package main

import "testing"

func BenchmarkChanPong(b *testing.B) {
    a := make(chan struct{}, 1)
    z := make(chan struct{}, 1)
    go func() {
        for i := 0; i < b.N; i++ {
            <-a
            z <- struct{}{}
        }
    }()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        a <- struct{}{}
        <-z
    }
}

// Typical: 200-400 ns per round trip = 100-200 ns per goroutine switch.
// Compare with the pipe_pong number above to see the order-of-magnitude gap.

Why the gap is so big

An OS thread switch is the kernel taking the CPU away from one thread and giving it to another. The work happens in two distinct buckets, and the second one is usually larger than the first.

Bucket 1: the kernel work. Five steps, each with a real cost.

Step	What happens	Approximate cost
1	Trap from user mode into the kernel	100 to 200 ns
2	Save thread A's registers (general-purpose, floating-point, SSE/AVX state, around 50 register slots)	tens of ns
3	Scheduler bookkeeping: mark A no longer current, update accounting, pick the next runnable thread	tens of ns
4	Restore thread B's registers from its saved state	tens of ns
5	Trap back to user mode and start running thread B	100 to 200 ns

That alone is roughly 500 ns to 1 µs of pure kernel overhead.

Bucket 2: the cold caches and cold TLB. This is the part most people miss, and it is usually bigger than the kernel work.

After the switch, thread B is running on a core whose caches are full of thread A's data. B's stack, B's hot variables, and B's recently-touched globals are not in L1. Every memory access B makes for the next hundred instructions or so misses L1 and has to refill from L2, L3, or main memory. The TLB (the CPU's cache of virtual-to-physical address mappings) also has stale entries. Even with PCID / ASID tagging, which lets the kernel avoid a full TLB flush, many entries get invalidated.

The "first hundred accesses are slow" effect is often the largest cost of a context switch, larger than the kernel work itself. Two threads that fit comfortably in cache individually do not fit together; switching evicts each one's data on the way out.

A goroutine switch or a virtual-thread switch avoids most of this. Both stay in user space, so steps 1 and 5 (the user-kernel transitions) do not happen. The "save and restore registers" work is a small struct copy in user space rather than a kernel call. The runtime tries to keep the same goroutine on the same OS thread when possible, which keeps the cache warm. The total is roughly 100 to 200 ns, an order of magnitude faster than the OS thread switch and two orders of magnitude faster than crossing the network. That gap is why services that fan out lots of small concurrent units (web servers, RPC clients, ingestion pipelines) benefit so much from goroutines or virtual threads over OS-thread-per-request.

Tip

The interview answer "Context switch cost matters at scale. OS threads ~1-5μs; goroutines/virtual threads ~hundreds of ns; async ~50-100ns. For services handling >100K req/sec, picking the right primitive saves measurable CPU. The corollary: don't optimise switches in services doing <1K req/sec; the cost is invisible."

Tradeoffs

Option	Pros	Cons	When to use
OS thread	Simple model Compatible with all libraries Direct OS scheduling	~5μs context switch ~1MB stack each Limited to ~10K per process	CPU-bound work; small thread counts
Virtual thread / goroutine	~200-500ns switch ~few KB stack Millions per process	Can't avoid OS thread switch entirely (M's still switch under heavy load) Synchronized blocks can pin	I/O-bound work; high concurrency
Async task	~50-100ns switch Tiny memory per task Single-threaded → no locks	Must be all-async (one sync call stalls everything) Steeper mental model	Massive I/O concurrency; event-driven services

Follow-up questions

▸Why do OS thread switches cost so much more?

Three reasons: (1) kernel mode transition (~hundreds of ns by itself); (2) full register save/restore including FP and SIMD state; (3) TLB pollution, virtual memory mappings get partially invalidated, causing subsequent memory accesses to refill from page tables. Goroutine switches do none of this, same address space, just swap stack pointers.

▸What's TLB pollution?

TLB = Translation Lookaside Buffer, the CPU's cache of virtual→physical address translations. Switching processes flushes it; switching threads in the same process partially flushes (TLB tagged by ASID on modern CPUs). After a thread switch, the next N memory accesses are slower because they refill the TLB.

▸How are context switches measured in production?

Linux: `vmstat 1` shows cs (context switches per second) globally. Per-process: `pidstat -w 1`. For perf-critical services, monitor cs/sec; sudden spikes correlate with throughput drops. Goroutine 'switches' don't show in OS metrics, use Go's pprof/sched.

▸When does context-switch cost become the bottleneck?

When tasks are very short. If each task does 1μs of work and incurs a 5μs context switch, 5x more time goes to switching than working. Either batch tasks (do many per switch) or use cheaper switching (goroutines, async).

Why context switch cost matters

Order-of-magnitude numbers

Switch type	Typical cost	What's involved
Function call	~1-5ns	push/pop, no kernel
Async task switch	~50-100ns	coroutine resume, single thread
Goroutine / virtual thread	~200-500ns	user-space scheduler, no kernel mode
OS thread (same process)	~1-5μs	kernel mode, register save/restore, partial TLB
Process	~5-20μs	full TLB flush, address space switch

The numbers vary by ~3x with hardware/OS, but the orders-of-magnitude relationships hold.

When this dominates

Important

The math worth being able to do in interviews Service handles 1M requests per second. Each request involves 5 context switches (network read, processing, network write, etc.). Per-switch cost = 1μs.

Total switching CPU = 1M × 5 × 1μs = 5 seconds of CPU per second = 5 cores just for switching.

With virtual threads (~300ns/switch): 1M × 5 × 300ns = 1.5 seconds = 1.5 cores. Saved 3.5 cores per server.

Multiply by hundreds of servers = real money.

How to measure

Linux system-wide: vmstat 1 shows cs (context switches per second).
Per-process: pidstat -w 1 shows voluntary and involuntary cs.
Go: runtime/pprof goroutine and mutex profiles show coordination overhead.
Java: JFR JavaThreadStatistics event tracks thread states.
Async (Python/JS): harder, track task-resume rate via custom instrumentation.

Reproduce the numbers locally

// pipe_pong.c -- one process, two threads ping-ponging via a pipe
// Build: cc -O2 -pthread pipe_pong.c -o pipe_pong
// Run:   taskset -c 0,1 ./pipe_pong         # pin to two cores to avoid noise
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

#define ROUNDS 1000000

int p1[2], p2[2];
char b;

void *pong(void *_) {
    for (long i = 0; i < ROUNDS; i++) {
        read(p1[0], &b, 1);
        write(p2[1], &b, 1);
    }
    return NULL;
}

int main() {
    pipe(p1); pipe(p2);
    pthread_t t;
    pthread_create(&t, NULL, pong, NULL);

    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);

    for (long i = 0; i < ROUNDS; i++) {
        write(p1[1], &b, 1);
        read(p2[0], &b, 1);
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double sec = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("%.0f ns per round trip (%.0f ns per switch)\n",
           sec * 1e9 / ROUNDS, sec * 1e9 / ROUNDS / 2);
    return 0;
}

// Typical output on a modern Linux box:
//   ~3000-5000 ns per round trip = ~1500-2500 ns per OS thread switch

For goroutines, the equivalent benchmark is two goroutines exchanging on a channel:

// chan_pong_test.go -- run with: go test -bench=Pong -benchtime=1000000x
package main

import "testing"

func BenchmarkChanPong(b *testing.B) {
    a := make(chan struct{}, 1)
    z := make(chan struct{}, 1)
    go func() {
        for i := 0; i < b.N; i++ {
            <-a
            z <- struct{}{}
        }
    }()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        a <- struct{}{}
        <-z
    }
}

// Typical: 200-400 ns per round trip = 100-200 ns per goroutine switch.
// Compare with the pipe_pong number above to see the order-of-magnitude gap.

Why the gap is so big

An OS thread switch is the kernel taking the CPU away from one thread and giving it to another. The work happens in two distinct buckets, and the second one is usually larger than the first.

Bucket 1: the kernel work. Five steps, each with a real cost.

Step	What happens	Approximate cost
1	Trap from user mode into the kernel	100 to 200 ns
2	Save thread A's registers (general-purpose, floating-point, SSE/AVX state, around 50 register slots)	tens of ns
3	Scheduler bookkeeping: mark A no longer current, update accounting, pick the next runnable thread	tens of ns
4	Restore thread B's registers from its saved state	tens of ns
5	Trap back to user mode and start running thread B	100 to 200 ns

That alone is roughly 500 ns to 1 µs of pure kernel overhead.

Bucket 2: the cold caches and cold TLB. This is the part most people miss, and it is usually bigger than the kernel work.

Tip

Tradeoffs

Option	Pros	Cons	When to use
OS thread	Simple model Compatible with all libraries Direct OS scheduling	~5μs context switch ~1MB stack each Limited to ~10K per process	CPU-bound work; small thread counts
Virtual thread / goroutine	~200-500ns switch ~few KB stack Millions per process	Can't avoid OS thread switch entirely (M's still switch under heavy load) Synchronized blocks can pin	I/O-bound work; high concurrency
Async task	~50-100ns switch Tiny memory per task Single-threaded → no locks	Must be all-async (one sync call stalls everything) Steeper mental model	Massive I/O concurrency; event-driven services

Follow-up questions

▸Why do OS thread switches cost so much more?

▸What's TLB pollution?

▸How are context switches measured in production?

▸When does context-switch cost become the bottleneck?

Why context switch cost matters

Order-of-magnitude numbers

When this dominates

How to measure

Reproduce the numbers locally

Why the gap is so big

Key points

Tradeoffs

Follow-up questions

Gotchas

Related reading

Context Switch Cost

Why context switch cost matters

Order-of-magnitude numbers

When this dominates

How to measure

Reproduce the numbers locally

Why the gap is so big

Key points

Tradeoffs

Follow-up questions

Gotchas

Related reading