Context Switch Cost
OS thread context switch: ~5μs (kernel-mediated, save/restore registers, TLB flush). Goroutine / virtual thread context switch: ~hundreds of ns (user-space, no kernel transition). Async task switch: ~tens of ns (function call). Choosing the right unit affects throughput at scale.
Why context switch cost matters
Every concurrency primitive has a cost per switch. At small scale (hundreds of switches/sec) it's invisible. At scale (10K-1M switches/sec) it becomes a measurable percentage of CPU time. Choosing a primitive whose switch cost is 10x cheaper can mean the difference between maxing out a server at 100K rps vs 1M rps.
Order-of-magnitude numbers
| Switch type | Typical cost | What's involved |
|---|---|---|
| Function call | ~1-5ns | push/pop, no kernel |
| Async task switch | ~50-100ns | coroutine resume, single thread |
| Goroutine / virtual thread | ~200-500ns | user-space scheduler, no kernel mode |
| OS thread (same process) | ~1-5μs | kernel mode, register save/restore, partial TLB |
| Process | ~5-20μs | full TLB flush, address space switch |
The numbers vary by ~3x with hardware/OS, but the orders-of-magnitude relationships hold.
When this dominates
The math worth being able to do in interviews Service handles 1M requests per second. Each request involves 5 context switches (network read, processing, network write, etc.). Per-switch cost = 1μs.
Total switching CPU = 1M × 5 × 1μs = 5 seconds of CPU per second = 5 cores just for switching.
With virtual threads (~300ns/switch): 1M × 5 × 300ns = 1.5 seconds = 1.5 cores. Saved 3.5 cores per server.
Multiply by hundreds of servers = real money.
How to measure
- Linux system-wide:
vmstat 1shows cs (context switches per second). - Per-process:
pidstat -w 1shows voluntary and involuntary cs. - Go:
runtime/pprofgoroutineandmutexprofiles show coordination overhead. - Java: JFR
JavaThreadStatisticsevent tracks thread states. - Async (Python/JS): harder, track task-resume rate via custom instrumentation.
Reproduce the numbers locally
A pipe ping-pong is the standard microbenchmark for OS thread switch cost. Two threads (or two processes) each write one byte and read one byte, forever; each round trip is two context switches. Divide.
// pipe_pong.c -- one process, two threads ping-ponging via a pipe
// Build: cc -O2 -pthread pipe_pong.c -o pipe_pong
// Run: taskset -c 0,1 ./pipe_pong # pin to two cores to avoid noise
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define ROUNDS 1000000
int p1[2], p2[2];
char b;
void *pong(void *_) {
for (long i = 0; i < ROUNDS; i++) {
read(p1[0], &b, 1);
write(p2[1], &b, 1);
}
return NULL;
}
int main() {
pipe(p1); pipe(p2);
pthread_t t;
pthread_create(&t, NULL, pong, NULL);
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (long i = 0; i < ROUNDS; i++) {
write(p1[1], &b, 1);
read(p2[0], &b, 1);
}
clock_gettime(CLOCK_MONOTONIC, &t1);
double sec = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
printf("%.0f ns per round trip (%.0f ns per switch)\n",
sec * 1e9 / ROUNDS, sec * 1e9 / ROUNDS / 2);
return 0;
}
// Typical output on a modern Linux box:
// ~3000-5000 ns per round trip = ~1500-2500 ns per OS thread switch
For goroutines, the equivalent benchmark is two goroutines exchanging on a channel:
// chan_pong_test.go -- run with: go test -bench=Pong -benchtime=1000000x
package main
import "testing"
func BenchmarkChanPong(b *testing.B) {
a := make(chan struct{}, 1)
z := make(chan struct{}, 1)
go func() {
for i := 0; i < b.N; i++ {
<-a
z <- struct{}{}
}
}()
b.ResetTimer()
for i := 0; i < b.N; i++ {
a <- struct{}{}
<-z
}
}
// Typical: 200-400 ns per round trip = 100-200 ns per goroutine switch.
// Compare with the pipe_pong number above to see the order-of-magnitude gap.
Why the gap is so big
An OS thread switch is the kernel taking the CPU away from one thread and giving it to another. The work happens in two distinct buckets, and the second one is usually larger than the first.
Bucket 1: the kernel work. Five steps, each with a real cost.
| Step | What happens | Approximate cost |
|---|---|---|
| 1 | Trap from user mode into the kernel | 100 to 200 ns |
| 2 | Save thread A's registers (general-purpose, floating-point, SSE/AVX state, around 50 register slots) | tens of ns |
| 3 | Scheduler bookkeeping: mark A no longer current, update accounting, pick the next runnable thread | tens of ns |
| 4 | Restore thread B's registers from its saved state | tens of ns |
| 5 | Trap back to user mode and start running thread B | 100 to 200 ns |
That alone is roughly 500 ns to 1 µs of pure kernel overhead.
Bucket 2: the cold caches and cold TLB. This is the part most people miss, and it is usually bigger than the kernel work.
After the switch, thread B is running on a core whose caches are full of thread A's data. B's stack, B's hot variables, and B's recently-touched globals are not in L1. Every memory access B makes for the next hundred instructions or so misses L1 and has to refill from L2, L3, or main memory. The TLB (the CPU's cache of virtual-to-physical address mappings) also has stale entries. Even with PCID / ASID tagging, which lets the kernel avoid a full TLB flush, many entries get invalidated.
The "first hundred accesses are slow" effect is often the largest cost of a context switch, larger than the kernel work itself. Two threads that fit comfortably in cache individually do not fit together; switching evicts each one's data on the way out.
A goroutine switch or a virtual-thread switch avoids most of this. Both stay in user space, so steps 1 and 5 (the user-kernel transitions) do not happen. The "save and restore registers" work is a small struct copy in user space rather than a kernel call. The runtime tries to keep the same goroutine on the same OS thread when possible, which keeps the cache warm. The total is roughly 100 to 200 ns, an order of magnitude faster than the OS thread switch and two orders of magnitude faster than crossing the network. That gap is why services that fan out lots of small concurrent units (web servers, RPC clients, ingestion pipelines) benefit so much from goroutines or virtual threads over OS-thread-per-request.
The interview answer "Context switch cost matters at scale. OS threads ~1-5μs; goroutines/virtual threads ~hundreds of ns; async ~50-100ns. For services handling >100K req/sec, picking the right primitive saves measurable CPU. The corollary: don't optimise switches in services doing <1K req/sec; the cost is invisible."
Key points
- •OS thread switch: ~1-5μs typical, includes kernel mode switch + TLB pollution
- •Goroutine / virtual thread switch: ~200-500ns, user-space only
- •asyncio task switch: ~50-100ns, just a coroutine resumption
- •At 1M req/sec, a 1μs context switch costs 1 second of CPU per second, 100% of one core
- •Mass goroutine spawning is fine; mass thread spawning hits scheduler limits
Tradeoffs
| Option | Pros | Cons | When to use |
|---|---|---|---|
| OS thread |
|
| CPU-bound work; small thread counts |
| Virtual thread / goroutine |
|
| I/O-bound work; high concurrency |
| Async task |
|
| Massive I/O concurrency; event-driven services |
Follow-up questions
▸Why do OS thread switches cost so much more?
▸What's TLB pollution?
▸How are context switches measured in production?
▸When does context-switch cost become the bottleneck?
Gotchas
- !Mass-spawning OS threads (e.g., one per request) hits OS scheduler limits at ~10K threads
- !vmstat cs only counts OS-level switches; goroutine and async switches invisible
- !TLB pollution makes the FIRST few accesses after a switch much slower than steady-state
- !Hyper-threading helps hide some thread-switch latency by running another logical core
- !Synchronized blocks in Java pin virtual threads to carriers, losing the cheap-switch benefit