GMP Scheduler: How Goroutines Actually Run
G = goroutine, M = OS thread, P = logical processor. The runtime keeps GOMAXPROCS P's; each P holds a local run queue of runnable G's. M's must hold a P to run G's. Blocking syscalls hand the P off to another M so other goroutines keep running. Idle P's steal work from busy ones. The network poller and sysmon thread keep things moving. The whole machine runs in user space and is what makes 100K goroutines cheap.
Diagram
A coffee shop analogy
The Go runtime is built around three things, named with single letters in the source. The names matter because every Go scheduler discussion uses them. The mapping to a coffee shop:
| Letter | What it stands for | Coffee shop analogy |
|---|---|---|
| G | Goroutine | An order ticket. Cheap to create. Says "make this drink." |
| P | Processor (logical, not a CPU) | A barista station with a small stack of tickets next to it. There are exactly GOMAXPROCS stations; each can work on at most one ticket at a time. |
| M | Machine (an OS thread) | A barista. To make drinks, a barista has to stand at a station. No station, no drinks. |
Tickets get created and stacked on stations. Baristas grab the next ticket and make the drink. When a barista has to wait for something slow (an espresso machine that takes 30 seconds), the runtime walks another barista over to take over that station so the rest of the tickets keep flowing. That handoff is the difference between Go's scheduler and a naive "one OS thread per task" design.
With GOMAXPROCS = 3, there are three stations and therefore at most three tickets in progress at any moment.
Two things go wrong if the runtime is naive about this and the scheduler has to handle both:
- A barista gets tied up on a slow drink. The station could go idle while everyone waits for one syscall. The fix is to detach the station from the stuck barista and let a different barista work it. The slow drink finishes whenever the kernel says it does.
- A station's stack runs out while another station is overloaded. The fix is work stealing: an idle barista walks to the busiest station and takes half its tickets. Half is the equilibrium amount; one would mean repeated raids, all would tilt the imbalance the other way.
What the runtime does in plain steps
Every time a barista finishes a drink and looks for the next ticket, the runtime checks five places in this order. The order matters: cache locality is best when work stays on the same P, so the local sources are checked first.
| Order | Source | What it really is |
|---|---|---|
| 1 | The "express slot" on their own station | runnext: the most recently created goroutine, optimised for "the goroutine I just spawned should run next" |
| 2 | The local ticket stack on their own station | The P's local FIFO run queue, up to 256 entries |
| 3 | The shared overflow line at the front counter | The global run queue (shared across all P's) |
| 4 | Tickets ready from the network | The netpoller's ready queue (goroutines woken up by completed I/O) |
| 5 | Steal half the tickets from a busy station | Work stealing from another P's local queue |
If none of those have work, the barista takes a break (the M parks). Some external event wakes them up: a new goroutine spawned, a timer firing, or the netpoller reporting that a network operation has completed.
Why steal half Stealing one ticket at a time means the busy station keeps getting raided. Stealing all the tickets means the stealing station now has too much. Half is the equilibrium point: both stations end up with similar amounts of work and the system rebalances naturally over a few steals.
Syscall handoff is the magic trick
A blocking syscall in classic threading models stops everything: the OS thread is parked, no other "lightweight tasks" can run on it. Go cheats. When sysmon notices a P has been in a syscall for more than ~10us, it detaches the P from the syscalling M and finds a different M to host it. The detached M stays in the kernel waiting for the syscall to return; meanwhile the other goroutines keep running.
This is why a Go program with 10000 goroutines doing slow disk reads doesn't grind to a halt: the runtime keeps spawning M's to absorb the syscall delays, but the scheduler never runs out of P's to keep useful work going.
The network poller is the other magic trick
Network I/O does NOT block an M. On conn.Read, Go submits the FD to its netpoller (epoll on Linux, kqueue on BSD/macOS, IOCP on Windows), parks the G on the netpoller's wait list, and the M moves on. When the kernel reports the FD is ready, the netpoller pushes the G back onto a P's run queue.
A Go server can hold a million idle TCP connections with eight M's. One M is enough to handle the netpoller's wakeups; the others run G's that are doing actual work.
sysmon: the babysitter thread
sysmon is one OS thread (no P) that runs forever and does:
- Preempt G's that have been running more than 10ms (signal-based, since 1.14).
- Retake P's from M's stuck in syscalls.
- Trigger periodic GC.
- Forced GC if heap grows too fast.
It runs at low frequency (every 20us up to 10ms, depending on idleness) and is invisible to user code. The "sysmon" entry in a stack trace is this thread.
schedtrace and scheddetail
Running with GODEBUG=schedtrace=1000 prints one line per second showing per-P queue lengths, M counts, and idle counts. Add scheddetail=1 for everything. This is the easiest way to see whether a program is creating too many M's, has unbalanced queues, or is starving the scheduler.
When this actually matters
For 95% of Go code, the right approach is to write goroutines and channels and let the runtime handle the rest. GMP knowledge becomes essential when:
- The M count climbs under load (look for blocking syscalls or cgo).
- Goroutines pile up in pprof (look for stuck channel operations or leaks).
- Benchmark numbers don't make sense (check GOMAXPROCS).
- A service holds millions of connections (sizing the netpoller workload, picking GOMAXPROCS).
- Code interacts directly with the runtime (LockOSThread, cgo, signal handling).
For the other 95%: write goroutines, send on channels, and let the scheduler do its job.
Primitives by language
- GOMAXPROCS (number of P's, default = NumCPU)
- runtime.GOMAXPROCS(n)
- runtime.Gosched (yield current G)
- runtime.NumGoroutine
- runtime/debug.SetGCPercent (interacts with sysmon-driven GC)
Implementation
When a goroutine spawns another with go, the new G goes into the parent P's runnext slot. If the parent then blocks (channel, lock), the runtime immediately runs the runnext G on the same P. This avoids a queue insert and keeps the cache hot. It's what makes patterns like "spawn worker, send to its channel, wait for reply" cheap.
1 package main
2
3 import (
4 "fmt"
5 "runtime"
6 )
7
8 func main() {
9 ch := make(chan int)
10 go func() {
11 // Almost certainly runs on the same P that spawned us,
12 // because we landed in runnext and main blocks on <-ch.
13 ch <- 42
14 }()
15 fmt.Println(<-ch, "GOMAXPROCS =", runtime.GOMAXPROCS(0))
16 }When a G calls a blocking syscall (read on a regular file, for example), the M holding the P calls the kernel. If the syscall lasts more than ~10us, sysmon detaches the P from this M. The P picks up a fresh M (or wakes a parked one) and runs other G's. The blocked G stays with its original M; when the syscall returns, the M tries to reacquire its old P; if that fails, the G goes back on the global queue and the M parks.
1 package main
2
3 import (
4 "io"
5 "os"
6 "sync"
7 )
8
9 func main() {
10 var wg sync.WaitGroup
11 for i := 0; i < 10000; i++ {
12 wg.Add(1)
13 go func() {
14 defer wg.Done()
15 // Open and read a file: blocking syscall on Linux.
16 // The runtime will detach the P from this M during the read,
17 // so the other 9999 goroutines keep making progress.
18 f, err := os.Open("/etc/hostname")
19 if err == nil {
20 io.Copy(io.Discard, f)
21 f.Close()
22 }
23 }()
24 }
25 wg.Wait()
26 // 10K parallel blocking reads with GOMAXPROCS=8: still feels fine
27 // because the runtime spawns extra M's for stuck syscalls.
28 }Network reads and writes go through the netpoller, not blocking syscalls. The G is parked on the netpoller's wait list (no M involved); when the kernel reports the FD is ready, the netpoller pushes the G back onto a P's run queue. This is why a Go server can hold a million idle TCP connections cheaply: one M can host thousands of waiting G's with zero kernel threads tied up.
1 package main
2
3 import (
4 "io"
5 "net"
6 )
7
8 func main() {
9 ln, _ := net.Listen("tcp", ":8080")
10 for {
11 conn, err := ln.Accept()
12 if err != nil { return }
13 go func(c net.Conn) {
14 defer c.Close()
15 io.Copy(io.Discard, c) // parks G in netpoller, releases M
16 }(conn)
17 }
18 // 1M idle connections: 1M parked G's, ~1MB of G headers, GOMAXPROCS M's.
19 // Same on a thread-per-connection model would need 1M kernel threads.
20 }Before Go 1.14, preemption only happened at function-call boundaries. A tight CPU-bound loop with NO function calls and no select/channel/lock operations would never yield. With GOMAXPROCS=1, such a loop starved every other goroutine forever. Go 1.14 added signal-based async preemption: sysmon sends SIGURG to a long-running G's M, which interrupts the G at a safe point and lets the scheduler reschedule. This means tight pure-CPU loops are now preemptible. runtime.Gosched() is almost never needed anymore.
1 package main
2
3 import (
4 "fmt"
5 "runtime"
6 "time"
7 )
8
9 // Pre-1.14: this exact loop, with no calls and no select, was unpreemptible.
10 // 1.14+: sysmon sends SIGURG; the runtime preempts at the next safe point.
11 func tightLoop(stop *bool) {
12 x := 0
13 for !*stop {
14 x++ // no function calls, no select
15 }
16 _ = x
17 }
18
19 func main() {
20 runtime.GOMAXPROCS(1)
21 stop := false
22 go tightLoop(&stop)
23
24 // On Go 1.14+ this prints. On Go pre-1.14 it never would, because
25 // tightLoop would never yield the only available P.
26 time.Sleep(100 * time.Millisecond)
27 fmt.Println("main got the P back")
28 stop = true
29 }Key points
- •G = goroutine (small struct + stack), M = OS thread (machine), P = logical processor (GOMAXPROCS of these). M needs a P to run G's.
- •Each P has a local run queue of 256 runnable G's plus a 'runnext' LIFO slot for newly created goroutines. Overflow goes to the global queue.
- •Work stealing: when a P's local queue is empty, it steals half of another P's queue. Keeps load balanced without explicit coordination.
- •Blocking syscalls: the M holding a P releases the P (handoff), another M (or a fresh one) picks the P up and continues running other G's. The blocked G stays attached to its M.
- •Channel and lock waits don't block an M. The G is parked in user space (a wait list inside the channel/mutex), the M moves on to other work.
- •Network poller (epoll/kqueue/IOCP) is one shared component. When network I/O completes, it puts the waiting G back on a P's run queue.
- •sysmon is a dedicated OS thread (no P) that preempts long-running G's, retakes P's stuck in syscalls too long, and triggers GC.
- •Async preemption via signals (since Go 1.14): even a tight loop with no function calls can be preempted. Before 1.14, such a loop could starve the scheduler.
Follow-up questions
▸Why have P at all? Why not just M and G?
▸What is 'spinning' and why does the runtime do it?
▸How does GOMAXPROCS interact with cgroups and CPU limits?
▸Can the scheduler be starved?
▸What's the cost of a goroutine creation?
Gotchas
- !Runaway G creation under load: spawning a goroutine per inbound message without bounds will OOM the process when the queue can't drain fast enough. Always have a worker pool or semaphore.
- !cgo calls block an M for the entire C call. Millions of cgo calls per second will exhaust the M pool and require raising the soft cap (default 10000).
- !GOMAXPROCS too low (default 1 in Go pre-1.5, or in containerized environments without automaxprocs) silently serializes everything.
- !GOMAXPROCS too high on a small cgroup cuts performance because the OS scheduler thrashes between threads that compete for limited CPU.
- !Long syscalls (slow disks) can spike the M count; suddenly seeing hundreds of M's is a sign of blocking syscalls under load.
Common pitfalls
- Calling runtime.Gosched() to 'help the scheduler.' Almost always unnecessary since Go 1.14; reaching for it usually signals a hot loop where the real fix is to break it up or call into a blocking primitive.
- Using runtime.LockOSThread for performance reasons. It pins the G to its M and disables many scheduler optimizations. Use only when required (window-system bindings, OpenGL, certain crypto libraries).
- Assuming GOMAXPROCS == NumCPU. On Kubernetes pods with CPU limits, the default is wrong; use uber-go/automaxprocs.
APIs worth memorising
- runtime.GOMAXPROCS
- runtime.Gosched
- runtime.NumGoroutine
- runtime.LockOSThread / UnlockOSThread
- GODEBUG=schedtrace=1000 (prints scheduler state every second)
- GODEBUG=scheddetail=1 (prints detailed per-P, per-M state)
Every Go server uses this. The reason a Go HTTP server can handle 100K concurrent connections without breaking a sweat is the GMP design: 100K goroutines, GOMAXPROCS M's, the netpoller doing the heavy lifting. Compare to a thread-per-request server (Java pre-virtual-threads, classic C with pthreads), where 100K connections meant 100K kernel threads or an explicit event loop in application code.