Go ConcurrencyTopic 2 of 13

LanguageGoAdvancedAsked Often

GMP Scheduler: How Goroutines Actually Run

In one line

G = goroutine, M = OS thread, P = logical processor. The runtime keeps GOMAXPROCS P's; each P holds a local run queue of runnable G's. M's must hold a P to run G's. Blocking syscalls hand the P off to another M so other goroutines keep running. Idle P's steal work from busy ones. The network poller and sysmon thread keep things moving. The whole machine runs in user space and is what makes 100K goroutines cheap.

Diagram

A coffee shop analogy

The Go runtime is built around three things, named with single letters in the source. The names matter because every Go scheduler discussion uses them. The mapping to a coffee shop:

Letter	What it stands for	Coffee shop analogy
G	Goroutine	An order ticket. Cheap to create. Says "make this drink."
P	Processor (logical, not a CPU)	A barista station with a small stack of tickets next to it. There are exactly `GOMAXPROCS` stations; each can work on at most one ticket at a time.
M	Machine (an OS thread)	A barista. To make drinks, a barista has to stand at a station. No station, no drinks.

Tickets get created and stacked on stations. Baristas grab the next ticket and make the drink. When a barista has to wait for something slow (an espresso machine that takes 30 seconds), the runtime walks another barista over to take over that station so the rest of the tickets keep flowing. That handoff is the difference between Go's scheduler and a naive "one OS thread per task" design.

With GOMAXPROCS = 3, there are three stations and therefore at most three tickets in progress at any moment.

Two things go wrong if the runtime is naive about this and the scheduler has to handle both:

A barista gets tied up on a slow drink. The station could go idle while everyone waits for one syscall. The fix is to detach the station from the stuck barista and let a different barista work it. The slow drink finishes whenever the kernel says it does.
A station's stack runs out while another station is overloaded. The fix is work stealing: an idle barista walks to the busiest station and takes half its tickets. Half is the equilibrium amount; one would mean repeated raids, all would tilt the imbalance the other way.

What the runtime does in plain steps

Every time a barista finishes a drink and looks for the next ticket, the runtime checks five places in this order. The order matters: cache locality is best when work stays on the same P, so the local sources are checked first.

Order	Source	What it really is
1	The "express slot" on their own station	`runnext`: the most recently created goroutine, optimised for "the goroutine I just spawned should run next"
2	The local ticket stack on their own station	The P's local FIFO run queue, up to 256 entries
3	The shared overflow line at the front counter	The global run queue (shared across all P's)
4	Tickets ready from the network	The netpoller's ready queue (goroutines woken up by completed I/O)
5	Steal half the tickets from a busy station	Work stealing from another P's local queue

If none of those have work, the barista takes a break (the M parks). Some external event wakes them up: a new goroutine spawned, a timer firing, or the netpoller reporting that a network operation has completed.

Note

Why steal half Stealing one ticket at a time means the busy station keeps getting raided. Stealing all the tickets means the stealing station now has too much. Half is the equilibrium point: both stations end up with similar amounts of work and the system rebalances naturally over a few steals.

Syscall handoff is the magic trick

A blocking syscall in classic threading models stops everything: the OS thread is parked, no other "lightweight tasks" can run on it. Go cheats. When sysmon notices a P has been in a syscall for more than ~10us, it detaches the P from the syscalling M and finds a different M to host it. The detached M stays in the kernel waiting for the syscall to return; meanwhile the other goroutines keep running.

This is why a Go program with 10000 goroutines doing slow disk reads doesn't grind to a halt: the runtime keeps spawning M's to absorb the syscall delays, but the scheduler never runs out of P's to keep useful work going.

The network poller is the other magic trick

Network I/O does NOT block an M. On conn.Read, Go submits the FD to its netpoller (epoll on Linux, kqueue on BSD/macOS, IOCP on Windows), parks the G on the netpoller's wait list, and the M moves on. When the kernel reports the FD is ready, the netpoller pushes the G back onto a P's run queue.

A Go server can hold a million idle TCP connections with eight M's. One M is enough to handle the netpoller's wakeups; the others run G's that are doing actual work.

sysmon: the babysitter thread

sysmon is one OS thread (no P) that runs forever and does:

Preempt G's that have been running more than 10ms (signal-based, since 1.14).
Retake P's from M's stuck in syscalls.
Trigger periodic GC.
Forced GC if heap grows too fast.

It runs at low frequency (every 20us up to 10ms, depending on idleness) and is invisible to user code. The "sysmon" entry in a stack trace is this thread.

Tip

schedtrace and scheddetail Running with GODEBUG=schedtrace=1000 prints one line per second showing per-P queue lengths, M counts, and idle counts. Add scheddetail=1 for everything. This is the easiest way to see whether a program is creating too many M's, has unbalanced queues, or is starving the scheduler.

When this actually matters

For 95% of Go code, the right approach is to write goroutines and channels and let the runtime handle the rest. GMP knowledge becomes essential when:

The M count climbs under load (look for blocking syscalls or cgo).
Goroutines pile up in pprof (look for stuck channel operations or leaks).
Benchmark numbers don't make sense (check GOMAXPROCS).
A service holds millions of connections (sizing the netpoller workload, picking GOMAXPROCS).
Code interacts directly with the runtime (LockOSThread, cgo, signal handling).

For the other 95%: write goroutines, send on channels, and let the scheduler do its job.

Primitives by language

GOMAXPROCS (number of P's, default = NumCPU)
runtime.GOMAXPROCS(n)
runtime.Gosched (yield current G)
runtime.NumGoroutine
runtime/debug.SetGCPercent (interacts with sysmon-driven GC)

Implementation

Showing the spawn-then-wait optimization

When a goroutine spawns another with go, the new G goes into the parent P's runnext slot. If the parent then blocks (channel, lock), the runtime immediately runs the runnext G on the same P. This avoids a queue insert and keeps the cache hot. It's what makes patterns like "spawn worker, send to its channel, wait for reply" cheap.

 1  package main
 2  
 3  import (
 4      "fmt"
 5      "runtime"
 6  )
 7  
 8  func main() {
 9      ch := make(chan int)
10      go func() {
11          // Almost certainly runs on the same P that spawned us,
12          // because we landed in runnext and main blocks on <-ch.
13          ch <- 42
14      }()
15      fmt.Println(<-ch, "GOMAXPROCS =", runtime.GOMAXPROCS(0))
16  }

Syscall handoff in action

When a G calls a blocking syscall (read on a regular file, for example), the M holding the P calls the kernel. If the syscall lasts more than ~10us, sysmon detaches the P from this M. The P picks up a fresh M (or wakes a parked one) and runs other G's. The blocked G stays with its original M; when the syscall returns, the M tries to reacquire its old P; if that fails, the G goes back on the global queue and the M parks.

 1  package main
 2  
 3  import (
 4      "io"
 5      "os"
 6      "sync"
 7  )
 8  
 9  func main() {
10      var wg sync.WaitGroup
11      for i := 0; i < 10000; i++ {
12          wg.Add(1)
13          go func() {
14              defer wg.Done()
15              // Open and read a file: blocking syscall on Linux.
16              // The runtime will detach the P from this M during the read,
17              // so the other 9999 goroutines keep making progress.
18              f, err := os.Open("/etc/hostname")
19              if err == nil {
20                  io.Copy(io.Discard, f)
21                  f.Close()
22              }
23          }()
24      }
25      wg.Wait()
26      // 10K parallel blocking reads with GOMAXPROCS=8: still feels fine
27      // because the runtime spawns extra M's for stuck syscalls.
28  }

Network I/O parks the goroutine, not the M

Network reads and writes go through the netpoller, not blocking syscalls. The G is parked on the netpoller's wait list (no M involved); when the kernel reports the FD is ready, the netpoller pushes the G back onto a P's run queue. This is why a Go server can hold a million idle TCP connections cheaply: one M can host thousands of waiting G's with zero kernel threads tied up.

 1  package main
 2  
 3  import (
 4      "io"
 5      "net"
 6  )
 7  
 8  func main() {
 9      ln, _ := net.Listen("tcp", ":8080")
10      for {
11          conn, err := ln.Accept()
12          if err != nil { return }
13          go func(c net.Conn) {
14              defer c.Close()
15              io.Copy(io.Discard, c)               // parks G in netpoller, releases M
16          }(conn)
17      }
18      // 1M idle connections: 1M parked G's, ~1MB of G headers, GOMAXPROCS M's.
19      // Same on a thread-per-connection model would need 1M kernel threads.
20  }

The 1.14 preemption fix

Before Go 1.14, preemption only happened at function-call boundaries. A tight CPU-bound loop with NO function calls and no select/channel/lock operations would never yield. With GOMAXPROCS=1, such a loop starved every other goroutine forever. Go 1.14 added signal-based async preemption: sysmon sends SIGURG to a long-running G's M, which interrupts the G at a safe point and lets the scheduler reschedule. This means tight pure-CPU loops are now preemptible. runtime.Gosched() is almost never needed anymore.

 1  package main
 2  
 3  import (
 4      "fmt"
 5      "runtime"
 6      "time"
 7  )
 8  
 9  // Pre-1.14: this exact loop, with no calls and no select, was unpreemptible.
10  // 1.14+: sysmon sends SIGURG; the runtime preempts at the next safe point.
11  func tightLoop(stop *bool) {
12      x := 0
13      for !*stop {
14          x++                                      // no function calls, no select
15      }
16      _ = x
17  }
18  
19  func main() {
20      runtime.GOMAXPROCS(1)
21      stop := false
22      go tightLoop(&stop)
23  
24      // On Go 1.14+ this prints. On Go pre-1.14 it never would, because
25      // tightLoop would never yield the only available P.
26      time.Sleep(100 * time.Millisecond)
27      fmt.Println("main got the P back")
28      stop = true
29  }

Key points

•G = goroutine (small struct + stack), M = OS thread (machine), P = logical processor (GOMAXPROCS of these). M needs a P to run G's.
•Each P has a local run queue of 256 runnable G's plus a 'runnext' LIFO slot for newly created goroutines. Overflow goes to the global queue.
•Work stealing: when a P's local queue is empty, it steals half of another P's queue. Keeps load balanced without explicit coordination.
•Blocking syscalls: the M holding a P releases the P (handoff), another M (or a fresh one) picks the P up and continues running other G's. The blocked G stays attached to its M.
•Channel and lock waits don't block an M. The G is parked in user space (a wait list inside the channel/mutex), the M moves on to other work.
•Network poller (epoll/kqueue/IOCP) is one shared component. When network I/O completes, it puts the waiting G back on a P's run queue.
•sysmon is a dedicated OS thread (no P) that preempts long-running G's, retakes P's stuck in syscalls too long, and triggers GC.
•Async preemption via signals (since Go 1.14): even a tight loop with no function calls can be preempted. Before 1.14, such a loop could starve the scheduler.

Follow-up questions

▸Why have P at all? Why not just M and G?

P is the unit of scheduling capacity. GOMAXPROCS P's means at most GOMAXPROCS goroutines run in parallel, regardless of how many M's exist. This decouples 'how many cores are allowed to be used' from 'how many OS threads exist.' Stuck syscalls can spawn extra M's without stealing CPU from useful work, because those M's don't have a P. The local run queue per P is also a cache-locality win: G's recently created on a P tend to share data with other G's on the same P.

▸What is 'spinning' and why does the runtime do it?

After releasing a G, an M with no work to do can spin briefly (a few microseconds) checking whether new work appeared, instead of parking immediately. Parking and waking M's costs an OS round trip; if work is about to appear, spinning is cheaper. The runtime keeps the count of spinning M's low (one or two) to avoid burning CPU. It shows up in pprof as 'runtime.findrunnable' time.

▸How does GOMAXPROCS interact with cgroups and CPU limits?

Before Go 1.25, GOMAXPROCS defaulted to the number of host CPUs, ignoring cgroup CPU limits. On a 64-core box with a 2-core cgroup, GOMAXPROCS=64 caused massive scheduling overhead. The standard workaround was uber-go/automaxprocs. Go 1.25 made the runtime cgroup-aware by default, so most modern containers no longer need the workaround. Older Go versions still do.

▸Can the scheduler be starved?

Hard, but yes. (1) Calls to C code through cgo block the M without releasing the P; if all M's go into cgo at once, no goroutines run. (2) Tight loops in Go assembly that don't include preemption points still aren't preemptible. (3) Holding a runtime mutex (very rare in user code) blocks scheduling. For normal Go code, sysmon's signal-based preemption catches almost everything.

▸What's the cost of a goroutine creation?

About 200ns and 2KB of stack on first creation. Reuse from a pool helps amortize the stack allocation. Compared to an OS thread (~5us creation, ~1MB stack), goroutines are nearly free, which is why patterns like 'spawn one goroutine per request' are idiomatic in Go.

Gotchas

!Runaway G creation under load: spawning a goroutine per inbound message without bounds will OOM the process when the queue can't drain fast enough. Always have a worker pool or semaphore.
!cgo calls block an M for the entire C call. Millions of cgo calls per second will exhaust the M pool and require raising the soft cap (default 10000).
!GOMAXPROCS too low (default 1 in Go pre-1.5, or in containerized environments without automaxprocs) silently serializes everything.
!GOMAXPROCS too high on a small cgroup cuts performance because the OS scheduler thrashes between threads that compete for limited CPU.
!Long syscalls (slow disks) can spike the M count; suddenly seeing hundreds of M's is a sign of blocking syscalls under load.

Common pitfalls

Calling runtime.Gosched() to 'help the scheduler.' Almost always unnecessary since Go 1.14; reaching for it usually signals a hot loop where the real fix is to break it up or call into a blocking primitive.
Using runtime.LockOSThread for performance reasons. It pins the G to its M and disables many scheduler optimizations. Use only when required (window-system bindings, OpenGL, certain crypto libraries).
Assuming GOMAXPROCS == NumCPU. On Kubernetes pods with CPU limits, the default is wrong; use uber-go/automaxprocs.

APIs worth memorising

runtime.GOMAXPROCS
runtime.Gosched
runtime.NumGoroutine
runtime.LockOSThread / UnlockOSThread
GODEBUG=schedtrace=1000 (prints scheduler state every second)
GODEBUG=scheddetail=1 (prints detailed per-P, per-M state)

Where this shows up

Every Go server uses this. The reason a Go HTTP server can handle 100K concurrent connections without breaking a sweat is the GMP design: 100K goroutines, GOMAXPROCS M's, the netpoller doing the heavy lifting. Compare to a thread-per-request server (Java pre-virtual-threads, classic C with pthreads), where 100K connections meant 100K kernel threads or an explicit event loop in application code.

GMP Scheduler: How Goroutines Actually Run

In one line

Diagram

A coffee shop analogy

The Go runtime is built around three things, named with single letters in the source. The names matter because every Go scheduler discussion uses them. The mapping to a coffee shop:

Letter	What it stands for	Coffee shop analogy
G	Goroutine	An order ticket. Cheap to create. Says "make this drink."
P	Processor (logical, not a CPU)	A barista station with a small stack of tickets next to it. There are exactly `GOMAXPROCS` stations; each can work on at most one ticket at a time.
M	Machine (an OS thread)	A barista. To make drinks, a barista has to stand at a station. No station, no drinks.

With GOMAXPROCS = 3, there are three stations and therefore at most three tickets in progress at any moment.

Two things go wrong if the runtime is naive about this and the scheduler has to handle both:

A barista gets tied up on a slow drink. The station could go idle while everyone waits for one syscall. The fix is to detach the station from the stuck barista and let a different barista work it. The slow drink finishes whenever the kernel says it does.
A station's stack runs out while another station is overloaded. The fix is work stealing: an idle barista walks to the busiest station and takes half its tickets. Half is the equilibrium amount; one would mean repeated raids, all would tilt the imbalance the other way.

What the runtime does in plain steps

Order	Source	What it really is
1	The "express slot" on their own station	`runnext`: the most recently created goroutine, optimised for "the goroutine I just spawned should run next"
2	The local ticket stack on their own station	The P's local FIFO run queue, up to 256 entries
3	The shared overflow line at the front counter	The global run queue (shared across all P's)
4	Tickets ready from the network	The netpoller's ready queue (goroutines woken up by completed I/O)
5	Steal half the tickets from a busy station	Work stealing from another P's local queue

Note

Syscall handoff is the magic trick

The network poller is the other magic trick

A Go server can hold a million idle TCP connections with eight M's. One M is enough to handle the netpoller's wakeups; the others run G's that are doing actual work.

sysmon: the babysitter thread

sysmon is one OS thread (no P) that runs forever and does:

Preempt G's that have been running more than 10ms (signal-based, since 1.14).
Retake P's from M's stuck in syscalls.
Trigger periodic GC.
Forced GC if heap grows too fast.

It runs at low frequency (every 20us up to 10ms, depending on idleness) and is invisible to user code. The "sysmon" entry in a stack trace is this thread.

Tip

When this actually matters

For 95% of Go code, the right approach is to write goroutines and channels and let the runtime handle the rest. GMP knowledge becomes essential when:

The M count climbs under load (look for blocking syscalls or cgo).
Goroutines pile up in pprof (look for stuck channel operations or leaks).
Benchmark numbers don't make sense (check GOMAXPROCS).
A service holds millions of connections (sizing the netpoller workload, picking GOMAXPROCS).
Code interacts directly with the runtime (LockOSThread, cgo, signal handling).

For the other 95%: write goroutines, send on channels, and let the scheduler do its job.

Primitives by language

GOMAXPROCS (number of P's, default = NumCPU)
runtime.GOMAXPROCS(n)
runtime.Gosched (yield current G)
runtime.NumGoroutine
runtime/debug.SetGCPercent (interacts with sysmon-driven GC)

Implementation

Showing the spawn-then-wait optimization

 1  package main
 2  
 3  import (
 4      "fmt"
 5      "runtime"
 6  )
 7  
 8  func main() {
 9      ch := make(chan int)
10      go func() {
11          // Almost certainly runs on the same P that spawned us,
12          // because we landed in runnext and main blocks on <-ch.
13          ch <- 42
14      }()
15      fmt.Println(<-ch, "GOMAXPROCS =", runtime.GOMAXPROCS(0))
16  }

Syscall handoff in action

 1  package main
 2  
 3  import (
 4      "io"
 5      "os"
 6      "sync"
 7  )
 8  
 9  func main() {
10      var wg sync.WaitGroup
11      for i := 0; i < 10000; i++ {
12          wg.Add(1)
13          go func() {
14              defer wg.Done()
15              // Open and read a file: blocking syscall on Linux.
16              // The runtime will detach the P from this M during the read,
17              // so the other 9999 goroutines keep making progress.
18              f, err := os.Open("/etc/hostname")
19              if err == nil {
20                  io.Copy(io.Discard, f)
21                  f.Close()
22              }
23          }()
24      }
25      wg.Wait()
26      // 10K parallel blocking reads with GOMAXPROCS=8: still feels fine
27      // because the runtime spawns extra M's for stuck syscalls.
28  }

Network I/O parks the goroutine, not the M

 1  package main
 2  
 3  import (
 4      "io"
 5      "net"
 6  )
 7  
 8  func main() {
 9      ln, _ := net.Listen("tcp", ":8080")
10      for {
11          conn, err := ln.Accept()
12          if err != nil { return }
13          go func(c net.Conn) {
14              defer c.Close()
15              io.Copy(io.Discard, c)               // parks G in netpoller, releases M
16          }(conn)
17      }
18      // 1M idle connections: 1M parked G's, ~1MB of G headers, GOMAXPROCS M's.
19      // Same on a thread-per-connection model would need 1M kernel threads.
20  }

The 1.14 preemption fix

 1  package main
 2  
 3  import (
 4      "fmt"
 5      "runtime"
 6      "time"
 7  )
 8  
 9  // Pre-1.14: this exact loop, with no calls and no select, was unpreemptible.
10  // 1.14+: sysmon sends SIGURG; the runtime preempts at the next safe point.
11  func tightLoop(stop *bool) {
12      x := 0
13      for !*stop {
14          x++                                      // no function calls, no select
15      }
16      _ = x
17  }
18  
19  func main() {
20      runtime.GOMAXPROCS(1)
21      stop := false
22      go tightLoop(&stop)
23  
24      // On Go 1.14+ this prints. On Go pre-1.14 it never would, because
25      // tightLoop would never yield the only available P.
26      time.Sleep(100 * time.Millisecond)
27      fmt.Println("main got the P back")
28      stop = true
29  }

Key points

•G = goroutine (small struct + stack), M = OS thread (machine), P = logical processor (GOMAXPROCS of these). M needs a P to run G's.
•Each P has a local run queue of 256 runnable G's plus a 'runnext' LIFO slot for newly created goroutines. Overflow goes to the global queue.
•Work stealing: when a P's local queue is empty, it steals half of another P's queue. Keeps load balanced without explicit coordination.
•Blocking syscalls: the M holding a P releases the P (handoff), another M (or a fresh one) picks the P up and continues running other G's. The blocked G stays attached to its M.
•Channel and lock waits don't block an M. The G is parked in user space (a wait list inside the channel/mutex), the M moves on to other work.
•Network poller (epoll/kqueue/IOCP) is one shared component. When network I/O completes, it puts the waiting G back on a P's run queue.
•sysmon is a dedicated OS thread (no P) that preempts long-running G's, retakes P's stuck in syscalls too long, and triggers GC.
•Async preemption via signals (since Go 1.14): even a tight loop with no function calls can be preempted. Before 1.14, such a loop could starve the scheduler.

Follow-up questions

▸Why have P at all? Why not just M and G?

▸What is 'spinning' and why does the runtime do it?

▸How does GOMAXPROCS interact with cgroups and CPU limits?

▸Can the scheduler be starved?

▸What's the cost of a goroutine creation?

Gotchas

!Runaway G creation under load: spawning a goroutine per inbound message without bounds will OOM the process when the queue can't drain fast enough. Always have a worker pool or semaphore.
!cgo calls block an M for the entire C call. Millions of cgo calls per second will exhaust the M pool and require raising the soft cap (default 10000).
!GOMAXPROCS too low (default 1 in Go pre-1.5, or in containerized environments without automaxprocs) silently serializes everything.
!GOMAXPROCS too high on a small cgroup cuts performance because the OS scheduler thrashes between threads that compete for limited CPU.
!Long syscalls (slow disks) can spike the M count; suddenly seeing hundreds of M's is a sign of blocking syscalls under load.

Common pitfalls

Calling runtime.Gosched() to 'help the scheduler.' Almost always unnecessary since Go 1.14; reaching for it usually signals a hot loop where the real fix is to break it up or call into a blocking primitive.
Using runtime.LockOSThread for performance reasons. It pins the G to its M and disables many scheduler optimizations. Use only when required (window-system bindings, OpenGL, certain crypto libraries).
Assuming GOMAXPROCS == NumCPU. On Kubernetes pods with CPU limits, the default is wrong; use uber-go/automaxprocs.

APIs worth memorising

runtime.GOMAXPROCS
runtime.Gosched
runtime.NumGoroutine
runtime.LockOSThread / UnlockOSThread
GODEBUG=schedtrace=1000 (prints scheduler state every second)
GODEBUG=scheddetail=1 (prints detailed per-P, per-M state)

Where this shows up

GMP Scheduler: How Goroutines Actually Run

Diagram

A coffee shop analogy

What the runtime does in plain steps

Syscall handoff is the magic trick

The network poller is the other magic trick

sysmon: the babysitter thread

When this actually matters

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading

GMP Scheduler: How Goroutines Actually Run

Diagram

A coffee shop analogy

What the runtime does in plain steps

Syscall handoff is the magic trick

The network poller is the other magic trick

sysmon: the babysitter thread

When this actually matters

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading