Goroutine Leak Prevention
Every spawned goroutine must have a known exit story, context cancellation, channel close, natural completion, or timeout. Goroutines that block forever are the #1 production scaling bug in Go services. Detect via runtime.NumGoroutine and pprof/goroutine.
The single most common Go production bug
Goroutine leaks. Not memory leaks, not deadlocks. Goroutines that block forever, accumulating until OOM.
The signature
runtime.NumGoroutine() over time. Healthy services: roughly bounded count proportional to in-flight requests. Unhealthy: linear growth with uptime. The graph is unmistakable.
The "every goroutine has a sentence" rule
Every go func() deserves a sentence
"This goroutine exits when ___."
The blank should be one of:
- The function returns naturally.
ctx.Done()fires (parent cancelled or deadline reached).- The channel it ranges over is closed.
- A select case it watches becomes ready.
If the answer is "not sure" or "well, eventually...", that's a potential leak.
The patterns that prevent leaks
// 1. Goroutine that always exits (natural completion)
go func() {
defer wg.Done()
process(input) // returns when done
}()
// 2. Goroutine with cancellation
go func() {
select {
case <-ctx.Done(): // exits on cancel
case <-doneSignal:
}
}()
// 3. Worker pool
go func() {
for work := range workCh { // exits when workCh is closed
process(work)
}
}()
// 4. Bounded send with timeout
go func() {
select {
case ch <- value:
case <-ctx.Done():
case <-time.After(d): // give up after d
}
}()
The first goroutine is implicit (function returns). The other three are explicit (select + ctx.Done or close + range).
Test for leaks, make it CI gating
go.uber.org/goleak is the standard tool. Two integration patterns:
// Whole test binary
func TestMain(m *testing.M) {
goleak.VerifyTestMain(m)
}
// Per test
func TestHandler(t *testing.T) {
defer goleak.VerifyNone(t)
// ...
}
If a test leaves a goroutine, the test fails with the goroutine's stack trace. Add this to test setup. It catches leaks before production does.
Diagnose with pprof
When NumGoroutine grows in production:
# Dump all goroutines (private port only!)
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt
Inspect the output. Thousands of goroutines blocked at the same line is the leak. Usually chan send or chan receive. The line points exactly at where a select with cancellation belongs.
Pprof must be on a private port
Exposing /debug/pprof publicly is an info-disclosure vulnerability, leaks request data, lets attackers DOS the service. Standard practice: bind to localhost:6060 (or a VPC-internal port). Access via SSH tunnel or admin VPN.
Primitives by language
- context.WithTimeout / WithCancel
- select with ctx.Done()
- close(ch) for signaling
- /debug/pprof/goroutine
- runtime.NumGoroutine
Implementation
Spawning a goroutine to do background work without a cancellation path. If the channel never receives, every request leaks one goroutine. Common in HTTP handlers.
1 var logCh = make(chan string) // unbuffered
2
3 func handler(w http.ResponseWriter, r *http.Request) {
4 go func() {
5 logCh <- summarize(r) // blocks forever if no consumer
6 }()
7 w.WriteHeader(200)
8 }
9 // Every request that finds the consumer slow → leaks a goroutinePick based on use case: (1) buffered channel with bounded capacity for burst absorption; (2) ctx.Done() for cancellation-aware exit; (3) drop-on-full for graceful overflow.
1 // Fix #1, buffered channel; bounded backpressure
2 var logCh = make(chan string, 10_000)
3
4 // Fix #2, context-aware
5 func handler(w http.ResponseWriter, r *http.Request) {
6 go func() {
7 select {
8 case logCh <- summarize(r):
9 case <-r.Context().Done(): // request cancelled → exit
10 }
11 }()
12 w.WriteHeader(200)
13 }
14
15 // Fix #3, drop on overflow (telemetry-style)
16 func handler(w http.ResponseWriter, r *http.Request) {
17 go func() {
18 select {
19 case logCh <- summarize(r):
20 default:
21 metrics.IncDropped() // visible counter
22 }
23 }()
24 w.WriteHeader(200)
25 }Add a leak check to the test setup. goleak.VerifyTestMain(m) fails the test run if any goroutines outlive the tests. Catches leaks at CI time.
1 package mypackage
2
3 import (
4 "testing"
5
6 "go.uber.org/goleak"
7 )
8
9 func TestMain(m *testing.M) {
10 goleak.VerifyTestMain(m)
11 }
12
13 // Or per-test:
14 func TestHandler(t *testing.T) {
15 defer goleak.VerifyNone(t)
16 handler(httptest.NewRecorder(), httptest.NewRequest("GET", "/", nil))
17 }Expose pprof on a private port. When NumGoroutine is unhappy, dump the goroutine profile and look for many goroutines stuck at the same line.
1 package main
2
3 import (
4 _ "net/http/pprof" // registers /debug/pprof handlers
5 "log"
6 "net/http"
7 )
8
9 func main() {
10 // Private port, never expose pprof publicly
11 go func() {
12 log.Println(http.ListenAndServe("localhost:6060", nil))
13 }()
14 // ... rest of service ...
15 }
16
17 // In production:
18 // $ curl http://localhost:6060/debug/pprof/goroutine?debug=2
19 // → full stack of every goroutine
20 //
21 // $ go tool pprof http://localhost:6060/debug/pprof/goroutine
22 // → interactive profilerKey points
- •Symptom: NumGoroutine grows unbounded under load
- •Diagnosis: pprof/goroutine shows what each leaked goroutine is blocked on
- •Pattern: every send/receive needs a select-with-ctx.Done() escape
- •Bounded buffered channels absorb bursts; default-on-full handles overflow
- •Test for leaks: run handler 10K times in test, check NumGoroutine before/after
Follow-up questions
▸What's the canonical leak in production Go services?
▸How is a leak reproduced locally?
▸Why don't unbuffered channels usually leak in tests?
▸When is unbuffered fine?
Gotchas
- !context.WithCancel without `defer cancel()` leaks the timer goroutine
- !for {} loops without select-on-ctx.Done leak forever
- !Channel send with no receiver blocks forever, wrap in select + default OR + ctx.Done
- !time.After in tight loops creates one timer per call; use NewTimer + Stop
- !Goroutines waiting on http.Request body to fully drain, close the body explicitly