Goroutine Leak Prevention
Every spawned goroutine must have a known exit story, context cancellation, channel close, natural completion, or timeout. Goroutines that block forever are the #1 production scaling bug in Go services. Detect via runtime.NumGoroutine and pprof/goroutine.
The single most common Go production bug
Goroutine leaks. Not memory leaks, not deadlocks. Goroutines that block forever, accumulating until OOM.
The signature
runtime.NumGoroutine() over time. Healthy services: roughly bounded count proportional to in-flight requests. Unhealthy: linear growth with uptime. The graph is unmistakable.
The "every goroutine has a sentence" rule
Every go func() deserves a sentence
"This goroutine exits when ___."
The blank should be one of:
- The function returns naturally.
ctx.Done()fires (parent cancelled or deadline reached).- The channel it ranges over is closed.
- A select case it watches becomes ready.
If the answer is "not sure" or "well, eventually...", that's a potential leak.
The patterns that prevent leaks
// 1. Goroutine that always exits (natural completion)
go func() {
defer wg.Done()
process(input) // returns when done
}()
// 2. Goroutine with cancellation
go func() {
select {
case <-ctx.Done(): // exits on cancel
case <-doneSignal:
}
}()
// 3. Worker pool
go func() {
for work := range workCh { // exits when workCh is closed
process(work)
}
}()
// 4. Bounded send with timeout
go func() {
select {
case ch <- value:
case <-ctx.Done():
case <-time.After(d): // give up after d
}
}()The first goroutine is implicit (function returns). The other three are explicit (select + ctx.Done or close + range).
Test for leaks, make it CI gating
go.uber.org/goleak is the standard tool. Two integration patterns:
// Whole test binary
func TestMain(m *testing.M) {
goleak.VerifyTestMain(m)
}
// Per test
func TestHandler(t *testing.T) {
defer goleak.VerifyNone(t)
// ...
}If a test leaves a goroutine, the test fails with the goroutine's stack trace. Add this to test setup. It catches leaks before production does.
Diagnose with pprof
When NumGoroutine grows in production:
# Dump all goroutines (private port only!)
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txtInspect the output. Thousands of goroutines blocked at the same line is the leak. Usually chan send or chan receive. The line points exactly at where a select with cancellation belongs.
Pprof must be on a private port
Exposing /debug/pprof publicly is an info-disclosure vulnerability, leaks request data, lets attackers DOS the service. Standard practice: bind to localhost:6060 (or a VPC-internal port). Access via SSH tunnel or admin VPN.
Primitives by language
- context.WithTimeout / WithCancel
- select with ctx.Done()
- close(ch) for signaling
- /debug/pprof/goroutine
- runtime.NumGoroutine
Implementation
Spawning a goroutine to do background work without a cancellation path. If the channel never receives, every request leaks one goroutine. Common in HTTP handlers.
var logCh = make(chan string) // unbuffered
func handler(w http.ResponseWriter, r *http.Request) {
go func() {
logCh <- summarize(r) // blocks forever if no consumer
}()
w.WriteHeader(200)
}
// Every request that finds the consumer slow → leaks a goroutinePick based on use case: (1) buffered channel with bounded capacity for burst absorption; (2) ctx.Done() for cancellation-aware exit; (3) drop-on-full for graceful overflow.
// Fix #1, buffered channel; bounded backpressure
var logCh = make(chan string, 10_000)
// Fix #2, context-aware
func handler(w http.ResponseWriter, r *http.Request) {
go func() {
select {
case logCh <- summarize(r):
case <-r.Context().Done(): // request cancelled → exit
}
}()
w.WriteHeader(200)
}
// Fix #3, drop on overflow (telemetry-style)
func handler(w http.ResponseWriter, r *http.Request) {
go func() {
select {
case logCh <- summarize(r):
default:
metrics.IncDropped() // visible counter
}
}()
w.WriteHeader(200)
}Add a leak check to the test setup. goleak.VerifyTestMain(m) fails the test run if any goroutines outlive the tests. Catches leaks at CI time.
package mypackage
import (
"testing"
"go.uber.org/goleak"
)
func TestMain(m *testing.M) {
goleak.VerifyTestMain(m)
}
// Or per-test:
func TestHandler(t *testing.T) {
defer goleak.VerifyNone(t)
handler(httptest.NewRecorder(), httptest.NewRequest("GET", "/", nil))
}Expose pprof on a private port. When NumGoroutine is unhappy, dump the goroutine profile and look for many goroutines stuck at the same line.
package main
import (
_ "net/http/pprof" // registers /debug/pprof handlers
"log"
"net/http"
)
func main() {
// Private port, never expose pprof publicly
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// ... rest of service ...
}
// In production:
// $ curl http://localhost:6060/debug/pprof/goroutine?debug=2
// → full stack of every goroutine
//
// $ go tool pprof http://localhost:6060/debug/pprof/goroutine
// → interactive profilerKey points
- •Symptom: NumGoroutine grows unbounded under load
- •Diagnosis: pprof/goroutine shows what each leaked goroutine is blocked on
- •Pattern: every send/receive needs a select-with-ctx.Done() escape
- •Bounded buffered channels absorb bursts; default-on-full handles overflow
- •Test for leaks: run handler 10K times in test, check NumGoroutine before/after
Follow-up questions
▸What's the canonical leak in production Go services?
▸How is a leak reproduced locally?
▸Why don't unbuffered channels usually leak in tests?
▸When is unbuffered fine?
Gotchas
- !context.WithCancel without `defer cancel()` leaks the timer goroutine
- !for {} loops without select-on-ctx.Done leak forever
- !Channel send with no receiver blocks forever, wrap in select + default OR + ctx.Done
- !time.After in tight loops creates one timer per call; use NewTimer + Stop
- !Goroutines waiting on http.Request body to fully drain, close the body explicitly