Bug Hunt: Goroutine Count Growing 10/sec
A goroutine that blocks on a channel send/receive without a cancellation path lives forever. The fix is always a select on ctx.Done() or a buffered channel large enough to absorb all sends. The same bug appears in Java (futures never cancelled) and Python (asyncio tasks orphaned).
The puzzle
A Go service ships. For a week, things look fine. Slowly, the dashboards show RSS climbing. By day 14, the pod is OOMKilled. Restart; the cycle repeats.
runtime.NumGoroutine() over time:
hour 0: ~120
hour 6: ~3500
hour 12: ~7100
hour 24: pod restarted by OOMKill
A roughly linear growth in goroutine count is the unmistakable signature of a leak. Every request adds a goroutine that never exits.
The diagnostic, in 30 seconds
Hit /debug/pprof/goroutine?debug=2 (private port). The output shows every goroutine's current stack. The leak signature: thousands of goroutines all stuck at the same line, usually chan send or chan receive. That's where to look.
What to look for in the broken code
Read the broken Go example (the leak is most idiomatic in Go but the pattern applies to every async runtime). Trace what happens to the goroutine on a single request:
- Handler fires, spawns a goroutine that does
logCh <- summarize(r). - Handler returns 200, request done.
- The spawned goroutine is now blocked on the channel send. Forever, unless something reads.
- If the consumer is slow or down, every request adds one orphan.
The same bug pattern in three languages
- Go: goroutine blocked on channel send with no
select { case <-ctx.Done() }. - Java: CompletableFuture with no
orTimeoutwaiting on a network call. - Python:
asyncio.create_taskwith no reference, no timeout, no error handler.
Different runtimes, same root cause: an async unit of work was started without a story for how it terminates on failure.
The fix pattern
Every async unit of work needs answers to three questions:
- What's the success exit? (Normal completion, channel send received, future resolved.)
- What's the failure exit? (Timeout, parent context cancelled, channel closed.)
- What's the bound on memory? (Bounded buffer, max queue size, max concurrent tasks.)
The "every goroutine has a sentence" rule
For every go func() (or create_task, or supplyAsync), be able to write a one-sentence answer to: "this goroutine exits when ___." If no clear answer exists, the code has a potential leak. The answer is usually "the context is cancelled" or "the channel is closed", both via select.
Production-grade pattern
go func() {
defer wg.Done() // tracked exit
select {
case work <- payload: // success exit
case <-ctx.Done(): // failure exit (parent cancelled)
case <-time.After(5*time.Second): // failure exit (timeout)
}
}()
Almost every leak diagnosed in production is a missing select somewhere. Build the habit of writing the select first, the channel-send second.
What "fire-and-forget" actually means
When someone writes go log(req) and calls it "fire-and-forget," they're saying: "I don't care if this fails or hangs." But goroutines that hang accumulate. They hold captured variables, channels, mutexes. They consume memory and file descriptors. Fire-and-forget should mean "I don't await the result", not "I have no termination story."
Implementations
A CompletableFuture chain. Looks idiomatic. The executor is a fixed-size pool. Under load, pool queue grows unbounded, futures never complete, threads stay parked forever. What's missing?
1 // BROKEN, futures pile up, never time out
2 ExecutorService executor = Executors.newFixedThreadPool(8);
3
4 CompletableFuture<String> fetch(String url) {
5 return CompletableFuture.supplyAsync(() -> {
6 return httpClient.get(url); // ← can hang forever
7 }, executor);
8 }
9
10 void handle(Request req) {
11 fetch(req.url).thenAccept(this::process); // ← no timeout, no cancel
12 respond(req, 200);
13 }The bug: a future that never completes blocks the executor thread forever. With a fixed-size pool, that thread is gone for good. Eventually all 8 threads are stuck and the pool dies. The fix: every async chain needs a timeout (orTimeout) and explicit error handling (exceptionally). Cancelling the future signals downstream stages to stop. Bounded queues on the executor prevent unbounded backlog.
1 ExecutorService executor = new ThreadPoolExecutor(
2 8, 8, 0L, TimeUnit.MILLISECONDS,
3 new ArrayBlockingQueue<>(1000), // BOUNDED queue
4 new ThreadPoolExecutor.CallerRunsPolicy() // backpressure
5 );
6
7 CompletableFuture<String> fetch(String url) {
8 return CompletableFuture
9 .supplyAsync(() -> httpClient.get(url), executor)
10 .orTimeout(5, TimeUnit.SECONDS) // timeout
11 .exceptionally(ex -> { // handle failures
12 log.warn("fetch failed", ex);
13 return null;
14 });
15 }Key points
- •A goroutine that blocks forever holds memory + captured variables + channels
- •Symptom: runtime.NumGoroutine() growing unbounded under load
- •Diagnosis: /debug/pprof/goroutine?debug=2 shows what each goroutine is waiting on
- •Fix pattern: every goroutine that sends/receives must have a cancellation path
Follow-up questions
▸How is a goroutine leak detected in production?
▸Why is the unbuffered-channel send pattern so prone to leaks?
▸Is asyncio.create_task without await ever correct?
▸How is 'goroutine count growing' measured?
Gotchas
- !fire-and-forget goroutines without ctx.Done() are the #1 source of Go production leaks
- !Java's CompletableFuture without orTimeout silently inherits 'wait forever' semantics
- !Python: asyncio.create_task without keeping a reference can be GC'd mid-execution
- !context.WithCancel without calling cancel() leaks the underlying timer goroutine
- !Logging goroutines that block on a slow downstream → biggest leak risk in HTTP handlers
- !Bounded queues + drop-on-full is often better than infinite buffering
Cloudflare, Stripe, and Datadog have all published postmortems about goroutine leaks at scale. The tooling, pprof, async-profiler, py-spy, exists because every long-running concurrent service eventually hits this. Goroutine count over time is the single best 'is my service healthy' metric.