Debugging ConcurrencyTopic 10 of 18

ProblemIntermediateSometimes

Bug Hunt: Goroutine Count Growing 10/sec

In one line

A goroutine that blocks on a channel send/receive without a cancellation path lives forever. The fix is always a select on ctx.Done() or a buffered channel large enough to absorb all sends. The same bug appears in Java (futures never cancelled) and Python (asyncio tasks orphaned).

The puzzle

A Go service ships. For a week, things look fine. Slowly, the dashboards show RSS climbing. By day 14, the pod is OOMKilled. Restart; the cycle repeats.

runtime.NumGoroutine() over time:

hour 0:  ~120
hour 6:  ~3500
hour 12: ~7100
hour 24: pod restarted by OOMKill

A roughly linear growth in goroutine count is the unmistakable signature of a leak. Every request adds a goroutine that never exits.

Important

The diagnostic, in 30 seconds Hit /debug/pprof/goroutine?debug=2 (private port). The output shows every goroutine's current stack. The leak signature: thousands of goroutines all stuck at the same line, usually chan send or chan receive. That's where to look.

What to look for in the broken code

Read the broken Go example (the leak is most idiomatic in Go but the pattern applies to every async runtime). Trace what happens to the goroutine on a single request:

Handler fires, spawns a goroutine that does logCh <- summarize(r).
Handler returns 200, request done.
The spawned goroutine is now blocked on the channel send. Forever, unless something reads.
If the consumer is slow or down, every request adds one orphan.

Note

The same bug pattern in three languages

Go: goroutine blocked on channel send with no select { case <-ctx.Done() }.
Java: CompletableFuture with no orTimeout waiting on a network call.
Python: asyncio.create_task with no reference, no timeout, no error handler.

Different runtimes, same root cause: an async unit of work was started without a story for how it terminates on failure.

The fix pattern

Every async unit of work needs answers to three questions:

What's the success exit? (Normal completion, channel send received, future resolved.)
What's the failure exit? (Timeout, parent context cancelled, channel closed.)
What's the bound on memory? (Bounded buffer, max queue size, max concurrent tasks.)

Tip

The "every goroutine has a sentence" rule For every go func() (or create_task, or supplyAsync), be able to write a one-sentence answer to: "this goroutine exits when ___." If no clear answer exists, the code has a potential leak. The answer is usually "the context is cancelled" or "the channel is closed", both via select.

Production-grade pattern

go func() {
    defer wg.Done()                 // tracked exit
    select {
    case work <- payload:           // success exit
    case <-ctx.Done():              // failure exit (parent cancelled)
    case <-time.After(5*time.Second): // failure exit (timeout)
    }
}()

Almost every leak diagnosed in production is a missing select somewhere. Build the habit of writing the select first, the channel-send second.

Warning

What "fire-and-forget" actually means When someone writes go log(req) and calls it "fire-and-forget," they're saying: "I don't care if this fails or hangs." But goroutines that hang accumulate. They hold captured variables, channels, mutexes. They consume memory and file descriptors. Fire-and-forget should mean "I don't await the result", not "I have no termination story."

Implementations

BROKEN, same bug, Java flavor

A CompletableFuture chain. Looks idiomatic. The executor is a fixed-size pool. Under load, pool queue grows unbounded, futures never complete, threads stay parked forever. What's missing?

 1  // BROKEN, futures pile up, never time out
 2  ExecutorService executor = Executors.newFixedThreadPool(8);
 3  
 4  CompletableFuture<String> fetch(String url) {
 5      return CompletableFuture.supplyAsync(() -> {
 6          return httpClient.get(url);       // ← can hang forever
 7      }, executor);
 8  }
 9  
10  void handle(Request req) {
11      fetch(req.url).thenAccept(this::process);   // ← no timeout, no cancel
12      respond(req, 200);
13  }

FIXED, orTimeout + cancellation

The bug: a future that never completes blocks the executor thread forever. With a fixed-size pool, that thread is gone for good. Eventually all 8 threads are stuck and the pool dies. The fix: every async chain needs a timeout (orTimeout) and explicit error handling (exceptionally). Cancelling the future signals downstream stages to stop. Bounded queues on the executor prevent unbounded backlog.

 1  ExecutorService executor = new ThreadPoolExecutor(
 2      8, 8, 0L, TimeUnit.MILLISECONDS,
 3      new ArrayBlockingQueue<>(1000),                  // BOUNDED queue
 4      new ThreadPoolExecutor.CallerRunsPolicy()        // backpressure
 5  );
 6  
 7  CompletableFuture<String> fetch(String url) {
 8      return CompletableFuture
 9          .supplyAsync(() -> httpClient.get(url), executor)
10          .orTimeout(5, TimeUnit.SECONDS)               // timeout
11          .exceptionally(ex -> {                        // handle failures
12              log.warn("fetch failed", ex);
13              return null;
14          });
15  }

Key points

•A goroutine that blocks forever holds memory + captured variables + channels
•Symptom: runtime.NumGoroutine() growing unbounded under load
•Diagnosis: /debug/pprof/goroutine?debug=2 shows what each goroutine is waiting on
•Fix pattern: every goroutine that sends/receives must have a cancellation path

Follow-up questions

▸How is a goroutine leak detected in production?

Expose `/debug/pprof/goroutine` and watch the count. Alert on `runtime.NumGoroutine()` growing unbounded. To diagnose: `curl /debug/pprof/goroutine?debug=2` shows every goroutine's stack, find the ones stuck on `chan send` or `chan receive`.

▸Why is the unbuffered-channel send pattern so prone to leaks?

Unbuffered = synchronous handoff. The sender blocks until a receiver is ready. If the receiver is gone (panic'd, request cancelled, slow), the sender blocks forever. Buffered channels add slack; ctx.Done() adds an exit path.

▸Is asyncio.create_task without await ever correct?

Rarely. Only for truly fire-and-forget work where completion and errors don't matter. Even then, keep a reference (Python may garbage-collect mid-run) and add a done_callback for exception logging. Prefer asyncio.TaskGroup for structured cleanup.

▸How is 'goroutine count growing' measured?

Expose runtime.NumGoroutine() as a Prometheus metric. Alert when it grows for >5 minutes without bound. Healthy services have a roughly bounded number of goroutines proportional to in-flight requests; unhealthy ones grow linearly with uptime.

Gotchas

!fire-and-forget goroutines without ctx.Done() are the #1 source of Go production leaks
!Java's CompletableFuture without orTimeout silently inherits 'wait forever' semantics
!Python: asyncio.create_task without keeping a reference can be GC'd mid-execution
!context.WithCancel without calling cancel() leaks the underlying timer goroutine
!Logging goroutines that block on a slow downstream → biggest leak risk in HTTP handlers
!Bounded queues + drop-on-full is often better than infinite buffering

Where this shows up

Cloudflare, Stripe, and Datadog have all published postmortems about goroutine leaks at scale. The tooling, pprof, async-profiler, py-spy, exists because every long-running concurrent service eventually hits this. Goroutine count over time is the single best 'is my service healthy' metric.

Bug Hunt: Goroutine Count Growing 10/sec

In one line

The puzzle

A Go service ships. For a week, things look fine. Slowly, the dashboards show RSS climbing. By day 14, the pod is OOMKilled. Restart; the cycle repeats.

runtime.NumGoroutine() over time:

hour 0:  ~120
hour 6:  ~3500
hour 12: ~7100
hour 24: pod restarted by OOMKill

A roughly linear growth in goroutine count is the unmistakable signature of a leak. Every request adds a goroutine that never exits.

Important

What to look for in the broken code

Read the broken Go example (the leak is most idiomatic in Go but the pattern applies to every async runtime). Trace what happens to the goroutine on a single request:

Handler fires, spawns a goroutine that does logCh <- summarize(r).
Handler returns 200, request done.
The spawned goroutine is now blocked on the channel send. Forever, unless something reads.
If the consumer is slow or down, every request adds one orphan.

Note

The same bug pattern in three languages

Go: goroutine blocked on channel send with no select { case <-ctx.Done() }.
Java: CompletableFuture with no orTimeout waiting on a network call.
Python: asyncio.create_task with no reference, no timeout, no error handler.

Different runtimes, same root cause: an async unit of work was started without a story for how it terminates on failure.

The fix pattern

Every async unit of work needs answers to three questions:

What's the success exit? (Normal completion, channel send received, future resolved.)
What's the failure exit? (Timeout, parent context cancelled, channel closed.)
What's the bound on memory? (Bounded buffer, max queue size, max concurrent tasks.)

Tip

Production-grade pattern

go func() {
    defer wg.Done()                 // tracked exit
    select {
    case work <- payload:           // success exit
    case <-ctx.Done():              // failure exit (parent cancelled)
    case <-time.After(5*time.Second): // failure exit (timeout)
    }
}()

Almost every leak diagnosed in production is a missing select somewhere. Build the habit of writing the select first, the channel-send second.

Warning

Implementations

BROKEN, same bug, Java flavor

A CompletableFuture chain. Looks idiomatic. The executor is a fixed-size pool. Under load, pool queue grows unbounded, futures never complete, threads stay parked forever. What's missing?

 1  // BROKEN, futures pile up, never time out
 2  ExecutorService executor = Executors.newFixedThreadPool(8);
 3  
 4  CompletableFuture<String> fetch(String url) {
 5      return CompletableFuture.supplyAsync(() -> {
 6          return httpClient.get(url);       // ← can hang forever
 7      }, executor);
 8  }
 9  
10  void handle(Request req) {
11      fetch(req.url).thenAccept(this::process);   // ← no timeout, no cancel
12      respond(req, 200);
13  }

FIXED, orTimeout + cancellation

 1  ExecutorService executor = new ThreadPoolExecutor(
 2      8, 8, 0L, TimeUnit.MILLISECONDS,
 3      new ArrayBlockingQueue<>(1000),                  // BOUNDED queue
 4      new ThreadPoolExecutor.CallerRunsPolicy()        // backpressure
 5  );
 6  
 7  CompletableFuture<String> fetch(String url) {
 8      return CompletableFuture
 9          .supplyAsync(() -> httpClient.get(url), executor)
10          .orTimeout(5, TimeUnit.SECONDS)               // timeout
11          .exceptionally(ex -> {                        // handle failures
12              log.warn("fetch failed", ex);
13              return null;
14          });
15  }

Key points

•A goroutine that blocks forever holds memory + captured variables + channels
•Symptom: runtime.NumGoroutine() growing unbounded under load
•Diagnosis: /debug/pprof/goroutine?debug=2 shows what each goroutine is waiting on
•Fix pattern: every goroutine that sends/receives must have a cancellation path

Follow-up questions

▸How is a goroutine leak detected in production?

▸Why is the unbuffered-channel send pattern so prone to leaks?

▸Is asyncio.create_task without await ever correct?

▸How is 'goroutine count growing' measured?

Gotchas

!fire-and-forget goroutines without ctx.Done() are the #1 source of Go production leaks
!Java's CompletableFuture without orTimeout silently inherits 'wait forever' semantics
!Python: asyncio.create_task without keeping a reference can be GC'd mid-execution
!context.WithCancel without calling cancel() leaks the underlying timer goroutine
!Logging goroutines that block on a slow downstream → biggest leak risk in HTTP handlers
!Bounded queues + drop-on-full is often better than infinite buffering

Where this shows up

Bug Hunt: Goroutine Count Growing 10/sec

The puzzle

What to look for in the broken code

The fix pattern

Production-grade pattern

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Bug Hunt: Goroutine Count Growing 10/sec

The puzzle

What to look for in the broken code

The fix pattern

Production-grade pattern

Implementations

Key points

Follow-up questions

Gotchas

Related reading