Memory Leak Patterns

How Memory Leaks Kill Services

Memory leaks don't page you at deployment time. They page you at 3am the next day when the pods that have been running for 18 hours finally hit their memory limits and get OOM-killed simultaneously. The gradual nature of memory leaks makes them one of the hardest failure patterns to catch in testing.

The typical trajectory: deploy at 9am, memory looks fine at 10am, still fine at 6pm when everyone goes home, pods start dying at 3am. By the time the on-call wakes up, all pods have restarted and the evidence is gone.

Common Leak Sources

Connection pool leaks. Every HTTP client, database driver, and gRPC channel maintains a pool of connections. When error handling doesn't close connections properly, the pool grows until it hits the file descriptor limit or fills available memory. This is the single most common leak pattern across all languages.

Event listener accumulation. In Node.js, adding event listeners without removing them is trivial. A React component that subscribes to a WebSocket in useEffect but never unsubscribes leaks memory proportional to how many times the component mounts. In the browser this is annoying. On a server-rendered component, it's a ticking bomb.

Unbounded caches. In-memory caches without eviction policies grow forever. A cache that stores the result of every unique query will eventually contain every query your system has ever seen. Always set a max size and a TTL.

Goroutine leaks (Go). A goroutine waiting on a channel that nobody writes to never gets garbage collected. You can have 100,000 leaked goroutines each holding a small amount of memory, and the total adds up to gigabytes. Monitor runtime.NumGoroutine() and alert on growth.

Detection Strategies

The key metric isn't current memory usage. It's the rate of memory growth. Plot memory over 24 hours and look for a positive slope that doesn't flatten. A healthy service's memory graph looks like a sawtooth wave (grow, GC, drop, grow, GC, drop). A leaking service's graph is a sawtooth with an upward trend.

Use go tool pprof for Go services. Take a heap profile at startup and another 6 hours later. Diff them. The allocations that grew are your suspects. For Java, use async-profiler in allocation mode to track where memory is being allocated without the overhead of a full heap dump.

Mitigation Without a Fix

Sometimes you find the leak but can't deploy a fix immediately. Your options: reduce the container memory limit and increase replica count, so OOM kills affect smaller blast radius. Set up a cron job that gracefully restarts pods on a rolling basis every 6 hours. This is a band-aid, not a fix, but it keeps you pager-free until the real fix ships.

Configure Kubernetes preStop hooks to drain connections before the restart. Without this, the rolling restart causes connection errors during each pod's termination.

How Memory Leaks Kill Services

Common Leak Sources

Detection Strategies

Mitigation Without a Fix

Configure Kubernetes preStop hooks to drain connections before the restart. Without this, the rolling restart causes connection errors during each pod's termination.

How Memory Leaks Kill Services

Common Leak Sources

Detection Strategies

Mitigation Without a Fix

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Memory Leak Patterns

How Memory Leaks Kill Services

Common Leak Sources

Detection Strategies

Mitigation Without a Fix

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics