Memory Leak Patterns
How Memory Leaks Kill Services
Memory leaks don't page you at deployment time. They page you at 3am the next day when the pods that have been running for 18 hours finally hit their memory limits and get OOM-killed simultaneously. The gradual nature of memory leaks makes them one of the hardest failure patterns to catch in testing.
The typical trajectory: deploy at 9am, memory looks fine at 10am, still fine at 6pm when everyone goes home, pods start dying at 3am. By the time the on-call wakes up, all pods have restarted and the evidence is gone.
Common Leak Sources
Connection pool leaks. Every HTTP client, database driver, and gRPC channel maintains a pool of connections. When error handling doesn't close connections properly, the pool grows until it hits the file descriptor limit or fills available memory. This is the single most common leak pattern across all languages.
Event listener accumulation. In Node.js, adding event listeners without removing them is trivial. A React component that subscribes to a WebSocket in useEffect but never unsubscribes leaks memory proportional to how many times the component mounts. In the browser this is annoying. On a server-rendered component, it's a ticking bomb.
Unbounded caches. In-memory caches without eviction policies grow forever. A cache that stores the result of every unique query will eventually contain every query your system has ever seen. Always set a max size and a TTL.
Goroutine leaks (Go). A goroutine waiting on a channel that nobody writes to never gets garbage collected. You can have 100,000 leaked goroutines each holding a small amount of memory, and the total adds up to gigabytes. Monitor runtime.NumGoroutine() and alert on growth.
Detection Strategies
The key metric isn't current memory usage. It's the rate of memory growth. Plot memory over 24 hours and look for a positive slope that doesn't flatten. A healthy service's memory graph looks like a sawtooth wave (grow, GC, drop, grow, GC, drop). A leaking service's graph is a sawtooth with an upward trend.
Use go tool pprof for Go services. Take a heap profile at startup and another 6 hours later. Diff them. The allocations that grew are your suspects. For Java, use async-profiler in allocation mode to track where memory is being allocated without the overhead of a full heap dump.
Mitigation Without a Fix
Sometimes you find the leak but can't deploy a fix immediately. Your options: reduce the container memory limit and increase replica count, so OOM kills affect smaller blast radius. Set up a cron job that gracefully restarts pods on a rolling basis every 6 hours. This is a band-aid, not a fix, but it keeps you pager-free until the real fix ships.
Configure Kubernetes preStop hooks to drain connections before the restart. Without this, the rolling restart causes connection errors during each pod's termination.
Incident Timeline
- T+0mDeployment at 09:00. Memory usage baseline is 512MB per pod. Everything looks normal. No alerts fire.
- T+2mMemory grows to 520MB. Within noise. Garbage collection running normally. No performance impact visible.
- T+5mMemory at 600MB after 2 hours in production. Growth rate is 1MB/minute. At this rate, the 2GB container limit will be hit in 24 hours.
- T+10mNext morning: first pod hits OOM limit and gets killed by the kernel. Kubernetes restarts it. Traffic shifts to remaining pods, which are also close to OOM.
- T+15mCascading OOM kills. All 5 pods restart within a 10-minute window. Service is down for 3 minutes during the restart storm.
- T+30mOn-call identifies the leak source: a new HTTP client that creates a transport per request instead of reusing connections. Hotfix deployed.
Detection Signals
- •Monotonically increasing memory usage that doesn't decrease after garbage collection cycles
- •Increasing GC pause times or GC frequency without corresponding traffic increases
- •Container OOM kill events in Kubernetes (reason: OOMKilled in pod status)
- •Growing number of goroutines, threads, or file descriptors over time without traffic correlation
Prevention
- Set container memory limits and monitor RSS memory trends over 24-hour windows, not just current usage
- Run load tests with sustained traffic for at least 1 hour to catch slow leaks that don't appear in 5-minute test runs
- Use pprof (Go), async-profiler (Java), or tracemalloc (Python) in staging environments with production-like traffic patterns
- Implement connection pool metrics: track active connections, idle connections, and total created connections over time
- Deploy canary instances that run for 48 hours before full rollout to catch slow leaks early
Key Points
- •Memory leaks rarely cause immediate outages. They're slow killers that show up 12-48 hours after deployment, often during off-hours when nobody is watching
- •The most common leak in Go services is goroutine leaks from unbuffered channels or missing context cancellation
- •In Java, the leak is usually in off-heap memory (direct byte buffers, JNI allocations) that doesn't show up in heap dumps
- •Container memory limits are a safety net, not a solution. An OOM kill is still an outage, just a shorter one
- •Connection pool leaks happen when error paths don't close connections. The happy path works fine; the leak is in the error handling
Common Mistakes
- ✗Setting memory alerts on absolute thresholds instead of growth rate. A service using 1.5GB steadily is fine; a service growing by 50MB/hour is not
- ✗Running heap dumps in production under load, causing additional memory pressure and making the OOM happen sooner
- ✗Fixing the leak by increasing the memory limit instead of finding the root cause, turning a 24-hour time bomb into a 72-hour time bomb