Memory ManagementTopic 3 of 10

Memory ManagementIntermediate

OOM Killer & Memory Pressure

DockerKubernetesPostgreSQLJVM

🧠

Mental Model

An overbooked flight. The airline sold more seats than exist. When every passenger shows up, someone has to be bumped. The gate agent scans boarding passes -- no frequent-flyer status, cheapest ticket, booked last -- and escorts that person off. No negotiation. That gate agent is the OOM killer, and the boarding passes are oom_score_adj values.

💡

The Problem

3 AM page. The database is dead -- logs just say "Killed." No stack trace, no error message, nothing. The host still has 30% memory free, so what happened? A cgroup limit was breached, or a JVM full GC touched enough pages to spike RSS over the ceiling. The kernel's OOM scorer tagged the biggest resident process, fired SIGKILL, and moved on. Exit code 137.

Architecture

The database just got shot.

No warning. No graceful shutdown. No stack trace. Just "Killed" in the logs and a 3 AM page. The host shows 30% of memory free. Everything looks fine. So what happened?

The OOM killer happened. And understanding why it chose that process is the first step to making sure it never happens again.

What Actually Happens

When a process tries to allocate memory and the kernel cannot satisfy the request, here is the cascade:

First, the kernel tries to reclaim memory on its own. It evicts clean page cache pages. It writes back dirty pages. It swaps anonymous pages to disk. It compacts memory to create contiguous free regions. All of this happens in the direct reclaim path, and the allocating thread is blocked while it runs.

Second, if reclaim cannot free enough, the kernel wakes kswapd (the background reclaim daemon) and tries again.

Third, if all of that fails, out_of_memory() is called. This is the last resort.

The OOM killer iterates over every process and calls oom_badness() for each. The scoring is simple: base score equals RSS plus swap usage (in pages), adjusted by oom_score_adj (range -1000 to +1000). The process with the highest score gets SIGKILL. No signal handler can catch it. No cleanup code runs. The process is dead.

The kernel prefers to kill one large process over many small ones because a single kill that frees 4 GB is better than killing 100 processes that each free 40 MB.

Under the Hood

Overcommit is the root cause. Linux's default mode (vm.overcommit_memory=0) uses heuristic overcommit. It lets processes malloc() more than physically available because most processes allocate far more than they touch. A JVM with -Xmx8g that actually uses 2 GB is the norm, not the exception.

This means malloc() almost never fails. The allocation succeeds, the kernel records the virtual mapping, and physical pages are committed later when the process touches them. If everyone touches their pages at once and there is not enough RAM, someone dies.

With vm.overcommit_memory=2 (strict accounting), the kernel tracks total committed memory against CommitLimit = swap_size + (overcommit_ratio/100 * physical_RAM). Allocations that exceed the limit get ENOMEM immediately. No surprises, but also no flexibility -- and without enough swap, only half the RAM may be usable.

Cgroup OOM is the container trap. In Docker and Kubernetes, the OOM killer can trigger at the cgroup level when memory.max is exceeded -- even if the host has plenty of free memory. The cgroup OOM killer only considers processes within that cgroup. This is why a container can be OOM-killed at 512 MB on a 128 GB host.

The event is logged to memory.events (oom and oom_kill counters), not always to dmesg. Many teams miss these entirely because they only monitor system logs.

The OOM reaper handles stuck victims. Introduced in Linux 4.6, the OOM reaper runs in a kernel thread. If the OOM-killed process is stuck in D-state (uninterruptible sleep) and cannot exit, the reaper unmaps its anonymous memory areas proactively. Without this, the victim could hold a lock needed by other processes trying to free memory, creating a deadlock where the system needs memory to free memory.

PSI is the crystal ball. Pressure Stall Information (/proc/pressure/memory) reports the percentage of time tasks are stalled waiting for memory. It is a gradient, not a binary. By the time the OOM killer fires, PSI has been screaming for minutes. User-space daemons like systemd-oomd watch these numbers and kill processes proactively -- with SIGTERM first, proper logging, and respect for service priorities. This is strictly better than the kernel's blunt SIGKILL.

Common Questions

A production process was OOM-killed but the host had 30% free memory. What happened?

Almost certainly a cgroup memory limit (memory.max), not a global OOM. Check dmesg for "memory cgroup out of memory" vs plain "Out of memory." In Kubernetes, kubectl describe pod shows the OOMKilled reason and the container's memory limit. This is the single most common OOM mystery in containerized environments.

How is a critical process protected from being OOM-killed?

Set oom_score_adj=-1000 via /proc/<pid>/oom_score_adj or systemd's OOMScoreAdjust=-1000. But use this sparingly. If all processes are immune, the kernel panics. The better approach is proper capacity planning: set appropriate cgroup memory limits, use PSI-based proactive killing of non-critical workloads, and leave headroom.

What is the difference between overcommit modes 0, 1, and 2?

Mode 0 (heuristic): rejects obviously excessive allocations but allows moderate overcommit. This is the default and works for most workloads. Mode 1 (always): never rejects anything, relying entirely on OOM if memory runs out. Redis deployments sometimes use this. Mode 2 (strict): tracks committed memory and returns ENOMEM before exhaustion. Prevents OOM kills but can underutilize memory.

How does swap affect OOM?

Swap extends the kernel's ability to reclaim memory by moving anonymous pages to disk, delaying OOM. But excessive swapping (thrashing) can make the system unusable long before OOM triggers. PSI captures this perfectly: high some pressure with low full means the system is swapping but progressing. High full pressure means tasks are completely stalled -- OOM is imminent.

How Technologies Use This

Docker

A container exits with code 137, but the host still has 100 GB of free RAM. The application logs show nothing useful, and the team spends hours chasing phantom application crashes before realizing the container was killed externally.

Docker maps --memory to cgroup memory.max, creating an absolute ceiling enforced by the kernel at the page allocator level. When the container exceeds its 512 MB limit, the cgroup-scoped OOM killer fires and kills only processes inside that container. The kill event appears in memory.events counters, not in dmesg, which is why teams relying solely on system logs miss it entirely.

Run docker inspect to check the OOMKilled flag and exact timestamp, then correlate it with the memory spike that caused it. Setting appropriate --memory limits confines the OOM blast radius to a single container instead of letting the kernel hunt globally and potentially kill the database.

Kubernetes

A node running 40 pods hits memory pressure, and the kernel kills the production API server instead of the log collector running next to it. The API server was the largest process, so the kernel scored it highest and chose it as the victim, causing a user-facing outage.

Kubernetes rigs the OOM scoring game using oom_score_adj. BestEffort pods get oom_score_adj set to 1000, making them first to die. Guaranteed pods get -997, making them effectively immune. The kubelet also monitors PSI avg10 thresholds and evicts Burstable pods proactively when memory pressure exceeds 60%, giving containers a graceful SIGTERM with a 30-second window instead of the kernel's instant SIGKILL.

Always set resource requests and limits on production pods to get Guaranteed QoS class. This ensures the kernel sacrifices BestEffort log collectors and batch jobs first, keeping critical workloads alive during node-level memory pressure on a 64 GB node.

PostgreSQL

A runaway analytical query allocates 40 GB on a 64 GB host, and the OOM killer shoots the postmaster process because it accounts for all shared_buffers RSS. Every connection dies simultaneously, and the full database restart takes 3 minutes, affecting every user.

The kernel scores the postmaster highest because its RSS includes all of shared_buffers, even though the runaway backend is the actual culprit. Without OOM score tuning, the kernel cannot distinguish between the parent postmaster and the child backend that caused the problem, so it kills the biggest target.

Set oom_score_adj to -1000 on the postmaster and leave child backends at default. The kernel then kills only the offending query backend, the postmaster survives, existing connections stay alive, and the killed backend's client gets a disconnection error it can retry. This single change is the difference between a 5-second query timeout and a 3-minute full restart.

JVM

A JVM with only 2 GB of live objects keeps getting OOM-killed in a 4 GB container. The heap looks half-empty, and the pod restarts every few hours during garbage collection cycles with no apparent memory leak.

A full GC marks every reachable object, touching all heap pages and spiking RSS to the full -Xmx value even though most of that memory is garbage about to be freed. The container's cgroup limit sees this sudden wall of physical memory usage and fires the OOM killer mid-collection. Adding 500 MB for thread stacks, native memory, and page cache means a 4 GB limit is 500 MB too small.

Set -XX:MaxRAMPercentage to 70% to cap the heap at 2.8 GB in a 4 GB container, leaving 1.2 GB of headroom for non-heap memory. Never set the container memory limit equal to the max heap size, because GC-induced RSS spikes will guarantee periodic OOM kills.

Same Concept Across Tech

Technology	How OOM manifests	Key signal
Docker	Container killed with exit code 137	docker inspect shows OOMKilled: true
Kubernetes	Pod status OOMKilled, restarts increment	kubectl describe pod, look for OOMKilled reason
JVM	Process killed mid-GC or after heap expansion hits cgroup limit	Not a Java OutOfMemoryError. Process just disappears. dmesg confirms
Node.js	Process killed during V8 heap expansion or Buffer allocation	Exit code 137, no uncaughtException handler fires
Go	Process killed during large allocation or mmap	Exit code 137, no panic/recover
PostgreSQL	Shared memory allocation exceeds cgroup limit, all connections drop	pg_log shows nothing. dmesg shows oom-kill

Stack layer mapping (process mysteriously dying):

Layer	What to check	Tool
Application	Did the app log an error before dying?	Application logs
Runtime	Was the runtime expanding heap or allocating native memory?	JVM GC logs, Node --max-old-space-size
Cgroup	Did memory usage hit the cgroup limit?	memory.usage_in_bytes vs memory.limit_in_bytes
Kernel	Did the OOM killer fire? Which process was selected?	dmesg, /var/log/kern.log
Hardware	Is physical RAM actually exhausted? Or just the cgroup limit?	free -h, /proc/meminfo

Design Rationale Overcommit is the default because processes routinely reserve far more memory than they touch -- a JVM with an 8 GB max heap that actually uses 2 GB is normal, not exceptional. Strict accounting (mode 2) would reject those allocations and waste half the RAM on a swap-less machine. The tradeoff: if everyone touches their pages at the same time, someone dies. The kernel targets the largest resident process because one big kill reclaims more than dozens of small ones. And since the kernel has no idea which process matters to the business, oom_score_adj hands that decision to operators.

If You See This, Think This

Symptom	Likely cause	First check
Exit code 137, no error in app logs	OOM killer sent SIGKILL	dmesg
Container restarts repeatedly in K8s	Memory limit too low for the workload	kubectl describe pod, check OOMKilled
JVM killed but no OutOfMemoryError	Native memory or metaspace pushed RSS over cgroup limit	Check RSS vs cgroup limit, not just heap
Host has free memory but container OOM'd	Cgroup limit is per-container, not host-level	cat memory.limit_in_bytes vs memory.usage_in_bytes
Random process killed, not the biggest one	oom_score_adj was set, or the biggest process had OOM protection	Check oom_score_adj for all processes
OOM kills happen during GC pauses	GC touches pages that were lazy-allocated, suddenly materializing RSS	Reduce heap or increase cgroup limit to cover GC peak

When to Use / Avoid

Relevant when:

Containers run with memory limits -- cgroup OOM is the most common production trigger by far
JVM heap + metaspace + native allocations can blow past a cgroup ceiling during GC
Overcommit is on (the Linux default), meaning total virtual memory quietly exceeds physical RAM
Processes die with no application-level error and the only clue is exit code 137

Protect against it by:

Pinning oom_score_adj to -1000 on processes that absolutely cannot die
Setting memory.oom.group in cgroup v2 so the whole cgroup goes down together instead of one random victim
Using memory.low as a soft reservation the OOM killer respects

Try It Yourself

 1  # Check OOM score for all processes (sorted)
 2  
 3  for pid in /proc/[0-9]*/; do echo "$(cat $pid/oom_score 2>/dev/null) $(cat $pid/comm 2>/dev/null) $(basename $pid)"; done | sort -rn | head -20
 4  
 5  # Set OOM score adjustment for a process
 6  
 7  echo -1000 | sudo tee /proc/$(pidof postgres)/oom_score_adj
 8  
 9  # Check overcommit settings
10  
11  sysctl vm.overcommit_memory vm.overcommit_ratio
12  
13  # View committed memory vs limit
14  
15  grep -E 'Committed_AS|CommitLimit' /proc/meminfo
16  
17  # Monitor memory pressure in real-time
18  
19  watch -n1 cat /proc/pressure/memory
20  
21  # Search for OOM events in kernel log
22  
23  dmesg | grep -A5 'Out of memory\|oom-kill\|Killed process'

Debug Checklist

1Check if OOM killed a process: dmesg | grep -i 'oom\|killed process'
2Check OOM scores: for p in /proc/[0-9]*/oom_score; do echo $(cat $p) $(cat ${p%/*}/cmdline | tr '\0' ' '); done | sort -rn | head
3Check cgroup memory usage: cat /sys/fs/cgroup/memory/.../memory.usage_in_bytes
4Check cgroup memory limit: cat /sys/fs/cgroup/memory/.../memory.limit_in_bytes
5Check memory pressure: cat /proc/pressure/memory
6Check overcommit setting: cat /proc/sys/vm/overcommit_memory

Key Takeaways

✓The OOM killer is the absolute last resort -- before it fires, the kernel has already tried reclaiming page cache, writing back dirty pages, swapping anonymous pages, and compacting memory; if you are seeing OOM kills, the system was drowning for a while before that
✓oom_score_adj is how you rig the game: -1000 makes a process immortal (but if everything is immortal, the kernel panics), +1000 volunteers it as tribute; Kubernetes uses this to protect Guaranteed pods and sacrifice BestEffort ones
✓Overcommit mode 0 is the kernel making a bet that not everyone will cash their checks at once -- malloc succeeds now, but if everyone touches their pages later, someone gets killed; this is why "malloc succeeded but the process died later" confuses so many developers
✓PSI metrics are your early warning system -- user-space daemons like systemd-oomd watch these numbers and kill processes BEFORE the kernel OOM killer fires, giving you cleaner shutdowns and actual log messages instead of a bare SIGKILL
✓The OOM reaper is the kernel's backup plan -- if the victim is stuck in D-state and cannot exit, the reaper strips its anonymous memory anyway, because a dead process that cannot release its pages is worse than useless

Common Pitfalls

✗Making everything immune with oom_score_adj=-1000 -- if the kernel cannot find anyone to kill, it panics or hangs; at least one non-essential process must be killable, always
✗Seeing "Killed" in logs and assuming it is an application bug -- search dmesg for "oom-kill" or "Out of memory" first; OOM kills look identical to crashes unless you check the kernel log
✗Using strict overcommit (mode 2) without enough swap -- the commit limit is swap + (ratio * RAM); with no swap and the default 50% ratio, only half your RAM is allocatable, causing ENOMEM with gigabytes still free
✗Ignoring PSI until it is too late -- by the time the OOM killer fires, the system has been thrashing for seconds or minutes; monitoring /proc/pressure/memory lets you act before things get that bad

Reference

System Calls

madvisemlocksetrlimit

Tools

dmesg / journalctl/proc/pressure/memorysystemd-oomd

📌

In One Line

In production, cgroup limits -- not host exhaustion -- trigger most OOM kills; tune oom_score_adj and monitor PSI before the kernel has to choose a victim.

OOM Killer & Memory Pressure

DockerKubernetesPostgreSQLJVM

🧠

Mental Model

💡

The Problem

Architecture

The database just got shot.

No warning. No graceful shutdown. No stack trace. Just "Killed" in the logs and a 3 AM page. The host shows 30% of memory free. Everything looks fine. So what happened?

The OOM killer happened. And understanding why it chose that process is the first step to making sure it never happens again.

What Actually Happens

When a process tries to allocate memory and the kernel cannot satisfy the request, here is the cascade:

Second, if reclaim cannot free enough, the kernel wakes kswapd (the background reclaim daemon) and tries again.

Third, if all of that fails, out_of_memory() is called. This is the last resort.

The kernel prefers to kill one large process over many small ones because a single kill that frees 4 GB is better than killing 100 processes that each free 40 MB.

Under the Hood

The event is logged to memory.events (oom and oom_kill counters), not always to dmesg. Many teams miss these entirely because they only monitor system logs.

Common Questions

A production process was OOM-killed but the host had 30% free memory. What happened?

How is a critical process protected from being OOM-killed?

What is the difference between overcommit modes 0, 1, and 2?

How does swap affect OOM?

How Technologies Use This

Docker

Kubernetes

PostgreSQL

JVM

Same Concept Across Tech

Technology	How OOM manifests	Key signal
Docker	Container killed with exit code 137	docker inspect shows OOMKilled: true
Kubernetes	Pod status OOMKilled, restarts increment	kubectl describe pod, look for OOMKilled reason
JVM	Process killed mid-GC or after heap expansion hits cgroup limit	Not a Java OutOfMemoryError. Process just disappears. dmesg confirms
Node.js	Process killed during V8 heap expansion or Buffer allocation	Exit code 137, no uncaughtException handler fires
Go	Process killed during large allocation or mmap	Exit code 137, no panic/recover
PostgreSQL	Shared memory allocation exceeds cgroup limit, all connections drop	pg_log shows nothing. dmesg shows oom-kill

Stack layer mapping (process mysteriously dying):

Layer	What to check	Tool
Application	Did the app log an error before dying?	Application logs
Runtime	Was the runtime expanding heap or allocating native memory?	JVM GC logs, Node --max-old-space-size
Cgroup	Did memory usage hit the cgroup limit?	memory.usage_in_bytes vs memory.limit_in_bytes
Kernel	Did the OOM killer fire? Which process was selected?	dmesg, /var/log/kern.log
Hardware	Is physical RAM actually exhausted? Or just the cgroup limit?	free -h, /proc/meminfo

If You See This, Think This

Symptom	Likely cause	First check
Exit code 137, no error in app logs	OOM killer sent SIGKILL	dmesg
Container restarts repeatedly in K8s	Memory limit too low for the workload	kubectl describe pod, check OOMKilled
JVM killed but no OutOfMemoryError	Native memory or metaspace pushed RSS over cgroup limit	Check RSS vs cgroup limit, not just heap
Host has free memory but container OOM'd	Cgroup limit is per-container, not host-level	cat memory.limit_in_bytes vs memory.usage_in_bytes
Random process killed, not the biggest one	oom_score_adj was set, or the biggest process had OOM protection	Check oom_score_adj for all processes
OOM kills happen during GC pauses	GC touches pages that were lazy-allocated, suddenly materializing RSS	Reduce heap or increase cgroup limit to cover GC peak

When to Use / Avoid

Relevant when:

Containers run with memory limits -- cgroup OOM is the most common production trigger by far
JVM heap + metaspace + native allocations can blow past a cgroup ceiling during GC
Overcommit is on (the Linux default), meaning total virtual memory quietly exceeds physical RAM
Processes die with no application-level error and the only clue is exit code 137

Protect against it by:

Pinning oom_score_adj to -1000 on processes that absolutely cannot die
Setting memory.oom.group in cgroup v2 so the whole cgroup goes down together instead of one random victim
Using memory.low as a soft reservation the OOM killer respects

Try It Yourself

 1  # Check OOM score for all processes (sorted)
 2  
 3  for pid in /proc/[0-9]*/; do echo "$(cat $pid/oom_score 2>/dev/null) $(cat $pid/comm 2>/dev/null) $(basename $pid)"; done | sort -rn | head -20
 4  
 5  # Set OOM score adjustment for a process
 6  
 7  echo -1000 | sudo tee /proc/$(pidof postgres)/oom_score_adj
 8  
 9  # Check overcommit settings
10  
11  sysctl vm.overcommit_memory vm.overcommit_ratio
12  
13  # View committed memory vs limit
14  
15  grep -E 'Committed_AS|CommitLimit' /proc/meminfo
16  
17  # Monitor memory pressure in real-time
18  
19  watch -n1 cat /proc/pressure/memory
20  
21  # Search for OOM events in kernel log
22  
23  dmesg | grep -A5 'Out of memory\|oom-kill\|Killed process'

Debug Checklist

1Check if OOM killed a process: dmesg | grep -i 'oom\|killed process'
2Check OOM scores: for p in /proc/[0-9]*/oom_score; do echo $(cat $p) $(cat ${p%/*}/cmdline | tr '\0' ' '); done | sort -rn | head
3Check cgroup memory usage: cat /sys/fs/cgroup/memory/.../memory.usage_in_bytes
4Check cgroup memory limit: cat /sys/fs/cgroup/memory/.../memory.limit_in_bytes
5Check memory pressure: cat /proc/pressure/memory
6Check overcommit setting: cat /proc/sys/vm/overcommit_memory

Key Takeaways

✓The OOM killer is the absolute last resort -- before it fires, the kernel has already tried reclaiming page cache, writing back dirty pages, swapping anonymous pages, and compacting memory; if you are seeing OOM kills, the system was drowning for a while before that
✓oom_score_adj is how you rig the game: -1000 makes a process immortal (but if everything is immortal, the kernel panics), +1000 volunteers it as tribute; Kubernetes uses this to protect Guaranteed pods and sacrifice BestEffort ones
✓Overcommit mode 0 is the kernel making a bet that not everyone will cash their checks at once -- malloc succeeds now, but if everyone touches their pages later, someone gets killed; this is why "malloc succeeded but the process died later" confuses so many developers
✓PSI metrics are your early warning system -- user-space daemons like systemd-oomd watch these numbers and kill processes BEFORE the kernel OOM killer fires, giving you cleaner shutdowns and actual log messages instead of a bare SIGKILL
✓The OOM reaper is the kernel's backup plan -- if the victim is stuck in D-state and cannot exit, the reaper strips its anonymous memory anyway, because a dead process that cannot release its pages is worse than useless

Common Pitfalls

✗Making everything immune with oom_score_adj=-1000 -- if the kernel cannot find anyone to kill, it panics or hangs; at least one non-essential process must be killable, always
✗Seeing "Killed" in logs and assuming it is an application bug -- search dmesg for "oom-kill" or "Out of memory" first; OOM kills look identical to crashes unless you check the kernel log
✗Using strict overcommit (mode 2) without enough swap -- the commit limit is swap + (ratio * RAM); with no swap and the default 50% ratio, only half your RAM is allocatable, causing ENOMEM with gigabytes still free
✗Ignoring PSI until it is too late -- by the time the OOM killer fires, the system has been thrashing for seconds or minutes; monitoring /proc/pressure/memory lets you act before things get that bad

Reference

System Calls

madvisemlocksetrlimit

Tools

dmesg / journalctl/proc/pressure/memorysystemd-oomd

📌

In One Line

In production, cgroup limits -- not host exhaustion -- trigger most OOM kills; tune oom_score_adj and monitor PSI before the kernel has to choose a victim.

OOM Killer & Memory Pressure

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

OOM Killer & Memory Pressure

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics