OOM Killer & Memory Pressure
Mental Model
An overbooked flight. The airline sold more seats than exist. When every passenger shows up, someone has to be bumped. The gate agent scans boarding passes -- no frequent-flyer status, cheapest ticket, booked last -- and escorts that person off. No negotiation. That gate agent is the OOM killer, and the boarding passes are oom_score_adj values.
The Problem
3 AM page. The database is dead -- logs just say "Killed." No stack trace, no error message, nothing. The host still has 30% memory free, so what happened? A cgroup limit was breached, or a JVM full GC touched enough pages to spike RSS over the ceiling. The kernel's OOM scorer tagged the biggest resident process, fired SIGKILL, and moved on. Exit code 137.
Architecture
The database just got shot.
No warning. No graceful shutdown. No stack trace. Just "Killed" in the logs and a 3 AM page. The host shows 30% of memory free. Everything looks fine. So what happened?
The OOM killer happened. And understanding why it chose that process is the first step to making sure it never happens again.
What Actually Happens
When a process tries to allocate memory and the kernel cannot satisfy the request, here is the cascade:
First, the kernel tries to reclaim memory on its own. It evicts clean page cache pages. It writes back dirty pages. It swaps anonymous pages to disk. It compacts memory to create contiguous free regions. All of this happens in the direct reclaim path, and the allocating thread is blocked while it runs.
Second, if reclaim cannot free enough, the kernel wakes kswapd (the background reclaim daemon) and tries again.
Third, if all of that fails, out_of_memory() is called. This is the last resort.
The OOM killer iterates over every process and calls oom_badness() for each. The scoring is simple: base score equals RSS plus swap usage (in pages), adjusted by oom_score_adj (range -1000 to +1000). The process with the highest score gets SIGKILL. No signal handler can catch it. No cleanup code runs. The process is dead.
The kernel prefers to kill one large process over many small ones because a single kill that frees 4 GB is better than killing 100 processes that each free 40 MB.
Under the Hood
Overcommit is the root cause. Linux's default mode (vm.overcommit_memory=0) uses heuristic overcommit. It lets processes malloc() more than physically available because most processes allocate far more than they touch. A JVM with -Xmx8g that actually uses 2 GB is the norm, not the exception.
This means malloc() almost never fails. The allocation succeeds, the kernel records the virtual mapping, and physical pages are committed later when the process touches them. If everyone touches their pages at once and there is not enough RAM, someone dies.
With vm.overcommit_memory=2 (strict accounting), the kernel tracks total committed memory against CommitLimit = swap_size + (overcommit_ratio/100 * physical_RAM). Allocations that exceed the limit get ENOMEM immediately. No surprises, but also no flexibility -- and without enough swap, only half the RAM may be usable.
Cgroup OOM is the container trap. In Docker and Kubernetes, the OOM killer can trigger at the cgroup level when memory.max is exceeded -- even if the host has plenty of free memory. The cgroup OOM killer only considers processes within that cgroup. This is why a container can be OOM-killed at 512 MB on a 128 GB host.
The event is logged to memory.events (oom and oom_kill counters), not always to dmesg. Many teams miss these entirely because they only monitor system logs.
The OOM reaper handles stuck victims. Introduced in Linux 4.6, the OOM reaper runs in a kernel thread. If the OOM-killed process is stuck in D-state (uninterruptible sleep) and cannot exit, the reaper unmaps its anonymous memory areas proactively. Without this, the victim could hold a lock needed by other processes trying to free memory, creating a deadlock where the system needs memory to free memory.
PSI is the crystal ball. Pressure Stall Information (/proc/pressure/memory) reports the percentage of time tasks are stalled waiting for memory. It is a gradient, not a binary. By the time the OOM killer fires, PSI has been screaming for minutes. User-space daemons like systemd-oomd watch these numbers and kill processes proactively -- with SIGTERM first, proper logging, and respect for service priorities. This is strictly better than the kernel's blunt SIGKILL.
Common Questions
A production process was OOM-killed but the host had 30% free memory. What happened?
Almost certainly a cgroup memory limit (memory.max), not a global OOM. Check dmesg for "memory cgroup out of memory" vs plain "Out of memory." In Kubernetes, kubectl describe pod shows the OOMKilled reason and the container's memory limit. This is the single most common OOM mystery in containerized environments.
How is a critical process protected from being OOM-killed?
Set oom_score_adj=-1000 via /proc/<pid>/oom_score_adj or systemd's OOMScoreAdjust=-1000. But use this sparingly. If all processes are immune, the kernel panics. The better approach is proper capacity planning: set appropriate cgroup memory limits, use PSI-based proactive killing of non-critical workloads, and leave headroom.
What is the difference between overcommit modes 0, 1, and 2?
Mode 0 (heuristic): rejects obviously excessive allocations but allows moderate overcommit. This is the default and works for most workloads. Mode 1 (always): never rejects anything, relying entirely on OOM if memory runs out. Redis deployments sometimes use this. Mode 2 (strict): tracks committed memory and returns ENOMEM before exhaustion. Prevents OOM kills but can underutilize memory.
How does swap affect OOM?
Swap extends the kernel's ability to reclaim memory by moving anonymous pages to disk, delaying OOM. But excessive swapping (thrashing) can make the system unusable long before OOM triggers. PSI captures this perfectly: high some pressure with low full means the system is swapping but progressing. High full pressure means tasks are completely stalled -- OOM is imminent.
How Technologies Use This
A container exits with code 137, but the host still has 100 GB of free RAM. The application logs show nothing useful, and the team spends hours chasing phantom application crashes before realizing the container was killed externally.
Docker maps --memory to cgroup memory.max, creating an absolute ceiling enforced by the kernel at the page allocator level. When the container exceeds its 512 MB limit, the cgroup-scoped OOM killer fires and kills only processes inside that container. The kill event appears in memory.events counters, not in dmesg, which is why teams relying solely on system logs miss it entirely.
Run docker inspect to check the OOMKilled flag and exact timestamp, then correlate it with the memory spike that caused it. Setting appropriate --memory limits confines the OOM blast radius to a single container instead of letting the kernel hunt globally and potentially kill the database.
A node running 40 pods hits memory pressure, and the kernel kills the production API server instead of the log collector running next to it. The API server was the largest process, so the kernel scored it highest and chose it as the victim, causing a user-facing outage.
Kubernetes rigs the OOM scoring game using oom_score_adj. BestEffort pods get oom_score_adj set to 1000, making them first to die. Guaranteed pods get -997, making them effectively immune. The kubelet also monitors PSI avg10 thresholds and evicts Burstable pods proactively when memory pressure exceeds 60%, giving containers a graceful SIGTERM with a 30-second window instead of the kernel's instant SIGKILL.
Always set resource requests and limits on production pods to get Guaranteed QoS class. This ensures the kernel sacrifices BestEffort log collectors and batch jobs first, keeping critical workloads alive during node-level memory pressure on a 64 GB node.
A runaway analytical query allocates 40 GB on a 64 GB host, and the OOM killer shoots the postmaster process because it accounts for all shared_buffers RSS. Every connection dies simultaneously, and the full database restart takes 3 minutes, affecting every user.
The kernel scores the postmaster highest because its RSS includes all of shared_buffers, even though the runaway backend is the actual culprit. Without OOM score tuning, the kernel cannot distinguish between the parent postmaster and the child backend that caused the problem, so it kills the biggest target.
Set oom_score_adj to -1000 on the postmaster and leave child backends at default. The kernel then kills only the offending query backend, the postmaster survives, existing connections stay alive, and the killed backend's client gets a disconnection error it can retry. This single change is the difference between a 5-second query timeout and a 3-minute full restart.
A JVM with only 2 GB of live objects keeps getting OOM-killed in a 4 GB container. The heap looks half-empty, and the pod restarts every few hours during garbage collection cycles with no apparent memory leak.
A full GC marks every reachable object, touching all heap pages and spiking RSS to the full -Xmx value even though most of that memory is garbage about to be freed. The container's cgroup limit sees this sudden wall of physical memory usage and fires the OOM killer mid-collection. Adding 500 MB for thread stacks, native memory, and page cache means a 4 GB limit is 500 MB too small.
Set -XX:MaxRAMPercentage to 70% to cap the heap at 2.8 GB in a 4 GB container, leaving 1.2 GB of headroom for non-heap memory. Never set the container memory limit equal to the max heap size, because GC-induced RSS spikes will guarantee periodic OOM kills.
Same Concept Across Tech
| Technology | How OOM manifests | Key signal |
|---|---|---|
| Docker | Container killed with exit code 137 | docker inspect shows OOMKilled: true |
| Kubernetes | Pod status OOMKilled, restarts increment | kubectl describe pod, look for OOMKilled reason |
| JVM | Process killed mid-GC or after heap expansion hits cgroup limit | Not a Java OutOfMemoryError. Process just disappears. dmesg confirms |
| Node.js | Process killed during V8 heap expansion or Buffer allocation | Exit code 137, no uncaughtException handler fires |
| Go | Process killed during large allocation or mmap | Exit code 137, no panic/recover |
| PostgreSQL | Shared memory allocation exceeds cgroup limit, all connections drop | pg_log shows nothing. dmesg shows oom-kill |
Stack layer mapping (process mysteriously dying):
| Layer | What to check | Tool |
|---|---|---|
| Application | Did the app log an error before dying? | Application logs |
| Runtime | Was the runtime expanding heap or allocating native memory? | JVM GC logs, Node --max-old-space-size |
| Cgroup | Did memory usage hit the cgroup limit? | memory.usage_in_bytes vs memory.limit_in_bytes |
| Kernel | Did the OOM killer fire? Which process was selected? | dmesg, /var/log/kern.log |
| Hardware | Is physical RAM actually exhausted? Or just the cgroup limit? | free -h, /proc/meminfo |
Design Rationale Overcommit is the default because processes routinely reserve far more memory than they touch -- a JVM with an 8 GB max heap that actually uses 2 GB is normal, not exceptional. Strict accounting (mode 2) would reject those allocations and waste half the RAM on a swap-less machine. The tradeoff: if everyone touches their pages at the same time, someone dies. The kernel targets the largest resident process because one big kill reclaims more than dozens of small ones. And since the kernel has no idea which process matters to the business, oom_score_adj hands that decision to operators.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Exit code 137, no error in app logs | OOM killer sent SIGKILL | dmesg |
| Container restarts repeatedly in K8s | Memory limit too low for the workload | kubectl describe pod, check OOMKilled |
| JVM killed but no OutOfMemoryError | Native memory or metaspace pushed RSS over cgroup limit | Check RSS vs cgroup limit, not just heap |
| Host has free memory but container OOM'd | Cgroup limit is per-container, not host-level | cat memory.limit_in_bytes vs memory.usage_in_bytes |
| Random process killed, not the biggest one | oom_score_adj was set, or the biggest process had OOM protection | Check oom_score_adj for all processes |
| OOM kills happen during GC pauses | GC touches pages that were lazy-allocated, suddenly materializing RSS | Reduce heap or increase cgroup limit to cover GC peak |
When to Use / Avoid
Relevant when:
- Containers run with memory limits -- cgroup OOM is the most common production trigger by far
- JVM heap + metaspace + native allocations can blow past a cgroup ceiling during GC
- Overcommit is on (the Linux default), meaning total virtual memory quietly exceeds physical RAM
- Processes die with no application-level error and the only clue is exit code 137
Protect against it by:
- Pinning oom_score_adj to -1000 on processes that absolutely cannot die
- Setting memory.oom.group in cgroup v2 so the whole cgroup goes down together instead of one random victim
- Using memory.low as a soft reservation the OOM killer respects
Try It Yourself
1 # Check OOM score for all processes (sorted)
2
3 for pid in /proc/[0-9]*/; do echo "$(cat $pid/oom_score 2>/dev/null) $(cat $pid/comm 2>/dev/null) $(basename $pid)"; done | sort -rn | head -20
4
5 # Set OOM score adjustment for a process
6
7 echo -1000 | sudo tee /proc/$(pidof postgres)/oom_score_adj
8
9 # Check overcommit settings
10
11 sysctl vm.overcommit_memory vm.overcommit_ratio
12
13 # View committed memory vs limit
14
15 grep -E 'Committed_AS|CommitLimit' /proc/meminfo
16
17 # Monitor memory pressure in real-time
18
19 watch -n1 cat /proc/pressure/memory
20
21 # Search for OOM events in kernel log
22
23 dmesg | grep -A5 'Out of memory\|oom-kill\|Killed process'Debug Checklist
- 1
Check if OOM killed a process: dmesg | grep -i 'oom\|killed process' - 2
Check OOM scores: for p in /proc/[0-9]*/oom_score; do echo $(cat $p) $(cat ${p%/*}/cmdline | tr '\0' ' '); done | sort -rn | head - 3
Check cgroup memory usage: cat /sys/fs/cgroup/memory/.../memory.usage_in_bytes - 4
Check cgroup memory limit: cat /sys/fs/cgroup/memory/.../memory.limit_in_bytes - 5
Check memory pressure: cat /proc/pressure/memory - 6
Check overcommit setting: cat /proc/sys/vm/overcommit_memory
Key Takeaways
- ✓The OOM killer is the absolute last resort -- before it fires, the kernel has already tried reclaiming page cache, writing back dirty pages, swapping anonymous pages, and compacting memory; if you are seeing OOM kills, the system was drowning for a while before that
- ✓oom_score_adj is how you rig the game: -1000 makes a process immortal (but if everything is immortal, the kernel panics), +1000 volunteers it as tribute; Kubernetes uses this to protect Guaranteed pods and sacrifice BestEffort ones
- ✓Overcommit mode 0 is the kernel making a bet that not everyone will cash their checks at once -- malloc succeeds now, but if everyone touches their pages later, someone gets killed; this is why "malloc succeeded but the process died later" confuses so many developers
- ✓PSI metrics are your early warning system -- user-space daemons like systemd-oomd watch these numbers and kill processes BEFORE the kernel OOM killer fires, giving you cleaner shutdowns and actual log messages instead of a bare SIGKILL
- ✓The OOM reaper is the kernel's backup plan -- if the victim is stuck in D-state and cannot exit, the reaper strips its anonymous memory anyway, because a dead process that cannot release its pages is worse than useless
Common Pitfalls
- ✗Making everything immune with oom_score_adj=-1000 -- if the kernel cannot find anyone to kill, it panics or hangs; at least one non-essential process must be killable, always
- ✗Seeing "Killed" in logs and assuming it is an application bug -- search dmesg for "oom-kill" or "Out of memory" first; OOM kills look identical to crashes unless you check the kernel log
- ✗Using strict overcommit (mode 2) without enough swap -- the commit limit is swap + (ratio * RAM); with no swap and the default 50% ratio, only half your RAM is allocatable, causing ENOMEM with gigabytes still free
- ✗Ignoring PSI until it is too late -- by the time the OOM killer fires, the system has been thrashing for seconds or minutes; monitoring /proc/pressure/memory lets you act before things get that bad
Reference
In One Line
In production, cgroup limits -- not host exhaustion -- trigger most OOM kills; tune oom_score_adj and monitor PSI before the kernel has to choose a victim.