Memory Cgroups & Resource Limits
Mental Model
Shared office building, each team with its own electricity meter. The meter counts everything: desktop computers, overhead lights, the shared printer when the team uses it, HVAC for the team's section. A team thinks it is at 60% because it only watches the computers. The meter says 95%. When the kilowatt limit is hit, that team's breaker trips -- even if the building has power to spare. A smart mode exists too: approaching the cap, the building dims non-essential outlets before the breaker ever fires.
The Problem
Twenty containers on a 128 GB server. One leaks memory over 48 hours, silently draining host RAM until the OOM killer fires and takes out something unrelated. A Kubernetes pod shows 60% heap usage yet keeps getting OOMKilled -- the missing 40% is page cache from log writes, 16 KB kernel stacks per thread, and socket buffers from 10,000 TCP connections. A JVM in a 4 GB container reads /proc/meminfo, sees 128 GB, auto-sizes its heap to 32 GB, and dies on the first major GC.
Architecture
A Kubernetes pod keeps getting OOMKilled. The application logs show it is using 60% of its memory limit. Everything looks fine. But the pod keeps dying.
Here is the missing piece: the memory limit does not just count the heap. It counts page cache from every file read. Kernel memory from every TCP connection. Slab caches, page tables, socket buffers. Things the application never explicitly allocated and cannot directly control.
This is the world of memory cgroups, and it is how containers actually work.
What Actually Happens
When a page is allocated inside a cgroup -- via page fault, file read, or kernel allocation -- the kernel charges it to that cgroup's mem_cgroup accounting structure.
If usage stays below memory.high: everything proceeds normally.
If usage exceeds memory.high: the kernel invokes direct reclaim synchronously. The allocating thread is put to sleep while the kernel tries to free memory within the cgroup -- evicting clean page cache, writing back dirty pages, swapping anonymous pages. This creates back-pressure. The process slows down but keeps running.
If usage hits memory.max: the kernel first attempts reclaim within the cgroup. If that fails, the cgroup OOM killer selects and kills a process within the cgroup. This is completely separate from the global OOM killer. The host can have 100 GB free -- if this cgroup is over its limit, something inside it dies.
The hierarchy matters. A child cgroup cannot exceed its parent's limit. If a parent is set to 8 GB and contains two children each set to 6 GB, the children compete for the parent's 8 GB. When total usage approaches 8 GB, the kernel reclaims from both children proportionally. memory.low on a child protects it from being reclaimed in favor of its siblings.
Under the Hood
Page cache is where most teams get burned. memory.current includes RSS (anonymous pages) plus page cache (file-backed pages) plus kernel memory (slab, page tables, socket buffers). A process that writes temp files or reads large log files generates page cache that counts against the cgroup limit. The application's RSS looks fine. The cgroup says it is at 95%.
This is the number one source of unexpected OOM kills in Kubernetes. Teams size limits for heap usage and forget about file I/O. Solutions: use O_DIRECT for large file reads, size the limit to include expected page cache usage, or accept that the kernel will evict page cache pages before OOM-killing (it will try).
Kernel memory accounting is always-on in cgroup v2. In cgroup v1, opting out was possible. Not anymore. Slab caches, page tables, socket buffers, kernel stacks (16 KB each) -- all of it counts. A process handling 10,000 concurrent TCP connections consumes significant kernel memory for socket buffers and connection tracking. This kernel overhead can push a cgroup over its limit even when user-space heap usage is modest.
The recommended pattern for production:
Set memory.high at the expected working set size. This is the soft limit -- it throttles the process and creates back-pressure without killing anything. Set memory.max 10-20% above memory.high. This is the hard limit -- the emergency stop.
Set memory.low on critical services to protect their working set from being reclaimed by noisy neighbors. Without this, a batch job's page cache growth can cause the kernel to reclaim pages from a latency-sensitive database.
Swap and cgroups have a complicated relationship. memory.swap.max controls how much swap each cgroup can use. Setting it to 0 disables swap for that cgroup -- useful for latency-sensitive workloads where swapping would be worse than dying. Kubernetes set this to 0 historically (swap was disabled until v1.28's alpha swap support).
Common Questions
A pod keeps OOMKilled at 60% of its limit. What is happening?
The application logs show heap usage at 60%, but memory.current includes everything. Check memory.stat for the full breakdown. Common culprits: page cache from log file writes or temp file I/O, kernel memory from many threads (16 KB kernel stacks each) or socket buffers from many connections, and memory-mapped files. The fix is to size the limit for total memory, not just heap.
What is the difference between ulimit and cgroup limits?
ulimit -v (RLIMIT_AS) limits virtual address space per process. It causes mmap/brk to return ENOMEM but does not account for shared pages. Cgroup memory.max limits physical memory consumption per group of processes, accounting for actual page usage and triggering OOM kill. In containers, cgroup limits are the effective constraint. ulimit is per-process within the container.
How does memory.high differ from memory.max?
memory.high slows things down. It invokes direct reclaim synchronously -- the allocating thread sleeps until memory is freed. The process gets sluggish but stays alive. memory.max kills. No throttling, no second chances, just SIGKILL. memory.high gives the application a chance to respond to pressure. memory.max is the emergency stop.
How to protect a critical service from noisy neighbors?
Set memory.low on the critical service's cgroup to its expected working set. The kernel will prefer to reclaim from other cgroups before touching memory below that threshold. For guaranteed protection, use memory.min (hard floor). And set memory.high on the noisy neighbor to throttle it before it consumes all available memory.
How Technologies Use This
A memory leak in a logging sidecar silently consumes all 128 GB of host RAM over 48 hours. Every other container on the machine starves for pages, and the entire host goes down, taking 20 services with it.
Docker --memory maps directly to cgroup memory.max, creating an invisible ceiling that the kernel enforces at the page allocator level. When a container hits its 4 GB limit, only processes inside that cgroup are OOM-killed, and the other 19 containers never notice. Without this boundary, a single leaking container's page allocations drain the global free pool until the host-level OOM killer fires indiscriminately.
Set --memory on every container to confine the blast radius of leaks. Use --memory-reservation to activate memory.high, which throttles allocations by forcing direct reclaim before the hard limit, slowing the leaking container to 30% throughput instead of killing it outright.
A 64 GB node running 40 pods starts thrashing after a single batch job reads large files and fills the page cache. Every other pod's working set gets pushed to swap, and latency-sensitive services see response times spike 10x even though their own heap usage is well below limits.
Kubernetes maps resources.limits.memory to cgroup memory.max, and the kubelet sets memory.min on Guaranteed pods to protect their resident pages from reclaim. The accounting is strict -- page cache from log writes, kernel memory from 5,000 TCP connections, and slab caches for inotify watches all count against the limit. Teams that size limits for heap alone see OOMKilled restarts at 60% reported usage because the other 40% is invisible kernel overhead.
Size memory limits for total memory consumption including page cache and kernel overhead, not just heap. Set memory.min on critical pods to guarantee their working set survives noisy-neighbor pressure, and always account for the invisible 40% of non-heap memory when setting limits.
A JVM starts inside a 4 GB container on a 128 GB host and immediately gets OOM-killed on the first major GC. The team sees the heap is configured well below the container limit and cannot explain why the pod keeps restarting.
Before JDK 10, Runtime.maxMemory() read /proc/meminfo, saw 128 GB of host RAM, and auto-sized the heap to 32 GB -- eight times the container's 4 GB cgroup limit. The first major GC would touch all heap pages, spike RSS past memory.max, and trigger an instant OOM kill. Since JDK 10, the JVM reads /sys/fs/cgroup/memory.max and sizes accordingly, but the default 25% MaxRAMPercentage wastes most of the available budget.
Set -XX:MaxRAMPercentage to 70% to cap the heap at 2.8 GB in a 4 GB container, leaving 1.2 GB for metaspace, thread stacks at 1 MB each, JIT code cache, and the page cache that file-heavy workloads silently consume. Always verify the JVM is container-aware by checking the JDK version is 10 or later.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Hard memory limit | --memory (sets memory.max) | -Xmx (heap only, not total) | --max-old-space-size (V8 heap) | GOMEMLIMIT (soft, Go 1.19+) | resources.limits.memory (sets memory.max) |
| Soft throttle | --memory-reservation (memory.high) | N/A | N/A | N/A | N/A (no direct memory.high mapping) |
| Reclaim protection | N/A | N/A | N/A | N/A | resources.requests.memory (affects scheduling, sets memory.min for Guaranteed QoS) |
| Container awareness | N/A (host tool) | UseContainerSupport (JDK 10+) | Reads cgroup limits since Node 12 | Reads cgroup limits since Go 1.19 | kubelet reads cgroup for eviction decisions |
| OOM behavior | Container killed, restart policy applies | JVM killed, pod restarted | Process killed | Process killed | Pod restarted with OOMKilled status |
Stack Layer Mapping
| Layer | Memory Control Mechanism |
|---|---|
| Hardware | Physical RAM, NUMA nodes |
| Kernel page allocator | Charges pages to mem_cgroup on allocation |
| Cgroup controller | memory.max / memory.high / memory.low enforcement |
| Container runtime | Maps --memory flag to cgroup memory.max |
| Orchestrator | K8s maps resources.limits.memory, kubelet monitors memory.events |
| Application | Sees cgroup limit via /sys/fs/cgroup or container-aware runtime |
Design Rationale
ulimit caps per-process virtual address space and knows nothing about shared page cache or kernel memory -- completely inadequate for container isolation. cgroup v2 split enforcement into memory.high (throttle) and memory.max (kill) because v1's all-or-nothing OOM kill gave applications zero chance to shed load under pressure. memory.low and memory.min round out the model: without workload protection, a batch job filling page cache silently evicts a latency-sensitive service's working set, and nobody sees it until p99 latency spikes.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Pod OOMKilled but heap usage shows 60% | Page cache + kernel memory consuming the remaining 40% | cat memory.stat and check file, slab, sock fields |
| JVM dies on first major GC in container | JVM not container-aware, sized heap to host RAM | Verify JDK 10+ and check -XX:MaxRAMPercentage |
| Container slow but not killed | memory.high throttling via direct reclaim | cat memory.events and check high counter |
| All containers on host degraded | One container consuming all memory, no limits set | systemd-cgtop -m to find the consumer |
| /proc/meminfo shows wrong values in container | /proc/meminfo shows host memory, not cgroup | Use memory.current / memory.stat instead, or deploy lxcfs |
| Critical service reclaimed during batch job spike | No memory.low protection on critical service | Set memory.low to expected working set size |
When to Use / Avoid
Use when:
- Running multiple containers or services on shared hosts that need memory isolation
- Protecting critical services from noisy-neighbor memory pressure (memory.low/memory.min)
- Debugging unexpected OOMKilled events in Kubernetes pods
- Setting JVM heap sizes relative to container limits (MaxRAMPercentage)
- Creating back-pressure via memory.high to throttle leaking services before hard kill
Avoid when:
- Running a single application on a dedicated host (cgroup overhead adds no value)
- Memory limits would mask underlying leaks that need fixing (use for containment, not a cure)
- The workload is entirely CPU-bound with negligible memory variation
Try It Yourself
1 # Check cgroup v2 memory usage and limits
2
3 cat /sys/fs/cgroup/system.slice/docker-*/memory.current
4
5 cat /sys/fs/cgroup/system.slice/docker-*/memory.max
6
7 # View detailed memory breakdown
8
9 cat /sys/fs/cgroup/system.slice/docker-*/memory.stat
10
11 # Check OOM events for a cgroup
12
13 cat /sys/fs/cgroup/system.slice/docker-*/memory.events
14
15 # Set memory high (soft limit) for a systemd service
16
17 systemctl set-property myservice.service MemoryHigh=2G
18
19 # Monitor cgroup memory usage in real-time
20
21 systemd-cgtop -m
22
23 # Check process resource limits (ulimit values)
24
25 cat /proc/$(pidof nginx)/limitsDebug Checklist
- 1
cat /sys/fs/cgroup/<path>/memory.current -- current total memory usage - 2
cat /sys/fs/cgroup/<path>/memory.stat -- breakdown of anon, file, kernel, slab, sock - 3
cat /sys/fs/cgroup/<path>/memory.events -- OOM kill and reclaim event counts - 4
cat /sys/fs/cgroup/<path>/memory.max -- hard limit in bytes or 'max' - 5
systemd-cgtop -m -- real-time per-cgroup memory usage - 6
cat /proc/<pid>/cgroup -- find which cgroup a process belongs to
Key Takeaways
- ✓Page cache is the silent cgroup killer -- reading a temp file or writing logs generates page cache that counts against your memory.max, even though RSS looks low; this is the #1 source of unexpected container OOM kills
- ✓memory.high is the seatbelt, memory.max is the brick wall -- memory.high throttles your process (sleeps it during allocation) giving it time to recover; memory.max just kills it; set memory.max 10-20% above memory.high as a safety net
- ✓The cgroup OOM killer is completely separate from the global one -- it only kills processes inside the over-limit cgroup, and the events may only appear in memory.events counters, not dmesg; many teams miss these entirely
- ✓In Kubernetes, container memory limits map directly to cgroup memory.max -- a pod generating page cache via file I/O will be OOM-killed even if its heap is well within limits; you must size for total memory, not just heap
- ✓Kernel memory accounting is always-on in cgroup v2 -- slab caches, page tables, socket buffers, and kernel stacks all count against your limit; a process with 10,000 TCP connections can OOM from kernel memory alone
Common Pitfalls
- ✗Setting container limits equal to heap size -- a Java app with -Xmx4g needs at least 5-6 GB container limit to cover page cache, kernel memory, native allocations, and thread stacks; setting it to 4g guarantees OOM
- ✗Only monitoring dmesg for OOM events -- cgroup OOM kills may only appear in memory.events counters (oom, oom_kill fields); if you are not reading those, you are flying blind in containers
- ✗Trusting /proc/meminfo inside a container -- without overrides, it shows HOST memory, not container memory; use memory.current and memory.stat for accurate cgroup-level data; tools like lxcfs expose container-aware /proc/meminfo
- ✗Forgetting memory.low for critical services -- without it, a batch job's page cache growth can cause the kernel to reclaim your latency-sensitive service's working set; memory.low marks it as "reclaim from others first"
Reference
In One Line
Set memory limits for total consumption -- heap plus page cache plus kernel overhead -- or the invisible 40% will OOM-kill pods that look fine.