Kernel & BootTopic 2 of 13

Kernel InternalsIntermediate

cgroups v2 (Control Groups)

DockerKubernetessystemd

🧠

Mental Model

An apartment building with utility meters and circuit breakers for each unit. Electricity, water, gas -- every tenant uses the same infrastructure, but each unit has its own meter and its own breaker. One tenant cranks every appliance to max? Their breaker trips. The rest of the building does not flicker.

💡

The Problem

One process on a 64 GB shared host gobbles 58 GB. Everything else starts getting OOM-killed. Meanwhile, a different container with no CPU limit pins all 8 cores during a batch job, spiking API latency for co-located services from 5ms to 800ms. No per-group resource boundaries means one runaway workload takes down the entire machine.

Architecture

One runaway container eats all the RAM. Every other service on the machine grinds to a halt. Customers start seeing errors.

This is what happens without resource limits. Docker and Kubernetes do prevent it -- but only by writing a number to a file. The kernel does the actual enforcement.

That file lives in a cgroup.

What Actually Happens

Cgroups v2 presents a single hierarchy mounted at /sys/fs/cgroup. Each node is a directory containing pseudo-files that control resource allocation.

Here is the basic flow:

Enable controllers for a cgroup's children: echo '+memory +cpu' > cgroup.subtree_control
Create a new cgroup: mkdir /sys/fs/cgroup/myapp
Set limits: echo 536870912 > /sys/fs/cgroup/myapp/memory.max (512MB)
Move a process in: echo $PID > /sys/fs/cgroup/myapp/cgroup.procs

That is it. The process is now memory-limited.

The memory controller tracks every page the cgroup allocates -- anonymous pages (heap, stack), page cache, and kernel memory (slab, page tables). When usage approaches memory.high, the kernel increases direct reclaim pressure: allocations slow down as the kernel tries to reclaim pages. If usage hits memory.max, the cgroup's OOM killer activates and kills a process within the cgroup -- not system-wide.

The cpu controller uses CFS bandwidth control. cpu.max takes a quota/period pair in microseconds. 50000 100000 means "50ms of CPU time every 100ms" -- 50% of one core. cpu.weight (1-10000, default 100) determines proportional share when multiple cgroups compete.

systemd maps its hierarchy directly onto the cgroup tree. nginx.service becomes /sys/fs/cgroup/system.slice/nginx.service/. MemoryMax=1G in a unit file writes 1073741824 to memory.max.

Under the Hood

The no-internal-processes rule. In v2, a cgroup with children cannot contain processes (except the root). The v1 ambiguity -- where a parent could have both processes and children with different limits -- is gone. Processes must go in leaf nodes.

Delegation for unprivileged users. Non-root users can manage a cgroup subtree if they own the directory and key files (cgroup.procs, cgroup.subtree_control, cgroup.threads). systemd provides this via Delegate=yes, which is how rootless Podman and user-level services manage their own cgroup trees.

Memory pressure notifications. The memory.pressure file provides PSI (Pressure Stall Information) metrics -- avg10, avg60, avg300 -- showing what percentage of time tasks are stalled waiting for memory. Proactive scaling can kick in before the OOM killer does.

IO writeback attribution. In v2 with cgroup-aware writeback, dirty pages are charged to the cgroup that dirtied them, and writeback I/O is throttled by that cgroup's io.max limits. Without this, buffered writes bypass I/O limits because the actual disk writes happen asynchronously in a kernel thread.

Common Questions

How does the OOM killer decide which process to kill?

It scores processes by oom_score_adj (settable via /proc/PID/oom_score_adj, range -1000 to 1000) combined with memory usage. Higher scores mean more likely to die. Kubernetes sets this based on QoS class: Guaranteed=-997, BestEffort=1000, Burstable=2-999.

What changed from v1 to v2?

In v1, each controller had its own independent mount point -- a process could be in different cgroups for cpu and memory. v2 uses a single unified hierarchy. It also adds the no-internal-processes rule, proper delegation, PSI pressure metrics, and thread-granularity mode. The cpuacct controller was merged into cpu, and net_cls/net_prio were replaced by eBPF.

How does Kubernetes use cgroups?

Three levels: (1) kubelet creates kubepods.slice with system-reserved resources excluded, (2) QoS classes (Guaranteed, Burstable, BestEffort) are sub-cgroups with different cpu.weight values, (3) each container gets a leaf cgroup. Resource requests set cpu.weight; limits set cpu.max and memory.max.

Can cgroups limit network bandwidth?

Not directly in v2. The v1 net_cls and net_prio controllers were removed because network scheduling works better with TC and eBPF. The modern approach: attach eBPF programs to the cgroup (via BPF_CGROUP_INET_INGRESS/EGRESS) or mark packets with a classid and apply TC filters. Cilium uses eBPF-based bandwidth management.

How Technologies Use This

Docker

A memory-leaking Node.js container on a shared host consumes all 64GB of RAM. The Linux OOM killer starts terminating random processes across the entire host, including the production database. Thirty other containers are collateral damage from one misbehaving service.

The root cause is that without cgroup limits, the kernel has no per-container memory boundary. Every container shares the same pool of physical memory, and when it runs out, the OOM killer cannot distinguish between critical and expendable processes. It kills based on a score that has nothing to do with container boundaries.

Docker prevents this by writing hard limits to cgroup files: --memory=512m sets memory.max to 536870912 bytes, --cpus=1.5 writes 150000 100000 to cpu.max, and --pids-limit=256 caps fork bombs via pids.max. The kernel enforces these boundaries, not Docker. When a container hits memory.max, only its own processes are OOM-killed, keeping the other 30 containers on the host completely unaffected.

Kubernetes

A CPU-hungry ML training pod on a 40-pod node starves a latency-sensitive API server running beside it. The API server's p99 latency jumps from 5ms to 500ms, and there is no way to guarantee it gets the CPU time it needs regardless of what other pods are doing.

The mechanism behind the fix is the cgroup cpu.weight and cpu.max system. Kubernetes maps resource requests to cgroup cpu.weight for proportional sharing and limits to cpu.max and memory.max for hard caps. The kubelet organizes pods into a three-tier cgroup hierarchy by QoS class: Guaranteed pods get oom_score_adj=-997, BestEffort gets 1000. When node memory drops below the eviction threshold, the kubelet reads memory.pressure PSI metrics to decide which pods to evict.

Setting resource requests and limits in the pod spec ensures the kernel enforces proportional CPU sharing and hard memory caps. BestEffort pods are evicted first under pressure, keeping Guaranteed workloads stable even under 95% memory utilization. The lesson is that Kubernetes resource guarantees are only as strong as the cgroup configuration they translate into.

systemd

One of 40 services on a box is leaking memory, and the entire system is slowly degrading. Running top shows individual processes, but there is no way to see which service is responsible because a service can spawn dozens of child processes under different names.

The insight is that systemd places every service in its own cgroup automatically, creating per-service resource accounting that top cannot provide. systemd-cgtop shows real-time CPU, memory, and I/O usage per service, instantly revealing the leaking culprit. MemoryMax=1G in a unit file writes 1073741824 to memory.max, and CPUQuota=200% sets cpu.max to 200000 100000.

Adding MemoryMax and CPUQuota to the leaking service's unit file caps it before it can affect the rest of the system. When a service hits its memory ceiling, only that service's processes are killed, keeping the remaining 39 services running without impact. The lesson is that cgroup-based accounting through systemd is the only reliable way to attribute resource usage to services rather than individual processes.

Same Concept Across Tech

Technology	How it uses cgroups	Key controls
Docker	Each container runs in its own cgroup. --memory and --cpus flags set limits	docker stats shows live cgroup metrics
Kubernetes	Pod resource requests = cgroup shares (soft), limits = cgroup max (hard)	resources.requests maps to cpu.shares, limits maps to cpu.max
systemd	Every service runs in a cgroup slice. MemoryMax, CPUQuota in unit files	systemctl show service --property=MemoryMax
JVM	Since JDK 10, JVM reads cgroup limits to auto-configure heap and GC threads	-XX:+UseContainerSupport (default on)
Node.js	No automatic cgroup awareness. Must manually set --max-old-space-size to match limit	Common source of container OOM kills

Stack layer mapping (container resource issue):

Layer	What to check	Tool
Application	Is the app using more resources than expected?	Application metrics, profiler
Runtime	Is the runtime aware of cgroup limits? (JVM yes, Node no)	Check runtime config
Cgroup	What is the actual usage vs limit? Is throttling happening?	cgroup files: memory.current, cpu.stat
Kernel	Is the cgroup controller enabled? v1 or v2?	mount
Host	How much total resource is available on the node?	free -h, nproc, lscpu

Design Rationale Per-process limits via setrlimit could not express group-level constraints -- a service spawning 200 workers could collectively exhaust a host while each individual process stayed within bounds. A hierarchical, filesystem-based model was the answer. The unified v2 hierarchy replaced v1's independent per-controller mount points because putting a process in one cpu cgroup and a different memory cgroup made accounting incoherent and delegation impossible to reason about. The "no internal processes" rule closed a further gap: in v1, a parent cgroup could compete with its own children for resources, producing scheduling behavior that container runtimes simply could not predict.

If You See This, Think This

Symptom	Likely cause	First check
Container OOM-killed but host has free memory	Cgroup memory limit reached, not host limit	cat memory.current vs memory.max
API latency spikes in multi-tenant env	CPU throttling from CFS quota, or memory pressure	cat cpu.stat, check nr_throttled
JVM heap set to 4G, container limit 4G, OOM-killed	JVM native memory + metaspace + thread stacks exceed cgroup limit	Set -Xmx to ~75% of container limit
Node.js container OOM despite small heap	V8 does not read cgroup limits, allocates based on host memory	Set --max-old-space-size explicitly
Container using 100% CPU but host is not busy	No CPU limit set, container has access to all cores	Set cpu.max or docker --cpus
I/O-intensive container slowing down host	No I/O limit set	Set io.max in cgroup v2

When to Use / Avoid

Use cgroups when:

Running multiple workloads on the same machine (containers, VMs, services)
Enforcing CPU and memory limits on containers or pods
Preventing noisy-neighbor problems in multi-tenant environments
Accounting resource usage per service for billing or capacity planning

Watch out for:

Setting CPU limits too low relative to thread count (causes CFS throttling)
Setting memory limits without accounting for kernel caches and page cache
Mixing cgroup v1 and v2 on the same system (causes confusion)
Not monitoring cgroup metrics (throttling and OOM happen silently)

Try It Yourself

 1  # View the current cgroup hierarchy
 2  
 3  systemd-cgls --no-pager | head -40
 4  
 5  # Check which controllers are available and enabled
 6  
 7  cat /sys/fs/cgroup/cgroup.controllers && cat /sys/fs/cgroup/cgroup.subtree_control
 8  
 9  # Create a memory-limited cgroup and run a process in it
10  
11  sudo mkdir -p /sys/fs/cgroup/test && echo '+memory +pids' | sudo tee /sys/fs/cgroup/cgroup.subtree_control && echo $((256*1024*1024)) | sudo tee /sys/fs/cgroup/test/memory.max && echo $$ | sudo tee /sys/fs/cgroup/test/cgroup.procs
12  
13  # Monitor real-time cgroup resource usage
14  
15  systemd-cgtop -d 1 --depth=3
16  
17  # Check a service's cgroup limits
18  
19  systemctl show nginx.service -p MemoryMax -p CPUQuota -p TasksMax 2>/dev/null || echo 'nginx not installed'
20  
21  # Read detailed memory statistics for a cgroup
22  
23  cat /sys/fs/cgroup/system.slice/memory.stat 2>/dev/null | head -15

Debug Checklist

1Find cgroup of a process: cat /proc/<pid>/cgroup
2Check memory usage vs limit: cat /sys/fs/cgroup/.../memory.current and memory.max
3Check CPU throttling: cat /sys/fs/cgroup/.../cpu.stat (look for nr_throttled)
4List all cgroups: systemd-cgls or find /sys/fs/cgroup -name cgroup.procs
5Check I/O limits: cat /sys/fs/cgroup/.../io.max
6Monitor pressure: cat /sys/fs/cgroup/.../memory.pressure

Key Takeaways

✓The 'no internal processes' rule eliminates ambiguity: a cgroup with children cannot itself contain processes (except root). In v1, a parent could have both processes and children with different limits. v2 forces you to leaf-node your processes.
✓The memory controller tracks everything -- anonymous pages, page cache, kernel memory (slab, page tables). memory.current shows live usage, memory.stat breaks it down. Kernel memory charging was added in v2.
✓cpu.weight (default 100) replaced v1's CFS shares. For hard limits, cpu.max takes 'quota period' in microseconds -- '50000 100000' means 50% of one CPU. Simple math, direct control.
✓Buffered writes bypass io.max unless cgroup-aware writeback is enabled. Only direct I/O is immediately throttled. This trips people up constantly when they set I/O limits and wonder why writes are not being capped.
✓systemd maps its unit hierarchy directly to the cgroup tree. system.slice/ nginx.service becomes /sys/fs/cgroup/system.slice/nginx.service/. MemoryMax= in a unit file writes to memory.max. No cgroup API needed.

Common Pitfalls

✗Mistake: Mixing cgroups v1 and v2 controllers. Reality: A controller can only be used in v1 OR v2, not both. Hybrid mode creates confusion. Modern systems should use unified v2 (systemd defaults to it since v248).
✗Mistake: Setting memory.max without memory.high. Reality: The process gets OOM-killed instantly with no warning. Set memory.high to ~90% of memory.max to trigger throttling first, giving the app time to respond.
✗Mistake: Expecting io.max to limit buffered writes. Reality: Buffered writes go through the page cache and are attributed at writeback time. Only direct I/O is immediately throttled. Enable cgroup writeback for correct accounting.
✗Mistake: Not understanding cgroup delegation. Reality: A non-root user can manage a subtree only if they own the directory AND cgroup.procs, cgroup.subtree_control, and cgroup.threads files. systemd handles this via Delegate=yes.

Reference

System Calls

mkdirwriteclone

Tools

systemd-cgtopsystemd-cglscat /sys/fs/cgroup/*/memory.stat

📌

In One Line

Most container resource problems trace back to misconfigured cgroup limits, not application bugs -- cgroups are where isolation actually happens.

cgroups v2 (Control Groups)

DockerKubernetessystemd

🧠

Mental Model

💡

The Problem

Architecture

One runaway container eats all the RAM. Every other service on the machine grinds to a halt. Customers start seeing errors.

This is what happens without resource limits. Docker and Kubernetes do prevent it -- but only by writing a number to a file. The kernel does the actual enforcement.

That file lives in a cgroup.

What Actually Happens

Cgroups v2 presents a single hierarchy mounted at /sys/fs/cgroup. Each node is a directory containing pseudo-files that control resource allocation.

Here is the basic flow:

Enable controllers for a cgroup's children: echo '+memory +cpu' > cgroup.subtree_control
Create a new cgroup: mkdir /sys/fs/cgroup/myapp
Set limits: echo 536870912 > /sys/fs/cgroup/myapp/memory.max (512MB)
Move a process in: echo $PID > /sys/fs/cgroup/myapp/cgroup.procs

That is it. The process is now memory-limited.

systemd maps its hierarchy directly onto the cgroup tree. nginx.service becomes /sys/fs/cgroup/system.slice/nginx.service/. MemoryMax=1G in a unit file writes 1073741824 to memory.max.

Under the Hood

Common Questions

How does the OOM killer decide which process to kill?

What changed from v1 to v2?

How does Kubernetes use cgroups?

Can cgroups limit network bandwidth?

How Technologies Use This

Docker

Kubernetes

systemd

Same Concept Across Tech

Technology	How it uses cgroups	Key controls
Docker	Each container runs in its own cgroup. --memory and --cpus flags set limits	docker stats shows live cgroup metrics
Kubernetes	Pod resource requests = cgroup shares (soft), limits = cgroup max (hard)	resources.requests maps to cpu.shares, limits maps to cpu.max
systemd	Every service runs in a cgroup slice. MemoryMax, CPUQuota in unit files	systemctl show service --property=MemoryMax
JVM	Since JDK 10, JVM reads cgroup limits to auto-configure heap and GC threads	-XX:+UseContainerSupport (default on)
Node.js	No automatic cgroup awareness. Must manually set --max-old-space-size to match limit	Common source of container OOM kills

Stack layer mapping (container resource issue):

Layer	What to check	Tool
Application	Is the app using more resources than expected?	Application metrics, profiler
Runtime	Is the runtime aware of cgroup limits? (JVM yes, Node no)	Check runtime config
Cgroup	What is the actual usage vs limit? Is throttling happening?	cgroup files: memory.current, cpu.stat
Kernel	Is the cgroup controller enabled? v1 or v2?	mount
Host	How much total resource is available on the node?	free -h, nproc, lscpu

If You See This, Think This

Symptom	Likely cause	First check
Container OOM-killed but host has free memory	Cgroup memory limit reached, not host limit	cat memory.current vs memory.max
API latency spikes in multi-tenant env	CPU throttling from CFS quota, or memory pressure	cat cpu.stat, check nr_throttled
JVM heap set to 4G, container limit 4G, OOM-killed	JVM native memory + metaspace + thread stacks exceed cgroup limit	Set -Xmx to ~75% of container limit
Node.js container OOM despite small heap	V8 does not read cgroup limits, allocates based on host memory	Set --max-old-space-size explicitly
Container using 100% CPU but host is not busy	No CPU limit set, container has access to all cores	Set cpu.max or docker --cpus
I/O-intensive container slowing down host	No I/O limit set	Set io.max in cgroup v2

When to Use / Avoid

Use cgroups when:

Running multiple workloads on the same machine (containers, VMs, services)
Enforcing CPU and memory limits on containers or pods
Preventing noisy-neighbor problems in multi-tenant environments
Accounting resource usage per service for billing or capacity planning

Watch out for:

Setting CPU limits too low relative to thread count (causes CFS throttling)
Setting memory limits without accounting for kernel caches and page cache
Mixing cgroup v1 and v2 on the same system (causes confusion)
Not monitoring cgroup metrics (throttling and OOM happen silently)

Try It Yourself

 1  # View the current cgroup hierarchy
 2  
 3  systemd-cgls --no-pager | head -40
 4  
 5  # Check which controllers are available and enabled
 6  
 7  cat /sys/fs/cgroup/cgroup.controllers && cat /sys/fs/cgroup/cgroup.subtree_control
 8  
 9  # Create a memory-limited cgroup and run a process in it
10  
11  sudo mkdir -p /sys/fs/cgroup/test && echo '+memory +pids' | sudo tee /sys/fs/cgroup/cgroup.subtree_control && echo $((256*1024*1024)) | sudo tee /sys/fs/cgroup/test/memory.max && echo $$ | sudo tee /sys/fs/cgroup/test/cgroup.procs
12  
13  # Monitor real-time cgroup resource usage
14  
15  systemd-cgtop -d 1 --depth=3
16  
17  # Check a service's cgroup limits
18  
19  systemctl show nginx.service -p MemoryMax -p CPUQuota -p TasksMax 2>/dev/null || echo 'nginx not installed'
20  
21  # Read detailed memory statistics for a cgroup
22  
23  cat /sys/fs/cgroup/system.slice/memory.stat 2>/dev/null | head -15

Debug Checklist

1Find cgroup of a process: cat /proc/<pid>/cgroup
2Check memory usage vs limit: cat /sys/fs/cgroup/.../memory.current and memory.max
3Check CPU throttling: cat /sys/fs/cgroup/.../cpu.stat (look for nr_throttled)
4List all cgroups: systemd-cgls or find /sys/fs/cgroup -name cgroup.procs
5Check I/O limits: cat /sys/fs/cgroup/.../io.max
6Monitor pressure: cat /sys/fs/cgroup/.../memory.pressure

Key Takeaways

✓The 'no internal processes' rule eliminates ambiguity: a cgroup with children cannot itself contain processes (except root). In v1, a parent could have both processes and children with different limits. v2 forces you to leaf-node your processes.
✓The memory controller tracks everything -- anonymous pages, page cache, kernel memory (slab, page tables). memory.current shows live usage, memory.stat breaks it down. Kernel memory charging was added in v2.
✓cpu.weight (default 100) replaced v1's CFS shares. For hard limits, cpu.max takes 'quota period' in microseconds -- '50000 100000' means 50% of one CPU. Simple math, direct control.
✓Buffered writes bypass io.max unless cgroup-aware writeback is enabled. Only direct I/O is immediately throttled. This trips people up constantly when they set I/O limits and wonder why writes are not being capped.
✓systemd maps its unit hierarchy directly to the cgroup tree. system.slice/ nginx.service becomes /sys/fs/cgroup/system.slice/nginx.service/. MemoryMax= in a unit file writes to memory.max. No cgroup API needed.

Common Pitfalls

✗Mistake: Mixing cgroups v1 and v2 controllers. Reality: A controller can only be used in v1 OR v2, not both. Hybrid mode creates confusion. Modern systems should use unified v2 (systemd defaults to it since v248).
✗Mistake: Setting memory.max without memory.high. Reality: The process gets OOM-killed instantly with no warning. Set memory.high to ~90% of memory.max to trigger throttling first, giving the app time to respond.
✗Mistake: Expecting io.max to limit buffered writes. Reality: Buffered writes go through the page cache and are attributed at writeback time. Only direct I/O is immediately throttled. Enable cgroup writeback for correct accounting.
✗Mistake: Not understanding cgroup delegation. Reality: A non-root user can manage a subtree only if they own the directory AND cgroup.procs, cgroup.subtree_control, and cgroup.threads files. systemd handles this via Delegate=yes.

Reference

System Calls

mkdirwriteclone

Tools

systemd-cgtopsystemd-cglscat /sys/fs/cgroup/*/memory.stat

📌

In One Line

Most container resource problems trace back to misconfigured cgroup limits, not application bugs -- cgroups are where isolation actually happens.

cgroups v2 (Control Groups)

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

cgroups v2 (Control Groups)

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics