File Systems & I/OTopic 9 of 19

File Systems & I/OIntermediate

/proc and /sys Filesystems

DockerKubernetessystemd

🧠

Mental Model

A building with a live information kiosk on every floor. No kiosk stores anything. Each has a phone line to the building manager. "How many people on floor 3?" -- the kiosk dials, the manager counts heads right now, and scrawls the answer on a slip. Ask a second later and the number might differ. One wing of kiosks (/proc) hands back messy multi-line reports. The newer wing (/sys) gives exactly one fact per slip -- cleaner, same building.

💡

The Problem

A JVM inside a 4 GB container reads /proc/meminfo, sees the host's 256 GB, sizes a 200 GB heap, and gets OOM-killed in seconds. Monitoring agents parsing /proc/net/tcp on a server with 50,000 connections see inconsistent data because seq_file reads are not atomic across page boundaries. An engineer tunes tcp_max_syn_backlog at runtime but skips the persistent sysctl.d config -- the next reboot resets it, and a SYN flood takes a 40% latency hit. On a Kubernetes node, a single ip_forward=0 in /proc/sys silently breaks pod-to-pod networking for 200+ pods with nothing in the kubelet logs.

Architecture

Run cat /proc/meminfo. That command just read a file that does not exist on any disk.

The kernel generated that text on the fly, right now, from its live data structures. No database query. No API call. Just a read() on a fake file.

This is how Linux exposes its internals. And once the pattern of reading these files directly becomes familiar, every monitoring tool becomes transparent -- because they are all doing exactly this behind the scenes.

What Actually Happens

/proc is backed by struct proc_dir_entry objects with function pointers that generate content on read. Running cat /proc/meminfo triggers the kernel to call meminfo_proc_show(), which walks memory structures and formats the output. Writing to /proc/sys/vm/dirty_ratio calls a sysctl handler that validates the input and modifies the kernel variable immediately -- no restart needed.

/sys was added in Linux 2.6 to model the device hierarchy. Every kernel object (device, driver, bus, network interface) appears as a directory in /sys. Each directory has attribute files -- one value per file -- that expose or modify object properties. /sys/block/sda/queue/scheduler shows and allows changing the I/O scheduler for a disk.

The per-process /proc/PID/ directory is especially powerful. /proc/PID/status provides state, memory usage, capabilities, and signal masks. /proc/PID/maps shows the complete virtual memory layout. /proc/PID/fd/ is a directory of symlinks from fd numbers to their targets. /proc/PID/smaps_rollup gives aggregated memory stats (RSS, PSS, swap) in a single read.

These files are the data source for ps, top, htop, pmap, and every monitoring agent.

Under the Hood

seq_file and the consistency trap. Most /proc files use the seq_file interface, generating output iteratively. A single read() returns a page-sized chunk, and the iterator advances on the next read(). This means reading a large /proc file (like /proc/net/tcp with thousands of connections) is NOT atomic. Connections can open or close between reads, producing inconsistent snapshots. For consistent network data, use netlink sockets instead.

Sysctl categories map to real kernel subsystems. /proc/sys/vm/ controls virtual memory (dirty ratios, overcommit, swappiness). /proc/sys/net/ controls networking (TCP tunables, socket buffers, conntrack limits). /proc/sys/kernel/ handles general settings (pid_max, threads-max, panic). Each file is backed by a ctl_table with type checking, range validation, and immediate application. Persistent changes go in /etc/sysctl.conf or /etc/sysctl.d/*.conf.

sysfs was designed to fix procfs. procfs grew organically and became a dumping ground -- inconsistent formatting, multi-value files, no clear hierarchy. sysfs was built with strict rules: one attribute per file, ASCII text, no binary data, consistent directory structure. New kernel subsystems use sysfs (or configfs, debugfs) rather than procfs.

Common Questions

What is the difference between /proc/PID/maps and /proc/PID/smaps?

maps shows each VMA with its address range, permissions, and backing file. smaps adds detailed per-region accounting: Rss, Pss (proportional share of shared pages), Shared/Private_Clean/Dirty, Referenced, Anonymous, Swap. smaps_rollup (Linux 4.14+) gives totals across all VMAs -- faster than parsing the full smaps.

Why does /proc/meminfo show so little MemFree but the system seems fine?

MemFree counts only truly unused pages. Most "free" memory is used by page cache (Cached), reclaimable slabs (SReclaimable), and buffers. These are all reclaimable under pressure. MemAvailable (Linux 3.14+) estimates usable memory by accounting for reclaimable cache. A system with 200MB MemFree but 28GB MemAvailable is perfectly healthy -- the cache is doing its job.

Do containers get their own /proc?

With PID namespaces, a container gets its own /proc showing only its processes. /proc/1 is the container's init, not the host's. But /proc/meminfo and /proc/cpuinfo still show HOST values -- a known limitation that causes apps to misdetect resources. LXCFS fixes this by overlaying those files with cgroup-aware data.

How to find which tunable to change?

Start with sysctl -a | grep <keyword>. For documentation, check Documentation/admin-guide/sysctl/ in the kernel source. For sysfs attributes, udevadm info -a <device_path> walks the parent chain showing all attributes.

How Technologies Use This

Docker

A Java application inside a Docker container with a 4GB memory limit sets its heap to 200GB and immediately gets OOM-killed. The JVM auto-sizes its heap based on available memory, but the container keeps reporting 256GB -- the full host memory.

The hidden issue is that Docker's PID-namespaced /proc only isolates the process list, not system-wide files like /proc/meminfo and /proc/cpuinfo. The JVM reads /proc/meminfo, sees the host's 256GB total memory instead of the container's 4GB cgroup limit, and requests a heap that far exceeds its allowance. This mismatch between what /proc reports and what cgroups enforce is the root cause of most container OOM-kill mysteries.

LXCFS solves this by overlaying /proc/meminfo and /proc/cpuinfo with cgroup-aware data, reducing JVM OOM-kill incidents by over 90% in container deployments that rely on automatic memory detection. For modern JVMs, the -XX:+UseContainerSupport flag reads cgroup limits directly instead of trusting /proc.

Kubernetes

Pod networking silently breaks across an entire node after a configuration change, with no errors in kubelet logs. Pods can reach external services but cannot communicate with each other, and the kube-proxy logs show nothing useful.

The cause is a misconfigured /proc/sys/net/ipv4/ip_forward sysctl that was reset to 0. kube-proxy writes to this file and to bridge-nf-call-iptables to enable pod-to-pod traffic routing. The kubelet also reads MemAvailable from /proc/meminfo every 10 seconds and compares it against per-pod usage from /sys/fs/cgroup memory.current files to make eviction decisions. All of this happens through plain file reads and writes against virtual filesystems.

The fix is ensuring ip_forward is set persistently via /etc/sysctl.d/ and verifying it after node reboots or kernel upgrades. Understanding that Kubernetes depends entirely on /proc and /sys for node-level operations explains why a single missing sysctl can silently break an entire node.

systemd

A runaway service leaks memory and approaches the host's physical limit. Without per-service resource controls, the OOM killer will eventually terminate random processes across the entire system, potentially taking down critical infrastructure.

The mechanism that makes instant enforcement possible is that /proc/sys and /sys/fs/cgroup are virtual files backed by live kernel data structures. Writing 1073741824 to /sys/fs/cgroup/system.slice/myservice.service/memory.max immediately caps that service at 1GB with no restart needed. At boot, sysctl.d configs write to /proc/sys tunables to set kernel parameters like vm.swappiness and net.core.somaxconn.

systemd's MemoryMax=1G in a unit file writes directly to these virtual files, and PrivateTmp creates a per-service mount namespace while ProtectSystem remounts /usr and /boot read-only. These features restrict which kernel state each service can observe, reducing the blast radius of a compromise by over 80% compared to unrestricted services.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Memory detection	/proc/meminfo (host leak)	-XX:+UseContainerSupport reads cgroup	process.memoryUsage() reads /proc/self	runtime.MemStats reads /proc/self	kubelet reads /proc/meminfo + cgroup memory.current
Runtime tuning	sysctl in docker run --sysctl	-XX flags at startup	--max-old-space-size flag	GOGC, GOMEMLIMIT env vars	sysctl pod security context
Process inspection	docker top reads /proc/PID	jcmd, jstack read /proc/self	/proc/self/status from fs module	/proc/self/maps via os package	kubectl exec + cat /proc/*
FD monitoring	/proc/PID/fd inside container ns	/proc/self/fd + lsof	process.getMaxListeners() + /proc/self/fd	os.File count + /proc/self/fd	liveness probes checking fd count

Stack Layer	Mechanism
Application	Reads /proc/self/* for self-inspection, memory, FDs
Runtime / VM	JVM, Node, Go read /proc/meminfo or cgroup files for auto-sizing
Container	PID namespace isolates /proc process list but not meminfo/cpuinfo; LXCFS patches this
Kernel	proc_dir_entry callbacks generate text from task_struct, mm_struct, net namespace on each read()
Hardware	/sys/class/* and /sys/devices/* expose hardware topology, device attributes, power state

Design rationale: Making kernel state look like files means any language, any tool, any one-liner script can inspect the system through open()/read()/write() with zero special APIs. The cost is text-parsing fragility and non-atomic reads for files that span multiple pages.

If You See This, Think This

Symptom	Likely Cause	First Check
Container app sees host memory instead of limit	/proc/meminfo not namespace-aware	cat /proc/meminfo inside container vs cgroup memory.max
sysctl change lost after reboot	Runtime-only write to /proc/sys, no persistent config	grep setting in /etc/sysctl.d/*.conf
"Too many open files" error	fs.file-max or per-process nofile ulimit too low	cat /proc/sys/fs/file-max && cat /proc/$PID/limits
Pod-to-pod networking broken, no errors	ip_forward sysctl set to 0	cat /proc/sys/net/ipv4/ip_forward
Monitoring agent shows stale connection counts	Non-atomic /proc/net/tcp reads under high churn	Compare wc -l /proc/net/tcp across rapid reads
Process RSS appears 0 in /proc/PID/status	Zombie or kernel thread with no mm_struct	cat /proc/$PID/status and check State field

When to Use / Avoid

Use when inspecting per-process memory, file descriptors, or thread state at runtime
Use when tuning kernel parameters (TCP buffers, dirty ratios, pid_max) without a reboot
Use when building monitoring agents that need raw kernel metrics (CPU, memory, disk, network)
Use when debugging container resource misdetection (/proc/meminfo showing host values)
Avoid when needing atomic snapshots of fast-changing data like connection tables -- use netlink sockets instead
Avoid when binary data or structured output is required -- /proc text parsing is fragile across kernel versions

Try It Yourself

 1  # Show current shell's process status: name, state, PID, UID, VmRSS (resident memory), threads, etc.
 2  cat /proc/$$/status | head -20
 3  
 4  # Display the virtual memory map of the shell: text segment, heap, stack, shared libraries, vDSO
 5  cat /proc/$$/maps | head -10
 6  
 7  # Key memory stats: total, free (unused), available (reclaimable), cached (page cache), dirty (unflushed)
 8  cat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Cached|Dirty'
 9  
10  # Read current writeback thresholds. when write() starts blocking vs background flush starts
11  sysctl vm.dirty_ratio vm.dirty_background_ratio
12  
13  # Explore block device tunables: scheduler, read_ahead_kb, nr_requests, rotational
14  ls /sys/block/sda/queue/
15  
16  # Aggregated memory stats for a process: RSS, PSS (proportional shared), swap, anonymous vs file-backed
17  cat /proc/$$/smaps_rollup

Debug Checklist

1cat /proc/meminfo | grep MemAvailable -- check actual usable memory (not MemFree)
2cat /proc/$PID/status | grep -E 'VmRSS|VmSize|Threads' -- per-process memory and thread count
3sysctl -a | grep $KEYWORD -- find relevant kernel tunables by keyword
4ls -la /proc/$PID/fd | wc -l -- count open file descriptors for a process
5cat /proc/sys/net/ipv4/ip_forward -- verify IP forwarding is enabled (critical for containers)
6diff <(sysctl -a) <(cat /etc/sysctl.conf) -- compare running tunables against persistent config

Key Takeaways

✓/proc/PID files do not exist on disk. Reading /proc/PID/status triggers a live walk of the task's mm_struct, signal state, and scheduler data. The kernel generates the text on every read() call.
✓/proc/sys is sysctl exposed as files. Writing to /proc/sys/vm/dirty_ratio calls a kernel handler that validates the value and updates the global variable immediately. No restart needed. No special API.
✓sysfs enforces one-value-per-file. procfs dumps multi-line formatted data. That strict policy makes sysfs easy to script but less human-readable.
✓/proc/PID/maps is a complete X-ray of a process's virtual memory -- every mapped region with address range, permissions, offset, device, inode, and pathname. Essential for debugging memory issues.
✓Reading /proc files is NOT atomic. The kernel generates output page by page via seq_file. If data changes between pages (connections opening while you read /proc/net/tcp), you get an inconsistent snapshot.

Common Pitfalls

✗Mistake: Parsing /proc files by field position. Reality: Kernel upgrades frequently add new fields to /proc/PID/status and /proc/stat. Always parse by field name, not position.
✗Mistake: Using MemFree from /proc/meminfo to gauge available memory. Reality: Most 'free' memory is in page cache and reclaimable slabs. Use MemAvailable (Linux 3.14+) -- it accounts for reclaimable memory minus kernel reserves.
✗Mistake: Writing to /proc/sys and assuming success. Reality: write() returns bytes written even if the kernel rejected the value. Always read back after writing to verify.
✗Mistake: Using /proc/PID/stat utime/stime for precise CPU timing. Reality: These fields are in clock ticks (typically 100 Hz = 10ms resolution). For anything finer, use clock_gettime().

Reference

System Calls

openreadwrite

Tools

sysctlprocfs explorationudevadm / systool

📌

In One Line

/proc for live state, /proc/sys for instant tuning, /sys for device attributes -- persist everything in sysctl.d or it disappears on reboot.

/proc and /sys Filesystems

DockerKubernetessystemd

🧠

Mental Model

💡

The Problem

Architecture

Run cat /proc/meminfo. That command just read a file that does not exist on any disk.

The kernel generated that text on the fly, right now, from its live data structures. No database query. No API call. Just a read() on a fake file.

What Actually Happens

These files are the data source for ps, top, htop, pmap, and every monitoring agent.

Under the Hood

Common Questions

What is the difference between /proc/PID/maps and /proc/PID/smaps?

Why does /proc/meminfo show so little MemFree but the system seems fine?

Do containers get their own /proc?

How to find which tunable to change?

How Technologies Use This

Docker

Kubernetes

systemd

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Memory detection	/proc/meminfo (host leak)	-XX:+UseContainerSupport reads cgroup	process.memoryUsage() reads /proc/self	runtime.MemStats reads /proc/self	kubelet reads /proc/meminfo + cgroup memory.current
Runtime tuning	sysctl in docker run --sysctl	-XX flags at startup	--max-old-space-size flag	GOGC, GOMEMLIMIT env vars	sysctl pod security context
Process inspection	docker top reads /proc/PID	jcmd, jstack read /proc/self	/proc/self/status from fs module	/proc/self/maps via os package	kubectl exec + cat /proc/*
FD monitoring	/proc/PID/fd inside container ns	/proc/self/fd + lsof	process.getMaxListeners() + /proc/self/fd	os.File count + /proc/self/fd	liveness probes checking fd count

Stack Layer	Mechanism
Application	Reads /proc/self/* for self-inspection, memory, FDs
Runtime / VM	JVM, Node, Go read /proc/meminfo or cgroup files for auto-sizing
Container	PID namespace isolates /proc process list but not meminfo/cpuinfo; LXCFS patches this
Kernel	proc_dir_entry callbacks generate text from task_struct, mm_struct, net namespace on each read()
Hardware	/sys/class/* and /sys/devices/* expose hardware topology, device attributes, power state

If You See This, Think This

Symptom	Likely Cause	First Check
Container app sees host memory instead of limit	/proc/meminfo not namespace-aware	cat /proc/meminfo inside container vs cgroup memory.max
sysctl change lost after reboot	Runtime-only write to /proc/sys, no persistent config	grep setting in /etc/sysctl.d/*.conf
"Too many open files" error	fs.file-max or per-process nofile ulimit too low	cat /proc/sys/fs/file-max && cat /proc/$PID/limits
Pod-to-pod networking broken, no errors	ip_forward sysctl set to 0	cat /proc/sys/net/ipv4/ip_forward
Monitoring agent shows stale connection counts	Non-atomic /proc/net/tcp reads under high churn	Compare wc -l /proc/net/tcp across rapid reads
Process RSS appears 0 in /proc/PID/status	Zombie or kernel thread with no mm_struct	cat /proc/$PID/status and check State field

When to Use / Avoid

Use when inspecting per-process memory, file descriptors, or thread state at runtime
Use when tuning kernel parameters (TCP buffers, dirty ratios, pid_max) without a reboot
Use when building monitoring agents that need raw kernel metrics (CPU, memory, disk, network)
Use when debugging container resource misdetection (/proc/meminfo showing host values)
Avoid when needing atomic snapshots of fast-changing data like connection tables -- use netlink sockets instead
Avoid when binary data or structured output is required -- /proc text parsing is fragile across kernel versions

Try It Yourself

 1  # Show current shell's process status: name, state, PID, UID, VmRSS (resident memory), threads, etc.
 2  cat /proc/$$/status | head -20
 3  
 4  # Display the virtual memory map of the shell: text segment, heap, stack, shared libraries, vDSO
 5  cat /proc/$$/maps | head -10
 6  
 7  # Key memory stats: total, free (unused), available (reclaimable), cached (page cache), dirty (unflushed)
 8  cat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Cached|Dirty'
 9  
10  # Read current writeback thresholds. when write() starts blocking vs background flush starts
11  sysctl vm.dirty_ratio vm.dirty_background_ratio
12  
13  # Explore block device tunables: scheduler, read_ahead_kb, nr_requests, rotational
14  ls /sys/block/sda/queue/
15  
16  # Aggregated memory stats for a process: RSS, PSS (proportional shared), swap, anonymous vs file-backed
17  cat /proc/$$/smaps_rollup

Debug Checklist

1cat /proc/meminfo | grep MemAvailable -- check actual usable memory (not MemFree)
2cat /proc/$PID/status | grep -E 'VmRSS|VmSize|Threads' -- per-process memory and thread count
3sysctl -a | grep $KEYWORD -- find relevant kernel tunables by keyword
4ls -la /proc/$PID/fd | wc -l -- count open file descriptors for a process
5cat /proc/sys/net/ipv4/ip_forward -- verify IP forwarding is enabled (critical for containers)
6diff <(sysctl -a) <(cat /etc/sysctl.conf) -- compare running tunables against persistent config

Key Takeaways

✓/proc/PID files do not exist on disk. Reading /proc/PID/status triggers a live walk of the task's mm_struct, signal state, and scheduler data. The kernel generates the text on every read() call.
✓/proc/sys is sysctl exposed as files. Writing to /proc/sys/vm/dirty_ratio calls a kernel handler that validates the value and updates the global variable immediately. No restart needed. No special API.
✓sysfs enforces one-value-per-file. procfs dumps multi-line formatted data. That strict policy makes sysfs easy to script but less human-readable.
✓/proc/PID/maps is a complete X-ray of a process's virtual memory -- every mapped region with address range, permissions, offset, device, inode, and pathname. Essential for debugging memory issues.
✓Reading /proc files is NOT atomic. The kernel generates output page by page via seq_file. If data changes between pages (connections opening while you read /proc/net/tcp), you get an inconsistent snapshot.

Common Pitfalls

✗Mistake: Parsing /proc files by field position. Reality: Kernel upgrades frequently add new fields to /proc/PID/status and /proc/stat. Always parse by field name, not position.
✗Mistake: Using MemFree from /proc/meminfo to gauge available memory. Reality: Most 'free' memory is in page cache and reclaimable slabs. Use MemAvailable (Linux 3.14+) -- it accounts for reclaimable memory minus kernel reserves.
✗Mistake: Writing to /proc/sys and assuming success. Reality: write() returns bytes written even if the kernel rejected the value. Always read back after writing to verify.
✗Mistake: Using /proc/PID/stat utime/stime for precise CPU timing. Reality: These fields are in clock ticks (typically 100 Hz = 10ms resolution). For anything finer, use clock_gettime().

Reference

System Calls

openreadwrite

Tools

sysctlprocfs explorationudevadm / systool

📌

In One Line

/proc for live state, /proc/sys for instant tuning, /sys for device attributes -- persist everything in sysctl.d or it disappears on reboot.

/proc and /sys Filesystems

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

/proc and /sys Filesystems

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics