/proc and /sys Filesystems
Mental Model
A building with a live information kiosk on every floor. No kiosk stores anything. Each has a phone line to the building manager. "How many people on floor 3?" -- the kiosk dials, the manager counts heads right now, and scrawls the answer on a slip. Ask a second later and the number might differ. One wing of kiosks (/proc) hands back messy multi-line reports. The newer wing (/sys) gives exactly one fact per slip -- cleaner, same building.
The Problem
A JVM inside a 4 GB container reads /proc/meminfo, sees the host's 256 GB, sizes a 200 GB heap, and gets OOM-killed in seconds. Monitoring agents parsing /proc/net/tcp on a server with 50,000 connections see inconsistent data because seq_file reads are not atomic across page boundaries. An engineer tunes tcp_max_syn_backlog at runtime but skips the persistent sysctl.d config -- the next reboot resets it, and a SYN flood takes a 40% latency hit. On a Kubernetes node, a single ip_forward=0 in /proc/sys silently breaks pod-to-pod networking for 200+ pods with nothing in the kubelet logs.
Architecture
Run cat /proc/meminfo. That command just read a file that does not exist on any disk.
The kernel generated that text on the fly, right now, from its live data structures. No database query. No API call. Just a read() on a fake file.
This is how Linux exposes its internals. And once the pattern of reading these files directly becomes familiar, every monitoring tool becomes transparent -- because they are all doing exactly this behind the scenes.
What Actually Happens
/proc is backed by struct proc_dir_entry objects with function pointers that generate content on read. Running cat /proc/meminfo triggers the kernel to call meminfo_proc_show(), which walks memory structures and formats the output. Writing to /proc/sys/vm/dirty_ratio calls a sysctl handler that validates the input and modifies the kernel variable immediately -- no restart needed.
/sys was added in Linux 2.6 to model the device hierarchy. Every kernel object (device, driver, bus, network interface) appears as a directory in /sys. Each directory has attribute files -- one value per file -- that expose or modify object properties. /sys/block/sda/queue/scheduler shows and allows changing the I/O scheduler for a disk.
The per-process /proc/PID/ directory is especially powerful. /proc/PID/status provides state, memory usage, capabilities, and signal masks. /proc/PID/maps shows the complete virtual memory layout. /proc/PID/fd/ is a directory of symlinks from fd numbers to their targets. /proc/PID/smaps_rollup gives aggregated memory stats (RSS, PSS, swap) in a single read.
These files are the data source for ps, top, htop, pmap, and every monitoring agent.
Under the Hood
seq_file and the consistency trap. Most /proc files use the seq_file interface, generating output iteratively. A single read() returns a page-sized chunk, and the iterator advances on the next read(). This means reading a large /proc file (like /proc/net/tcp with thousands of connections) is NOT atomic. Connections can open or close between reads, producing inconsistent snapshots. For consistent network data, use netlink sockets instead.
Sysctl categories map to real kernel subsystems. /proc/sys/vm/ controls virtual memory (dirty ratios, overcommit, swappiness). /proc/sys/net/ controls networking (TCP tunables, socket buffers, conntrack limits). /proc/sys/kernel/ handles general settings (pid_max, threads-max, panic). Each file is backed by a ctl_table with type checking, range validation, and immediate application. Persistent changes go in /etc/sysctl.conf or /etc/sysctl.d/*.conf.
sysfs was designed to fix procfs. procfs grew organically and became a dumping ground -- inconsistent formatting, multi-value files, no clear hierarchy. sysfs was built with strict rules: one attribute per file, ASCII text, no binary data, consistent directory structure. New kernel subsystems use sysfs (or configfs, debugfs) rather than procfs.
Common Questions
What is the difference between /proc/PID/maps and /proc/PID/smaps?
maps shows each VMA with its address range, permissions, and backing file. smaps adds detailed per-region accounting: Rss, Pss (proportional share of shared pages), Shared/Private_Clean/Dirty, Referenced, Anonymous, Swap. smaps_rollup (Linux 4.14+) gives totals across all VMAs -- faster than parsing the full smaps.
Why does /proc/meminfo show so little MemFree but the system seems fine?
MemFree counts only truly unused pages. Most "free" memory is used by page cache (Cached), reclaimable slabs (SReclaimable), and buffers. These are all reclaimable under pressure. MemAvailable (Linux 3.14+) estimates usable memory by accounting for reclaimable cache. A system with 200MB MemFree but 28GB MemAvailable is perfectly healthy -- the cache is doing its job.
Do containers get their own /proc?
With PID namespaces, a container gets its own /proc showing only its processes. /proc/1 is the container's init, not the host's. But /proc/meminfo and /proc/cpuinfo still show HOST values -- a known limitation that causes apps to misdetect resources. LXCFS fixes this by overlaying those files with cgroup-aware data.
How to find which tunable to change?
Start with sysctl -a | grep <keyword>. For documentation, check Documentation/admin-guide/sysctl/ in the kernel source. For sysfs attributes, udevadm info -a <device_path> walks the parent chain showing all attributes.
How Technologies Use This
A Java application inside a Docker container with a 4GB memory limit sets its heap to 200GB and immediately gets OOM-killed. The JVM auto-sizes its heap based on available memory, but the container keeps reporting 256GB -- the full host memory.
The hidden issue is that Docker's PID-namespaced /proc only isolates the process list, not system-wide files like /proc/meminfo and /proc/cpuinfo. The JVM reads /proc/meminfo, sees the host's 256GB total memory instead of the container's 4GB cgroup limit, and requests a heap that far exceeds its allowance. This mismatch between what /proc reports and what cgroups enforce is the root cause of most container OOM-kill mysteries.
LXCFS solves this by overlaying /proc/meminfo and /proc/cpuinfo with cgroup-aware data, reducing JVM OOM-kill incidents by over 90% in container deployments that rely on automatic memory detection. For modern JVMs, the -XX:+UseContainerSupport flag reads cgroup limits directly instead of trusting /proc.
Pod networking silently breaks across an entire node after a configuration change, with no errors in kubelet logs. Pods can reach external services but cannot communicate with each other, and the kube-proxy logs show nothing useful.
The cause is a misconfigured /proc/sys/net/ipv4/ip_forward sysctl that was reset to 0. kube-proxy writes to this file and to bridge-nf-call-iptables to enable pod-to-pod traffic routing. The kubelet also reads MemAvailable from /proc/meminfo every 10 seconds and compares it against per-pod usage from /sys/fs/cgroup memory.current files to make eviction decisions. All of this happens through plain file reads and writes against virtual filesystems.
The fix is ensuring ip_forward is set persistently via /etc/sysctl.d/ and verifying it after node reboots or kernel upgrades. Understanding that Kubernetes depends entirely on /proc and /sys for node-level operations explains why a single missing sysctl can silently break an entire node.
A runaway service leaks memory and approaches the host's physical limit. Without per-service resource controls, the OOM killer will eventually terminate random processes across the entire system, potentially taking down critical infrastructure.
The mechanism that makes instant enforcement possible is that /proc/sys and /sys/fs/cgroup are virtual files backed by live kernel data structures. Writing 1073741824 to /sys/fs/cgroup/system.slice/myservice.service/memory.max immediately caps that service at 1GB with no restart needed. At boot, sysctl.d configs write to /proc/sys tunables to set kernel parameters like vm.swappiness and net.core.somaxconn.
systemd's MemoryMax=1G in a unit file writes directly to these virtual files, and PrivateTmp creates a per-service mount namespace while ProtectSystem remounts /usr and /boot read-only. These features restrict which kernel state each service can observe, reducing the blast radius of a compromise by over 80% compared to unrestricted services.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Memory detection | /proc/meminfo (host leak) | -XX:+UseContainerSupport reads cgroup | process.memoryUsage() reads /proc/self | runtime.MemStats reads /proc/self | kubelet reads /proc/meminfo + cgroup memory.current |
| Runtime tuning | sysctl in docker run --sysctl | -XX flags at startup | --max-old-space-size flag | GOGC, GOMEMLIMIT env vars | sysctl pod security context |
| Process inspection | docker top reads /proc/PID | jcmd, jstack read /proc/self | /proc/self/status from fs module | /proc/self/maps via os package | kubectl exec + cat /proc/* |
| FD monitoring | /proc/PID/fd inside container ns | /proc/self/fd + lsof | process.getMaxListeners() + /proc/self/fd | os.File count + /proc/self/fd | liveness probes checking fd count |
| Stack Layer | Mechanism |
|---|---|
| Application | Reads /proc/self/* for self-inspection, memory, FDs |
| Runtime / VM | JVM, Node, Go read /proc/meminfo or cgroup files for auto-sizing |
| Container | PID namespace isolates /proc process list but not meminfo/cpuinfo; LXCFS patches this |
| Kernel | proc_dir_entry callbacks generate text from task_struct, mm_struct, net namespace on each read() |
| Hardware | /sys/class/* and /sys/devices/* expose hardware topology, device attributes, power state |
Design rationale: Making kernel state look like files means any language, any tool, any one-liner script can inspect the system through open()/read()/write() with zero special APIs. The cost is text-parsing fragility and non-atomic reads for files that span multiple pages.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Container app sees host memory instead of limit | /proc/meminfo not namespace-aware | cat /proc/meminfo inside container vs cgroup memory.max |
| sysctl change lost after reboot | Runtime-only write to /proc/sys, no persistent config | grep setting in /etc/sysctl.d/*.conf |
| "Too many open files" error | fs.file-max or per-process nofile ulimit too low | cat /proc/sys/fs/file-max && cat /proc/$PID/limits |
| Pod-to-pod networking broken, no errors | ip_forward sysctl set to 0 | cat /proc/sys/net/ipv4/ip_forward |
| Monitoring agent shows stale connection counts | Non-atomic /proc/net/tcp reads under high churn | Compare wc -l /proc/net/tcp across rapid reads |
| Process RSS appears 0 in /proc/PID/status | Zombie or kernel thread with no mm_struct | cat /proc/$PID/status and check State field |
When to Use / Avoid
- Use when inspecting per-process memory, file descriptors, or thread state at runtime
- Use when tuning kernel parameters (TCP buffers, dirty ratios, pid_max) without a reboot
- Use when building monitoring agents that need raw kernel metrics (CPU, memory, disk, network)
- Use when debugging container resource misdetection (/proc/meminfo showing host values)
- Avoid when needing atomic snapshots of fast-changing data like connection tables -- use netlink sockets instead
- Avoid when binary data or structured output is required -- /proc text parsing is fragile across kernel versions
Try It Yourself
1 # Show current shell's process status: name, state, PID, UID, VmRSS (resident memory), threads, etc.
2 cat /proc/$$/status | head -20
3
4 # Display the virtual memory map of the shell: text segment, heap, stack, shared libraries, vDSO
5 cat /proc/$$/maps | head -10
6
7 # Key memory stats: total, free (unused), available (reclaimable), cached (page cache), dirty (unflushed)
8 cat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Cached|Dirty'
9
10 # Read current writeback thresholds. when write() starts blocking vs background flush starts
11 sysctl vm.dirty_ratio vm.dirty_background_ratio
12
13 # Explore block device tunables: scheduler, read_ahead_kb, nr_requests, rotational
14 ls /sys/block/sda/queue/
15
16 # Aggregated memory stats for a process: RSS, PSS (proportional shared), swap, anonymous vs file-backed
17 cat /proc/$$/smaps_rollupDebug Checklist
- 1
cat /proc/meminfo | grep MemAvailable -- check actual usable memory (not MemFree) - 2
cat /proc/$PID/status | grep -E 'VmRSS|VmSize|Threads' -- per-process memory and thread count - 3
sysctl -a | grep $KEYWORD -- find relevant kernel tunables by keyword - 4
ls -la /proc/$PID/fd | wc -l -- count open file descriptors for a process - 5
cat /proc/sys/net/ipv4/ip_forward -- verify IP forwarding is enabled (critical for containers) - 6
diff <(sysctl -a) <(cat /etc/sysctl.conf) -- compare running tunables against persistent config
Key Takeaways
- ✓/proc/PID files do not exist on disk. Reading /proc/PID/status triggers a live walk of the task's mm_struct, signal state, and scheduler data. The kernel generates the text on every read() call.
- ✓/proc/sys is sysctl exposed as files. Writing to /proc/sys/vm/dirty_ratio calls a kernel handler that validates the value and updates the global variable immediately. No restart needed. No special API.
- ✓sysfs enforces one-value-per-file. procfs dumps multi-line formatted data. That strict policy makes sysfs easy to script but less human-readable.
- ✓/proc/PID/maps is a complete X-ray of a process's virtual memory -- every mapped region with address range, permissions, offset, device, inode, and pathname. Essential for debugging memory issues.
- ✓Reading /proc files is NOT atomic. The kernel generates output page by page via seq_file. If data changes between pages (connections opening while you read /proc/net/tcp), you get an inconsistent snapshot.
Common Pitfalls
- ✗Mistake: Parsing /proc files by field position. Reality: Kernel upgrades frequently add new fields to /proc/PID/status and /proc/stat. Always parse by field name, not position.
- ✗Mistake: Using MemFree from /proc/meminfo to gauge available memory. Reality: Most 'free' memory is in page cache and reclaimable slabs. Use MemAvailable (Linux 3.14+) -- it accounts for reclaimable memory minus kernel reserves.
- ✗Mistake: Writing to /proc/sys and assuming success. Reality: write() returns bytes written even if the kernel rejected the value. Always read back after writing to verify.
- ✗Mistake: Using /proc/PID/stat utime/stime for precise CPU timing. Reality: These fields are in clock ticks (typically 100 Hz = 10ms resolution). For anything finer, use clock_gettime().
Reference
In One Line
/proc for live state, /proc/sys for instant tuning, /sys for device attributes -- persist everything in sysctl.d or it disappears on reboot.