Observability & TuningTopic 1 of 6

System TuningIntermediate

/proc and /sys Advanced Patterns

PrometheusDockerKubernetes

🧠

Mental Model

A library where no book is stored on shelves. Every time someone requests a book, a clerk writes it from scratch using the latest data. Some "books" are a single page (current temperature: just read a thermometer and write the number). Others are encyclopedias (full memory map of a process: the clerk has to walk through every room in a building and catalog its contents). Requesting the same book twice produces two independently generated copies. The library has no storage costs but the clerk's time varies wildly depending on the request.

💡

The Problem

A monitoring system scraping /proc for 500 containers. Each scrape reads dozens of /proc files per container. Understanding which files are cheap vs expensive to read matters for scrape interval tuning. Scraping /proc/PID/smaps for 500 processes every 10 seconds consumes 15% of one CPU core because each read triggers a full page table walk. Switching to /proc/PID/smaps_rollup (a single summarized line) drops the cost by 95%. Scraping /proc/PID/status (a few cached fields) costs almost nothing.

Architecture

A monitoring system scrapes 500 containers every 10 seconds. Each scrape reads a dozen /proc files per container: status, smaps, io, cgroup files, fd counts. The monitoring agent itself consumes 15% of a CPU core. Nothing else on the system is under load.

The cause: /proc/PID/smaps. Each read walks every page table entry for every VMA in the process. For a Java container with 2 GB heap, that is 500,000+ pages examined per read. Multiply by 500 containers, every 10 seconds. Switching to /proc/PID/smaps_rollup (a single summarized line) cuts the per-read cost by 95%. Switching to /proc/PID/status (VmRSS only, no page walk) cuts it by 99%.

This is the fundamental reality of /proc and /sys: they look like files, but every read triggers a kernel function. Some functions format a cached integer. Others walk data structures holding locks. Knowing which is which separates monitoring that scales from monitoring that becomes the bottleneck.

What Actually Happens

/proc and /sys are virtual filesystems. There is no storage backing. When a process calls open("/proc/1234/status", O_RDONLY) followed by read(), the kernel calls a function registered for that file (proc_pid_status() in fs/proc/array.c). That function reads fields from the process's task_struct, formats them into text, and copies the result to userspace.

The key mechanism is seq_file (fs/seq_file.c). Most /proc files use this framework:

open() allocates a seq_file structure and calls a start callback.
read() calls show callbacks that format data into a buffer using seq_printf().
If the buffer fills before all data is formatted, subsequent read() calls continue from where the last one left off.
close() calls stop and frees the structure.

This means each read() gets fresh data. There is no caching between reads. Two consecutive reads of /proc/meminfo can return different values if memory state changed between them.

The Cost Hierarchy

Not all /proc reads are equal. The cost depends entirely on what the kernel function does to generate the output.

Nanosecond reads (cached counters):

cat /proc/loadavg        # Three cached floats + running/total processes
cat /proc/uptime         # Two cached counters
cat /proc/sys/vm/swappiness  # Single integer variable

These read pre-computed values from global variables. The kernel function does nothing but sprintf().

Microsecond reads (task_struct fields):

cat /proc/PID/status     # ~40 fields from task_struct and mm_struct
cat /proc/PID/io         # 7 counters from task_io_accounting
cat /proc/PID/cgroup     # Walk short cgroup list
cat /proc/PID/cmdline    # Copy saved argv from mm_struct

These access fields on a single task_struct. The lock hold time is minimal. Reading /proc/PID/status for 500 processes takes milliseconds total.

Millisecond reads (data structure walks):

cat /proc/PID/smaps      # Walk every PTE in every VMA
cat /proc/PID/maps       # Iterate all VMAs (cheaper than smaps)
cat /proc/PID/numa_maps  # Walk PTEs + NUMA node lookup per page
cat /proc/net/tcp        # Iterate TCP established hash table under lock

These iterate large kernel data structures. /proc/PID/smaps calls smaps_pte_entry() for every page table entry. A process mapping 2 GB of memory with 4 KB pages has 524,288 PTEs to walk. Each PTE access may trigger a TLB miss. The total cost: 5-50 ms depending on memory size and TLB pressure.

The smaps vs smaps_rollup vs status comparison:

# Full per-VMA detail (walks all PTEs, produces kilobytes of output)
cat /proc/PID/smaps          # 5-50 ms per read

# Single-line summary (still walks PTEs but generates one line)
cat /proc/PID/smaps_rollup   # 5-50 ms but much less output to parse

# Cached VmRSS from task_struct (no page walk)
grep VmRSS /proc/PID/status  # <10 us per read

If only the total RSS matters, /proc/PID/status is three orders of magnitude cheaper. smaps_rollup is useful when the breakdown (shared vs private, clean vs dirty) is needed without per-VMA granularity.

/sys/fs/cgroup: The Container Control Plane

Cgroup v2 exposes its entire interface through /sys/fs/cgroup. Each cgroup is a directory. Resource controls are files within that directory.

Setting limits:

# Memory limit: 512 MB
echo 536870912 > /sys/fs/cgroup/my-container/memory.max

# CPU limit: 200% (200ms per 100ms period = 2 cores)
echo "200000 100000" > /sys/fs/cgroup/my-container/cpu.max

# PID limit: 1000 processes
echo 1000 > /sys/fs/cgroup/my-container/pids.max

# I/O limit: 50 MB/s read on device 8:0
echo "8:0 rbps=52428800" > /sys/fs/cgroup/my-container/io.max

Reading usage and events:

# Current memory (single number, cheap)
cat /sys/fs/cgroup/my-container/memory.current

# Detailed memory accounting (cheap, precomputed counters)
cat /sys/fs/cgroup/my-container/memory.stat
# Output:
# anon 134217728
# file 67108864
# kernel 8388608
# sock 1048576
# pgfault 15234567
# pgmajfault 42

# CPU usage and throttling
cat /sys/fs/cgroup/my-container/cpu.stat
# Output:
# usage_usec 1234567890
# user_usec 1000000000
# system_usec 234567890
# nr_periods 50000
# nr_throttled 125
# throttled_usec 6250000

# OOM events
cat /sys/fs/cgroup/my-container/memory.events
# Output:
# low 0
# high 45
# max 3
# oom 1
# oom_kill 1

The memory.events file is special: container runtimes use inotify on this file to detect OOM kills without polling. When the kernel kills a process in the cgroup, it increments the oom_kill counter and triggers an inotify event. The runtime reads the updated value and reports the OOM to the orchestrator.

/proc/sys: Kernel Tunables

Every file under /proc/sys maps to a kernel variable. The sysctl command is a wrapper around reading and writing these files.

Network tuning (most common):

# TCP buffer sizes: min default max (in bytes)
sysctl net.ipv4.tcp_rmem
# Output: net.ipv4.tcp_rmem = 4096 131072 6291456

sysctl net.ipv4.tcp_wmem
# Output: net.ipv4.tcp_wmem = 4096 16384 4194304

# Increase max socket buffer to 16 MB
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP backlog for high-connection servers
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

Virtual memory tuning:

# Swappiness: 0 = avoid swap, 100 = aggressively swap
sysctl -w vm.swappiness=10

# Dirty page writeback thresholds
sysctl -w vm.dirty_ratio=20
sysctl -w vm.dirty_background_ratio=5

# Overcommit: 0=heuristic, 1=always, 2=never
sysctl vm.overcommit_memory

Persistence: Changes via sysctl -w are lost on reboot. For persistence:

# Create a drop-in file
cat > /etc/sysctl.d/99-tuning.conf << 'EOF'
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
vm.swappiness = 10
EOF

# Apply immediately
sysctl --system

Under the Hood

seq_file internals. The /proc framework uses seq_file for files that produce multi-line output. The start, next, stop, show callbacks iterate over a data structure. For /proc/net/tcp, start grabs a spinlock on the TCP hash table, next advances to the next hash bucket entry, show formats one TCP socket, and stop releases the lock. This means the hash table is locked for the entire duration of the read. On a server with 100k connections, reading /proc/net/tcp holds the lock for tens of milliseconds, blocking new connection processing.

sysfs (kobject) model. Files in /sys are backed by kernel objects (kobjects). Each kobject has attributes, and each attribute has show and store callbacks. When a new network interface is created, the kernel creates a kobject in /sys/class/net/, registers attribute files (mtu, address, carrier, speed), and sends a uevent via netlink. udev receives the event, reads the attributes from /sys, and applies rules (rename interface, set permissions, trigger scripts).

cgroup v2 file operations. Cgroup knob files use cgroup_file_ops with seq_show for reads and write for writes. Writing to memory.max calls mem_cgroup_write() which validates the value, updates the cgroup's limit, and may trigger immediate reclaim if current usage exceeds the new limit. The write is synchronous: when it returns, the limit is in effect.

fdinfo and per-fd metadata. /proc/PID/fdinfo/N provides metadata about file descriptor N that goes beyond what /proc/PID/fd/N (a symlink to the path) shows. For an epoll fd, fdinfo lists all monitored fds and their events. For a timerfd, it shows the timer interval and next expiration. For an eventfd, it shows the current counter value. This is invaluable for debugging event-driven applications.

Common Questions

How does /proc handle processes that exit during a read?

If a process exits after its /proc/PID directory has been opened, the kernel keeps the task_struct referenced until all open file handles are closed. Reading from an already-opened file descriptor returns the last known state. But if the directory has not been opened yet, open() returns ENOENT. There is an inherent race: listing /proc to find PIDs, then opening files for each PID, will inevitably encounter processes that exited in between. Robust monitoring code handles ENOENT and ESRCH gracefully.

Is there overhead to having /proc mounted?

No. /proc (procfs) and /sys (sysfs) consume no disk I/O and minimal memory. The directory entries are generated dynamically when listed. Memory overhead is limited to the superblock, inode cache entries for recently accessed files, and the dentry cache. On a system with 1000 processes, the /proc overhead is measured in kilobytes.

Can /proc and /sys files be mmap'd?

Most /proc files cannot be mmap()'d because they have no backing pages. The content is generated on read. Some /sys files support mmap for device register access (e.g., PCI BAR regions exposed via sysfs). In general, treat these as read-once, process, discard. Repeated monitoring requires repeated reads.

How do containers see /proc?

A container typically gets /proc mounted from the host but filtered by its PID namespace. The container sees only its own processes in /proc. Additionally, paths like /proc/kcore, /proc/sysrq-trigger, and /proc/sys may be masked (bind-mounted to /dev/null) or made read-only to prevent the container from modifying host kernel state. The container runtime (runc) configures these masks based on the OCI runtime spec.

How Technologies Use This

Prometheus

A Prometheus node_exporter instance on a 64-core host running 500 containers scrapes /proc and /sys every 10 seconds. Each scrape cycle opens and reads approximately 2,500 files: /proc/stat (CPU counters), /proc/meminfo (memory stats), /proc/diskstats (I/O counters), /proc/net/dev (network interface stats), and per-container cgroup files under /sys/fs/cgroup (cpu.stat, memory.current, memory.stat, io.stat for each of the 500 containers). The scrape takes 180ms on average, but CPU usage of node_exporter is 8% of one core.

The cost difference between /proc files is dramatic. Reading /proc/stat costs 2 microseconds because the kernel simply formats pre-maintained per-CPU counters. Reading /proc/meminfo costs 5 microseconds for the same reason. But reading /proc/PID/smaps costs 500 microseconds to 3 milliseconds per process because the kernel must walk every page table entry in the process's address space, check each page's flags, and sum the results. For 500 containers, scraping smaps at a 10-second interval consumes 15% of a CPU core just for page table walks. Switching to /proc/PID/smaps_rollup (introduced in Linux 4.14), which returns a single summarized line instead of per-VMA details, drops the cost by 95%.

Reading /proc/net/tcp is another expensive operation. The kernel iterates the entire TCP hash table under a spinlock to generate the output. On a host with 200,000 open TCP connections across 500 containers, a single read of /proc/net/tcp takes 50ms and holds the lock for that entire duration, stalling other network operations. node_exporter's --no-collector.netstat.tcp flag disables this read, and the team instead collects TCP metrics from /proc/net/snmp, which reads pre-aggregated counters in under 10 microseconds.

Docker

A Docker host running 300 containers uses /sys/fs/cgroup as the primary interface for resource monitoring and enforcement. Each container gets a cgroup directory under /sys/fs/cgroup/system.slice/docker-<container_id>.scope/ containing files that the kernel generates on the fly. Reading memory.current returns the container's RSS in bytes (a single atomic counter, costs 1 microsecond). Reading memory.stat returns 25 fields including page cache, active/inactive anonymous pages, and slab usage. Reading cpu.stat returns usage_usec, user_usec, system_usec, and nr_throttled. Each of these files maps directly to an in-kernel counter that the cgroup subsystem maintains.

Docker's containerd runtime reads and writes these cgroup files at multiple lifecycle stages. At container creation, runc writes resource limits: "max 100000" to memory.max (cap at 100 MB), "50000 100000" to cpu.max (50% of one core), "300" to pids.max. During runtime, containerd monitors memory.events for OOM kills by watching for the "oom_kill" counter incrementing. It reads io.stat per block device to track container I/O throughput. The kubelet on Kubernetes nodes reads these same files every 10 seconds to populate container metrics and make pod eviction decisions when memory.current approaches the node's allocatable limit.

Inside the container, /proc is mounted from the host's procfs but filtered through the PID namespace, so the container sees only its own processes in /proc/[0-9]*/. The /sys/fs/cgroup mount inside the container is a bind mount of the container's own cgroup subtree, meaning the container can read its own memory.current and cpu.stat but cannot see any other container's cgroup directory. This isolation is enforced by the mount namespace, not by cgroup permissions, which is why escaping the mount namespace would expose all containers' cgroup data.

Kubernetes

Kubernetes cAdvisor (embedded in the kubelet) reads /proc/PID/cgroup for every container process on the node to discover which cgroup each container belongs to, then reads the corresponding /sys/fs/cgroup files to collect CPU, memory, and I/O metrics. On a node running 80 pods with an average of 3 containers each, cAdvisor reads approximately 720 cgroup files every 10 seconds (cpu.stat, memory.current, memory.stat for each of the 240 containers). These reads are cheap because cgroup counters are pre-aggregated by the kernel, costing 1 to 5 microseconds each.

cAdvisor also reads /proc/PID/status for each container's init process to collect VmRSS (resident memory), VmSwap (swapped pages), and Threads (thread count). It reads /proc/PID/io for per-process disk I/O counters (read_bytes, write_bytes). These /proc files cost 5 to 15 microseconds each because the kernel reads cached fields from the task_struct. The total scrape cost for 240 containers is under 50ms, well within the 10-second collection interval. The kubelet exposes these metrics at /metrics/cadvisor for Prometheus to scrape, forming the foundation for Horizontal Pod Autoscaler decisions and node-level eviction thresholds.

Reducing the cAdvisor housekeeping interval below 10 seconds is possible but increases CPU overhead linearly. At a 1-second interval on a 240-container node, cAdvisor consumes 3% of a CPU core. The expensive files to avoid at high frequency are /proc/PID/smaps (page table walk, 500 microseconds per process) and /proc/net/tcp (full hash table scan under lock). cAdvisor disables smaps collection by default for this reason. The /sys/fs/cgroup reads remain the cheapest and most information-dense source of container metrics available.

Same Concept Across Tech

Technology	Key /proc and /sys files	Key gotcha
Prometheus node_exporter	/proc/stat, /proc/meminfo, /proc/diskstats, /proc/net/dev	Enabling per-process collectors (smaps) at high frequency causes CPU spikes
containerd/CRI-O	/sys/fs/cgroup/*/memory.events, cpu.stat, memory.current	OOM detection depends on inotify watch on memory.events, not polling
cAdvisor	/sys/fs/cgroup/*/memory.stat, cpu.stat, io.stat, /proc/PID/io	Reads dozens of cgroup files per container per scrape. Scale matters
sysctl tuning	/proc/sys/net/, /proc/sys/vm/, /proc/sys/fs/*	Changes are not persistent unless written to /etc/sysctl.d/
perf/bpftrace	/proc/PID/maps (for symbol resolution), /sys/kernel/debug/tracing/	perf reads maps once at profile start, stale if process loads new libraries

Stack layer mapping (monitoring CPU spike from /proc scraping):

Layer	What to check	Tool
Monitoring config	Which collectors are enabled? Is smaps/net_tcp collection on?	node_exporter --web.listen-address, check enabled collectors
Scrape frequency	How often does the scraper read expensive files?	Prometheus scrape_interval in config
Per-file cost	Which /proc files dominate CPU time?	strace -c on the monitoring process
Kernel	How much time is spent in seq_file generation?	perf top during scrape, look for smaps_pte_entry or tcp4_seq_show
System	Is the monitoring overhead acceptable relative to total CPU?	top/htop during scrape window

Design Rationale Exposing kernel state as a filesystem was a deliberate Unix design choice. Files are the universal interface: cat, grep, awk, and every programming language can read them without special libraries or syscalls. Plan 9 took this further. Linux's /proc started as process information (hence the name) and grew to include kernel tunables (/proc/sys), network state (/proc/net), and hardware info. /sys was introduced in Linux 2.6 to separate the device model from /proc's increasingly crowded namespace. The split is not always clean: network statistics exist in both /proc/net/dev and /sys/class/net/*/statistics/. But the general rule holds: /proc for process and kernel state, /sys for device and subsystem state.

If You See This, Think This

Symptom	Likely cause	First check
Monitoring process using 10-15% CPU	Scraping expensive /proc files (smaps, net/tcp) at high frequency	strace -c on monitoring process, look for open/read on smaps
Container OOM not detected promptly	Not watching memory.events via inotify, relying on polling memory.current	Check if runtime uses inotify on memory.events or polls
sysctl changes lost after reboot	Wrote to /proc/sys directly without persisting to /etc/sysctl.d/	Check /etc/sysctl.d/*.conf for the setting
/proc/PID files return ESRCH	Process exited between directory listing and file read	Expected race condition. Handle ESRCH/ENOENT gracefully
Container shows wrong memory usage	Reading memory.current includes page cache. RSS may differ from container's active usage	Read memory.stat for anon vs file breakdown
ss returns data but /proc/net/tcp is empty	Namespace issue: reading from wrong namespace's /proc	nsenter -t PID -n cat /proc/net/tcp

When to Use / Avoid

Relevant when:

Building or configuring monitoring that scrapes per-process or per-container metrics
Tuning kernel parameters for networking, memory management, or I/O scheduling
Debugging process-level resource consumption (memory leaks, fd leaks, I/O patterns)
Understanding container resource limits and cgroup v2 interface files

Watch out for:

/proc/PID/smaps is expensive at scale. Use smaps_rollup or status instead
/proc/net/tcp holds a spinlock during iteration. Use ss (netlink) for frequent socket enumeration
/proc/PID files are not atomic across multiple reads. Values from different files may not be perfectly consistent
Writing invalid formats to /sys/fs/cgroup files fails silently or returns EINVAL

Try It Yourself

 1  # Quick process memory overview (cheap read)
 2  
 3  grep -E 'VmRSS|VmSize|VmPeak|VmSwap|Threads' /proc/$$/status
 4  
 5  # Detailed memory summary (much cheaper than full smaps)
 6  
 7  cat /proc/$$/smaps_rollup
 8  
 9  # Per-process I/O accounting
10  
11  cat /proc/$$/io
12  
13  # Count open file descriptors for a process
14  
15  ls /proc/$$/fd | wc -l && cat /proc/$$/limits | grep "open files"
16  
17  # System-wide pressure stall information
18  
19  for f in /proc/pressure/*; do echo "=== $(basename $f) ===" && cat $f; done
20  
21  # Cgroup v2: check container memory limit and current usage
22  
23  CG=$(cat /proc/1/cgroup | cut -d: -f3); cat /sys/fs/cgroup${CG}/memory.max; cat /sys/fs/cgroup${CG}/memory.current
24  
25  # Network tunables: TCP buffer sizes
26  
27  sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.core.rmem_max net.core.wmem_max
28  
29  # Benchmark /proc read costs: status (cheap) vs smaps (expensive)
30  
31  time for i in $(seq 1000); do cat /proc/$$/status > /dev/null; done; time for i in $(seq 1000); do cat /proc/$$/smaps > /dev/null; done
32  
33  # Watch cgroup CPU throttling in real time
34  
35  watch -n1 'cat /sys/fs/cgroup/system.slice/my-service.service/cpu.stat'
36  
37  # Find all processes in a cgroup
38  
39  cat /sys/fs/cgroup/system.slice/docker-*.scope/cgroup.procs
40  
41  # System memory breakdown
42  
43  grep -E '^(MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Dirty|Writeback|AnonPages|Mapped|Shmem)' /proc/meminfo

Debug Checklist

1Quick memory check: cat /proc/<pid>/status | grep -E 'VmRSS|VmSize|VmPeak|Threads'
2Detailed memory breakdown: cat /proc/<pid>/smaps_rollup
3I/O accounting: cat /proc/<pid>/io
4Open file descriptors: ls /proc/<pid>/fd | wc -l
5Cgroup membership: cat /proc/<pid>/cgroup
6Container memory limit: cat /sys/fs/cgroup/<cgroup>/memory.max
7Container CPU throttling: cat /sys/fs/cgroup/<cgroup>/cpu.stat | grep throttled
8System-wide pressure: cat /proc/pressure/memory
9Network tunable: sysctl net.core.rmem_max
10Check all kernel tunables: sysctl -a 2>/dev/null | grep <keyword>

Key Takeaways

✓/proc files have zero bytes on disk. The kernel generates content at read time via seq_file or simple_read callbacks. A read of /proc/meminfo calls meminfo_proc_show() which formats current values from global variables. Nothing is cached between reads. Every open+read gets a fresh snapshot.
✓Read cost varies by orders of magnitude. /proc/loadavg: read a few cached integers (nanoseconds). /proc/PID/smaps: walk every page table entry for every VMA in the process (milliseconds for large processes). /proc/net/tcp: iterate the entire TCP established hash table under a spinlock (tens of milliseconds on busy servers with 100k connections).
✓Writing to /proc/sys files calls a kernel handler that validates the input and updates an in-memory variable. There is no file I/O. Writing "1" to /proc/sys/net/ipv4/ip_forward calls devinet_sysctl_forward() which toggles a flag in the network stack. The sysctl command, echo redirect, and direct write() all do exactly the same thing.
✓/sys follows the kobject model. Each directory represents a kernel object (device, bus, driver). Attributes are files that map to show/store callbacks on the object. Creating a new network interface creates a new directory in /sys/class/net/ with attribute files. This is how udev discovers devices: it watches /sys via netlink for kobject events.
✓In cgroup v2, the interface files (memory.max, cpu.max) live in /sys/fs/cgroup. Each cgroup is a directory. The hierarchy is the filesystem hierarchy. Moving a process between cgroups is done by writing its PID to the target cgroup's cgroup.procs file. Listing members is reading cgroup.procs.

Common Pitfalls

✗Scraping /proc/PID/smaps for hundreds of processes at high frequency. Each read walks the entire page table. For a process with 10 GB of memory (2.5M pages), this takes 5-15ms. At 500 processes every 10 seconds, that is 2.5 to 7.5 seconds of CPU time per scrape cycle. Use /proc/PID/smaps_rollup for a single-line summary, or /proc/PID/status for VmRSS if detailed per-VMA breakdown is not needed.
✗Reading /proc/net/tcp on servers with 100k+ connections. The kernel iterates the TCP hash table under a lock, serializing all readers. On a busy load balancer, frequent reads of /proc/net/tcp cause lock contention that increases TCP latency. Use ss (which uses netlink SOCK_DIAG) instead of parsing /proc/net/tcp directly.
✗Assuming /proc/PID files are consistent across reads. A process can exit between opening /proc/PID/status and reading it. The read returns stale data or ESRCH. Multi-file reads from the same PID directory are not atomic: /proc/PID/stat and /proc/PID/status can reflect different moments. For monitoring, this is usually acceptable. For debugging, it means values might not add up perfectly.
✗Writing to cgroup files with echo and forgetting that some files require specific formats. cpu.max takes "quota period" (e.g., "100000 100000" for 100%). memory.max takes bytes or "max" for unlimited. Writing an invalid format silently fails or returns EINVAL. Always check the return value of the write.

Reference

System Calls

openreadwriteopenatreadvwritev

Tools

cat /proc/PID/smaps_rollupcat /proc/PID/statuscat /proc/PID/iocat /sys/fs/cgroup/<cgroup>/memory.statcat /proc/pressure/{cpu,memory,io}sysctl -a

📌

In One Line

/proc and /sys are the kernel's live API. Every read generates fresh data, but /proc/PID/smaps costs milliseconds while /proc/PID/status costs microseconds, and that difference matters at 500 containers.

/proc and /sys Advanced Patterns

PrometheusDockerKubernetes

🧠

Mental Model

💡

The Problem

Architecture

What Actually Happens

The key mechanism is seq_file (fs/seq_file.c). Most /proc files use this framework:

open() allocates a seq_file structure and calls a start callback.
read() calls show callbacks that format data into a buffer using seq_printf().
If the buffer fills before all data is formatted, subsequent read() calls continue from where the last one left off.
close() calls stop and frees the structure.

This means each read() gets fresh data. There is no caching between reads. Two consecutive reads of /proc/meminfo can return different values if memory state changed between them.

The Cost Hierarchy

Not all /proc reads are equal. The cost depends entirely on what the kernel function does to generate the output.

Nanosecond reads (cached counters):

cat /proc/loadavg        # Three cached floats + running/total processes
cat /proc/uptime         # Two cached counters
cat /proc/sys/vm/swappiness  # Single integer variable

These read pre-computed values from global variables. The kernel function does nothing but sprintf().

Microsecond reads (task_struct fields):

cat /proc/PID/status     # ~40 fields from task_struct and mm_struct
cat /proc/PID/io         # 7 counters from task_io_accounting
cat /proc/PID/cgroup     # Walk short cgroup list
cat /proc/PID/cmdline    # Copy saved argv from mm_struct

These access fields on a single task_struct. The lock hold time is minimal. Reading /proc/PID/status for 500 processes takes milliseconds total.

Millisecond reads (data structure walks):

cat /proc/PID/smaps      # Walk every PTE in every VMA
cat /proc/PID/maps       # Iterate all VMAs (cheaper than smaps)
cat /proc/PID/numa_maps  # Walk PTEs + NUMA node lookup per page
cat /proc/net/tcp        # Iterate TCP established hash table under lock

The smaps vs smaps_rollup vs status comparison:

# Full per-VMA detail (walks all PTEs, produces kilobytes of output)
cat /proc/PID/smaps          # 5-50 ms per read

# Single-line summary (still walks PTEs but generates one line)
cat /proc/PID/smaps_rollup   # 5-50 ms but much less output to parse

# Cached VmRSS from task_struct (no page walk)
grep VmRSS /proc/PID/status  # <10 us per read

/sys/fs/cgroup: The Container Control Plane

Cgroup v2 exposes its entire interface through /sys/fs/cgroup. Each cgroup is a directory. Resource controls are files within that directory.

Setting limits:

# Memory limit: 512 MB
echo 536870912 > /sys/fs/cgroup/my-container/memory.max

# CPU limit: 200% (200ms per 100ms period = 2 cores)
echo "200000 100000" > /sys/fs/cgroup/my-container/cpu.max

# PID limit: 1000 processes
echo 1000 > /sys/fs/cgroup/my-container/pids.max

# I/O limit: 50 MB/s read on device 8:0
echo "8:0 rbps=52428800" > /sys/fs/cgroup/my-container/io.max

Reading usage and events:

# Current memory (single number, cheap)
cat /sys/fs/cgroup/my-container/memory.current

# Detailed memory accounting (cheap, precomputed counters)
cat /sys/fs/cgroup/my-container/memory.stat
# Output:
# anon 134217728
# file 67108864
# kernel 8388608
# sock 1048576
# pgfault 15234567
# pgmajfault 42

# CPU usage and throttling
cat /sys/fs/cgroup/my-container/cpu.stat
# Output:
# usage_usec 1234567890
# user_usec 1000000000
# system_usec 234567890
# nr_periods 50000
# nr_throttled 125
# throttled_usec 6250000

# OOM events
cat /sys/fs/cgroup/my-container/memory.events
# Output:
# low 0
# high 45
# max 3
# oom 1
# oom_kill 1

/proc/sys: Kernel Tunables

Every file under /proc/sys maps to a kernel variable. The sysctl command is a wrapper around reading and writing these files.

Network tuning (most common):

# TCP buffer sizes: min default max (in bytes)
sysctl net.ipv4.tcp_rmem
# Output: net.ipv4.tcp_rmem = 4096 131072 6291456

sysctl net.ipv4.tcp_wmem
# Output: net.ipv4.tcp_wmem = 4096 16384 4194304

# Increase max socket buffer to 16 MB
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP backlog for high-connection servers
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

Virtual memory tuning:

# Swappiness: 0 = avoid swap, 100 = aggressively swap
sysctl -w vm.swappiness=10

# Dirty page writeback thresholds
sysctl -w vm.dirty_ratio=20
sysctl -w vm.dirty_background_ratio=5

# Overcommit: 0=heuristic, 1=always, 2=never
sysctl vm.overcommit_memory

Persistence: Changes via sysctl -w are lost on reboot. For persistence:

# Create a drop-in file
cat > /etc/sysctl.d/99-tuning.conf << 'EOF'
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
vm.swappiness = 10
EOF

# Apply immediately
sysctl --system

Under the Hood

Common Questions

How does /proc handle processes that exit during a read?

Is there overhead to having /proc mounted?

Can /proc and /sys files be mmap'd?

How do containers see /proc?

How Technologies Use This

Prometheus

Docker

Kubernetes

Same Concept Across Tech

Technology	Key /proc and /sys files	Key gotcha
Prometheus node_exporter	/proc/stat, /proc/meminfo, /proc/diskstats, /proc/net/dev	Enabling per-process collectors (smaps) at high frequency causes CPU spikes
containerd/CRI-O	/sys/fs/cgroup/*/memory.events, cpu.stat, memory.current	OOM detection depends on inotify watch on memory.events, not polling
cAdvisor	/sys/fs/cgroup/*/memory.stat, cpu.stat, io.stat, /proc/PID/io	Reads dozens of cgroup files per container per scrape. Scale matters
sysctl tuning	/proc/sys/net/, /proc/sys/vm/, /proc/sys/fs/*	Changes are not persistent unless written to /etc/sysctl.d/
perf/bpftrace	/proc/PID/maps (for symbol resolution), /sys/kernel/debug/tracing/	perf reads maps once at profile start, stale if process loads new libraries

Stack layer mapping (monitoring CPU spike from /proc scraping):

Layer	What to check	Tool
Monitoring config	Which collectors are enabled? Is smaps/net_tcp collection on?	node_exporter --web.listen-address, check enabled collectors
Scrape frequency	How often does the scraper read expensive files?	Prometheus scrape_interval in config
Per-file cost	Which /proc files dominate CPU time?	strace -c on the monitoring process
Kernel	How much time is spent in seq_file generation?	perf top during scrape, look for smaps_pte_entry or tcp4_seq_show
System	Is the monitoring overhead acceptable relative to total CPU?	top/htop during scrape window

If You See This, Think This

Symptom	Likely cause	First check
Monitoring process using 10-15% CPU	Scraping expensive /proc files (smaps, net/tcp) at high frequency	strace -c on monitoring process, look for open/read on smaps
Container OOM not detected promptly	Not watching memory.events via inotify, relying on polling memory.current	Check if runtime uses inotify on memory.events or polls
sysctl changes lost after reboot	Wrote to /proc/sys directly without persisting to /etc/sysctl.d/	Check /etc/sysctl.d/*.conf for the setting
/proc/PID files return ESRCH	Process exited between directory listing and file read	Expected race condition. Handle ESRCH/ENOENT gracefully
Container shows wrong memory usage	Reading memory.current includes page cache. RSS may differ from container's active usage	Read memory.stat for anon vs file breakdown
ss returns data but /proc/net/tcp is empty	Namespace issue: reading from wrong namespace's /proc	nsenter -t PID -n cat /proc/net/tcp

When to Use / Avoid

Relevant when:

Building or configuring monitoring that scrapes per-process or per-container metrics
Tuning kernel parameters for networking, memory management, or I/O scheduling
Debugging process-level resource consumption (memory leaks, fd leaks, I/O patterns)
Understanding container resource limits and cgroup v2 interface files

Watch out for:

/proc/PID/smaps is expensive at scale. Use smaps_rollup or status instead
/proc/net/tcp holds a spinlock during iteration. Use ss (netlink) for frequent socket enumeration
/proc/PID files are not atomic across multiple reads. Values from different files may not be perfectly consistent
Writing invalid formats to /sys/fs/cgroup files fails silently or returns EINVAL

Try It Yourself

 1  # Quick process memory overview (cheap read)
 2  
 3  grep -E 'VmRSS|VmSize|VmPeak|VmSwap|Threads' /proc/$$/status
 4  
 5  # Detailed memory summary (much cheaper than full smaps)
 6  
 7  cat /proc/$$/smaps_rollup
 8  
 9  # Per-process I/O accounting
10  
11  cat /proc/$$/io
12  
13  # Count open file descriptors for a process
14  
15  ls /proc/$$/fd | wc -l && cat /proc/$$/limits | grep "open files"
16  
17  # System-wide pressure stall information
18  
19  for f in /proc/pressure/*; do echo "=== $(basename $f) ===" && cat $f; done
20  
21  # Cgroup v2: check container memory limit and current usage
22  
23  CG=$(cat /proc/1/cgroup | cut -d: -f3); cat /sys/fs/cgroup${CG}/memory.max; cat /sys/fs/cgroup${CG}/memory.current
24  
25  # Network tunables: TCP buffer sizes
26  
27  sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.core.rmem_max net.core.wmem_max
28  
29  # Benchmark /proc read costs: status (cheap) vs smaps (expensive)
30  
31  time for i in $(seq 1000); do cat /proc/$$/status > /dev/null; done; time for i in $(seq 1000); do cat /proc/$$/smaps > /dev/null; done
32  
33  # Watch cgroup CPU throttling in real time
34  
35  watch -n1 'cat /sys/fs/cgroup/system.slice/my-service.service/cpu.stat'
36  
37  # Find all processes in a cgroup
38  
39  cat /sys/fs/cgroup/system.slice/docker-*.scope/cgroup.procs
40  
41  # System memory breakdown
42  
43  grep -E '^(MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Dirty|Writeback|AnonPages|Mapped|Shmem)' /proc/meminfo

Debug Checklist

1Quick memory check: cat /proc/<pid>/status | grep -E 'VmRSS|VmSize|VmPeak|Threads'
2Detailed memory breakdown: cat /proc/<pid>/smaps_rollup
3I/O accounting: cat /proc/<pid>/io
4Open file descriptors: ls /proc/<pid>/fd | wc -l
5Cgroup membership: cat /proc/<pid>/cgroup
6Container memory limit: cat /sys/fs/cgroup/<cgroup>/memory.max
7Container CPU throttling: cat /sys/fs/cgroup/<cgroup>/cpu.stat | grep throttled
8System-wide pressure: cat /proc/pressure/memory
9Network tunable: sysctl net.core.rmem_max
10Check all kernel tunables: sysctl -a 2>/dev/null | grep <keyword>

Key Takeaways

✓/proc files have zero bytes on disk. The kernel generates content at read time via seq_file or simple_read callbacks. A read of /proc/meminfo calls meminfo_proc_show() which formats current values from global variables. Nothing is cached between reads. Every open+read gets a fresh snapshot.
✓Read cost varies by orders of magnitude. /proc/loadavg: read a few cached integers (nanoseconds). /proc/PID/smaps: walk every page table entry for every VMA in the process (milliseconds for large processes). /proc/net/tcp: iterate the entire TCP established hash table under a spinlock (tens of milliseconds on busy servers with 100k connections).
✓Writing to /proc/sys files calls a kernel handler that validates the input and updates an in-memory variable. There is no file I/O. Writing "1" to /proc/sys/net/ipv4/ip_forward calls devinet_sysctl_forward() which toggles a flag in the network stack. The sysctl command, echo redirect, and direct write() all do exactly the same thing.
✓/sys follows the kobject model. Each directory represents a kernel object (device, bus, driver). Attributes are files that map to show/store callbacks on the object. Creating a new network interface creates a new directory in /sys/class/net/ with attribute files. This is how udev discovers devices: it watches /sys via netlink for kobject events.
✓In cgroup v2, the interface files (memory.max, cpu.max) live in /sys/fs/cgroup. Each cgroup is a directory. The hierarchy is the filesystem hierarchy. Moving a process between cgroups is done by writing its PID to the target cgroup's cgroup.procs file. Listing members is reading cgroup.procs.

Common Pitfalls

✗Scraping /proc/PID/smaps for hundreds of processes at high frequency. Each read walks the entire page table. For a process with 10 GB of memory (2.5M pages), this takes 5-15ms. At 500 processes every 10 seconds, that is 2.5 to 7.5 seconds of CPU time per scrape cycle. Use /proc/PID/smaps_rollup for a single-line summary, or /proc/PID/status for VmRSS if detailed per-VMA breakdown is not needed.
✗Reading /proc/net/tcp on servers with 100k+ connections. The kernel iterates the TCP hash table under a lock, serializing all readers. On a busy load balancer, frequent reads of /proc/net/tcp cause lock contention that increases TCP latency. Use ss (which uses netlink SOCK_DIAG) instead of parsing /proc/net/tcp directly.
✗Assuming /proc/PID files are consistent across reads. A process can exit between opening /proc/PID/status and reading it. The read returns stale data or ESRCH. Multi-file reads from the same PID directory are not atomic: /proc/PID/stat and /proc/PID/status can reflect different moments. For monitoring, this is usually acceptable. For debugging, it means values might not add up perfectly.
✗Writing to cgroup files with echo and forgetting that some files require specific formats. cpu.max takes "quota period" (e.g., "100000 100000" for 100%). memory.max takes bytes or "max" for unlimited. Writing an invalid format silently fails or returns EINVAL. Always check the return value of the write.

Reference

System Calls

openreadwriteopenatreadvwritev

Tools

cat /proc/PID/smaps_rollupcat /proc/PID/statuscat /proc/PID/iocat /sys/fs/cgroup/<cgroup>/memory.statcat /proc/pressure/{cpu,memory,io}sysctl -a

📌

Mental Model

The Problem

Architecture

What Actually Happens

The Cost Hierarchy

/sys/fs/cgroup: The Container Control Plane

/proc/sys: Kernel Tunables

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Mental Model

The Problem

Architecture

What Actually Happens

The Cost Hierarchy

/sys/fs/cgroup: The Container Control Plane

/proc/sys: Kernel Tunables

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics