Audit Framework & Logging — Security & Access Control
Difficulty: Intermediate
Kernel-level event recording that hooks syscall entry/exit, LSM decisions, and filesystem watches through a dedicated kernel thread (kauditd). Every audited syscall produces a bundle of records -- SYSCALL, CWD, PATH, PROCTITLE -- tied by a shared event ID, buffered in a kernel backlog (default 8192 entries), and shipped to userspace auditd over NETLINK_AUDIT. The audit UID (auid), stamped once by PAM at login, follows a session through sudo, su, and setuid without changing. Immutable mode (-e 2) locks the rules down until reboot.
System Calls for Audit Framework & Logging
Key Components in Audit Framework & Logging
- kauditd (kernel audit daemon): Kernel thread that buffers audit events generated by LSM hooks, syscall entry/exit, and file watch triggers. Events are sent to userspace via a netlink socket (NETLINK_AUDIT). If the backlog exceeds audit_backlog_limit, the kernel can block syscalls or panic (configurable).
- auditd (userspace audit daemon): Receives events from kauditd via netlink, writes to /var/log/audit/audit.log, manages log rotation, and can forward events to remote aggregators. Configured via /etc/audit/auditd.conf. Runs as a privileged daemon that cannot be killed by regular signals.
- auditctl / audit.rules: Configures audit rules: file/directory watches (-w /etc/passwd -p wa), syscall auditing (-a always,exit -S execve), and filter conditions (-F uid=0). Rules in /etc/audit/rules.d/*.rules are loaded at boot. 'auditctl -l' lists active rules.
- audit_context (per-task): Each task_struct has an audit_context pointer that accumulates audit data for the current syscall. arguments, return value, file paths, cwd, subject/object labels. At syscall exit, if a rule matches, the accumulated context is emitted as a multi-record audit event.
Key Points for Audit Framework & Logging
- The audit UID (auid / loginuid) is the framework's killer feature. It is set once by PAM at login, written to /proc/self/loginuid, and never changes -- not through sudo, su, setuid, or container entry. When someone runs a destructive command as root, the auid field tells you which human actually logged in.
- Audit rules support precise filtering: syscall number (-S), architecture (-F arch=b64), UID/GID (-F auid=1000), success/failure (-F success=0), file path (-w /etc/shadow), permissions (-p rwxa), and SELinux context (-F subj_type=httpd_t). Combine multiple filters in one rule to avoid noise.
- Each syscall generates a multi-record event: SYSCALL (core data) + CWD (working directory) + PATH (each file touched) + PROCTITLE (command line) + optional EXECVE and SOCKADDR records. All share the same event ID. ausearch correlates them; aureport summarizes by category.
- File watches (-w /etc/passwd -p wa -k identity) trigger on write and attribute changes using kernel hooks. They capture who changed the file, when, and from what process. This is the foundation for detecting unauthorized modifications to critical system files.
- The audit backlog is a kernel buffer that can overflow under load. The default limit is 8192 events. When exceeded, the kernel either blocks syscalls (slowing the system), drops events (losing audit data), or panics (in high-security environments). Tuning the backlog limit and filtering rules is essential for production.
Common Mistakes with Audit Framework & Logging
- Mistake: Adding overly broad rules like '-a always,exit -S all' that audit every syscall. Reality: This generates millions of events per minute, fills the log in seconds, and can overflow the kernel backlog buffer. Start with specific syscalls (execve, connect, openat) and targeted file paths.
- Mistake: Not setting the -k (key) field on audit rules. Reality: Without keys, searching millions of events requires parsing full record content. Keys act as tags -- 'ausearch -k identity' instantly finds all events from the /etc/passwd watch. Always tag your rules.
- Mistake: Forgetting that auditctl rules do not survive reboot. Reality: 'auditctl -w /etc/shadow -p wa' is temporary. Persistent rules go in /etc/audit/rules.d/ (e.g., 50-identity.rules). Run 'augenrules --load' to activate them.
- Mistake: Ignoring performance impact of syscall audit rules. Reality: Each audited syscall adds about 5-10 microseconds of overhead for record generation. On a server making 100K syscalls/sec, broad rules add significant latency. Use filters (-F auid>=1000 to skip system accounts) to reduce volume.
Related Topics
SELinux & AppArmor, File Permissions, Ownership & ACLs, Linux Capabilities, System Calls: User to Kernel Transition
BIO & Request Queues — Storage & Filesystems
Difficulty: Advanced
Every block I/O operation in Linux passes through struct bio and the multi-queue block layer. Understanding this path explains why NVMe drives need blk-mq to reach their IOPS potential, how the kernel merges small writes into large requests, and where device mapper intercepts BIOs to split them across LVM stripes.
System Calls for BIO & Request Queues
- read
- write
- preadv2
- pwritev2
- io_submit
- io_uring_enter
Key Components in BIO & Request Queues
- struct bio: The fundamental unit of block I/O in the kernel. Represents a single I/O operation targeting a contiguous range of disk sectors. Contains a bi_iter (sector offset, size, index into bio_vec), a pointer to the target block_device, I/O flags (REQ_OP_READ, REQ_OP_WRITE, REQ_SYNC, REQ_FUA), and the bi_end_io completion callback. BIOs are allocated from a mempool (bioset) to guarantee forward progress under memory pressure.
- struct bio_vec: A scatter-gather element within a bio. Each bio_vec is a (page, offset, length) tuple pointing to one segment of data in memory. A single bio can chain multiple bio_vecs to describe a non-contiguous memory layout mapped to contiguous disk sectors. The maximum number of bio_vecs per bio is BIO_MAX_VECS (256), limiting a single bio to roughly 1 MB with 4 KB pages.
- struct request: A merged unit of I/O submitted to the device driver. Contains a linked list of BIOs covering a contiguous sector range. The I/O scheduler and plugging mechanism merge adjacent BIOs into a single request to reduce per-I/O overhead. Each request carries a tag for the hardware queue, enabling tag-based completion without scanning.
- blk-mq (Multi-Queue Block Layer): Replaces the legacy single-queue block layer. Creates per-CPU software staging queues (ctx) and maps them to hardware dispatch queues (hctx). The mapping is configurable: 1:1 (one ctx per hctx, typical for NVMe), N:1 (multiple ctx sharing one hctx, typical for SATA), or custom. Each hctx has its own tag set for lock-free request allocation.
- Software staging queues (blk_mq_ctx): Per-CPU queues where BIOs are converted into requests and optionally scheduled. If an I/O scheduler is active (mq-deadline, bfq, kyber), requests pass through it for reordering or throttling. If no scheduler is configured (none), requests go directly from the ctx to the hctx. NVMe devices typically use "none" because the device handles ordering internally.
- Hardware dispatch queues (blk_mq_hw_ctx): Represent the actual submission queues exposed by the hardware. The device driver provides queue_rq() to accept requests from the hctx and push them to the device. For NVMe, each hctx maps to an NVMe submission queue and its paired completion queue. The hctx manages a tag bitmap for request tracking and completion routing.
Key Points for BIO & Request Queues
- A struct bio describes a single contiguous I/O operation on disk but can reference scattered pages in memory through bio_vec entries. A struct request aggregates multiple contiguous BIOs into a single unit for the driver. The bio is the filesystem-to-block interface; the request is the block-to-driver interface.
- The blk-mq layer eliminates the single-queue bottleneck that capped legacy block I/O at roughly 500K IOPS regardless of device capability. By mapping per-CPU software queues to per-device hardware queues, it scales linearly with core count. A 64-core server with an NVMe device goes from 500K IOPS (single queue, lock-bound) to 1M+ IOPS (multi-queue, lock-free).
- BIO merging happens at two levels. First, the plug list: within a single syscall, the kernel accumulates BIOs in a per-task plug and merges adjacent ones before releasing them. Second, the I/O scheduler: if enabled, it reorders and merges requests in the software staging queue. For random I/O workloads, merging provides no benefit, and the "none" scheduler avoids the overhead entirely.
- The bio split mechanism (bio_split()) is critical for device mapper and RAID. When a bio crosses a stripe boundary or chunk size limit, the block layer splits it into two BIOs at the boundary. The split bio shares the original's pages via bio_vec references -- no data copying occurs. The original bio's bi_iter is adjusted to cover only the remaining range.
- Tag-based completion in blk-mq assigns each in-flight request a unique integer tag from the hctx tag bitmap. When the device signals completion, it returns the tag, and the kernel looks up the request directly by tag index. No scanning of a completion list is needed. This is O(1) per completion, essential at 1M+ IOPS.
Common Mistakes with BIO & Request Queues
- Using synchronous I/O from too few threads against NVMe. Each synchronous read() or write() blocks the thread until the single I/O completes. With 16 threads and 200us device latency, the maximum throughput is 16 / 0.0002 = 80K IOPS, regardless of device capability. Either increase thread count to hundreds or switch to io_uring / libaio for asynchronous submission.
- Running an I/O scheduler on NVMe devices. mq-deadline or bfq add latency and CPU overhead for reordering that NVMe firmware handles internally. For NVMe, set the scheduler to "none" via echo none > /sys/block/nvme0n1/queue/scheduler. Reserve mq-deadline for rotational drives where seek optimization matters.
- Assuming a single submission thread can saturate a multi-queue device. blk-mq maps software queues to CPUs. If all I/O originates from one CPU, only one hardware queue receives work. The other 63 queues sit idle. Spread I/O across CPUs using multiple threads, io_uring with SQPOLL, or multiple file descriptors with separate aio contexts.
- Ignoring the max_sectors_kb and max_segments limits. If a bio exceeds the device's maximum transfer size, the block layer splits it. Frequent splitting adds overhead. Aligning application I/O size to /sys/block/<dev>/queue/max_sectors_kb avoids unnecessary splits.
- Disabling plug merging by calling blk_finish_plug() too early or issuing O_DIRECT writes one page at a time. The plug batches BIOs from a single syscall, giving the block layer a window to merge. Issuing tiny, unplugged writes defeats this optimization and inflates IOPS unnecessarily.
Related Topics
Page Cache & Block I/O, I/O Models: Blocking, Non-Blocking, Async, io_uring: Modern Async I/O, Disk I/O Scheduling, Virtual File System (VFS), Zero-Copy Networking (sendfile, splice)
The Linux Boot Process from Power-On to Userspace — System Initialization
Difficulty: Intermediate
The path from pressing the power button to a login prompt crosses firmware, bootloader, kernel, and init system. Each stage hands off to the next in a chain of trust. Understanding where time is spent in this chain is the difference between a 45-second boot and a 3-second boot.
System Calls for The Linux Boot Process from Power-On to Userspace
- execve
- mount
- pivot_root
- clone
- reboot
Key Components in The Linux Boot Process from Power-On to Userspace
- UEFI/BIOS Firmware: The first code that runs after power-on. Initializes CPU, memory controller, and PCI bus. Runs POST (Power-On Self-Test). Locates the boot device by checking the EFI System Partition (UEFI) or the MBR (legacy BIOS). Loads the bootloader into memory and hands off execution.
- GRUB Bootloader: Stage 2 bootloader that reads its config (/boot/grub/grub.cfg), presents a menu if configured, loads the kernel image (vmlinuz) and initramfs into memory, passes the kernel command line parameters (root=, init=, quiet), and transfers control to the kernel entry point.
- Kernel (vmlinuz): Compressed kernel image that self-extracts into memory. Initializes the memory manager, scheduler, and interrupt handlers. Detects hardware via ACPI tables and device tree. Mounts the initramfs as a temporary root filesystem. Executes /init from the initramfs to begin early userspace.
- initramfs (Initial RAM Filesystem): A cpio archive loaded into memory containing the minimum set of tools and kernel modules needed to mount the real root filesystem. Loads storage drivers, assembles RAID arrays, unlocks LUKS encryption, activates LVM volumes. Once the real root is accessible, calls pivot_root() or switch_root to hand off.
- init / systemd (PID 1): The first userspace process. Mounts filesystems from /etc/fstab, starts services in dependency order, sets up networking, configures hostname and locale. As PID 1, it reaps orphaned child processes and handles system shutdown/reboot. systemd parallelizes service startup using socket and D-Bus activation.
Key Points for The Linux Boot Process from Power-On to Userspace
- The boot sequence is a chain of handoffs: firmware loads the bootloader, the bootloader loads the kernel and initramfs, the kernel mounts the initramfs and runs /init, and /init mounts the real root and execs the real init (systemd). Each stage trusts the output of the previous one.
- UEFI Secure Boot adds cryptographic verification to this chain. The firmware verifies the bootloader's signature, the bootloader verifies the kernel's signature, and the kernel can verify module signatures. A compromised bootloader cannot load a tampered kernel if Secure Boot is enforced.
- The initramfs exists because the kernel needs storage drivers to read the root filesystem, but those drivers live on the root filesystem. The initramfs breaks this chicken-and-egg problem by bundling the necessary drivers into a cpio archive that the bootloader loads alongside the kernel.
- systemd-analyze blame shows exactly which services are slow. systemd-analyze critical-chain shows the longest dependency chain. These two commands reveal whether boot time is spent waiting on a single slow service or blocked by a deep dependency graph.
- PID 1 has special kernel treatment. It cannot be killed by signals unless it explicitly installs handlers. If PID 1 exits, the kernel panics. This is why containers need a proper init process -- the application as PID 1 misses zombie reaping and signal forwarding.
Common Mistakes with The Linux Boot Process from Power-On to Userspace
- Assuming a slow boot means the kernel is slow. In most cases, the kernel initializes in 1-3 seconds. The actual bottlenecks are UEFI firmware enumeration (especially USB and network option ROMs), oversized initramfs images loading unnecessary drivers, and systemd units with long startup times or deep dependency chains.
- Including every possible driver in the initramfs for "compatibility." A generic distro initramfs can reach 80-100 MB because it bundles drivers for every storage controller, filesystem, and encryption scheme. On a cloud VM that only needs virtio_blk, virtio_net, and ext4, a tailored initramfs is under 10 MB and loads in a fraction of the time.
- Running application processes as PID 1 inside containers without a proper init. The application will not reap zombie children, will not receive SIGTERM properly (PID 1 ignores signals by default unless a handler is installed), and cannot perform graceful shutdown. Use tini, dumb-init, or Docker's --init flag.
- Ignoring the GRUB timeout. A default GRUB_TIMEOUT of 5 seconds means every single boot waits 5 seconds for menu input that nobody provides on a headless server. Set GRUB_TIMEOUT=0 in /etc/default/grub for servers and cloud VMs.
Related Topics
Process Lifecycle (fork/exec/wait), Virtual Memory & Address Spaces, Virtual File System (VFS), Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
BPF Maps & Ring Buffer — Advanced Tracing
Difficulty: Advanced
Hash maps, arrays, ring buffers, and per-CPU variants -- the data structures that let eBPF programs communicate with userspace and with each other. The choice of map type determines whether a monitoring tool drops events, wastes CPU on lock contention, or scales linearly across cores.
System Calls for BPF Maps & Ring Buffer
Key Components in BPF Maps & Ring Buffer
- BPF_MAP_TYPE_HASH: General-purpose key-value store with arbitrary key sizes. Supports insertion, deletion, and lookup in O(1) average time using a hash table with per-bucket spin locks. Used for connection tracking, flow tables, and any scenario where entries are dynamically created and destroyed.
- BPF_MAP_TYPE_ARRAY: Fixed-size array indexed by integer key from 0 to max_entries-1. All entries are pre-allocated at map creation, so lookups never fail for valid indices. Ideal for configuration data, lookup tables, and global counters where the key space is known ahead of time.
- BPF_MAP_TYPE_PERCPU_HASH / BPF_MAP_TYPE_PERCPU_ARRAY: Per-CPU variants of hash and array maps. Each CPU core gets its own private copy of every value. Eliminates all lock contention and cache-line bouncing on writes. Userspace reads back NR_CPUS copies and aggregates them. The standard choice for high-frequency counters and statistics.
- BPF_MAP_TYPE_LRU_HASH: Hash map with built-in LRU eviction. When the map reaches max_entries, the least recently used entry is automatically evicted to make room. Used for caches, rate limiters, and connection tables where old entries should age out without explicit cleanup.
- BPF_MAP_TYPE_RINGBUF: Single shared ring buffer replacing the older perf event array. One contiguous memory region visible to all CPUs. A lock-free reserve-commit protocol allows multiple CPUs to write concurrently without per-CPU buffers. Supports both polling and callback-based consumption in userspace. Introduced in Linux 5.8.
- BPF_MAP_TYPE_PERF_EVENT_ARRAY: The older mechanism for streaming events to userspace. Creates one ring buffer per CPU. Each BPF program calls perf_event_output() to push data into the calling CPU's buffer. Userspace must poll all CPU buffers independently. Superseded by BPF_MAP_TYPE_RINGBUF for most use cases.
Key Points for BPF Maps & Ring Buffer
- BPF ring buffer (BPF_MAP_TYPE_RINGBUF) is strictly superior to perf_event_array for event streaming. It uses a single shared buffer instead of per-CPU buffers, which means better memory efficiency (one buffer sized to aggregate throughput, not N buffers each sized for peak per-CPU throughput) and simpler userspace consumption (one fd to poll instead of N).
- Per-CPU maps are not optional for high-frequency counters. A shared hash map with 10 million updates per second across 64 cores spends more time on spin lock contention than on actual work. Per-CPU variants eliminate all synchronization from the write path. The cost is NR_CPUS copies of each value in memory and a userspace aggregation step on read.
- LRU hash maps solve the stale entry problem that plagues long-running BPF programs. A connection tracking map without eviction grows until it hits max_entries and then fails all inserts. LRU maps evict cold entries automatically, but the eviction is approximate -- under heavy churn, hot entries can be evicted if the LRU lists are not perfectly maintained. Size the map at 2-3x expected steady-state entries.
- Map pinning to bpffs (/sys/fs/bpf/) decouples map lifetime from program lifetime. A pinned map survives program restart, allowing a new version of a BPF program to attach to existing state without losing connection tracking entries or counters. Cilium relies on this for seamless datapath upgrades.
- The bpf() syscall is the single entry point for all map operations from userspace: BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE_ELEM, BPF_MAP_DELETE_ELEM, BPF_MAP_GET_NEXT_KEY. From BPF program context, maps are accessed via helper functions like bpf_map_lookup_elem() that the verifier validates at load time.
Common Mistakes with BPF Maps & Ring Buffer
- Using perf_event_output when BPF ring buffer is available. perf event arrays allocate one buffer per CPU, each sized for worst-case throughput. On a 128-core machine with 64 KB per-CPU buffers, that is 8 MB of ring buffer memory fragmented across 128 independent buffers. The BPF ring buffer achieves the same throughput with a single 256 KB buffer and never drops events under asymmetric load where some CPUs are hot and others are idle.
- Forgetting to use per-CPU maps for frequently updated counters. A regular BPF_MAP_TYPE_HASH protects each bucket with a spin lock. At 1 million updates per second on a 64-core machine, lock contention dominates. The fix is BPF_MAP_TYPE_PERCPU_HASH, which eliminates all locking. The tradeoff: reads require summing NR_CPUS values in userspace.
- Setting max_entries too low on LRU hash maps. When the map is full and churn is high, the LRU eviction runs on the hot path of every insert. If max_entries matches the expected steady state exactly, brief traffic spikes cause eviction storms that remove entries still in active use. Size LRU maps at 2-3x the expected working set.
- Not pinning maps that should survive program restarts. Without pinning, a BPF map is destroyed when the last program referencing it is unloaded. Restarting a Cilium agent without pinned maps drops all connection tracking state, causing thousands of connections to reset. Always pin maps that hold persistent state to /sys/fs/bpf/.
- Blocking in the userspace ring buffer consumer. The BPF ring buffer delivers events in order with a callback or epoll interface. If the consumer blocks on slow I/O (writing events to disk synchronously, making network calls), the ring buffer fills and events are lost. Consume into an in-memory queue first, then drain the queue asynchronously.
Related Topics
eBPF: Programmable Kernel, XDP & AF_XDP: Kernel-Bypass Networking, Perf Events & Performance Counters, System Calls: User to Kernel Transition, Kernel Modules & Device Drivers
Linux Capabilities — Security & Access Control
Difficulty: Intermediate
Root is not one privilege -- it is 40+ discrete permissions bundled under UID 0. Each privileged operation triggers a capable() check against the calling thread's effective set. Five per-thread sets (permitted, effective, inheritable, ambient, bounding) govern what a thread holds, what it can activate, and what it can hand to children. Drop a capability from the bounding set and it is gone for good -- no descendant can ever get it back.
System Calls for Linux Capabilities
Key Components in Linux Capabilities
- Permitted set (p): The maximum set of capabilities the thread CAN use. A capability must be in the permitted set before it can be raised into the effective set. The permitted set can only shrink. once a capability is dropped from permitted, it's gone forever for that thread.
- Effective set (e): The capabilities the kernel actually checks for privileged operations. A thread raises capabilities into the effective set from the permitted set when needed and drops them when not (capability-aware programs do this). For non-capability-aware legacy programs, effective = permitted.
- Inheritable set (i): Capabilities that can be inherited across execve(). but only if the executed file ALSO has the capability in its inheritable set. This two-key requirement makes inheritable sets rarely useful. Ambient capabilities (added in kernel 4.3) fix this limitation.
- Ambient set (a): Capabilities preserved across execve() for non-setuid, non-file-capability binaries. If a service manager launches a process with CAP_NET_BIND_SERVICE in the ambient set, the child retains it without needing file capabilities. systemd's AmbientCapabilities= uses this.
Key Points for Linux Capabilities
- There are 40+ capabilities in modern kernels, but a handful dominate real-world usage: CAP_NET_BIND_SERVICE (bind ports below 1024), CAP_NET_RAW (raw sockets for ping/tcpdump), CAP_SYS_ADMIN (the dangerous catch-all that is basically mini-root), CAP_DAC_OVERRIDE (bypass file permissions), and CAP_SETUID/CAP_SETGID (change identity).
- The bounding set is an irreversible ceiling. Drop CAP_SYS_ADMIN from it, and no child process can ever gain that capability again -- not through setuid binaries, not through file capabilities, not through anything. This is how container runtimes permanently lock the door on dangerous privileges.
- File capabilities replace setuid root for specific use cases. 'setcap cap_net_bind_service=ep /usr/bin/myserver' lets a binary bind port 80 without ever running as root. Much safer than chmod u+s, because the binary only gets the one permission it needs.
- CAP_SYS_ADMIN is the 'new root.' It controls mount, chroot, sethostname, BPF, quotas, namespaces, and dozens of other operations. A process with CAP_SYS_ADMIN can do almost anything root can. Container runtimes drop it by default for exactly this reason.
- When a setuid-root binary runs, the process gets ALL capabilities in its permitted and effective sets. When it drops to a non-root UID, it keeps the capabilities in its permitted set unless it explicitly drops them. That is how ping can run setuid-root, drop to your UID, and still hold CAP_NET_RAW for raw sockets.
Common Mistakes with Linux Capabilities
- Mistake: Granting CAP_SYS_ADMIN to a container 'because it needs to mount filesystems.' Reality: CAP_SYS_ADMIN is nearly equivalent to full root. Use bind mounts from the host, or run the specific operation in an init container with a narrow capability set.
- Mistake: Setting file capabilities without understanding version semantics. Reality: File capabilities have a version field. v2 (Linux 2.6.25+) supports only permitted/effective/inheritable. v3 (Linux 4.14+) adds namespace-aware root_id. Mismatched versions silently fail -- no error, just no capabilities.
- Mistake: Dropping from the effective set but not the permitted set, thinking the process is restricted. Reality: The process (or a compromised library) can raise the capability back into effective at any time. Drop from permitted for permanent restriction.
- Mistake: Forgetting that capabilities are per-thread, not per-process. Reality: A multithreaded program that drops capabilities in one thread still has them in all others. Each thread has its own effective/permitted/inheritable sets. Use prctl(PR_SET_KEEPCAPS) carefully across setuid transitions.
Related Topics
File Permissions, Ownership & ACLs, Seccomp: Sandboxing System Calls, SELinux & AppArmor, Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
cgroups v2 (Control Groups) — Kernel Internals
Difficulty: Intermediate
How the kernel puts CPU, memory, and I/O on a leash for groups of processes. Every Docker container, Kubernetes pod, and systemd service sits inside a cgroup. The limits are enforced by the kernel itself -- container runtimes just write numbers to files.
System Calls for cgroups v2 (Control Groups)
Key Components in cgroups v2 (Control Groups)
- cgroup_subsys (controllers): Each controller (cpu, memory, io, pids, cpuset, rdma, hugetlb) manages one resource type. In v2, controllers are attached to cgroup nodes and propagate constraints down the hierarchy. A controller can only be enabled in a child if it's enabled in the parent.
- cgroup.subtree_control: A file in each cgroup directory that determines which controllers are enabled for children. Writing '+memory +cpu' enables those controllers in child cgroups. The v2 mechanism that replaces v1's per-controller mount points.
- memory.max / memory.high: memory.max is a hard limit. the OOM killer activates when a cgroup hits it. memory.high is a soft limit. the kernel throttles allocations and increases reclaim pressure, giving the workload a chance to release memory before hitting the hard wall.
- css_set (cgroup subsystem state): Kernel structure linking each task to its cgroup membership. Every task_struct has a pointer to a css_set, which contains an array of cgroup_subsys_state pointers. one per active controller.
Key Points for cgroups v2 (Control Groups)
- The 'no internal processes' rule eliminates ambiguity: a cgroup with children cannot itself contain processes (except root). In v1, a parent could have both processes and children with different limits. v2 forces you to leaf-node your processes.
- The memory controller tracks everything -- anonymous pages, page cache, kernel memory (slab, page tables). memory.current shows live usage, memory.stat breaks it down. Kernel memory charging was added in v2.
- cpu.weight (default 100) replaced v1's CFS shares. For hard limits, cpu.max takes 'quota period' in microseconds -- '50000 100000' means 50% of one CPU. Simple math, direct control.
- Buffered writes bypass io.max unless cgroup-aware writeback is enabled. Only direct I/O is immediately throttled. This trips people up constantly when they set I/O limits and wonder why writes are not being capped.
- systemd maps its unit hierarchy directly to the cgroup tree. system.slice/ nginx.service becomes /sys/fs/cgroup/system.slice/nginx.service/. MemoryMax= in a unit file writes to memory.max. No cgroup API needed.
Common Mistakes with cgroups v2 (Control Groups)
- Mistake: Mixing cgroups v1 and v2 controllers. Reality: A controller can only be used in v1 OR v2, not both. Hybrid mode creates confusion. Modern systems should use unified v2 (systemd defaults to it since v248).
- Mistake: Setting memory.max without memory.high. Reality: The process gets OOM-killed instantly with no warning. Set memory.high to ~90% of memory.max to trigger throttling first, giving the app time to respond.
- Mistake: Expecting io.max to limit buffered writes. Reality: Buffered writes go through the page cache and are attributed at writeback time. Only direct I/O is immediately throttled. Enable cgroup writeback for correct accounting.
- Mistake: Not understanding cgroup delegation. Reality: A non-root user can manage a subtree only if they own the directory AND cgroup.procs, cgroup.subtree_control, and cgroup.threads files. systemd handles this via Delegate=yes.
Related Topics
Linux Namespaces (PID, NET, MNT, UTS, IPC, USER), Process Scheduling (CFS), Virtual Memory & Address Spaces, OOM Killer & Memory Pressure
chroot & pivot_root — File Systems & I/O
Difficulty: Intermediate
chroot flips a single pointer in the process's fs_struct. The old root is still reachable through saved file descriptors, cwd tricks, or /proc -- a root process escapes in 4 lines of C. pivot_root operates at the mount namespace level, swapping the root mount and shoving the old one to a put_old directory. After umount2(put_old, MNT_DETACH), no path, fd, or /proc reference back to the host survives. Every production container runtime uses pivot_root, not chroot.
System Calls for chroot & pivot_root
- chroot
- pivot_root
- mount
- umount2
- unshare
Key Components in chroot & pivot_root
- chroot(const char *path): Changes the calling process's root directory to path; subsequent absolute path lookups start from this new root. Does NOT change the current working directory, which is the basis for classic chroot escape attacks.
- pivot_root(new_root, put_old): Moves the current root mount to put_old and makes new_root the new root mount. Requires both arguments to be mount points within the caller's mount namespace. The old root can then be fully unmounted, leaving no reference to the host filesystem.
- unshare(CLONE_NEWNS): Creates a new mount namespace for the calling process. Required before pivot_root so that mount operations (including the root swap) are invisible to other processes and do not affect the host mount tree.
- umount2(target, MNT_DETACH): Lazily unmounts a filesystem. Used after pivot_root to detach the old root; MNT_DETACH ensures the unmount succeeds even if processes still have references, cleaning up once all references are dropped.
Key Points for chroot & pivot_root
- chroot is a filesystem trick, not a security boundary. It does not create namespaces, does not restrict syscalls, and does not stop a root process from escaping. The classic escape: mkdir(d); chroot(d); chdir('././.'); chroot('.') -- it works because chroot never changes the current working directory.
- pivot_root operates at the mount namespace level and requires CLONE_NEWNS. After pivot_root, the old root lands at put_old and MUST be unmounted. Skip the umount and you have a path straight back to the host filesystem inside the container.
- The runc container init sequence is: clone(CLONE_NEWNS|CLONE_NEWPID|..), mount overlay on new root, pivot_root(new_root, old_root), umount2(old_root, MNT_DETACH), then exec the container entrypoint. After this, no file descriptor, no path, and no /proc reference to the host filesystem survives.
- chroot has a simpler API but a weaker security model. pivot_root has a more complex API (mount namespace, mount point requirements) but provides actual isolation. In interviews, knowing WHY containers use pivot_root instead of chroot demonstrates deep understanding of Linux security.
- After pivot_root plus umount of old root, /proc/1/root inside the container points to the overlayfs mount. There is zero reference to the host's / in the container's mount table. Verify with cat /proc/self/mountinfo.
Common Mistakes with chroot & pivot_root
- Mistake: Using chroot for security isolation in production. Reality: chroot was designed for build environments (like debootstrap), not security boundaries. Any process with CAP_SYS_CHROOT can escape with the open-dir-chroot-fchdir technique. It was never meant to contain adversaries.
- Mistake: Forgetting to umount the old root after pivot_root. Reality: if put_old remains mounted, any process in the container that navigates to that path has full access to the host filesystem. runc unmounts it immediately and recursively. Skip this step and your container is not isolated.
- Mistake: Calling pivot_root without creating a mount namespace first. Reality: pivot_root modifies the root mount of the current namespace. Without CLONE_NEWNS, you would change the root for ALL processes sharing the namespace -- including the host init process. This would be catastrophic.
- Mistake: Not bind-mounting the new root onto itself before pivot_root. Reality: the kernel requires new_root to be a mount point. A plain directory will not work. You must run mount --bind /new_root /new_root first to satisfy this requirement.
Related Topics
Linux Namespaces (PID, NET, MNT, UTS, IPC, USER), OverlayFS & Union File Systems, Virtual File System (VFS), Seccomp: Sandboxing System Calls
Container Runtime Internals (runc/containerd) — Kernel Internals
Difficulty: Advanced
A container is not a kernel primitive. It is assembled from clone() with namespace flags, pivot_root for filesystem isolation, cgroup limits, capability drops, and seccomp filters -- all stitched together by runc in a specific sequence. There is no "create container" syscall. The kernel has no concept of a container. What Docker calls a container is roughly 8 syscalls executed in the right order, creating the illusion of an isolated machine from ordinary Linux process primitives.
System Calls for Container Runtime Internals (runc/containerd)
- clone
- unshare
- setns
- pivot_root
- mount
- execve
Key Components in Container Runtime Internals (runc/containerd)
- runc: The OCI-compliant low-level container runtime. runc reads a config.json (OCI runtime spec), performs the actual clone() with namespace flags, sets up the mount tree with pivot_root, writes cgroup limits, drops capabilities, installs seccomp filters, and exec()s the container entrypoint. It exits after setup -- it does not babysit the running container. runc is roughly 10,000 lines of Go wrapping Linux syscalls.
- containerd-shim: A per-container process that sits between containerd and the actual container process. The shim survives containerd restarts, meaning containers keep running even if containerd is upgraded or crashes. It holds the container's stdio pipes, reports exit status back to containerd, and reaps zombie children. Each container has its own shim process.
- OCI Runtime Spec (config.json): A JSON document that defines everything about the container's execution environment: which namespaces to create, what filesystems to mount, cgroup resource limits, the seccomp filter profile, which capabilities to keep or drop, the entrypoint command, environment variables, and the root filesystem path. runc is driven entirely by this spec. 'runc spec' generates a default one.
- CRI (Container Runtime Interface): A gRPC API that kubelet uses to talk to container runtimes. CRI defines operations like RunPodSandbox, CreateContainer, StartContainer, StopContainer, and RemoveContainer. containerd implements CRI as a built-in plugin. CRI-O is an alternative implementation. CRI replaced the old dockershim in Kubernetes 1.24, removing Docker as a direct dependency.
Key Points for Container Runtime Internals (runc/containerd)
- There is no container syscall. A container is assembled from clone() (namespaces), pivot_root (filesystem), cgroup writes (resource limits), prctl/capset (capability drops), and seccomp() (syscall filtering). runc orchestrates these in a precise sequence, and if any step fails, the container does not start.
- containerd-shim is the unsung hero of container reliability. Because each container has its own shim process, containerd itself can be restarted or upgraded without killing running containers. The shim holds stdio, tracks the exit code, and reaps zombies. Without it, a containerd upgrade would kill every container on the node.
- The Kubernetes pause container is not overhead -- it is the pod's identity. It is created first via RunPodSandbox, holds the network namespace, and all app containers join it with setns(). If the pause container dies, every container in the pod loses its network identity. It runs /pause (an infinite sleep), consuming about 1MB of memory.
- Image layers are content-addressable. Each layer is a tar.gz identified by its SHA256 hash. containerd stores them in a content store and assembles them using snapshots (overlayfs by default). Pulling an image that shares layers with an already-pulled image skips the shared layers entirely. A 500MB image that shares 450MB with an existing image only downloads 50MB.
- PID 1 in a container has special signal semantics. The kernel does not deliver signals to PID 1 unless PID 1 has explicitly registered a handler for that signal. If the entrypoint is bash (which does not handle SIGTERM by default), graceful shutdown is impossible. Tini or dumb-init solves this by acting as a proper init that forwards signals and reaps zombies.
Common Mistakes with Container Runtime Internals (runc/containerd)
- Mistake: Using a shell as PID 1 (CMD ["bash", "-c", "my-app"]) and expecting SIGTERM to reach the application. Reality: bash becomes PID 1, does not forward SIGTERM, and the app never receives the shutdown signal. Kubernetes waits terminationGracePeriodSeconds (default 30s) then sends SIGKILL. Use exec form CMD ["my-app"] or tini as the entrypoint.
- Mistake: Assuming "container restart" means a fast operation. Reality: runc must redo the entire setup sequence -- clone namespaces, prepare overlay mount, pivot_root, apply cgroups, drop capabilities, install seccomp filter, exec. If the snapshot is stale or the image layers need re-extraction, startup takes seconds, not milliseconds.
- Mistake: Running containers with --privileged "because it is easier." Reality: --privileged disables ALL isolation -- all capabilities granted, all devices accessible, AppArmor/SELinux disabled, seccomp disabled, /proc and /sys writable. A process inside a privileged container can mount the host filesystem and modify the host kernel.
- Mistake: Not understanding that Kubernetes removed dockershim in 1.24 and thinking Docker no longer works. Reality: Docker images still work everywhere because they are OCI images. What was removed is kubelet talking to dockerd. Kubernetes now talks to containerd directly via CRI, which is what Docker itself uses internally.
Related Topics
Linux Namespaces (PID, NET, MNT, UTS, IPC, USER), cgroups v2 (Control Groups), chroot & pivot_root, OverlayFS & Union File Systems, Linux Capabilities, Seccomp: Sandboxing System Calls, Network Namespaces & veth Pairs, Signals & Signal Handling
Copy-on-Write & Process Creation Internals — Processes & Threads
Difficulty: Advanced
Why fork() finishes in milliseconds no matter how much memory a process holds. Parent and child share every page after fork -- the kernel only duplicates a page when someone writes to it. The real cost of copying is deferred until the moment a write lands.
System Calls for Copy-on-Write & Process Creation Internals
- fork
- vfork
- clone
- clone3
- unshare
Key Components in Copy-on-Write & Process Creation Internals
- Page Table Entry (PTE): After fork(), both parent and child PTEs point to the same physical pages with write-protection set. On a write fault, the kernel's COW handler copies the page, updates the faulting process's PTE, and makes the copy writable.
- struct page / folio: The physical page descriptor. COW pages have a reference count (mapcount) > 1. When a write fault triggers, the kernel checks mapcount: if 1 (only this process maps it), it just makes the page writable (no copy needed). If > 1, it allocates a new page and copies.
- mm_struct: Represents a process's entire virtual address space. fork() calls dup_mm() which duplicates the mm_struct, copies all vm_area_struct entries, and walks the page tables to set COW protection bits.
- clone_flags: Bitmask controlling what the new process shares with the parent. CLONE_VM shares the address space (threads), CLONE_FS shares cwd/root/umask, CLONE_FILES shares the fd table, CLONE_NEWNS/NEWPID/NEWNET create new namespaces (containers).
Key Points for Copy-on-Write & Process Creation Internals
- fork() cost is proportional to page table entries, not memory size. A 100GB process with 25M pages needs ~200MB of page tables copied. That takes 1-10ms. The actual data pages? Untouched.
- The COW fault handler (do_wp_page) is smarter than you think. If only one process maps the page (mapcount=1), it just flips the write bit -- no copy at all. Zero page? Allocate a fresh zeroed page. The expensive copy-and-update path is the last resort.
- vfork() is NOT copy-on-write. The child shares the parent's mm_struct directly, and the parent is frozen until the child calls exec or _exit. It is faster because there is no page table copy, but the child must not modify any data. posix_spawn() is the modern safe alternative.
- clone3() is the modern, extensible version of clone(). It uses a struct clone_args with explicit size field, supporting all clone flags plus CLONE_CLEAR_SIGHAND, CLONE_INTO_CGROUP, and CLONE_NEWTIME. New flags will only be added to clone3(), not clone().
- Here is the elegant part: Linux uses the same clone() syscall for processes (no sharing flags), threads (CLONE_VM|CLONE_FILES|..), and containers (CLONE_NEWNS|CLONE_NEWPID|..). The only difference is which flags you pass. There is no separate 'create thread' or 'create container' syscall.
Common Mistakes with Copy-on-Write & Process Creation Internals
- Thinking fork() is slow for large processes. Reality: with COW, a 50GB process forks in milliseconds. The pages are not copied until written. But the page table copying still takes time, and the TLB flush across all CPUs can cause latency spikes.
- Forking Redis with 30GB of data and then writing extensively. Each written page triggers a COW fault, allocating physical memory. If you write to most pages during a bulk update, you temporarily need double the memory. Transparent Huge Pages make this worse -- a single byte write copies an entire 2MB page instead of 4KB.
- Using vfork() and modifying variables. Since the child shares the parent's address space directly, any modification corrupts the parent. The only safe operations after vfork() are exec() and _exit(). Even calling exit() is unsafe because atexit handlers may modify global state.
- Forgetting that shared pages after fork still share the same struct file for open fds. Parent and child share file offsets. If both write to the same fd without coordination, output is interleaved. Each should close or dup fds they do not need.
Related Topics
Process Lifecycle (fork/exec/wait), POSIX Threads, Process Scheduling (CFS), Shared Memory & Semaphores
Daemons & Service Management — Processes & Threads
Difficulty: Intermediate
A process that survives its launching terminal and runs in the background under init/systemd supervision. The old way: double-fork, setsid, close fds, redirect stdio to /dev/null -- six steps to cut every tie. The modern way: let systemd handle all of that through a unit file with Type=simple or Type=notify.
System Calls for Daemons & Service Management
- fork
- setsid
- dup2
- chdir
- umask
- open
Key Components in Daemons & Service Management
- setsid(): Creates a new session and process group. The calling process becomes the session leader with no controlling terminal. setsid() is the key step that detaches the daemon from the launching terminal.
- systemd unit file: Declarative service description: Type (simple, forking, notify, oneshot), ExecStart, Restart policy, resource limits, cgroup controls, security hardening (PrivateTmp, NoNewPrivileges, etc.).
- sd_notify(): For Type=notify services, the daemon signals readiness to systemd via sd_notify('READY=1') over a Unix socket ($NOTIFY_SOCKET). sd_notify lets systemd accurately track when the service is actually ready, not just when the process started.
- PID file: Traditional mechanism for daemon identification: the daemon writes its PID to /var/run/<name>.pid. systemd's Type=forking uses PIDFile= to track the main daemon process. With Type=notify/simple, PID files are unnecessary.
Key Points for Daemons & Service Management
- The classic double-fork: (1) fork, parent exits (returns shell prompt), (2) setsid() (new session, no terminal), (3) fork again, first child exits (grandchild cannot accidentally acquire a terminal because it is not a session leader), (4) chdir('/'), umask(0), redirect stdio to /dev/null. Six steps to escape the terminal.
- systemd's Type=simple makes all of this unnecessary. The daemon runs as a foreground process. systemd provides session isolation, stdio redirection to journald, working directory control, and cgroup tracking. No forking. No PID files. No ritual.
- Socket activation is the most underrated feature of systemd. It opens the listening socket and passes it to the daemon as fd 3. The daemon never calls bind/listen. Zero-downtime restarts are free because the socket stays open during restart. Clients queue in the kernel backlog.
- systemd tracks all daemon processes via cgroups, not PIDs or process groups. A daemon that double-forks, reparents children, or creates new sessions cannot escape its cgroup. This is why KillMode=control-group reliably kills everything.
- The second fork in the double-fork pattern has a specific purpose: it prevents the daemon from ever acquiring a controlling terminal. A session leader (created by setsid) can open a tty and get a controlling terminal. The grandchild (from the second fork) is not a session leader, so it cannot.
Common Mistakes with Daemons & Service Management
- Double-forking under systemd with Type=simple. The daemon forks, the original process (which systemd tracks) exits, and systemd thinks the service crashed. If your service double-forks, use Type=forking. If you are writing new code, do not fork at all -- use Type=simple or Type=notify.
- Not redirecting stdin/stdout/stderr to /dev/null. A daemon that inherits open file descriptors from the terminal may write to a closed terminal (SIGPIPE) or block on reads from it. Always dup2 all three to /dev/null in the classic pattern.
- Using a PID file without flock() locking. Two daemon instances can overwrite each other's PID file. Always lock the PID file with flock() or fcntl(). Better yet: use systemd's native cgroup tracking and skip PID files entirely.
- Calling sd_notify(READY=1) before the daemon is actually ready. systemd trusts this signal. If you send it before opening sockets or loading config, dependent services start before yours can serve requests. Only notify after full initialization.
Related Topics
Process Groups, Sessions & Job Control, Signals & Signal Handling, Process Lifecycle (fork/exec/wait), Inter-Process Communication (Pipes & FIFOs)
Device Mapper & Block Layer Internals — Storage & Filesystems
Difficulty: Advanced
How Linux stacks virtual block devices on top of physical ones. Device mapper transforms, splits, encrypts, and thin-provisions block I/O by intercepting BIO structs between the filesystem and the disk driver. LVM, LUKS, and container storage all depend on this single subsystem.
System Calls for Device Mapper & Block Layer Internals
Key Components in Device Mapper & Block Layer Internals
- struct bio: The fundamental I/O unit in the block layer. A BIO describes a single I/O operation: the target block device, the starting sector, the direction (read or write), and a list of memory pages (bio_vec) holding the data. Device mapper targets receive BIOs from the layer above, transform them (remap sectors, split, encrypt), and submit new BIOs to the layer below.
- struct mapped_device: Represents one device mapper device (e.g., /dev/dm-0). Contains the device name, the mapping table (dm_table), the request queue, and suspended/active state. dmsetup ls shows all mapped_device instances on the system. Each LVM logical volume, each LUKS device, and each dm-thin volume is a mapped_device.
- struct dm_table / struct dm_target: A dm_table holds an ordered list of dm_target entries, each covering a range of sectors. Each dm_target has a target_type (linear, crypt, thin, striped, snapshot-origin, etc.) and target-specific private data. The table is loaded atomically and can be swapped live via dmsetup load + dmsetup resume.
- struct request_queue / struct blk_mq_tag_set: The multi-queue block layer (blk-mq) manages submission and completion of I/O requests. Each block device has a request_queue with per-CPU software queues that map to hardware dispatch queues. BIOs are merged into requests, and the I/O scheduler (mq-deadline, bfq, kyber, or none) reorders them before dispatch to the driver.
- dm-snapshot COW device: When an LVM snapshot exists, every write to the origin must first copy the old data to the snapshot's COW area (a separate block device or reserved space). This copy-before-write turns every origin write into a read of the old block, a write of the old block to the COW area, and then the original write. Three I/O operations instead of one. This is the direct cause of snapshot-induced latency.
Key Points for Device Mapper & Block Layer Internals
- Device mapper operates at the BIO level, not the filesystem level. It sees sector ranges and raw bytes, not files or directories. This is why dm-crypt can encrypt any filesystem (ext4, XFS, Btrfs) without knowing anything about the filesystem's internal structure.
- The block layer merges adjacent BIOs into larger requests before submitting them to the device driver. This merging is critical for spinning disks (fewer seeks) and still beneficial for SSDs (fewer NVMe commands). The I/O scheduler sits between BIO submission and driver dispatch, reordering for locality or fairness.
- LVM snapshots use a copy-on-write mechanism at the block level. Every write to the origin triggers a read-copy-write sequence to preserve the old data in the snapshot COW area. This write amplification is 3x at minimum. For write-heavy workloads, this overhead is severe enough that LVM thin snapshots (dm-thin) are the preferred alternative because they handle COW more efficiently with B-tree metadata.
- dm-thin maintains a B-tree mapping from virtual blocks to physical blocks. Snapshots in dm-thin share the B-tree and use reference counting on physical blocks. A write to a shared block allocates a new physical block and updates only the writing volume's mapping. No read-copy-write sequence is needed for the origin. This is fundamentally more efficient than classic dm-snapshot.
- Device mapper tables can be stacked. A typical encrypted LVM setup stacks dm-linear (LVM striping or concatenation) on top of dm-crypt on top of the physical device. Each layer adds a BIO transformation. dmsetup deps shows the dependency tree. Deep stacking adds latency per layer, typically 5-20 microseconds each.
Common Mistakes with Device Mapper & Block Layer Internals
- Running an LVM thin pool to 100% utilization. When the pool has no free blocks, all thin volumes backed by that pool freeze with I/O errors. Unlike a full filesystem that returns ENOSPC, a full thin pool causes every write BIO to hang or error. Set up monitoring with lvs -o+data_percent and configure autoextend in /etc/lvm/lvm.conf (thin_pool_autoextend_threshold and thin_pool_autoextend_percent).
- Using classic LVM snapshots (dm-snapshot) on write-heavy databases. Every origin write pays a 3x I/O penalty (read old block, write old block to COW, write new block). On a database doing 10,000 IOPS, the snapshot adds 20,000 extra IOPS to the underlying device. Migrate to LVM thin snapshots or use filesystem-level snapshots (Btrfs, ZFS) that handle COW more efficiently.
- Forgetting to size the snapshot COW area properly. If the COW area fills up, the snapshot is silently invalidated and dropped. Any backup process reading from the snapshot gets I/O errors. Always allocate 2-3x the expected write volume during the snapshot's lifetime and monitor usage with lvs -o+snap_percent.
- Assuming dm-crypt has negligible overhead on all hardware. Without AES-NI (hardware AES acceleration), dm-crypt throughput drops from 2+ GB/s to 200-400 MB/s. Check for AES-NI support with grep aes /proc/cpuinfo. On VMs, ensure the hypervisor exposes AES-NI to guests. Older ARM servers without crypto extensions also suffer significant dm-crypt overhead.
Related Topics
Disk I/O Scheduling, ext4 & XFS On-Disk Internals, Virtual File System (VFS), Copy-on-Write & Process Creation Internals
Direct I/O (O_DIRECT): Bypassing the Page Cache — I/O & Storage
Difficulty: Advanced
Why databases open files with O_DIRECT instead of letting the kernel cache pages. When an application already manages its own buffer pool, the page cache becomes a liability -- it doubles memory usage, adds writeback latency, and evicts pages the application knows are still hot.
System Calls for Direct I/O (O_DIRECT): Bypassing the Page Cache
- open (with O_DIRECT)
- read
- write
- posix_memalign
- pread
- pwrite
Key Components in Direct I/O (O_DIRECT): Bypassing the Page Cache
- O_DIRECT flag: Passed to open() or openat() to request direct I/O. Tells the kernel to bypass the page cache for reads and writes on this file descriptor. The kernel transfers data directly between the application buffer and the storage device. Does not guarantee ordering or durability on its own -- fsync() or O_SYNC is still needed for persistence guarantees.
- Alignment requirements: O_DIRECT imposes strict alignment constraints. The memory buffer must be aligned to the logical block size of the filesystem (typically 512 bytes or 4096 bytes). The file offset must also be aligned. The transfer size must be a multiple of the block size. Violating any of these causes read() or write() to fail with EINVAL. posix_memalign() or aligned_alloc() is used to obtain properly aligned buffers.
- Page cache bypass path: Normal I/O follows the path: application buffer -> page cache -> block layer -> device. Direct I/O skips the page cache step entirely: application buffer -> block layer -> device. The kernel sets up DMA (Direct Memory Access) from the application's buffer pages directly, pinning them in memory for the duration of the transfer.
- DMA pinning: For direct I/O, the kernel must pin the application's buffer pages in physical memory so the storage device can DMA to/from them. The kernel calls get_user_pages() to pin pages, sets up the scatter-gather list for the block device, and unpins pages after the I/O completes. This is why O_DIRECT buffers must be page-aligned -- the DMA hardware operates on physical page boundaries.
Key Points for Direct I/O (O_DIRECT): Bypassing the Page Cache
- O_DIRECT eliminates double buffering. Without it, a database with its own buffer pool stores every page twice: once in the application's managed cache and once in the kernel's page cache. On a 128 GB server with a 96 GB buffer pool, this can waste 30-40 GB of RAM holding redundant copies.
- Alignment is not optional. The buffer address must be aligned to the filesystem block size (usually 512 or 4096 bytes), the file offset must be aligned, and the transfer size must be a multiple of the block size. Use posix_memalign(&buf, 4096, size) or aligned_alloc(4096, size). A misaligned O_DIRECT call returns EINVAL, not silently falls back to buffered I/O.
- O_DIRECT does not mean durable. The data bypasses the page cache but may still sit in the storage device's volatile write cache. Combine O_DIRECT with fsync() for durability, or use O_DIRECT|O_DSYNC to get both bypass and synchronous writes in one flag combination.
- Some filesystems handle O_DIRECT differently. ext4 falls back to buffered I/O for misaligned requests on older kernels. XFS has historically had the best O_DIRECT support and is the preferred filesystem for database workloads. btrfs supports O_DIRECT but may fall back to buffered I/O for compressed extents.
- io_uring with IORING_OP_READ/WRITE and O_DIRECT gives the best of both worlds: no page cache overhead and no syscall-per-I/O overhead. The submission queue batches multiple direct I/O operations, and the kernel processes them asynchronously. This is the path modern databases like TigerBeetle and ScyllaDB are moving toward.
Common Mistakes with Direct I/O (O_DIRECT): Bypassing the Page Cache
- Using O_DIRECT without alignment. The most common failure mode: allocating a buffer with malloc() (which returns 8 or 16 byte aligned memory) and passing it to a direct I/O read or write. The call fails with EINVAL. Always use posix_memalign() or aligned_alloc() with the filesystem block size as the alignment.
- Assuming O_DIRECT means data is on disk. O_DIRECT bypasses the page cache, not the device write cache. Without fsync() or O_DSYNC, a power failure can lose data that O_DIRECT "wrote" successfully. This catches teams that switch from buffered+fsync to O_DIRECT and drop the fsync call.
- Using O_DIRECT for small random reads in a general-purpose application. The page cache exists for a reason -- it absorbs repeated reads, coalesces small writes, and handles readahead. O_DIRECT makes sense when the application manages its own cache (databases) or when data has near-zero reuse (streaming writes). For typical application workloads, the page cache is strictly better.
- Mixing O_DIRECT and buffered I/O on the same file from different processes. The page cache and direct I/O path can see stale data. One process writes via the page cache, another reads with O_DIRECT and gets old data because the dirty page has not been flushed yet. If multiple access patterns are required, use O_DIRECT consistently or add explicit synchronization.
- Forgetting that O_DIRECT disables kernel readahead. Sequential reads through O_DIRECT get no automatic prefetching. The application must implement its own readahead by issuing larger reads or submitting multiple asynchronous read requests. Without this, sequential direct I/O throughput can be 50-70% lower than buffered I/O.
Related Topics
Page Cache & Block I/O, io_uring: Modern Async I/O, Disk I/O Scheduling, Virtual Memory & Address Spaces, mmap & Memory-Mapped Files
Directory Entries & Path Resolution — File Systems & I/O
Difficulty: Intermediate
Each directory entry (dentry) maps a filename to an inode number. Opening a file means walking these mappings component by component -- left to right, checking permissions at every step. The dentry cache holds recent lookups in memory, so the same path walk that costs disk I/O on first access resolves in nanoseconds the second time.
System Calls for Directory Entries & Path Resolution
- opendir
- readdir
- getcwd
- chdir
- realpath
- openat
Key Components in Directory Entries & Path Resolution
- struct dentry: In-memory cache of a (parent, name) -> inode mapping; forms a tree structure mirroring the directory hierarchy with d_parent and d_subdirs pointers
- Dentry cache (dcache): Global hash table keyed by (parent dentry pointer, name hash) providing O(1) path component lookup; one of the most performance-critical kernel data structures
- struct nameidata: Internal state carried through path walk: current dentry, mount, remaining path, flags (LOOKUP_FOLLOW, LOOKUP_DIRECTORY), and symlink depth counter
- getdents64 (internal syscall): The actual syscall behind readdir(); reads directory entries in bulk from the filesystem into a userspace buffer, returning struct linux_dirent64 records
Key Points for Directory Entries & Path Resolution
- Hot path lookups are nearly free. RCU-walk resolves paths without taking any locks or dentry reference counts — it only falls back to ref-walk on cache misses, sleeping permission checks, or concurrent renames
- openat(dirfd, "relative/path") eliminates an entire class of security bugs: it resolves paths from a pinned directory fd, so nobody can swap a directory out from under you between checking and using it (TOCTOU)
- The cache remembers "not found" too. Negative dentries prevent repeated disk reads for names that don't exist — critical every time your shell searches $PATH or a compiler checks include directories
- The dcache can quietly eat gigabytes of kernel memory (check /proc/sys/fs/dentry-state). The kernel's dcache shrinker reclaims unused entries under memory pressure via LRU eviction
- ext4 doesn't scan directories linearly — it uses an HTree (hash-tree B-tree) for O(1) lookup by name. Without it, a directory with a million files would require scanning every entry for each lookup
Common Mistakes with Directory Entries & Path Resolution
- Using access() then open() as separate calls — this is a textbook TOCTOU vulnerability. Between the check and the use, an attacker can swap the file. Use openat() or just open() and check the return value
- Passing a too-small buffer to getcwd() — it returns NULL with ERANGE. Use NULL as the buffer (GNU extension to let glibc allocate) or always allocate PATH_MAX bytes
- Expecting readdir() to return files in alphabetical or creation order — it doesn't. ext4 HTree returns entries in hash order (looks random). If you need sorted output, sort it yourself
- Not handling concurrent modifications during readdir() — on large directories with hash rebalancing, readdir may skip or duplicate entries. Don't assume a single pass sees a perfect snapshot
Related Topics
Inodes & File Metadata, Hard Links & Symbolic Links, Virtual File System (VFS), File Descriptors & File Tables
Disk I/O Scheduling — File Systems & I/O
Difficulty: Advanced
Sits between the filesystem and the disk, reordering and prioritizing I/O requests. Four options: none (straight pass-through, best for NVMe), mq-deadline (latency guarantees for databases), bfq (fair bandwidth sharing with ionice support), kyber (auto-tuned for fast devices). Pick wrong and latency jumps 10x.
System Calls for Disk I/O Scheduling
Key Components in Disk I/O Scheduling
- blk-mq (multi-queue block layer): Modern block layer architecture (default since Linux 5.0) that provides per-CPU software queues mapped to hardware dispatch queues. Eliminates the single-queue lock contention bottleneck, enabling millions of IOPS on NVMe devices with parallel submission across CPU cores.
- mq-deadline scheduler: Ensures every I/O request is serviced within a deadline (500ms reads, 5s writes by default). Maintains separate read and write deadline-sorted queues, preventing starvation of either type. The default and recommended scheduler for most server workloads.
- bfq (Budget Fair Queueing) scheduler: Assigns I/O bandwidth budgets to processes based on weight, guaranteeing fairness and low latency for interactive applications. Ideal for desktop/laptop workloads where responsiveness matters more than peak throughput. Higher CPU overhead than mq-deadline.
- I/O priority (ioprio): Per-process I/O scheduling priority set via ioprio_set(). Three classes: IOPRIO_CLASS_RT (real-time, 8 levels), IOPRIO_CLASS_BE (best-effort, 8 levels, default), and IOPRIO_CLASS_IDLE (only served when no other I/O pending). Only effective with bfq and cfq schedulers.
Key Points for Disk I/O Scheduling
- NVMe SSDs should use 'none' -- no scheduler at all. The device's internal controller with 64K+ queue entries handles ordering better than any software. Adding mq-deadline or bfq to NVMe wastes CPU cycles for zero I/O benefit.
- blk-mq uses per-CPU software queues mapped to hardware dispatch queues. This eliminated the single spinlock that serialized all I/O in the legacy block layer -- the bottleneck that prevented Linux from scaling beyond ~1M IOPS on fast storage.
- ionice is useless with mq-deadline and none schedulers. I/O priority classes only work with bfq (and the removed cfq). Running 'ionice -c3 pg_dump' does nothing on mq-deadline -- a common misconfiguration that gives a false sense of priority isolation.
- mq-deadline prevents read starvation with a 500ms deadline: if a read has waited longer than read_expire, it jumps the queue regardless of other optimizations. Writes get 5s. The writes_starved parameter (default 2) limits how many read batches run before writes get a turn.
- The legacy CFQ scheduler was removed in Linux 5.0 with the entire single-queue block layer. Any tuning guide referencing CFQ, noop, or the old elevator parameter is outdated. Modern tuning targets blk-mq schedulers exclusively.
Common Mistakes with Disk I/O Scheduling
- Mistake: Using bfq on high-throughput servers. Reality: bfq's per-process budget tracking and internal B-tree operations add significant CPU overhead at high IOPS. For databases and storage nodes, mq-deadline or none provides better throughput with less CPU burn.
- Mistake: Running ionice on processes when using mq-deadline or none. Reality: I/O priorities are completely ignored by these schedulers. You get zero priority isolation. If you need ionice to work, switch to bfq.
- Mistake: Applying HDD-era tuning (large nr_requests, high read_ahead_kb) to NVMe. Reality: NVMe has deep hardware queues and microsecond latencies. Large software queue depths add memory overhead and latency. Default values are usually optimal.
- Mistake: Assuming all block devices use the same scheduler. Reality: Linux allows per-device scheduler selection. Your NVMe boot drive can use 'none' while your HDD data drive uses mq-deadline. Check /sys/block/<dev>/queue/scheduler for each device.
Related Topics
io_uring: Modern Async I/O, I/O Models: Blocking, Non-Blocking, Async, cgroups v2 (Control Groups), Page Cache & Block I/O
DNS Resolution & /etc/resolv.conf — Networking & Sockets
Difficulty: Intermediate
getaddrinfo() is not a system call -- it is a glibc function that reads /etc/nsswitch.conf to decide where to look first (files, dns, mDNS, LDAP) and in what order. /etc/resolv.conf tells the stub resolver which nameserver to query, which search domains to append, and how many dots a name must have before it is treated as fully qualified. With ndots:5 and 5 search domains, a single lookup for "redis" generates up to 12 DNS packets before reaching the actual answer.
System Calls for DNS Resolution & /etc/resolv.conf
Key Components in DNS Resolution & /etc/resolv.conf
- getaddrinfo(): The standard POSIX function for hostname resolution. Not a syscall -- it is a glibc/musl library function that reads nsswitch.conf to determine resolution order, checks /etc/hosts, queries DNS, and returns a linked list of addrinfo structs with resolved addresses. Supports both IPv4 (A records) and IPv6 (AAAA records), service name to port mapping, and AI_ADDRCONFIG to filter results based on which protocols the system actually has configured. Every call from start to result involves at minimum: reading nsswitch.conf, reading /etc/resolv.conf, possibly reading /etc/hosts, constructing and sending UDP packets, and parsing DNS wire format responses.
- /etc/resolv.conf: Configuration file for the stub resolver. Contains: nameserver lines (up to 3, queried in order on timeout), search/domain lines (up to 6 domains appended to unqualified names), and options (ndots:N sets the threshold for treating a name as absolute -- default is 1; timeout:N sets per-query timeout in seconds -- default is 5; attempts:N sets retries per nameserver -- default is 2; rotate distributes queries across nameservers instead of always hitting the first). This file is read on every getaddrinfo() call in glibc -- there is no caching of the parsed config.
- /etc/nsswitch.conf: Name Service Switch configuration. The "hosts:" line controls resolution order. The default "hosts: files dns myhostname" means: check /etc/hosts first, then query DNS, then resolve the machine's own hostname as a fallback. Other possible sources include mDNS (Avahi for .local names), LDAP, NIS, and systemd-resolved (resolve). Each source can have action items in brackets: [!UNAVAIL=return] means stop searching if the source was available but returned no result, versus continuing to the next source.
- Stub resolver: The minimal DNS client built into glibc (or musl, or whatever libc the system uses). It reads /etc/resolv.conf for nameserver addresses and options, constructs DNS query packets, sends them over UDP (falling back to TCP if the response is truncated or larger than 512 bytes), and parses the response. It does NOT cache results -- every call to getaddrinfo() that reaches the DNS path sends a new query over the network. For caching, the system needs a local caching resolver like systemd-resolved, nscd, dnsmasq, or unbound running on 127.0.0.53 or 127.0.0.1.
Key Points for DNS Resolution & /etc/resolv.conf
- glibc does NOT cache DNS results. Every getaddrinfo() call for a name that is not in /etc/hosts sends a UDP packet to the nameserver. A web server making 1000 requests/second to api.example.com generates 1000 DNS queries/second unless something external caches: systemd-resolved, nscd, dnsmasq, or application-level caching.
- The ndots option controls when search domains are appended. With ndots:5 (Kubernetes default), any name with fewer than 5 dots gets each search domain appended first. "redis.default.svc" has 2 dots, which is less than 5, so glibc appends all search domains before trying the name as-is. Append a trailing dot ("redis.default.svc.") to force absolute resolution and skip the search list entirely.
- DNS queries go over UDP by default (port 53). If the response has the TC (truncated) flag set, the resolver retries over TCP. EDNS0 (RFC 6891) extends UDP payload size beyond 512 bytes -- modern resolvers negotiate up to 4096 bytes. DNS over TCP uses a persistent connection for multiple queries. DNS over TLS (port 853) and DNS over HTTPS (port 443) encrypt queries but require a resolver that supports them.
- /etc/hosts is checked before DNS (default nsswitch.conf order). Entries in /etc/hosts bypass DNS entirely -- no network query, no TTL, instant response. Container runtimes inject entries here for the container's own hostname and for linked containers. Kubernetes adds pod IP and hostAliases entries.
- IPv6 AAAA lookups happen alongside IPv4 A lookups. When AI_ADDRCONFIG is set (default in most resolvers), AAAA queries are skipped if the system has no IPv6 address configured. Happy Eyeballs (RFC 8305) races A and AAAA queries in parallel and connects to whichever responds first, with a 250ms head start for IPv6.
Common Mistakes with DNS Resolution & /etc/resolv.conf
- Mistake: Assuming DNS results are cached by the operating system. Reality: glibc has zero DNS caching. Every getaddrinfo() call that reaches the DNS path sends a fresh UDP query. Without nscd, systemd-resolved, or application-level caching, an application making 100 requests/second to the same hostname generates 100 DNS queries/second. Musl libc (Alpine Linux) also does not cache.
- Mistake: Not understanding ndots in Kubernetes. Reality: The default ndots:5 means any name with fewer than 5 dots gets the entire search list appended. Resolving "api.example.com" (2 dots) generates 6 DNS queries before the actual name is tried. For pods that mostly resolve external names, set ndots:1 in the pod's dnsConfig or append a trailing dot to every external hostname.
- Mistake: Using dns.lookup() in Node.js for high-throughput workloads. Reality: dns.lookup() calls getaddrinfo() on libuv's thread pool, which defaults to 4 threads. Under load, DNS lookups queue behind each other, stalling the event loop. Use dns.resolve() with c-ares for async DNS, or increase UV_THREADPOOL_SIZE up to 1024.
- Mistake: Running Go services in Alpine containers without setting GODEBUG=netdns=go. Reality: Alpine uses musl libc and lacks /etc/nsswitch.conf by default. Go may fall back to the cgo resolver, requiring CGO and a C library, adding latency and complexity. Setting GODEBUG=netdns=go forces the pure-Go resolver, which reads /etc/resolv.conf directly and scales better in containers.
- Mistake: Adding more than 3 nameservers to /etc/resolv.conf. Reality: glibc ignores any nameserver directive beyond the third. The resolver tries them in order with the configured timeout (default 5s). With 3 nameservers and 2 attempts each, the worst-case total timeout is 30 seconds before getaddrinfo() returns an error.
Related Topics
Socket Programming (TCP/UDP), Network Namespaces & veth Pairs, Kernel Network Stack, Unix Domain Sockets
DPDK: User-Space Networking — Networking & Sockets
Difficulty: Advanced
Kernel bypass for high-performance packet processing. DPDK maps NIC registers and DMA buffers directly into userspace via UIO or VFIO drivers, then polls for packets in a tight loop instead of waiting for interrupts. Poll-mode drivers (PMDs) process packets entirely outside the kernel -- no sk_buff allocation, no netfilter, no context switches. Hugepages back mbuf pools to guarantee contiguous physical memory for DMA and eliminate TLB misses. The cost is dedicating CPU cores to spin at 100% and losing kernel visibility into network traffic.
System Calls for DPDK: User-Space Networking
- mmap
- munmap
- ioctl
- mlock
- mbind
- sched_setaffinity
Key Components in DPDK: User-Space Networking
- Environment Abstraction Layer (EAL): The initialization framework that bootstraps a DPDK application. EAL parses command-line arguments (-l for core list, -n for memory channels, --socket-mem for per-NUMA hugepage allocation), sets up hugepage memory, initializes PCI device access, and pins threads to CPU cores using sched_setaffinity. Every DPDK application starts with rte_eal_init(). The EAL also provides platform-independent abstractions for atomic operations, spinlocks, timers, and memory barriers.
- Poll-Mode Drivers (PMDs): Userspace NIC drivers that bypass the kernel completely. Instead of registering interrupt handlers in the kernel, PMDs map NIC registers and descriptor rings into userspace via UIO or VFIO. The application calls rte_eth_rx_burst() in a tight loop, which checks the NIC RX descriptor ring for completed DMA transfers. No interrupts, no context switches, no softirq processing. The tradeoff is that PMD threads must spin continuously at 100% CPU, even when no packets arrive.
- Hugepage Memory (rte_mempool / mbuf pools): DPDK allocates all packet buffer memory from hugepage-backed mempools. Standard 4 KB pages would require millions of TLB entries for a 4 GB buffer pool; 2 MB hugepages reduce this to 2,048 entries, and 1 GB hugepages to just 4. The rte_mempool allocator provides fixed-size object pools with per-core caches to avoid cross-core contention. Each mbuf (message buffer) is a fixed-size structure containing packet data, metadata, and a pointer to the next mbuf for scatter-gather operations.
- Ring Queues (rte_ring): Lock-free FIFO queues used for inter-core communication. Built on compare-and-swap (CAS) atomic operations, rte_ring supports single-producer / single-consumer (fastest), multi-producer / multi-consumer, and mixed modes. Pipeline architectures use rings to pass mbufs between processing stages: one core handles RX, another does classification, another does forwarding. Cache-line alignment prevents false sharing between producer and consumer pointers.
- UIO / VFIO: Kernel modules that enable userspace device access. UIO (Userspace I/O) maps PCI BAR regions into userspace and provides basic interrupt forwarding. VFIO (Virtual Function I/O) adds IOMMU-based DMA remapping for security -- the NIC can only DMA to regions explicitly mapped by the application, not arbitrary physical memory. VFIO is preferred in production because UIO gives the application unrestricted DMA access to all physical memory.
Key Points for DPDK: User-Space Networking
- The fundamental tradeoff: DPDK trades kernel integration for raw speed. Applications lose access to the entire kernel networking stack -- iptables, tc, conntrack, /proc/net, tcpdump, socket API. Everything the kernel provides must be reimplemented in userspace or abandoned.
- Poll-mode drivers spin at 100% CPU regardless of traffic load. A DPDK core handling zero packets per second and one handling 10 million packets per second both show 100% CPU utilization. This is by design -- the latency benefit comes from eliminating the interrupt-to-poll transition. Power-aware deployments can use rte_power_empty_poll_stat to detect idle periods and scale frequency, but this adds latency variance.
- Hugepages are not optional. DMA requires physically contiguous memory because NICs operate on physical addresses. Standard 4 KB pages can be swapped out or fragmented, breaking DMA. Hugepages are pinned, physically contiguous, and provide 512x fewer TLB entries for the same memory. A DPDK application that fails to allocate hugepages will not start.
- DPDK and XDP solve similar problems from opposite directions. DPDK pulls packets out of the kernel entirely -- the application owns the NIC. XDP pushes processing into the kernel, running eBPF programs at the earliest point in the NIC driver. XDP keeps kernel integration (tcpdump, iptables still work for non-XDP traffic), while DPDK maximizes raw throughput at the cost of kernel visibility. XDP is typically 2-5x slower than DPDK for pure forwarding but requires no application-level protocol stacks.
- NUMA awareness is critical. A DPDK application that allocates mbufs from NUMA node 0 but processes them on a core attached to NUMA node 1 pays a 40-60% latency penalty for cross-node memory access. The EAL's --socket-mem flag and rte_lcore_to_socket_id() exist specifically to prevent this. In production, each NIC should be handled by cores on the same NUMA node as the NIC's PCIe slot.
Common Mistakes with DPDK: User-Space Networking
- Running DPDK without isolating CPU cores from the kernel scheduler. If the kernel schedules other tasks on DPDK poll-mode cores, context switches destroy latency predictability. Use isolcpus= boot parameter or cgroups cpuset to dedicate cores exclusively to DPDK. Without isolation, P99 latency can spike by 100x during scheduler preemptions.
- Allocating hugepages after boot instead of reserving them at boot time. Late allocation depends on physically contiguous free memory, which fragments over uptime. A server running for weeks may fail to allocate 1 GB hugepages even with plenty of free memory. Reserve hugepages via the kernel boot command line: hugepagesz=1G hugepages=8 default_hugepagesz=1G.
- Ignoring NUMA topology when assigning cores and memory. A NIC on PCIe bus attached to NUMA node 1, with DPDK cores running on NUMA node 0, crosses the QPI/UPI interconnect for every packet buffer access. This adds 70-100ns per memory operation. Use lstopo or lspci -vvv to check NIC NUMA affinity, then match --socket-mem and -l core assignments accordingly.
- Expecting kernel tools to work with DPDK traffic. Once a NIC is bound to a DPDK driver (uio_pci_generic or vfio-pci), the kernel cannot see any traffic on that interface. tcpdump, ss, netstat, iptables -- none of them work. DPDK applications must implement their own monitoring: pdump library for packet capture, rte_eth_stats_get() for counters, telemetry library for runtime introspection.
- Using DPDK for workloads that do not need it. If the application processes fewer than 1 million packets per second, kernel networking with interrupt coalescing and SO_BUSY_POLL is usually sufficient. DPDK adds operational complexity: custom drivers, dedicated cores, loss of kernel tooling, and application-managed protocol stacks. The breakeven point is typically 2-5 million pps depending on per-packet processing cost.
Related Topics
XDP & AF_XDP: Kernel-Bypass Networking, Kernel Network Stack, Huge Pages & THP, NUMA Architecture & Memory Policy, Zero-Copy Networking (sendfile, splice)
eBPF: Programmable Kernel — Kernel Internals
Difficulty: Advanced
Run custom code inside the live kernel without recompiling or loading modules. Each eBPF program passes a verifier that proves it cannot crash the kernel or loop forever, then gets JIT-compiled to native machine code. Programs attach to hooks spanning networking, tracing, security, and scheduling.
System Calls for eBPF: Programmable Kernel
Key Components in eBPF: Programmable Kernel
- BPF Verifier: Static analyzer that validates every eBPF program before loading. Walks all possible execution paths, checks memory bounds, prevents out-of-bounds access, ensures no loops (or bounded loops in kernel 5.3+), verifies map access types, and enforces privilege requirements. Rejects unsafe programs before they run.
- BPF Maps: Key-value data structures shared between eBPF programs and userspace. Types include hash maps, arrays, LRU hash, ring buffers, per-CPU variants, and sockmaps. Maps survive program detach and can be pinned to /sys/fs/bpf/ for persistence across process restarts.
- BPF Program Types: Determines where the program attaches and what context it receives. Key types: kprobe/kretprobe (function entry/exit), tracepoint (stable kernel events), XDP (packet processing at driver level), tc (traffic control), cgroup (per-cgroup hooks), LSM (security module), and struct_ops (replace kernel function pointers).
- BPF CO-RE (Compile Once, Run Everywhere): Uses BTF (BPF Type Format) type information embedded in the kernel to relocate field offsets at load time. Programs compiled on one kernel version run on another without recompilation, solving the kernel header dependency problem.
Key Points for eBPF: Programmable Kernel
- The verifier is the gatekeeper. It tracks register states as a lattice (known value, bounded range, pointer type), walks every execution path, and prunes equivalent states to handle path explosion. Limit: ~1 million verified instructions.
- XDP processes packets before sk_buff allocation -- at the driver's DMA completion handler. Cloudflare uses XDP to drop DDoS traffic at line rate (~100M pps) before it even reaches the network stack.
- Tail calls let one BPF program chain into another (up to 33 deep). Cilium uses this for composable network policy: base program -> L3 filter -> L4 filter -> L7 proxy, each a separate BPF program.
- BPF ring buffer (kernel 5.8+) replaces perf buffers for event streaming. Single shared ring per CPU group, variable-length records, no data copy. This is how modern observability tools get events out of the kernel efficiently.
- libbpf is the canonical library. It handles ELF parsing, CO-RE relocation, map creation, program loading, and attachment. libbpf-bootstrap provides skeleton code for new projects.
Common Mistakes with eBPF: Programmable Kernel
- Mistake: Hardcoding kernel struct field offsets. Reality: This breaks on different kernel versions. Use CO-RE with BTF and __builtin_preserve_access_index for portable programs.
- Mistake: Using bpf_probe_read() for everything. Reality: Since kernel 5.5, use bpf_probe_read_kernel() or bpf_probe_read_user() explicitly. The generic version cannot distinguish pointer types on all architectures.
- Mistake: Trying to call arbitrary kernel functions. Reality: Only BPF helpers and kfuncs (marked with BTF_ID) are callable. The verifier rejects everything else.
- Mistake: Dereferencing bpf_map_lookup_elem() without NULL check. Reality: The lookup returns NULL if the key does not exist. Skipping the NULL check is the most common verifier rejection for beginners.
Related Topics
System Calls: User to Kernel Transition, Seccomp: Sandboxing System Calls, Interrupt Handling & Softirqs, Kernel Modules & Device Drivers
epoll & I/O Multiplexing — Networking & Sockets
Difficulty: Intermediate
Behind Nginx, Redis, Go, Node.js, Kafka, and Chrome on Linux sits epoll. Rather than scanning every socket to find which ones have data, it maintains a kernel-managed ready list -- only sockets with pending events appear. Cost is O(ready), not O(total). That is the entire C10K answer.
System Calls for epoll & I/O Multiplexing
- epoll_create1
- epoll_ctl
- epoll_wait
- poll
- select
Key Components in epoll & I/O Multiplexing
- struct eventpoll: The epoll instance. contains a red-black tree of monitored fds (rbr), a ready list of fds with pending events (rdllist), and a wait queue for threads blocked in epoll_wait()
- struct epitem: Per-fd entry in the epoll red-black tree. links the fd to its epoll instance, stores the event mask (EPOLLIN/EPOLLOUT/EPOLLET), and contains the ready-list node
- ep_poll_callback: Callback registered on each monitored fd's wait queue. when the fd becomes ready (e.g., data arrives on a socket), this callback adds the epitem to the ready list and wakes epoll_wait()
- rdllist (ready list): Doubly-linked list of epitems with pending events. epoll_wait() drains this list, returning only ready fds instead of scanning all monitored fds; this is why epoll is O(ready_fds) not O(total_fds)
Key Points for epoll & I/O Multiplexing
- select is O(n) where n = highest fd number. poll is O(n) where n = total monitored fds. epoll_wait is O(k) where k = READY fds only. At 50,000 connections with 20 active, that is 50,000 vs 20. The difference is not theoretical -- it is the reason servers scale.
- Edge-triggered (EPOLLET) fires only on state CHANGES -- new data arriving, not data existing. You MUST read until EAGAIN or the fd goes silent forever. Miss this and your connection hangs with data sitting unread in the buffer.
- Level-triggered (default) fires every epoll_wait() call as long as data remains. Safer and simpler. More wakeups, but you cannot lose data by reading too little. Start here unless you have a specific reason for ET.
- EPOLLONESHOT disarms the fd after one event delivery. Essential for multi-threaded event loops -- without it, two threads can receive events for the same fd simultaneously. The owning thread must re-arm via EPOLL_CTL_MOD after processing.
- epoll does NOT work with regular files. Disk files are always "ready" from epoll's perspective, so epoll_wait returns immediately. For async disk I/O, use io_uring. This trips up every developer who tries to use epoll for file watching.
Common Mistakes with epoll & I/O Multiplexing
- Mistake: Using edge-triggered mode without reading until EAGAIN. Reality: If you read one buffer's worth on an ET event, remaining data stays in the socket buffer with no further notification. The connection appears to hang permanently.
- Mistake: Ignoring EPOLLHUP and EPOLLERR. Reality: These are always reported regardless of what you requested. Not checking for them causes busy-loops where epoll_wait returns immediately but your code does not process the error condition.
- Mistake: Sharing an epoll fd across fork() without understanding the consequences. Reality: The child inherits a reference to the same kernel eventpoll structure. Events go to whichever process calls epoll_wait() first, causing silent race conditions.
- Mistake: Monitoring millions of idle connections without considering memory. Reality: Each epitem is ~128 bytes of kernel slab memory. Ten million idle connections costs ~1.2 GB of kernel memory just for epoll metadata.
Related Topics
Socket Programming (TCP/UDP), TCP State Machine & Connection Lifecycle, Unix Domain Sockets, Zero-Copy Networking (sendfile, splice)
ext4 & XFS On-Disk Internals — File Systems & I/O
Difficulty: Advanced
ext4 and XFS are the two dominant Linux filesystems, handling over 90% of production workloads. ext4 organizes blocks into groups with a fixed inode table per group, uses extent trees to map logical offsets to physical blocks, and journals via JBD2 in three modes (journal, ordered, writeback). XFS splits the disk into allocation groups that operate as independent mini-filesystems, uses B+ trees for all metadata, and delays allocation until writeback to maximize contiguous extents. The choice between them comes down to workload: ext4 for general purpose with strong crash consistency, XFS for large files and parallel I/O.
System Calls for ext4 & XFS On-Disk Internals
Key Components in ext4 & XFS On-Disk Internals
- Superblock: The filesystem's birth certificate and configuration record. On ext4, it lives at byte offset 1024 and carries the magic number 0xEF53, total block count, inode count, block size, blocks per group, free counts, and feature flags. Backup copies exist at block group boundaries (groups 0, 1, 3, 5, 7 and powers of 3, 5, 7 via sparse_super). On XFS, the primary superblock sits at the start of allocation group 0, with a secondary copy in each additional AG. The superblock is read once at mount time and cached in memory. Corruption here means the filesystem cannot mount at all -- which is why redundant copies exist.
- Extent tree: ext4 replaced the classic Unix triple-indirect block mapping with extent trees starting in kernel 2.6.28. An extent is a (logical block, physical block, length) triple that maps a contiguous run of blocks in one entry instead of one entry per block. A 1GB contiguous file needs just one extent instead of 262,144 indirect block pointers. The extent tree is a B-tree rooted in the inode itself (60 bytes of the inode hold up to 4 extents directly). When more extents are needed, the tree grows deeper with internal index nodes. XFS uses a similar B+ tree structure but with allocation group-local addressing, allowing concurrent tree operations across different AGs.
- Journal (JBD2 / XFS log): A write-ahead log that records intended metadata changes before they hit the main filesystem structures. ext4 uses JBD2 (Journaling Block Device 2) with three modes: journal mode logs both metadata and data blocks (safest, slowest -- roughly 50% throughput penalty), ordered mode (default) forces data blocks to disk before journaling the metadata commit (good balance), and writeback mode journals metadata only with no data ordering guarantees (fastest, riskiest). XFS maintains a circular log that journals metadata transactions. Both use a commit-then-checkpoint pattern: write the intent to the log, then lazily apply changes to their final locations. Recovery replays uncommitted transactions from the log, typically completing in under 5 seconds regardless of filesystem size.
- Allocation groups (XFS): XFS splits the entire block device into equal-sized allocation groups (typically 1GB each on modern systems). Each AG is effectively an independent mini-filesystem with its own free space B+ tree, inode allocation B+ tree, and AG header. Multiple threads allocating files in different AGs never contend on the same metadata locks. This is the primary reason XFS outperforms ext4 on parallel workloads -- a 1TB volume has roughly 1000 AGs, meaning 1000 independent allocation contexts. ext4 block groups serve a similar partitioning role but share a single global inode bitmap and do not support the same degree of concurrent metadata updates.
Key Points for ext4 & XFS On-Disk Internals
- ext4 block groups divide the disk into fixed-size chunks (typically 128MB with 4KB blocks). Each group has its own block bitmap (tracks free blocks), inode bitmap (tracks free inodes), inode table (256-byte inode entries), and data blocks. The group descriptor table at the start of the filesystem maps all groups. This structure means a 1TB ext4 filesystem has roughly 8,192 block groups.
- XFS delayed allocation does not assign physical blocks when write() is called. Blocks stay in the page cache as "delayed" until writeback, when XFS knows the full extent of the write and can allocate the largest possible contiguous run. This produces fewer, larger extents and dramatically reduces fragmentation for sequential write workloads like log files and database WAL.
- ext4 journaling mode determines crash survival. In ordered mode (default), the kernel flushes data blocks to their final locations, then commits the metadata journal entry. If power fails before the journal commit, the metadata change is abandoned and the data blocks are orphaned (harmless). If power fails after the journal commit, the data was already on disk. Writeback mode offers no such guarantee -- metadata may reference data blocks that contain stale content.
- XFS reflink and copy-on-write (COW) allow instant file copies that share physical blocks. cp --reflink=always on XFS completes in milliseconds regardless of file size because it copies only the extent map, not the data. Writes to either copy trigger COW for just the modified extents. This is the foundation of XFS snapshot support and is used by container runtimes for efficient layer deduplication.
- The ext4 inode is 256 bytes by default (configurable at mkfs). The first 128 bytes match the classic ext2 layout (mode, size, timestamps, 60 bytes for block mapping). The extra 128 bytes hold nanosecond timestamps, extended attributes inline, and the extent tree root. Small files (under ~60 bytes) can store their entire content inline in the inode, avoiding any block allocation.
Common Mistakes with ext4 & XFS On-Disk Internals
- Mistake: Running ext4 with data=writeback for database workloads because benchmarks show 20% better throughput. Reality: writeback mode does not order data writes before metadata commits. A crash can leave committed metadata pointing at blocks containing old or zero data. PostgreSQL, MySQL, and similar databases that assume write ordering will experience silent data corruption. Use data=ordered (the default) and let the database handle its own write ordering through WAL.
- Mistake: Formatting XFS without ftype=1 and then running Docker. Reality: overlay2 requires directory entry type information to function correctly. XFS must be formatted with -n ftype=1 (default since xfsprogs 4.2 / 2015, but older RHEL 7 systems may not have it). Docker checks this at startup and either refuses to use overlay2 or falls back to a slower driver. The fix requires reformatting -- ftype cannot be enabled on an existing filesystem.
- Mistake: Ignoring inode exhaustion on ext4 because disk space monitoring shows plenty of free blocks. Reality: ext4 fixes the inode count at mkfs time (default: 1 inode per 16KB of disk). A 100GB volume gets ~6.5 million inodes. Workloads creating millions of small files (log rotation, mail spools, container layers) exhaust inodes while showing 50%+ free disk space. Monitor both df -h (blocks) and df -i (inodes). XFS allocates inodes dynamically and does not have this problem.
- Mistake: Using nobarrier mount option on ext4/XFS for performance without understanding the write cache implications. Reality: nobarrier tells the filesystem not to issue FLUSH/FUA commands to the disk. If the disk has a volatile write cache (no battery backup), a power failure can lose writes that the journal believed were committed. Only use nobarrier when the storage controller has battery-backed or capacitor-backed write cache (enterprise RAID controllers, most cloud block storage).
Related Topics
Page Cache & Block I/O, Disk I/O Scheduling, Inodes & File Metadata, OverlayFS & Union File Systems, Virtual File System (VFS), Inotify & fanotify: File System Events
File Descriptors & File Tables — File Systems & I/O
Difficulty: Starter
Small integers (0, 1, 2, 3...) that stand in for open files, sockets, pipes, and devices. Behind each one sits a three-layer kernel structure: per-process fd table, system-wide open file table, and inode. The layering determines how offsets are tracked, how sharing works across fork(), and where resource limits bite.
System Calls for File Descriptors & File Tables
- open
- close
- read
- write
- dup2
- fcntl
Key Components in File Descriptors & File Tables
- Per-process fd table (struct fdtable): Maps integer fd numbers to struct file pointers; one per process (task_struct->files)
- Open file description (struct file): Tracks file offset, access mode, and status flags; shared across dup()/fork()
- Inode (struct inode): Represents the on-disk file identity; shared across all opens of the same file
- File operations (struct file_operations): VFS dispatch table of function pointers (read, write, mmap, ioctl) attached to each struct file
Key Points for File Descriptors & File Tables
- dup2() and fork() share the SAME struct file — meaning file offset changes in one process are visible in the other. Two separate open() calls create independent struct files with independent offsets. This distinction breaks people's mental models constantly
- Two different limits, two different errors: RLIMIT_NOFILE (per-process, default 1024) produces EMFILE; /proc/sys/fs/file-max (system-wide) produces ENFILE. Production servers must tune both
- Without O_CLOEXEC, child processes inherit every open fd across exec() — including sockets, database connections, and files opened with elevated privileges. This is a security risk and a resource leak
- The kernel always assigns the lowest available fd number on open() — close stdin (fd 0) then open a file and it gets fd 0. Classic footgun in daemon code
- close_range() (Linux 5.11+) atomically closes a range of fds in one syscall — far more efficient than looping close() for fd sanitization before exec()
Common Mistakes with File Descriptors & File Tables
- Thinking two fds from separate open() calls share a file offset — they don't. Each open() creates a new struct file with its own position. Only dup() and fork() share offsets, because they share the struct file
- Forgetting O_CLOEXEC in multithreaded programs — between open() and a subsequent fcntl(FD_CLOEXEC), another thread can fork+exec and leak the fd. Set O_CLOEXEC atomically at open() time
- Ignoring close()'s return value — on NFS, close() can return EIO if a deferred write failed. Ignoring this means silently losing data and discovering it much, much later
- Leaking pipe() or socketpair() fds — each creates TWO fds, and forgetting to close the unused end in parent or child is one of the most common fd leaks in production systems
Related Topics
Inodes & File Metadata, I/O Models: Blocking, Non-Blocking, Async, Virtual File System (VFS), Process Lifecycle (fork/exec/wait)
File Locking (Advisory & Mandatory) — File Systems & I/O
Difficulty: Intermediate
Three locking mechanisms, three sets of rules. flock() locks the whole file and is tied to the fd. POSIX fcntl() does byte-range locking but is tied to the process -- close any fd to the same file and every lock vanishes silently. OFD locks (Linux 3.15+) combine byte-range granularity with fd ownership, fixing the worst of fcntl(). All three are advisory: a process that skips the check can still write freely.
System Calls for File Locking (Advisory & Mandatory)
Key Components in File Locking (Advisory & Mandatory)
- struct file_lock: Kernel representation of a file lock; holds lock type (shared/exclusive), byte range (start, len), owning pid/ofd, and blocked waiter list
- inode->i_flock (lock list): Linked list of all locks held on an inode; the kernel walks this list to check for conflicts when a new lock is requested
- struct flock (userspace): Userspace structure passed to fcntl(F_SETLK/F_GETLK): specifies l_type (F_RDLCK/F_WRLCK/F_UNLCK), l_whence, l_start, l_len
- Open File Description (struct file): For OFD locks (F_OFD_SETLK), the lock is owned by the struct file rather than the process, making it safe across threads and immune to close() on other fds
Key Points for File Locking (Advisory & Mandatory)
- POSIX fcntl() locks have a devastating footgun: locks are owned by (pid, inode), not by fd. If you open the same file on a different fd and close it, ALL your locks on that inode vanish — silently. This has bitten every major database that uses them
- flock() locks are tied to the struct file (open file description), not the pid. dup() and fork() share the lock, but independent open() calls get independent locks. This is usually the saner default for whole-file locking
- OFD locks (F_OFD_SETLK, Linux 3.15+) are the modern fix — struct-file ownership like flock(), plus byte-range support like fcntl(). If you're writing new code that needs record locking, use OFD locks
- Mandatory locking is dead. Deprecated in Linux 4.5, removed in 5.15. It never covered mmap(), had race conditions, and was never reliable. All locking in production is cooperative advisory locking
- The kernel detects deadlocks for POSIX locks (returns EDEADLK) but NOT for flock() or OFD locks — those just hang forever if you create a cycle. Design your lock ordering carefully
Common Mistakes with File Locking (Advisory & Mandatory)
- Using POSIX fcntl() locks in library code — any other code in the same process that opens and closes the same file silently releases your locks. Libraries can't control what the rest of the process does with fds
- Assuming flock() works properly on NFS — Linux emulates flock() via fcntl() byte-range locks on NFS, which changes its ownership semantics from fd-owned to pid-owned. The behavior you tested locally won't match production
- Spinning with F_SETLK in a loop instead of using F_SETLKW — this wastes CPU for no reason. F_SETLKW blocks in the kernel with proper waitqueue semantics and wakes you when the lock is available
- Forgetting that ALL advisory locks are optional — they only work if every process accessing the file cooperates by checking locks. A rogue process that ignores locking can read and write freely
Related Topics
File Descriptors & File Tables, I/O Models: Blocking, Non-Blocking, Async, Inodes & File Metadata, Process Lifecycle (fork/exec/wait)
File Permissions, Ownership & ACLs — Security & Access Control
Difficulty: Starter
Twelve bits per file. Three for owner, three for group, three for others, plus setuid, setgid, and sticky. The kernel checks one class and stops -- owner first, then group, then others, no fallthrough. ACLs extend the model with per-user and per-group rules when three classes are not enough.
System Calls for File Permissions, Ownership & ACLs
- chmod
- chown
- setfacl
- getfacl
- umask
- access
Key Components in File Permissions, Ownership & ACLs
- struct inode (i_mode): The inode's i_mode field contains the 16-bit file type and permissions. Bits 0-8: rwx for other/group/owner. Bits 9-11: sticky, setgid, setuid. Bits 12-15: file type (regular, directory, symlink, etc.). The kernel reads i_mode on every open/access/exec call.
- struct cred (fsuid/fsgid): Each process has a credential structure with real, effective, saved, and filesystem UIDs/GIDs. Permission checks use fsuid/fsgid (filesystem UID/GID), which normally equal the effective UID/GID. NFS servers change fsuid without affecting the effective UID.
- POSIX ACL (access_acl / default_acl): Extended attributes (system.posix_acl_access, system.posix_acl_default) stored on the inode. Access ACLs define per-user/group permissions for a specific file. Default ACLs (directories only) set the template for ACLs on newly created files within.
- umask: A per-process bitmask (typically 022) that clears permission bits on file creation. When open() specifies mode 0666, the effective permissions are 0666 & ~022 = 0644. umask does NOT affect chmod or ACL operations.
Key Points for File Permissions, Ownership & ACLs
- The permission check does NOT fall through. If your UID matches the file owner, ONLY owner bits are checked. Even if group or other bits grant full access, they're ignored. An owner with mode 0007 has zero access despite 'other' having rwx.
- setuid on a binary means the process runs as the file's owner, not as you. That's how /usr/bin/passwd (owned by root, setuid) can write to /etc/shadow. The kernel checks the setuid bit during execve(), flipping the effective UID before the program's first instruction runs.
- The sticky bit on /tmp (mode 1777) is the only thing stopping users from deleting each other's files. Without it, anyone with write permission on a directory can delete any file in it — regardless of the file's own permissions.
- ACL mask is the sneaky upper bound — if a named user ACL grants rwx but the mask is rw-, effective permission is rw-. When you chmod on an ACL-enabled file, you're actually modifying the mask, which silently restricts every non-owner ACL entry.
- access() checks the REAL UID, not the effective UID. That's deliberate — setuid programs use it to ask 'would the actual human who ran me be allowed to do this?' before performing the action with elevated privileges.
Common Mistakes with File Permissions, Ownership & ACLs
- Setting a directory to 0644 and wondering why nobody can cd into it — directories need the execute bit (x) for traversal. Without it, even the owner can't access files by name. Read without execute lets you list filenames but not stat or open anything inside.
- Reaching for chmod 777 as a fix — this removes all security, sets special bits unpredictably, and is never correct for production. Diagnose the actual mismatch with 'ls -la' and 'id', then fix the specific permission that's wrong.
- Setting umask carefully but seeing unexpected permissions — if the parent directory has a default ACL, umask is completely ignored for new files. The default ACL wins. This catches admins who rely on umask but have ACLs they forgot about.
- Trying to chmod a symlink — Linux doesn't implement lchmod(). The 0777 shown by ls for symlinks is cosmetic. The kernel follows the symlink and checks permissions on the target, always.
Related Topics
Linux Capabilities, SELinux & AppArmor, Audit Framework & Logging, Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
Kernel Tracing with ftrace, kprobes, and tracepoints — Debugging & Tracing
Difficulty: Advanced
The Linux kernel contains thousands of instrumentation points that can be activated at runtime without rebooting, recompiling, or installing debug packages. ftrace provides function-level tracing through a debugfs interface. kprobes allow inserting breakpoints at any kernel instruction address. Static tracepoints are compiled into the kernel source at carefully chosen locations with stable APIs.
System Calls for Kernel Tracing with ftrace, kprobes, and tracepoints
- perf_event_open
- ioctl
- mmap
- write
- read
Key Components in Kernel Tracing with ftrace, kprobes, and tracepoints
- /sys/kernel/debug/tracing: The debugfs mount point for ftrace. All configuration is done by writing to files in this directory: current_tracer selects the tracing backend, trace_pipe streams events, set_ftrace_filter restricts which functions are traced, and events/ contains toggles for every static tracepoint. No special tools required -- echo and cat are sufficient.
- function tracer: The simplest ftrace tracer. When enabled, it records every kernel function entry with a timestamp and the calling CPU. Implemented via compiler-inserted mcount/fentry stubs at the start of every kernel function. When tracing is off, these stubs are patched to NOPs at runtime (dynamic ftrace), so the overhead of having ftrace compiled in but disabled is effectively zero.
- function_graph tracer: Extends the function tracer by also recording function returns, enabling call depth indentation and per-function duration measurement. The output resembles a call tree with timing. Implemented by replacing the return address on the kernel stack with a trampoline that records the exit timestamp before jumping back to the real caller.
- kprobes / kretprobes: Dynamic instrumentation that can be inserted at virtually any kernel instruction address at runtime. kprobes work by replacing the target instruction with a breakpoint (int3 on x86). When the breakpoint fires, a registered handler runs, then the original instruction is single-stepped. kretprobes hook function returns by trampolining through a stub. Unlike static tracepoints, kprobes require no kernel source changes but have no API stability guarantees.
- static tracepoints (tracepoints): Instrumentation points placed in the kernel source code by developers at semantically meaningful locations (e.g., sched:sched_switch, block:block_rq_issue). Implemented using static keys: when no consumer is attached, the tracepoint compiles down to a NOP. When enabled, the kernel patches the NOP into a jump to the tracing handler. The overhead of a disabled tracepoint is a single NOP instruction -- effectively zero.
- ring buffer (per-CPU): ftrace writes trace events into per-CPU ring buffers to avoid cross-CPU locking. Each CPU has its own buffer, and events are merged by timestamp when read. Buffer size is configurable via buffer_size_kb. When the buffer fills, oldest events are overwritten (flight recorder mode) unless tracing_on is set to 0.
Key Points for Kernel Tracing with ftrace, kprobes, and tracepoints
- ftrace has near-zero overhead when disabled. Dynamic ftrace patches function entry points to NOPs at boot. Enabling tracing for a specific function patches just that NOP back to a call instruction. The rest of the kernel runs at full speed. This is why ftrace can be compiled into production kernels without fear.
- kprobes can instrument any kernel function, but the instrumented address is not part of any stable API. A kprobe attached to an internal function may break on the next kernel update if the function is renamed, inlined, or removed. Static tracepoints have stable interfaces across kernel versions. Prefer tracepoints when they exist; fall back to kprobes when they do not.
- function_graph tracing replaces return addresses on the kernel stack. If the traced function triggers an exception or oops, the stack trace may show ftrace trampoline addresses instead of the real callers. This is a known limitation. The ftrace infrastructure saves the real return addresses in a shadow stack, but crash dump tools may not decode them.
- Per-CPU ring buffers are the reason ftrace scales on multi-core systems. Each CPU writes to its own buffer without taking any locks. The only synchronization happens when reading the merged trace output. For high-frequency events (100k+ per second), increasing buffer_size_kb prevents data loss.
- trace-cmd and KernelShark are the standard tools for working with ftrace. trace-cmd handles the raw debugfs interface, manages per-CPU buffers, and produces trace.dat files. KernelShark provides a GUI timeline view. For scripted analysis, trace-cmd report produces text output that can be piped through standard Unix tools.
Common Mistakes with Kernel Tracing with ftrace, kprobes, and tracepoints
- Enabling the function tracer globally without set_ftrace_filter. Tracing every kernel function generates millions of events per second and can make the system unusable. Always filter to specific functions or use function_graph with a max_graph_depth limit. Start with a single function and widen the scope incrementally.
- Forgetting to disable tracing after a debug session. ftrace stays active until explicitly stopped. A forgotten function tracer with no filter can silently degrade system performance for days. Always run echo nop > /sys/kernel/debug/tracing/current_tracer when done.
- Attaching kprobes to functions that are called with interrupts disabled or while holding spinlocks. The kprobe handler itself must not sleep or take locks that could deadlock with the interrupted context. The handler runs in atomic context with preemption disabled. Allocating memory or calling printk from a kprobe handler can cause lockups.
- Assuming kprobe attachment points are stable across kernel versions. Internal function names change between releases. A kprobe on __blk_mq_run_hw_queue in kernel 5.15 may need to target blk_mq_run_hw_queue in 6.1. Always verify function availability in /proc/kallsyms or /sys/kernel/debug/tracing/available_filter_functions before deploying kprobe-based monitoring.
- Reading /sys/kernel/debug/tracing/trace instead of trace_pipe for live monitoring. The trace file is a snapshot that does not consume events; reading it repeatedly shows stale data. trace_pipe is a consuming read that blocks until new events arrive, making it suitable for real-time monitoring pipelines.
Related Topics
[object Object], Perf Events & Performance Counters, Kernel Modules & Device Drivers, Interrupt Handling & Softirqs, [object Object]
Futexes: Fast User-Space Locking — Processes & Threads
Difficulty: Advanced
Under every lock in pthread, Go, Java, and Rust sits a futex. When nobody else wants the lock (99% of the time), acquisition is one atomic compare-and-swap in user memory -- 10 ns, zero syscalls. The kernel only gets involved when a thread actually needs to sleep.
System Calls for Futexes: Fast User-Space Locking
Key Components in Futexes: Fast User-Space Locking
- Futex word (u32 in user-space): A 32-bit integer in user-space memory that serves as the lock state. Threads atomically CAS this value to acquire/release the lock without any syscall. The kernel never touches this word directly. it only uses its ADDRESS as a hash key for the wait queue. Typically 0 = unlocked, 1 = locked-no-waiters, 2 = locked-with-waiters.
- futex(FUTEX_WAIT, addr, expected): Atomically checks if *addr == expected and if so, sleeps the calling thread on a kernel wait queue keyed by addr. If *addr != expected (lock was released between user-space check and syscall), returns immediately with EAGAIN. This atomic check-and-sleep prevents the lost-wakeup race.
- futex(FUTEX_WAKE, addr, n): Wakes up to n threads sleeping on the wait queue keyed by addr. Typically n=1 for mutex unlock (wake one waiter) or n=INT_MAX for broadcast (pthread_cond_broadcast). The kernel hashes addr to find the correct wait queue bucket in the global futex hash table.
- Futex hash table (kernel): A global hash table (256 buckets by default, tunable) mapping futex addresses to wait queues. Each bucket has a spinlock protecting its chain of futex_q entries. Private futexes (FUTEX_PRIVATE_FLAG) hash by virtual address; shared futexes hash by (inode, page offset) for cross-process use.
Key Points for Futexes: Fast User-Space Locking
- The fast path is ZERO syscalls -- just an atomic CAS (lock cmpxchg on x86, ldxr/stxr on ARM64). 5-20 nanoseconds on modern CPUs. The kernel is only involved when a thread must sleep. Futexes are 100x faster than pure-kernel mutexes in the common case.
- The three-state protocol (0/1/2) eliminates unnecessary wakes. State 0 = unlocked. State 1 = locked, no waiters. State 2 = locked, has waiters. On unlock, if state was 1, just CAS to 0 -- no FUTEX_WAKE needed. Only state 2 triggers a wake syscall.
- FUTEX_LOCK_PI prevents priority inversion. When a high-priority thread blocks on a lock held by a low-priority thread, the kernel boosts the holder's priority. Used by real-time systems, Android Binder, and audio frameworks. Internally backed by rt_mutex.
- Robust futexes solve the crash-while-holding-lock problem. The kernel maintains a per-thread list of held futexes. On thread death, it sets FUTEX_OWNER_DIED and wakes a waiter. Used by glibc's PTHREAD_MUTEX_ROBUST.
- Private futexes (the default since glibc 2.10) hash by virtual address and only work within a single process. Shared futexes (for cross-process shared memory) hash by (inode, offset) and are ~30% slower.
Common Mistakes with Futexes: Fast User-Space Locking
- Mistake: Calling FUTEX_WAIT without trying CAS first. Reality: The syscall costs ~200ns vs ~10ns for an uncontended CAS. Always try the atomic operation first. The kernel is the fallback, not the default path.
- Mistake: Always calling FUTEX_WAKE on unlock. Reality: If no threads are sleeping (state was 1, not 2), the wake is a wasted syscall. The three-state protocol exists to avoid this -- check the state before waking.
- Mistake: Using FUTEX_WAIT with a shared mapping but forgetting to clear FUTEX_PRIVATE_FLAG. Reality: Private futexes hash by virtual address. Two processes mapping the same shared memory at different addresses hash to different buckets -- the waiter and waker never find each other.
- Mistake: Not handling spurious wakeups. Reality: FUTEX_WAIT can return 0 without a corresponding FUTEX_WAKE (signal delivery, kernel requeue). Always re-check the futex word in a loop, just like pthread_cond_wait.
Related Topics
POSIX Threads, Process Scheduling (CFS), System Calls: User to Kernel Transition, Shared Memory & Semaphores
Hard Links & Symbolic Links — File Systems & I/O
Difficulty: Starter
A hard link adds another directory entry pointing to the same inode -- i_nlink goes up by one, and data blocks are freed only when it drops to 0 with no open fds remaining. A symlink is a separate inode (type S_IFLNK) holding a target path string; the VFS follows it during path resolution, up to 40 hops before returning ELOOP. Hard links cannot cross filesystem boundaries. Symlinks can point anywhere, including paths that do not exist.
System Calls for Hard Links & Symbolic Links
- link
- symlink
- unlink
- readlink
- rename
Key Components in Hard Links & Symbolic Links
- Directory entry (dentry): Maps a filename string to an inode number; hard links create additional dentries pointing to the same inode
- struct inode i_nlink: Hard link count on the inode; when decremented to 0 and no open fds remain, the kernel frees the inode and data blocks
- Symlink inode: A separate inode with file type S_IFLNK; for short targets (<60 bytes on ext4), the path is stored inline in the inode (fast symlink)
- VFS path walk (namei): Resolves pathnames component by component; transparently follows symlinks up to a depth limit (40 on Linux, 8 in a single path walk) to prevent loops
Key Points for Hard Links & Symbolic Links
- Hard links can't cross filesystem boundaries — period. Inode numbers are only unique within a filesystem, so a directory entry on /dev/sda1 can't reference an inode on /dev/sdb1. Symlinks work across filesystems because they store a path, not an inode number
- You can't hard-link directories (link() returns EPERM) because it would create cycles in the filesystem tree, breaking find, fsck, and every tool that assumes directories form a tree. Only the kernel creates the '.' and '..' entries
- Symlink loops don't hang the kernel — it counts. ELOOP fires after 40 total symlink traversals or 8 within a single path component. If you see ELOOP, you've got a circular chain
- rename() on the same filesystem is atomic because it's just a dentry operation — the inode, data blocks, and permissions are untouched. This is the foundation of every crash-safe config update: write to temp, rename over target
- Symlink permissions (lrwxrwxrwx) are cosmetic — Linux ignores them entirely. Access control is always determined by the target file's permissions, never the symlink's
Common Mistakes with Hard Links & Symbolic Links
- Thinking unlink() deletes a file — it removes ONE name. The data survives as long as other hard links exist or any process holds an open fd. Check st_nlink and lsof to understand what's really happening
- Creating symlinks with relative paths, then moving the symlink — the relative target resolves from the symlink's new location, not the old one. The symlink dangles, and you stare at "No such file or directory"
- Trying rename() across filesystem boundaries — it fails with EXDEV. You must copy + unlink instead, which is NOT atomic. That's what mv does internally when crossing devices
- Ignoring ELOOP when following symlinks — symlink chains or cycles cause failures that look like "file not found" if you don't check the specific error. Set O_NOFOLLOW when you want to operate on the symlink itself
Related Topics
Inodes & File Metadata, Directory Entries & Path Resolution, File Descriptors & File Tables, Virtual File System (VFS)
Heap Allocators (malloc internals) — Memory Management
Difficulty: Advanced
malloc() and free() are not syscalls -- they manage a private memory pool in user space, built on top of brk() and mmap(). The allocator (ptmalloc2, jemalloc, tcmalloc) keeps freed memory on its own shelves for reuse rather than handing it back to the OS. That is why RSS refuses to shrink after free().
System Calls for Heap Allocators (malloc internals)
Key Components in Heap Allocators (malloc internals)
- malloc_state (arena): Per-arena metadata containing bin arrays, top chunk pointer, and mutex. ptmalloc2 creates one arena per CPU thread to reduce lock contention
- malloc_chunk: Header prepended to every allocation. contains size (with low bits encoding flags: PREV_INUSE, IS_MMAPPED, NON_MAIN_ARENA) and prev_size for coalescing
- bins (fast, small, large, unsorted): Free lists organized by chunk size. fastbins (16-80 bytes, LIFO, no coalescing), smallbins (exact size, FIFO), largebins (sorted by size), unsorted bin (recently freed chunks awaiting sorting)
- tcache (thread cache): Per-thread singly-linked free list (glibc 2.26+) holding up to 7 chunks per size class. serves malloc/free without touching arena mutexes, dramatically reducing contention
Key Points for Heap Allocators (malloc internals)
- malloc(16) actually allocates 32 bytes -- 16 for the payload, 16 for the chunk header; the minimum chunk size is 32 bytes because freed chunks need space for the free-list pointers
- Anything above 128 KB bypasses the heap entirely and goes straight to mmap -- these chunks ARE returned to the OS on free() via munmap, but at the cost of a syscall per allocation
- The top chunk sits at the end of the heap, and it is the only chunk that can shrink the heap via brk() -- if a single small allocation sits at the very top, all the memory below it stays trapped
- Fastbins are the speed trap: they serve allocations in nanoseconds but never coalesce freed chunks, causing fragmentation when allocation sizes vary; consolidation only happens when a large allocation triggers it
- jemalloc and tcmalloc exist because ptmalloc2's per-arena fragmentation is a fundamental design flaw -- memory freed in one arena cannot be reused by another, and both alternatives solve this with cross-thread deallocation
Common Mistakes with Heap Allocators (malloc internals)
- Expecting free() to reduce RSS -- glibc hoards freed memory in bins for reuse; only mmap'd chunks (>128 KB) and brk-shrink of the top chunk actually reduce RSS; this is the #1 source of 'memory leak' false alarms
- Double-free bugs -- freeing a chunk twice corrupts the free list; glibc 2.26+ has tcache double-free detection, but older versions are exploitable for arbitrary code execution
- Heap buffer overflows -- writing past an allocation corrupts the next chunk's metadata; on the next malloc/free, glibc detects the damage and aborts with 'corrupted size vs. prev_size'
- Ignoring malloc overhead for small objects -- a million 16-byte allocations consume 32 MB (32 bytes each), not 16 MB; for small objects, a slab or arena allocator is 2-4x more memory-efficient
Related Topics
Virtual Memory & Address Spaces, mmap & Memory-Mapped Files, OOM Killer & Memory Pressure, Memory Cgroups & Resource Limits
Huge Pages & THP — Memory Management
Difficulty: Advanced
Instead of the default 4 KB, Linux can use 2 MB or 1 GB pages. Fewer, larger pages mean fewer TLB entries to manage and fewer page table walks -- less CPU time burned on address translation. The cost: coarser memory granularity and, with THP, potential latency spikes from compaction.
System Calls for Huge Pages & THP
Key Components in Huge Pages & THP
- hugetlb_page_region: Preallocated pool of huge pages managed by the hugetlbfs filesystem. pages are reserved at boot or runtime via /proc/sys/vm/nr_hugepages and never fragmented by the buddy allocator
- khugepaged: Kernel thread that scans process address spaces for 512 consecutive 4 KB pages that can be collapsed into a single 2 MB THP. runs in the background with configurable scan interval and CPU limits
- struct page (compound page): A huge page is represented as a compound page. a head page followed by 511 tail pages; the head page tracks compound_order=9 (2^9 * 4KB = 2MB) and all tail pages point back to the head
- PMD entry (Page Size bit): On x86-64, a 2 MB huge page uses a PMD entry with the PS (Page Size) bit set, eliminating the need for a PTE page table level. the PMD directly maps the 2 MB virtual range to the physical frame
Key Points for Huge Pages & THP
- The TLB coverage math is brutal -- 64 entries at 4 KB covers 256 KB, but at 2 MB covers 128 MB; for any working set above 256 KB (which is every database, every JVM, every real workload), this 512x difference dominates performance
- THP is the convenient trap -- it works transparently, but khugepaged compaction can stall your process for milliseconds while the kernel moves pages around to create contiguous 2 MB regions; this is why Redis says 'disable THP'
- Explicit hugetlbfs pages trade flexibility for reliability -- they are preallocated, pinned, never swapped, never fragmented, but waste memory at 2 MB granularity; a 2.1 MB allocation wastes 43%
- 1 GB huge pages exist but must be reserved at boot -- the buddy allocator cannot produce 1 GB contiguous regions at runtime because memory is already fragmented; boot-time reservation carves them before anything else runs
- Both Redis and PostgreSQL warn about THP -- Redis because latency spikes from compaction are unacceptable, PostgreSQL because THP interacts poorly with its buffer management; but PostgreSQL strongly recommends explicit huge pages
Common Mistakes with Huge Pages & THP
- Enabling THP as 'always' on latency-sensitive workloads -- khugepaged compaction can cause multi-millisecond stalls; use 'madvise' mode and let applications opt in on specific regions
- Not reserving enough huge pages upfront -- if the hugetlb pool runs out, mmap with MAP_HUGETLB fails with ENOMEM; applications may fall back silently to 4 KB pages without telling you
- Assuming huge pages always help -- for sparse access patterns across huge address ranges, 2 MB pages waste physical memory (2 MB per accessed byte in the worst case) and the TLB benefit does not compensate
- Forgetting that hugetlbfs pages are pinned -- they cannot be swapped or reclaimed; over-reserving starves the page cache and other processes of memory they actually need
Related Topics
Page Tables & TLB, Virtual Memory & Address Spaces, NUMA Architecture & Memory Policy, mmap & Memory-Mapped Files
Inodes & File Metadata — File Systems & I/O
Difficulty: Starter
Everything about a file except its name lives in the inode: owner, permissions, timestamps, size, and pointers to data blocks on disk. One inode per file, no exceptions. The count is fixed when the filesystem is created -- run out and no new files can be made, even with terabytes free.
System Calls for Inodes & File Metadata
- stat
- fstat
- lstat
- statx
- chmod
- chown
Key Components in Inodes & File Metadata
- struct inode (VFS): In-memory representation of file metadata: mode, uid, gid, size, timestamps, link count, and pointers to file operations
- struct stat (userspace): Userspace view of inode metadata returned by stat()/fstat(); contains st_ino, st_nlink, st_mode, st_size, st_blocks
- On-disk inode (e.g., ext4_inode): Persistent inode structure on disk; ext4 uses 256 bytes per inode, stores block map or extent tree
- Inode cache (inode_hashtable): Kernel hash table caching recently used VFS inodes to avoid repeated disk reads; evicted under memory pressure
Key Points for Inodes & File Metadata
- Inode numbers are only unique within a single filesystem — two files on different mounts can share the same number. This is exactly why hard links can't cross filesystem boundaries: there's no cross-device inode lookup
- Run out of inodes and you can't create files, period — even with terabytes free. The inode count is fixed at mkfs time (unless you're on XFS/Btrfs which allocate dynamically). Check with df -i before it's too late
- statx() is the modern stat() — it returns only the fields you ask for (AT_STATX_DONT_SYNC skips NFS roundtrips), adds birth time (btime), and is extensible without needing new syscalls
- Delete a file while a process has it open and the data survives — the inode's link count drops to 0 but the file persists until the last fd closes. This is how atomic file replacement and safe log rotation work
- ext4 replaced indirect block pointers with an extent tree — one (start, length) pair describes a contiguous run of blocks, collapsing massive metadata overhead for large files compared to ext3's triple-indirect scheme
Common Mistakes with Inodes & File Metadata
- Seeing "No space left on device" and only checking df -h — the real culprit might be inode exhaustion. Always check df -i too. They're different failure modes with the same error message
- Using stat() on a symlink and getting the target's metadata — stat() follows symlinks by default. Use lstat() to inspect the symlink itself, or you'll think the symlink is a regular file
- Expecting birth time (file creation time) to be available everywhere — ext4 stores it, but stat() doesn't expose it. You need statx() with STATX_BTIME, and even then not all filesystems record it
- Relying on inode numbers for file identity across reboots on tmpfs/procfs — pseudo-filesystems allocate inodes dynamically and numbers are not stable between reboots
Related Topics
File Descriptors & File Tables, Hard Links & Symbolic Links, Directory Entries & Path Resolution, Virtual File System (VFS)
Inotify & fanotify: File System Events — File Systems & I/O
Difficulty: Intermediate
Rather than polling the filesystem in a loop, a process registers watches on files or directories and gets events -- create, modify, delete, rename, move -- delivered through a file descriptor. inotify covers individual directories; fanotify scales to entire mount points.
System Calls for Inotify & fanotify: File System Events
- inotify_init1
- inotify_add_watch
- inotify_rm_watch
- fanotify_init
- fanotify_mark
- read
Key Components in Inotify & fanotify: File System Events
- inotify instance (fd): A file descriptor created by inotify_init1() that represents a set of watches. Events from all watched files/directories are multiplexed onto this single fd, which can be polled with epoll/select for non-blocking event processing.
- struct inotify_event: Variable-length event structure delivered via read() on the inotify fd. Contains watch descriptor (wd), event mask (what happened), cookie (links rename FROM/TO pairs), and optional filename (for directory watches, identifies which child triggered the event).
- fanotify notification group: Created by fanotify_init() with class flags (FAN_CLASS_NOTIF, FAN_CLASS_CONTENT, FAN_CLASS_PRE_CONTENT). Unlike inotify, fanotify can monitor entire mount points or filesystems, delivers open file descriptors instead of names, and supports permission events where the kernel blocks until the handler allows/denies access.
- Watch descriptor (wd): Integer handle returned by inotify_add_watch() identifying a specific watch. Used in events to indicate which watch was triggered, and passed to inotify_rm_watch() for removal. Watch descriptors are recycled after removal.
Key Points for Inotify & fanotify: File System Events
- inotify has no recursive watching. Watching /etc only reports events in /etc itself, not /etc/nginx/conf.d/. You must walk the tree and add a watch to every subdirectory manually. This is why chokidar (webpack's watcher) can exhaust watch limits on large projects.
- fanotify can block file access until your handler says "allow" or "deny." This is how on-access antivirus works -- ClamAV's clamonacc intercepts file opens via FAN_CLASS_CONTENT, scans the file, and only then lets the process proceed.
- The default max_user_watches is 8192 on many systems. A single node_modules tree can have 50,000+ directories. Running out gives the confusing "ENOSPC: no space left on device" error even with plenty of disk space. Fix: sysctl fs.inotify.max_user_watches=524288.
- Rename detection uses IN_MOVED_FROM and IN_MOVED_TO paired by a cookie value. If you see IN_MOVED_FROM without a matching IN_MOVED_TO, the file was moved outside your watched tree. This cookie pairing is essential for file sync tools.
- The kernel coalesces rapid identical events -- 1000 writes to the same file may produce only one IN_MODIFY event. You cannot count exact modifications through inotify. You can only know that something changed.
Common Mistakes with Inotify & fanotify: File System Events
- Mistake: Watching a config file directly for IN_MODIFY and missing atomic replacements. Reality: Tools like Kubernetes and vim replace files via write-to-tmp then rename. The watch follows the old inode, not the name. Watch the directory for IN_MOVED_TO instead.
- Mistake: Treating inotify events as fixed-size structs. Reality: Events are variable-length because the name field varies. You must advance the buffer pointer by sizeof(struct inotify_event) + event->len per event, or you will read garbage.
- Mistake: Using inotify for filesystem-wide auditing. Reality: inotify requires per-file/directory watches. fanotify can mark an entire mount point with a single call, using far fewer kernel resources.
- Mistake: Ignoring IN_Q_OVERFLOW. Reality: When the event queue fills up (default 16384 events), the kernel drops events and delivers IN_Q_OVERFLOW. Your application must handle this by re-scanning watched directories to reconcile state.
Related Topics
epoll & I/O Multiplexing, File Descriptors & File Tables, Virtual File System (VFS), Signals & Signal Handling
Interrupt Handling & Softirqs — Kernel Internals
Difficulty: Advanced
Hardware's way of yanking the CPU out of whatever it is doing. A packet lands, a disk transfer finishes, and the device fires an interrupt mid-instruction. Linux splits the response: a fast top half acknowledges the hardware in microseconds, then a deferred bottom half (softirq or tasklet) handles the actual processing with interrupts re-enabled.
System Calls for Interrupt Handling & Softirqs
Key Components in Interrupt Handling & Softirqs
- IDT (Interrupt Descriptor Table): x86 CPU table mapping interrupt vectors (0-255) to handler addresses. Vectors 0-31 are CPU exceptions (page fault=14, GPF=13), 32-255 are for device interrupts. The IDTR register points to the IDT, loaded during boot.
- irq_desc / irq_chip: Per-IRQ kernel structures. irq_desc holds the handler chain (shared IRQs have multiple handlers), flags, and statistics. irq_chip abstracts the interrupt controller (APIC, GIC) operations: ack, mask, unmask, set affinity.
- softirq: Fixed set of 10 deferred execution contexts (NET_TX, NET_RX, BLOCK, TIMER, SCHED, HRTIMER, RCU, TASKLET_HI, TASKLET). Run immediately after hardirq with interrupts enabled. Cannot sleep. Process in ksoftirqd if load is high.
- workqueue: Kernel threads that execute deferred work in process context. they CAN sleep, allocate memory with GFP_KERNEL, and take mutexes. Used for heavy processing (filesystem I/O, USB transfers). system_wq is the default; drivers can create dedicated workqueues.
Key Points for Interrupt Handling & Softirqs
- IRQ affinity is a major tuning knob. The IOAPIC routes MSI/MSI-X interrupts to specific cores, and /proc/irq/N/smp_affinity controls which CPUs handle each device. Pinning NIC interrupts to the same core running your application keeps packet data in L1 cache.
- Most device drivers since kernel 4.x use threaded IRQs by default. The hardirq handler is minimal -- it just wakes a kernel thread that runs the main handler in process context. This improves latency predictability dramatically.
- NAPI flips from interrupts to polling under load. After the first packet, the driver disables NIC interrupts and polls for batches of packets. This prevents interrupt livelock at high packet rates. It is why ksoftirqd often shows high CPU on network-heavy servers.
- Softirqs are re-entrant across CPUs (same type can run on different cores simultaneously) but non-preemptible on a single CPU. If they take too long, the kernel defers remaining work to ksoftirqd to prevent user-space starvation.
- /proc/interrupts shows per-CPU interrupt counts for every IRQ line. A sudden spike means a device is generating excessive interrupts. Unbalanced columns mean poor IRQ affinity. irqbalance tries to distribute load automatically.
Common Mistakes with Interrupt Handling & Softirqs
- Mistake: Doing heavy processing in the hardirq handler. Reality: The hardirq runs with the interrupt line masked. Spending too long here delays other devices and causes packet drops. Move all non-essential work to softirqs or workqueues.
- Mistake: Calling sleeping functions from softirq or hardirq context. Reality: These contexts have no process to schedule away from. Use GFP_ATOMIC for allocation, spinlocks for synchronization. kmalloc with GFP_KERNEL or mutex_lock will deadlock or panic.
- Mistake: Ignoring ksoftirqd CPU usage. Reality: When ksoftirqd threads eat significant CPU, softirq processing is exceeding its inline budget (~2ms). This is common on network-heavy systems and signals the need for RSS, RPS, or interrupt affinity tuning.
- Mistake: Setting IRQ affinity without considering cache topology. Reality: Pinning a NIC IRQ and the application to the same CPU socket (same L3 cache) cuts memory latency. Cross-NUMA IRQ handling adds 100+ ns per packet in cache misses.
Related Topics
System Calls: User to Kernel Transition, Timers, Clocks & High-Resolution Timers, Process Scheduling (CFS), Kernel Data Structures
I/O Models: Blocking, Non-Blocking, Async — File Systems & I/O
Difficulty: Advanced
Five I/O models, each born from the previous one hitting a wall. Blocking ties a thread to every connection. Non-blocking avoids the sleep but burns CPU polling. select/poll let one thread wait on many fds but scan all of them every time. epoll notifies only the ready ones. io_uring hands the kernel a batch of operations and picks up results later -- no syscall per I/O.
System Calls for I/O Models: Blocking, Non-Blocking, Async
- fcntl
- poll
- select
- epoll_wait
- aio_read
Key Components in I/O Models: Blocking, Non-Blocking, Async
- Wait queue (wait_queue_head_t): Kernel structure where processes block waiting for I/O readiness; each socket/pipe/file has one, and epoll/poll/select use them for notification
- struct eventpoll (epoll): Kernel object backing an epoll fd; contains an RB-tree of monitored fds and a ready list (linked list) of fds with pending events
- struct poll_table_struct: Callback mechanism used by poll/select/epoll to register interest in wait queues during the f_op->poll() call on each monitored fd
- struct kiocb (kernel I/O control block): Tracks an in-flight async I/O operation for io_submit()/io_uring; contains completion callback and buffer pointers
Key Points for I/O Models: Blocking, Non-Blocking, Async
- epoll does not scan. When a packet arrives, the network stack calls ep_poll_callback(), which adds the fd to a ready list. epoll_wait() just drains that list. Cost: O(ready fds), not O(total fds). That is why it scales to millions of connections.
- Level-triggered epoll re-notifies you every call if data remains. Edge-triggered only notifies on state transitions -- you MUST drain the fd with non-blocking reads until EAGAIN, or the remaining data goes silent. Missing this is the most common epoll bug.
- select() is capped at 1024 fds and copies the entire fd set to/from kernel on every call. poll() removes the fd limit but still scans linearly. Both are O(n) per call, even if only one fd is ready.
- EPOLLONESHOT prevents two threads from handling the same fd simultaneously in multithreaded epoll. It disables the fd after one event -- the owning thread must re-arm it with EPOLL_CTL_MOD before the next event fires.
- Blocking I/O is not always wrong. For low-connection, high-throughput workloads (file processing, video transcoding), blocking I/O with a thread pool is simpler and can outperform epoll by avoiding syscall overhead and state machine complexity.
Common Mistakes with I/O Models: Blocking, Non-Blocking, Async
- Mistake: Using edge-triggered epoll without non-blocking sockets. Reality: If you read only part of the available data, ET will not re-notify you. The remaining data stalls until new data arrives. The connection appears frozen.
- Mistake: Adding the same fd to multiple epoll instances. Reality: It works, but the kernel wakes ALL epoll instances when the fd becomes ready, causing thundering herd issues and wasted CPU.
- Mistake: Assuming closed fds auto-remove from epoll. Reality: The kernel auto-removes when the underlying file description is destroyed. But if another fd still references the same struct file (via dup), the epoll entry persists and causes spurious events.
- Mistake: Using select() with fds numbered above 1023. Reality: FD_SET uses the fd as an array index into a fixed-size bitfield. Fds above 1023 corrupt memory. This is undefined behavior and often a security vulnerability.
Related Topics
io_uring: Modern Async I/O, File Descriptors & File Tables, Page Cache & Block I/O, TCP State Machine & Connection Lifecycle
io_uring: Modern Async I/O — File Systems & I/O
Difficulty: Advanced
Shared ring buffers between user space and the kernel, introduced in Linux 5.1. Applications write I/O requests into a submission ring; the kernel writes results into a completion ring. Zero syscalls in the hot path with SQPOLL mode. Works for both disk and network I/O -- the first Linux interface that can actually saturate modern NVMe hardware.
System Calls for io_uring: Modern Async I/O
- io_uring_setup
- io_uring_enter
- io_uring_register
Key Components in io_uring: Modern Async I/O
- Submission Queue (SQ): Shared-memory ring buffer where userspace writes Submission Queue Entries (SQEs) describing I/O operations; the kernel consumes entries from this ring
- Completion Queue (CQ): Shared-memory ring buffer where the kernel writes Completion Queue Entries (CQEs) containing I/O results; userspace polls this ring for completed operations
- SQE (struct io_uring_sqe): 64-byte submission entry specifying: opcode (read/write/accept/etc.), fd, buffer address, length, offset, user_data tag, and flags (IOSQE_LINK, IOSQE_FIXED_FILE)
- io_uring_params / io_uring_cqe: Setup parameters (IORING_SETUP_SQPOLL for kernel-side polling) and 16-byte completion entry (result code + user_data for request correlation)
Key Points for io_uring: Modern Async I/O
- SQPOLL mode spawns a kernel thread that drains the submission queue without any syscall. Your app writes an SQE, bumps a pointer, and the kernel picks it up. True zero-syscall I/O in steady state. This is how you saturate NVMe.
- Fixed files and fixed buffers (via io_uring_register) skip per-operation fd lookup and page pinning. For high-IOPS workloads on NVMe, this eliminates the remaining kernel overhead that was not syscall-related.
- Linked SQEs (IOSQE_IO_LINK) chain operations: read then process then write, each starting only after the previous completes. Complex I/O workflows without returning to userspace between steps.
- Unlike epoll, which only tells you an fd is ready and then requires separate read/write syscalls, io_uring handles the entire operation -- accept, recv, send -- as a single submitted entry. One interface for both file and network I/O.
- Size the CQ larger than the SQ (IORING_SETUP_CQ_SIZE). If completions arrive faster than you drain them, the CQ overflows and the kernel silently drops completions. This is one of the hardest bugs to debug.
Common Mistakes with io_uring: Modern Async I/O
- Mistake: Not draining the CQ frequently enough. Reality: CQ overflow causes the kernel to drop completions and set IORING_SQ_CQ_OVERFLOW. You must call io_uring_enter(IORING_ENTER_GETEVENTS) to recover, and you may have already lost data.
- Mistake: Expecting true async behavior from buffered file I/O. Reality: Buffered I/O goes through io-wq worker threads internally. You get an async API, but workers consume CPU and memory behind the scenes. True async requires O_DIRECT.
- Mistake: Enabling SQPOLL mode without understanding the cost. Reality: The polling thread consumes a full CPU core even when idle unless you set sq_thread_idle to park it. Also requires root or CAP_SYS_NICE since Linux 5.12.
- Mistake: Skipping IORING_SETUP_COOP_TASKRUN in non-SQPOLL mode. Reality: Without it, the kernel uses inter-processor interrupts to signal completions, adding latency on multi-core systems. Available since Linux 5.19.
Related Topics
I/O Models: Blocking, Non-Blocking, Async, Page Cache & Block I/O, File Descriptors & File Tables, Virtual File System (VFS)
kdump & Crash Analysis — Debugging & Tracing
Difficulty: Advanced
When a kernel panics, everything in memory is about to vanish. kdump preserves the entire kernel state by booting a secondary kernel via kexec and writing the crashed kernel's memory to disk. The crash tool then opens the resulting vmcore for full post-mortem analysis -- backtraces, memory contents, task states, and the complete dmesg log.
System Calls for kdump & Crash Analysis
- kexec_load
- kexec_file_load
- reboot
Key Components in kdump & Crash Analysis
- kexec: The mechanism that makes kdump possible. kexec_load() or kexec_file_load() preloads a secondary kernel and initramfs into reserved memory. On panic, kexec bypasses BIOS/UEFI and boots directly into the secondary kernel. This avoids the firmware re-initialization that would overwrite the crashed kernel's memory.
- kdump (crash dump mechanism): The framework that ties kexec to panic handling. When the primary kernel panics, the kdump infrastructure triggers kexec to boot the secondary (capture) kernel. The capture kernel mounts /proc/vmcore, which exposes the crashed kernel's physical memory as an ELF file. The kdump service then copies this to persistent storage.
- makedumpfile: Compresses and filters the raw vmcore. A 64 GB system produces a 64 GB raw dump. makedumpfile strips free pages, zero pages, and cache pages, then compresses the remainder with zlib or lzo. A typical filtered dump is 2-5% of total RAM. The -d (dump level) flag controls which page types to exclude -- level 31 excludes all unnecessary pages.
- crash (analysis tool): The post-mortem debugger for vmcore files. Built on top of GDB, it understands kernel data structures natively. Commands like bt (backtrace), ps (process list), log (dmesg), struct (structure inspection), and rd (raw memory read) provide full visibility into the crashed kernel's state. Requires the matching vmlinux with debug symbols.
- /proc/vmcore: Exposed by the capture kernel after kexec boot. It presents the crashed kernel's physical memory as an ELF core file. The ELF program headers describe the physical memory layout (which regions are RAM, which are MMIO). makedumpfile reads from /proc/vmcore and writes the filtered dump to the target location.
- crashkernel= boot parameter: Reserves memory for the capture kernel at boot time. The primary kernel does not use this reserved region, so it survives the panic and kexec transition. Common values: crashkernel=256M for systems under 64 GB, crashkernel=512M for larger systems. The "auto" option lets the distribution choose based on total RAM.
Key Points for kdump & Crash Analysis
- kdump works because of a two-kernel design. The primary kernel reserves memory at boot for a secondary kernel. On panic, kexec boots the secondary kernel into that reserved memory. The secondary kernel can read the crashed kernel's memory via /proc/vmcore because that memory was never overwritten -- the secondary kernel runs entirely within the reserved region.
- The kexec_load() syscall preloads the capture kernel and initramfs into reserved memory. The kexec_file_load() variant is newer and supports signed kernels (required when Secure Boot is enabled). Both store the kernel image in reserved memory so that the panic path has zero disk I/O -- it just jumps to the preloaded kernel.
- makedumpfile's dump levels control what gets excluded from the vmcore. Level 1 excludes zero pages. Level 2 adds cache pages. Level 31 excludes zero, cache, private cache, user, and free pages -- typically reducing a 64 GB dump to 1-3 GB. The trade-off: higher dump levels lose more potentially useful data. For kernel debugging, level 31 is standard. For user-space memory forensics, level 1 preserves more.
- The crash tool is not just a memory viewer. It reconstructs kernel state by parsing the vmcore with knowledge of kernel data structures. It reads the task_struct list to show all processes, walks page tables, decodes lock states, and traces the exact code path that led to the panic. It even recovers the dmesg ring buffer from the crashed kernel's memory.
- Testing kdump before a real crash is essential. The command "echo c > /proc/sysrq-trigger" forces an immediate kernel panic. If kdump is properly configured, the system reboots into the capture kernel, writes the vmcore, and reboots again into the normal kernel. The vmcore appears in /var/crash/ with a timestamp directory. If this test fails, kdump will also fail during a real crash.
Common Mistakes with kdump & Crash Analysis
- Not reserving enough memory for the capture kernel. If the crashkernel= parameter is too small, the capture kernel fails to boot and the vmcore is lost. Systems with many kernel modules, network-based dump targets, or complex initramfs configurations need more than the minimum 256 MB. Run kdumpctl estimate to check the actual memory requirement.
- Forgetting to rebuild the kdump initramfs after configuration changes. Changing the dump target (e.g., from local disk to NFS) requires running kdumpctl rebuild to regenerate the capture kernel's initramfs. Without this, the capture kernel boots with the old configuration and may fail to write the vmcore.
- Analyzing a vmcore without the matching vmlinux. The crash tool needs the exact vmlinux binary (with CONFIG_DEBUG_INFO) that was running when the crash occurred. A vmlinux from a different kernel build, even the same version, has different symbol addresses. Install the kernel-debuginfo package matching the exact kernel version and release.
- Assuming kdump works on first boot without testing. UEFI Secure Boot can block kexec_load(). SELinux policies may prevent writing to the dump target. Network-based targets may fail if the kdump initramfs lacks the correct network driver. Always validate with a controlled panic via sysrq-trigger after initial setup.
- Running out of disk space for the vmcore. A 256 GB server with dump level 31 and lzo compression may still produce a 5-10 GB vmcore. If /var/crash/ is on a small root partition, the dump fails silently. Configure a dedicated dump target with sufficient space, or use makedumpfile's --message-level to log progress during the dump.
Related Topics
System Calls: User to Kernel Transition, Kernel Modules & Device Drivers, Interrupt Handling & Softirqs, /proc and /sys Filesystems, Perf Events & Performance Counters
Kernel Data Structures — Kernel Internals
Difficulty: Advanced
No libc, no STL -- the kernel builds its own. struct list_head is an intrusive doubly-linked list embedded inside data structures, with container_of() to navigate back. Red-black trees handle CFS scheduling and VMA lookups in O(log n). The xarray (radix tree) backs the page cache with RCU-safe lockless reads. Per-CPU variables give each core its own counter copy to kill cache-line bouncing. Hash tables use singly-linked hlist to save 8 bytes per bucket across millions of entries.
System Calls for Kernel Data Structures
Key Components in Kernel Data Structures
- struct list_head: An intrusive doubly-linked list. the list node is embedded inside the data structure rather than wrapping it. The container_of() macro recovers the enclosing struct from a list_head pointer. Used everywhere: process lists, timer queues, device lists, filesystem dirty pages.
- struct rb_root / struct rb_node: Red-black tree implementation for O(log n) ordered operations. CFS uses an rb_tree keyed by vruntime for scheduling, the VMA subsystem uses it to store virtual memory areas sorted by address, and the I/O scheduler uses it for deadline ordering.
- struct xarray (radix tree): Concurrent radix tree (replaced the older radix_tree API in kernel 4.20). The page cache maps (inode, offset) pairs to struct page pointers via xarray. Supports RCU-safe lockless reads, making page cache lookups fast for concurrent readers.
- per-CPU variables: Data allocated with DEFINE_PER_CPU() has a separate copy for each CPU, accessed via get_cpu_var() / this_cpu_ptr(). Eliminates cache-line bouncing for frequently-updated counters (packet counts, statistics). No locks needed for reads from the owning CPU.
Key Points for Kernel Data Structures
- container_of() is the kernel's most important macro. It uses offsetof-based pointer arithmetic to go from an embedded list_head to the containing struct. This lets one struct live on multiple lists at once -- task_struct has over 10 list_head members.
- The kernel's red-black tree is intentionally NOT self-rebalancing. Callers must call rb_insert_color() after insertion. This gives the kernel control over when rebalancing happens -- critical in interrupt and RCU contexts.
- Hash tables use singly-linked lists (hlist_head/hlist_node) instead of doubly-linked, halving memory per bucket. With millions of buckets in the PID hash, dentry cache, and inode hash, those 8 saved bytes add up fast.
- RCU makes reads free. Readers take no lock -- they just call rcu_read_lock(). Writers create a modified copy and atomically swap the pointer, waiting for all readers to finish before freeing the old version. On non-preempt kernels, rcu_read_lock() compiles to nothing.
- The bitmap API (include/linux/bitmap.h) manages sets of flags using unsigned long arrays with bit manipulation. cpumask is a thin wrapper used for CPU affinity, IRQ balancing, and NUMA topology.
Common Mistakes with Kernel Data Structures
- Mistake: Iterating a list with list_for_each_entry() while modifying it. Reality: This corrupts list pointers. Use list_for_each_entry_safe() which pre-fetches the next pointer, or RCU list iteration for lock-free reads.
- Mistake: Forgetting to initialize list_head with INIT_LIST_HEAD(). Reality: Uninitialized list_heads contain garbage pointers and cause an immediate oops on first list operation.
- Mistake: Using per-CPU variables without disabling preemption. Reality: If the thread migrates between get_cpu_var() and put_cpu_var(), it accesses the wrong CPU's data. this_cpu_read()/this_cpu_write() are preemption-safe for simple scalars.
- Mistake: Assuming O(1) hash table lookups regardless of load. Reality: The kernel uses chaining, so a bad hash or too few buckets degrades to O(n). The dentry and inode hashes use jhash (Jenkins hash) for good distribution.
Related Topics
Process Scheduling (CFS), Virtual Memory & Address Spaces, Kernel Modules & Device Drivers, eBPF: Programmable Kernel
Kernel Livepatching & Runtime Code Replacement — System Administration
Difficulty: Advanced
Patching a running kernel without rebooting. The livepatch framework uses ftrace to intercept function calls and redirect them to replacement implementations loaded as kernel modules. The kernel keeps running. No process restarts. No connection drops.
System Calls for Kernel Livepatching & Runtime Code Replacement
- init_module
- finit_module
- delete_module
Key Components in Kernel Livepatching & Runtime Code Replacement
- ftrace (function tracer): The underlying mechanism that makes livepatching possible. ftrace uses the compiler-inserted __fentry__ call at the beginning of every kernel function to hook function entry points. The livepatch framework registers an ftrace handler that replaces the return address on the stack, causing execution to jump to the patched function instead of the original. This redirection happens at the instruction level with minimal overhead (a few nanoseconds per patched call site).
- klp_patch / klp_func / klp_object: The kernel data structures that describe a livepatch. klp_patch represents the entire patch. It contains klp_object entries (one per patched kernel module or vmlinux). Each klp_object contains klp_func entries mapping old function symbols to new function addresses. When the patch is enabled, the framework iterates these structures and registers ftrace hooks for each function replacement.
- Consistency Model (per-task switching): The mechanism ensuring no task executes a mix of old and new function versions. When a patch is applied, each task transitions from the old universe to the new universe individually. A task switches when it is in a safe state -- typically when it returns to userspace or is idle. The TIF_PATCH_PENDING flag marks tasks that have not yet switched. The patch is fully active only when all tasks have transitioned.
- /sys/kernel/livepatch/: The sysfs interface for livepatch management. Each loaded patch appears as a directory under /sys/kernel/livepatch/. The enabled file (0 or 1) controls activation. The transition file indicates whether the patch is still migrating tasks. Per-function entries show which functions are replaced. This is the primary interface for monitoring patch status.
- kpatch-build: The userspace tool that creates livepatch modules from source diffs. It compiles the original and patched kernel source, compares the resulting object files, extracts changed functions, and wraps them in a loadable kernel module that uses the livepatch API. The output is a standard .ko file that can be loaded with insmod or kpatch load.
Key Points for Kernel Livepatching & Runtime Code Replacement
- Livepatching replaces entire function bodies, not individual instructions. The granularity is one function at a time. If a CVE fix modifies three functions, the livepatch module contains three replacement functions. ftrace redirects each one independently.
- The consistency model is what separates modern livepatching from naive function replacement. Without it, one task could call the old version of function A, then the new version of function B that depends on the new behavior of A. The per-task universe switching prevents this by ensuring each task sees either all-old or all-new functions, never a mix.
- A livepatch module is a normal kernel module (.ko file) that calls klp_enable_patch() in its init function. It can be built, distributed, and loaded with standard module tools. The livepatch framework handles the ftrace registration and consistency transitions.
- Livepatches are cumulative. A second patch must account for the first. If patch-1 replaces function foo() and patch-2 also modifies foo(), then patch-2 must contain the combined fix. Atomic replace mode (since kernel 5.1) simplifies this by allowing a single patch to replace all previous patches at once.
- The compiler must generate functions with __fentry__ prologues for livepatching to work. This is controlled by CONFIG_FUNCTION_TRACER and the -pg or -mfentry compiler flag. Functions that are inlined, marked __always_inline, or compiled without fentry cannot be livepatched.
Common Mistakes with Kernel Livepatching & Runtime Code Replacement
- Assuming any kernel bug can be livepatched. Data structure changes, new struct fields, modified function signatures, and changes to inline functions or assembly cannot be applied via livepatching. Roughly 60-70% of security fixes are livepatchable. The rest require a full reboot.
- Leaving livepatches as permanent fixes. Livepatches are emergency bandages, not long-term solutions. They accumulate, interact in subtle ways, and make debugging harder because the running code no longer matches the installed kernel package. The correct workflow is: livepatch immediately, then schedule a real kernel update within the next maintenance window.
- Ignoring the transition state. After loading a livepatch, the transition file in /sys/kernel/livepatch/ may stay at 1 for seconds or minutes if long-running kernel tasks have not reached a safe transition point. A patch is not fully active until transition reaches 0 and all tasks have switched to the new universe.
- Stacking multiple independent livepatches without atomic replace. Each patch hooks the same ftrace entry points. If two patches modify different call sites in the same function, the interactions become unpredictable. Atomic replace (replace flag in klp_patch) was introduced in kernel 5.1 specifically to solve this by treating each new patch as a complete replacement of all previous patches.
Related Topics
Kernel Modules & Device Drivers, eBPF: Programmable Kernel, System Calls: User to Kernel Transition, Process Scheduling (CFS)
Kernel Modules & Device Drivers — Kernel Internals
Difficulty: Advanced
ELF .ko files loaded into a running kernel via finit_module(). load_module() verifies the binary, checks signatures, resolves symbol CRCs, allocates vmalloc memory, and calls the module's init function. That init function registers with the right subsystem -- cdev_add() for character devices, register_netdev() for network devices. udev picks up the uevent and creates /dev nodes. Unloading requires a zero reference count before the exit function runs cleanup in reverse order.
System Calls for Kernel Modules & Device Drivers
- init_module
- finit_module
- delete_module
Key Components in Kernel Modules & Device Drivers
- struct module: The kernel's in-memory representation of a loaded module. Contains the module name, reference count, init/exit function pointers, symbol table, dependency list, and memory sections (text, data, BSS). Linked into a global list traversed by lsmod.
- struct file_operations: The vtable for character devices. Each function pointer (open, read, write, ioctl, mmap, release, poll) defines how userspace interacts with /dev/xyz. The kernel dispatches VFS operations to the driver via this structure.
- struct cdev: Represents a character device within the kernel. Associates a device number range (major:minor) with a file_operations structure. Registered via cdev_add() and removed via cdev_del().
- udev / devtmpfs: devtmpfs is a kernel-maintained tmpfs that auto-creates device nodes when drivers register. udev is the userspace daemon that listens for kernel uevents via netlink, applies rules (/etc/udev/rules.d/), sets permissions, creates symlinks, and runs trigger scripts.
Key Points for Kernel Modules & Device Drivers
- modprobe is smart, insmod is not. modprobe reads modules.dep (generated by depmod) and loads prerequisites first. insmod loads a single .ko file and fails on unresolved symbols. That dependency resolution is why modprobe fixes problems insmod cannot.
- The 'disagrees about version of symbol' error is not a bug -- it is ABI protection. The kernel checks that exported symbol CRC signatures match (MODVERSIONS). A module compiled against a different kernel config will be rejected.
- Every loaded module consumes non-swappable kernel memory. The .text section lives in vmalloc space, per-CPU data uses the per-CPU allocator. Hundreds of unnecessary modules waste precious kernel address space.
- finit_module() (Linux 3.8+) loads from a file descriptor, enabling signature verification before loading. Modern insmod/modprobe use this syscall, not the older init_module().
- Device numbers: major identifies the driver (8 = sd, 1 = mem), minor identifies the device instance (sda=0, sda1=1). Modern kernels use dynamic major allocation via alloc_chrdev_region() to avoid conflicts.
Common Mistakes with Kernel Modules & Device Drivers
- Mistake: Forgetting to unregister resources in module_exit. Reality: If you register a char device, create a class, and add a device, you must undo all three in reverse order. Missing any step leaks kernel resources or leaves stale /dev entries.
- Mistake: Using GFP_KERNEL allocations in interrupt context. Reality: Interrupt handlers and tasklets must use GFP_ATOMIC, which can fail. The driver must handle NULL returns gracefully.
- Mistake: Dereferencing userspace pointers directly in ioctl handlers. Reality: This bypasses SMAP protection and crashes on invalid pointers. Use copy_from_user/copy_to_user -- always.
- Mistake: Building modules against headers that do not match the running kernel. Reality: The .ko file either fails to load (version magic mismatch) or causes subtle memory corruption if MODVERSIONS is disabled.
Related Topics
System Calls: User to Kernel Transition, eBPF: Programmable Kernel, Virtual File System (VFS), Interrupt Handling & Softirqs
Kernel Network Stack — Networking & Sockets
Difficulty: Advanced
A packet traverses seven layers of kernel code between the NIC and the application. DMA fills the ring buffer, NAPI polls into sk_buffs, the IP and TCP stacks process headers, inet_hashtable matches the 4-tuple to a socket, and copy_to_user() crosses back to ring 3. The TX path reverses it. Understanding this journey explains every tuning knob that matters.
System Calls for Kernel Network Stack
- recv
- send
- read
- write
- sendfile
- splice
- sendmsg
- recvmsg
Key Components in Kernel Network Stack
- sk_buff: The fundamental packet structure in the kernel. Contains head/data/tail/end pointers delineating the packet buffer, protocol header pointers for L2/L3/L4, metadata (device, timestamp, mark), and skb_shared_info for scatter-gather pages. Every packet in flight is an sk_buff.
- NAPI (New API): Hybrid interrupt/polling mechanism. The NIC fires one hardware interrupt, then switches to polling mode where the softirq handler calls the driver's poll function to drain packets in batches. This amortizes interrupt overhead across dozens of packets and prevents interrupt storms at high packet rates.
- inet_hashtable: Hash table that maps the 4-tuple (src_ip, src_port, dst_ip, dst_port) to a struct sock for established TCP connections. A separate inet_listening_hashtable handles SYN packets destined for listening sockets. This is how the kernel routes each packet to the correct socket in O(1).
- struct sock (sk): The socket object in kernel space. Contains the receive queue (sk_receive_queue), send buffer, congestion control state, TCP sequence numbers, and the wait queue where epoll callbacks are registered. This is the meeting point between the network stack and the application.
- Qdisc (Queueing Discipline): TX-side packet scheduler sitting between the TCP/IP stack and the driver. Default is pfifo_fast (or fq_codel on modern systems). Controls packet ordering, rate limiting, and fairness before packets reach the driver transmit ring.
Key Points for Kernel Network Stack
- The RX path crosses two privilege boundaries. The NIC writes packets via DMA (no CPU involvement), then a hardware interrupt transitions to kernel mode. NAPI softirq processes packets in kernel context. Finally, copy_to_user() copies data to the application buffer and the CPU returns to ring 3. Each boundary has real cost -- the syscall transition alone is ~200 nanoseconds.
- sk_buff is not a simple buffer. It has four pointers (head, data, tail, end) that allow protocols to push/pull headers without copying. When the TCP stack needs to prepend a header, it calls skb_push() which moves the data pointer backward. The actual packet data might span multiple pages via skb_shared_info's frags array, supporting scatter-gather DMA.
- Socket lookup is the bridge between the network and the application. For each incoming TCP segment, inet_hashtable hashes the 4-tuple and walks a hash chain to find the matching struct sock. SO_REUSEPORT creates multiple sockets on the same port, and the kernel (or an attached BPF program) selects which one receives each connection. In containers, DNAT rewrites the destination IP before this lookup happens.
- The TX path mirrors the RX path in reverse. send() copies data from user space into sk_buffs, TCP adds headers and applies congestion control, IP performs route lookup and passes through Netfilter OUTPUT and POSTROUTING hooks, the qdisc schedules the packet, and the driver enqueues it on the NIC TX ring for DMA transmission. TSO offloads TCP segmentation to the NIC hardware, sending one large sk_buff instead of many small ones.
- Zero-copy techniques bypass the user/kernel data copy. sendfile() splices data from the page cache directly into sk_buffs using page references instead of memcpy. MSG_ZEROCOPY (kernel 4.14+) lets send() reference user-space pages directly. io_uring can batch network operations to amortize syscall overhead. Each technique trades complexity for throughput at high data rates.
Common Mistakes with Kernel Network Stack
- Mistake: Assuming packet drops are always a network problem. Reality: The most common drop point is the NIC RX ring buffer overflow, visible in ethtool -S as rx_missed_errors. The NIC filled the ring via DMA faster than NAPI could drain it. Fix with larger ring buffers (ethtool -G) and RSS to spread interrupt load.
- Mistake: Tuning TCP buffer sizes to maximum values for all workloads. Reality: Each socket can auto-tune up to tcp_rmem[2] (default 6 MB). With 50,000 connections, that is 300 GB of theoretical buffer allocation. The kernel enters tcp_memory_pressure and starts dropping segments. Set appropriate maximums for the workload.
- Mistake: Ignoring softirq processing time. Reality: NAPI processes packets in softirq context, which has a time budget (netdev_budget_usecs, default 2 ms). If the budget expires, remaining packets stay in the ring buffer until the next cycle. Under load, this adds latency that looks like application slowness but is entirely in the kernel RX path.
- Mistake: Expecting sendfile() to always be faster than read()+send(). Reality: sendfile() avoids one copy but only works from a file descriptor to a socket. If the data needs modification (compression, encryption in user space), the copy is unavoidable. TLS via kTLS can push encryption into the kernel to preserve the zero-copy path.
Related Topics
epoll & I/O Multiplexing, Socket Programming (TCP/UDP), TCP State Machine & Connection Lifecycle, Zero-Copy Networking (sendfile, splice), Netfilter & nftables/iptables, Network Namespaces & veth Pairs, XDP & AF_XDP: Kernel-Bypass Networking
Linux Security Modules (LSM) Framework — Security & Access Control
Difficulty: Advanced
The kernel's pluggable security framework. Over 200 hook points embedded in VFS, networking, process lifecycle, and IPC code paths call into registered security modules (SELinux, AppArmor, Smack, TOMOYO, BPF LSM) after DAC checks pass. Each hook invokes a function pointer in the security_hook_list, and every registered module gets a vote. A single deny from any module blocks the operation. The security_struct blob attached to kernel objects (inodes, tasks, superblocks) stores per-LSM state without requiring changes to core data structures.
System Calls for Linux Security Modules (LSM) Framework
- security_file_open
- security_inode_permission
- security_task_alloc
- security_socket_connect
- security_bpf
Key Components in Linux Security Modules (LSM) Framework
- security_hook_list: A linked list of function pointers for each LSM hook. When the kernel calls security_inode_permission(), it walks the list and invokes every registered module's implementation. If any module returns a non-zero (deny) value, the operation is blocked. The list is populated during kernel initialization and is read-only at runtime.
- security_struct (lsm_blob_sizes): Since kernel 5.4, each major kernel object (task_struct, inode, superblock, file, cred, msg_msg, ipc) has a blob of memory divided among registered LSMs. Each LSM gets a fixed offset into the blob for its private data. SELinux stores its security context pointer there. AppArmor stores its profile reference. This replaces the old single void* pointer that forced only one LSM to attach data per object.
- LSM hook points (200+): Inserted at security-critical code paths throughout the kernel. VFS hooks (security_inode_permission, security_file_open, security_inode_create) gate file operations. Network hooks (security_socket_connect, security_socket_bind, security_sk_clone_security) gate socket operations. Process hooks (security_task_alloc, security_task_kill, security_bprm_check) gate process creation and signaling. Each hook is a call to a function in include/linux/lsm_hooks.h.
- struct security_operations (pre-5.4) / security_hook_heads: Before kernel 5.4, a single struct security_operations held one function pointer per hook, allowing only one major LSM. After 5.4, security_hook_heads is a struct where each member is a list_head pointing to a chain of hook callbacks, enabling multiple LSMs to stack. The lsm= boot parameter controls the order of evaluation.
Key Points for Linux Security Modules (LSM) Framework
- LSM hooks fire AFTER DAC checks pass. If standard Unix permissions deny access, the LSM hook is never reached. This means LSMs can only further restrict access, never grant access that DAC denied. The design is intentionally restrictive: LSMs are an additional gate, not a bypass.
- Since kernel 5.4, multiple major LSMs can stack. The lsm= boot parameter specifies the order: lsm=lockdown,capability,selinux,bpf. Every hook iterates through all registered modules. A single deny from any module blocks the operation. This enables layered security policies where SELinux provides baseline MAC and BPF LSM adds application-specific rules.
- BPF LSM (kernel 5.7+) allows attaching eBPF programs to any of the 200+ LSM hooks at runtime. No kernel recompilation, no reboot. The BPF verifier ensures the program is safe. This transforms LSM from a boot-time-only framework into a runtime-programmable security layer.
- The security_struct blob mechanism is what makes stacking possible. Before 5.4, each kernel object (inode, task) had a single void* security pointer, so only one LSM could store per-object data. The blob mechanism allocates a contiguous chunk partitioned among all active LSMs, with each LSM accessing its portion via a fixed offset.
- Hook placement is deliberate and follows the principle of complete mediation. Every path from a syscall to a security-sensitive kernel operation must pass through at least one LSM hook. The VFS layer alone has hooks at inode lookup, permission check, file open, read, write, mmap, and attribute changes. Missing a hook would create a bypass.
Common Mistakes with Linux Security Modules (LSM) Framework
- Assuming "Permission denied" always means DAC. When file permissions, ownership, and ACLs all check out but open() still returns EACCES, an LSM is the most likely cause. Check cat /sys/kernel/security/lsm to see which modules are active, then consult the appropriate audit log (ausearch -m AVC for SELinux, dmesg | grep APPARMOR for AppArmor).
- Believing LSMs can grant access. LSMs are restrictive-only hooks. They cannot override a DAC denial or grant permissions that the standard permission model rejects. If DAC denies, the LSM hook never runs. If DAC allows, the LSM gets a veto but cannot add further permissions.
- Disabling the entire LSM stack (setenforce 0 or removing the AppArmor profile) to debug one denial. This removes all mandatory access control, not just the offending rule. The correct approach: switch to permissive mode (SELinux) or complain mode (AppArmor), reproduce the issue, read the audit log, and fix the specific rule.
- Ignoring BPF LSM programs during debugging. On systems with BPF LSM enabled, eBPF programs attached to LSM hooks can deny operations without leaving traditional audit log entries. Use bpftool prog list to check for attached BPF LSM programs and bpftool prog dump to inspect their logic.
Related Topics
SELinux & AppArmor, Seccomp: Sandboxing System Calls, Linux Capabilities, eBPF: Programmable Kernel, Audit Framework & Logging
Memory Cgroups & Resource Limits — Memory Management
Difficulty: Intermediate
Per-group physical memory limits enforced at the kernel page allocator. Three categories count against the limit: anonymous pages (heap, stack), file-backed pages (page cache from reads and writes), and kernel memory (slab, page tables, socket buffers). memory.high throttles allocations through direct reclaim without killing. memory.max triggers a cgroup-scoped OOM kill -- entirely separate from the global OOM killer. memory.low and memory.min shield working sets from reclaim pressure caused by siblings.
System Calls for Memory Cgroups & Resource Limits
- setrlimit
- getrlimit
- prlimit
Key Components in Memory Cgroups & Resource Limits
- struct mem_cgroup: Per-cgroup memory accounting structure. tracks page counters (anon, file, kernel stack, slab, page tables), usage/limit values, and OOM control state
- memory.max: Hard limit. the cgroup is OOM-killed when usage exceeds this; equivalent to cgroup v1's memory.limit_in_bytes; setting to 'max' means unlimited
- memory.high: Soft throttle threshold. when usage exceeds memory.high, the kernel aggressively reclaims memory and throttles allocations (via direct reclaim), but does NOT OOM kill; creates back-pressure
- memory.low / memory.min: Protection thresholds. memory.low provides best-effort protection against reclaim (kernel prefers to reclaim from other cgroups); memory.min is hard protection (never reclaimed below this level)
Key Points for Memory Cgroups & Resource Limits
- Page cache is the silent cgroup killer -- reading a temp file or writing logs generates page cache that counts against your memory.max, even though RSS looks low; this is the #1 source of unexpected container OOM kills
- memory.high is the seatbelt, memory.max is the brick wall -- memory.high throttles your process (sleeps it during allocation) giving it time to recover; memory.max just kills it; set memory.max 10-20% above memory.high as a safety net
- The cgroup OOM killer is completely separate from the global one -- it only kills processes inside the over-limit cgroup, and the events may only appear in memory.events counters, not dmesg; many teams miss these entirely
- In Kubernetes, container memory limits map directly to cgroup memory.max -- a pod generating page cache via file I/O will be OOM-killed even if its heap is well within limits; you must size for total memory, not just heap
- Kernel memory accounting is always-on in cgroup v2 -- slab caches, page tables, socket buffers, and kernel stacks all count against your limit; a process with 10,000 TCP connections can OOM from kernel memory alone
Common Mistakes with Memory Cgroups & Resource Limits
- Setting container limits equal to heap size -- a Java app with -Xmx4g needs at least 5-6 GB container limit to cover page cache, kernel memory, native allocations, and thread stacks; setting it to 4g guarantees OOM
- Only monitoring dmesg for OOM events -- cgroup OOM kills may only appear in memory.events counters (oom, oom_kill fields); if you are not reading those, you are flying blind in containers
- Trusting /proc/meminfo inside a container -- without overrides, it shows HOST memory, not container memory; use memory.current and memory.stat for accurate cgroup-level data; tools like lxcfs expose container-aware /proc/meminfo
- Forgetting memory.low for critical services -- without it, a batch job's page cache growth can cause the kernel to reclaim your latency-sensitive service's working set; memory.low marks it as "reclaim from others first"
Related Topics
OOM Killer & Memory Pressure, Virtual Memory & Address Spaces, Heap Allocators (malloc internals), NUMA Architecture & Memory Policy
System V & POSIX Message Queues — Processes & Threads
Difficulty: Intermediate
Kernel-managed message passing with built-in priority ordering. POSIX MQs store messages in an rb-tree keyed by priority, and mq_receive() always returns the most urgent one first. On Linux, queue descriptors are real file descriptors -- they plug into epoll. System V MQs use integer IDs that cannot. Both types persist in kernel memory after the process exits and stay there until explicitly removed.
System Calls for System V & POSIX Message Queues
- mq_open
- mq_send
- mq_receive
- msgget
- msgsnd
- msgrcv
Key Components in System V & POSIX Message Queues
- mqueue_inode_info: Kernel structure for a POSIX message queue. Contains an rb-tree of messages ordered by priority, current message count, max message count/size limits, and notification registration (mq_notify).
- msg_msg: Kernel structure representing a single message in a System V queue. Contains the message type (long), size, and the payload data. Messages larger than one page use linked msg_msgseg structures.
- mqueue filesystem: POSIX MQs live on a virtual filesystem (mounted at /dev/mqueue). Queue names appear as files. 'ls /dev/mqueue' shows active queues with their metadata. The filesystem provides a user-visible interface for cleanup.
- ipc_namespace: Both POSIX and SysV MQs are isolated by IPC namespaces. Containers (Docker/Kubernetes) get separate MQ namespaces, preventing cross-container message leakage.
Key Points for System V & POSIX Message Queues
- POSIX MQs deliver by priority: mq_receive() always returns the highest-priority message first (0 to MQ_PRIO_MAX-1, at least 32 levels). System V MQs have a 'type' field for selective retrieval but no strict priority ordering.
- mq_notify() fires once when a message arrives on an empty queue -- then you must re-register. It won't fire if the queue already has messages. One process per queue. This design prevents thundering herds but demands careful coding.
- On Linux, POSIX MQ descriptors are real file descriptors. They work with select(), poll(), and epoll(). You can multiplex MQ events with socket I/O in one event loop. System V MQs return integer IDs that don't work with epoll -- a dealbreaker for event-driven code.
- Default POSIX MQ limits are surprisingly low: msg_max=10, msgsize_max=8192, queues_max=256. Production systems almost always need to tune /proc/sys/fs/mqueue/ values.
- Both MQ types persist until explicitly removed or reboot. POSIX MQs survive mq_close() -- you must call mq_unlink(). System V MQs survive until msgctl(IPC_RMID). Crashed processes leak queues. This is a real operational hazard.
Common Mistakes with System V & POSIX Message Queues
- Mistake: calling mq_close() and assuming the queue is gone. Reality: mq_close() closes the descriptor but the queue persists on /dev/mqueue. You must call mq_unlink() to actually remove it. Leaked queues eat kernel memory until reboot.
- Mistake: not tuning msg_max. Reality: with the default of 10, mq_send() blocks (or returns EAGAIN) almost immediately under any real load. Always check and raise /proc/sys/fs/mqueue/msg_max.
- Mistake: using System V MQs in new code. Reality: SysV IPC uses integer keys (not fds), doesn't work with epoll, has awkward permissions, and the API is inconsistent. POSIX MQs are superior in every way. SysV exists only for legacy compatibility.
- Mistake: expecting mq_notify to re-arm automatically. Reality: notification fires once. If messages arrive between the notification callback and your re-registration call, they're silently available but no new notification fires.
Related Topics
Inter-Process Communication (Pipes & FIFOs), Shared Memory & Semaphores, Signals & Signal Handling, POSIX Threads
mmap & Memory-Mapped Files — Memory Management
Difficulty: Intermediate
File contents mapped straight into a process's address space -- accessed as regular memory, no read() syscall, no kernel-to-user copy. The page cache backs the mapping, so multiple processes reading the same file share identical physical pages with zero duplication.
System Calls for mmap & Memory-Mapped Files
- mmap
- munmap
- msync
- mprotect
- mremap
Key Components in mmap & Memory-Mapped Files
- vm_area_struct (VMA): Kernel structure tracking each mmap region. start/end addresses, flags (MAP_SHARED/PRIVATE), file pointer, and page offset for file-backed mappings
- struct address_space: Per-inode structure managing the page cache. maps file offsets to page frames, shared by all processes mapping the same file
- struct page (page cache): Physical page frames cached in the page cache. for file-backed mappings, the page cache IS the memory; mmap provides a direct virtual address alias to these frames
- rmap (reverse mapping): Tracks which PTEs point to each physical page. needed for page reclaim and migration to update all mappings when a shared page is moved or evicted
Key Points for mmap & Memory-Mapped Files
- mmap does NOT read the file -- it sets up a virtual address range that points at the file's page cache; actual data loading happens on first access via page faults, which is why mmap'ing a 1 TB file costs zero RAM and microseconds of CPU
- MAP_SHARED means your writes go directly to the page cache and are instantly visible to every other process mapping the same file -- but they do NOT reach disk until msync() or fsync(); a power failure without sync loses your data
- MAP_PRIVATE gives you copy-on-write -- reads come from the shared page cache for free, but the first write to any page creates a private copy; the file on disk is never touched
- malloc uses mmap under the hood for allocations above 128 KB -- unlike brk, these regions can be independently returned to the kernel via munmap, which is why large allocations do not cause the heap fragmentation trap
- mremap() can resize an existing mapping without copying data (if virtual space permits), which is how dynamic arrays in Go and Rust can sometimes grow without memcpy
Common Mistakes with mmap & Memory-Mapped Files
- Mapping a file MAP_SHARED then truncating it -- accessing pages beyond the new file size delivers SIGBUS (bus error), not SIGSEGV; this is a classic database crash pattern
- Assuming mmap is always faster than read() -- for sequential reads, read() with kernel readahead can match or beat mmap because it avoids page fault overhead and TLB pressure; mmap wins for random access
- Forgetting msync() before relying on durability -- mmap writes go to the page cache, not to disk; without msync/fsync, a power failure loses your data, exactly like write() without fsync()
- Mapping very large files on 32-bit systems -- the 3 GB user-space limit means you can map at most ~2 GB contiguously; on 64-bit with 128 TB address space, this is a non-issue
Related Topics
Virtual Memory & Address Spaces, Page Tables & TLB, Huge Pages & THP, Zero-Copy Networking (sendfile, splice)
Linux Namespaces (PID, NET, MNT, UTS, IPC, USER) — Kernel Internals
Difficulty: Intermediate
The kernel trick behind every Linux container. A namespace gives a process its own private view of PIDs, network, filesystem, or users -- all while sharing the same kernel. Inside, the process is PID 1 with a full network stack. On the host, it is just PID 47832.
System Calls for Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
- unshare
- clone
- setns
- pivot_root
Key Components in Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
- struct nsproxy: Each task_struct holds a pointer to an nsproxy containing pointers to all namespace objects the task belongs to. Multiple tasks can share an nsproxy (and thus share namespaces). clone() with namespace flags creates a new nsproxy.
- struct pid_namespace: Creates a new PID number space. Processes inside see PIDs starting from 1. PID namespaces are hierarchical. the parent namespace can see all PIDs in child namespaces, but not vice versa. PID 1 in a namespace has init semantics (reaps orphans).
- struct net (network namespace): Each network namespace has its own network devices, IP addresses, routing table, iptables rules, /proc/net, and socket structures. veth pairs and bridges connect network namespaces to each other and to the host.
- struct user_namespace: Maps UID/GID ranges between namespaces via /proc/PID/uid_map. Enables rootless containers: UID 0 inside the namespace maps to an unprivileged UID (e.g., 100000) on the host. Owning a user namespace grants capabilities within it.
Key Points for Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
- 8 namespace types as of kernel 5.6: mount, UTS (hostname), IPC, network, PID, user, cgroup, and time. The time namespace lets containers have different CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets -- useful for container migration.
- unshare() creates new namespaces for the calling process. clone() creates them for a child. setns() joins an existing namespace by fd from /proc/PID/ns/<type>. That last one is how 'docker exec' enters a running container.
- If PID 1 in a PID namespace exits, every other process in that namespace gets SIGKILL. The entire container is torn down. That is why containers need an init process like tini or dumb-init for signal forwarding and zombie reaping.
- Mount namespaces plus pivot_root() are what make the container's rootfs appear as /. Docker uses overlayfs for layered images, then pivot_root to swap the root. The host filesystem becomes invisible.
- User namespaces are the key to rootless containers. An unprivileged user creates a user namespace, gets full capabilities inside it, and can then create all other namespace types. No real root needed.
Common Mistakes with Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
- Mistake: Expecting /proc to be isolated after entering a PID namespace. Reality: You must mount a new procfs (mount -t proc proc /proc) or /proc still shows the host's process list.
- Mistake: Wondering why network is broken in a new namespace. Reality: Network namespaces start with only loopback. You must create veth pairs, assign IPs, and set up routing -- or use a CNI plugin.
- Mistake: Thinking PID namespace isolation is absolute. Reality: PID namespaces are hierarchical. The parent can see all child PIDs. kill() from the host can target container processes by their host PID. This is by design.
- Mistake: Using chroot for container filesystem isolation. Reality: chroot is trivially escapable (open fd to /, chroot to subdirectory, fchdir). pivot_root in a mount namespace has no such escape.
Related Topics
cgroups v2 (Control Groups), Seccomp: Sandboxing System Calls, Linux Capabilities, File Permissions, Ownership & ACLs
Netfilter & nftables/iptables — Networking & Sockets
Difficulty: Advanced
Every packet passes through five kernel hook points where registered callbacks -- iptables rules, nftables rules, or eBPF programs -- can accept, drop, modify, or redirect it. Connection tracking (nf_conntrack) maintains a hash table of flows for stateful firewalling and NAT. iptables walks rules linearly, O(n) per chain. nftables uses hash sets for O(1) lookups. The conntrack table caps at 262,144 entries by default; when it fills, new connections drop silently.
System Calls for Netfilter & nftables/iptables
Key Components in Netfilter & nftables/iptables
- nf_hook_ops / nf_hooks: Array of hook points in the kernel's network stack where registered callbacks inspect and potentially modify, accept, or drop each packet . five hooks for IPv4: NF_INET_PRE_ROUTING, NF_INET_LOCAL_IN, NF_INET_FORWARD, NF_INET_LOCAL_OUT, NF_INET_POST_ROUTING
- nf_conntrack (connection tracking): Stateful packet inspection engine that tracks connections (TCP states, UDP pseudo-connections, ICMP queries) using a hash table of nf_conn entries . enables stateful firewalling (ESTABLISHED,RELATED) and NAT
- nf_tables (nft): Modern replacement for iptables. uses a virtual machine (nf_tables VM) with register-based operations instead of table/chain/match linear walks; supports sets, maps, concatenations, and atomic ruleset updates
- xt_table / ipt_table: Legacy iptables table structure. five tables (filter, nat, mangle, raw, security) with built-in chains at each hook point; each chain is a linear list of rules matched sequentially
Key Points for Netfilter & nftables/iptables
- Order matters more than anything else. PREROUTING runs DNAT before the routing decision. INPUT catches locally-bound traffic. FORWARD handles transit. OUTPUT intercepts locally-generated packets. POSTROUTING does SNAT after routing. Put a rule in the wrong chain and it silently never matches.
- Connection tracking (conntrack) is the most expensive part of Netfilter -- a hash table lookup and update on every packet adds 5-10% overhead at high packet rates. The raw table's NOTRACK target bypasses conntrack for specific flows when you don't need stateful tracking.
- The conntrack table has a hard limit (nf_conntrack_max, default 262144). When it's full, new connections are silently dropped. Each entry costs ~350 bytes. You'll see 'nf_conntrack: table full' in dmesg -- one of the most common causes of mysterious connection failures in containerized environments.
- iptables checks rules linearly -- 10,000 rules means 10,000 checks per packet. nftables uses hash-based sets and maps for O(1) lookups, making large rulesets orders of magnitude faster. This is why Kubernetes at scale can't use iptables mode.
- Kubernetes kube-proxy in iptables mode creates O(n) rules per Service. At 5000+ services, rule evaluation adds measurable latency to every packet. IPVS mode and eBPF (Cilium) avoid this scaling wall entirely.
Common Mistakes with Netfilter & nftables/iptables
- Mistake: adding DNAT rules in the FORWARD chain. Reality: DNAT must happen in PREROUTING, before the routing decision. By the time a packet reaches FORWARD, the destination is already resolved and DNAT is ignored.
- Mistake: mixing iptables and nftables without understanding shared state. Reality: both use the same kernel conntrack subsystem. Mixed rules on the same system cause unexpected interactions.
- Mistake: ignoring conntrack entries from TIME_WAIT connections. Reality: conntrack entries persist for tcp_timeout_time_wait (default 120 seconds) -- double the TCP TIME_WAIT. This exacerbates table exhaustion under high connection rates.
- Mistake: flushing iptables rules without flushing conntrack. Reality: existing connections continue through old conntrack entries even after rules are removed. Use 'conntrack -F' alongside rule changes.
Related Topics
Network Namespaces & veth Pairs, TCP State Machine & Connection Lifecycle, XDP & AF_XDP: Kernel-Bypass Networking, Socket Programming (TCP/UDP)
Network Namespaces & veth Pairs — Networking & Sockets
Difficulty: Advanced
A complete, isolated copy of the Linux network stack -- interfaces, IPs, routes, ARP, iptables, conntrack, /proc/net -- all separate per namespace. New ones start empty with only loopback. veth pairs act as virtual Ethernet cables: packets in one end, out the other, ~1-2 us of latency from a memcpy between sk_buffs. Docker wires containers to a bridge via veth pairs with MASQUERADE for egress. Kubernetes shares one namespace per pod via the pause container and assigns routable IPs through CNI plugins.
System Calls for Network Namespaces & veth Pairs
Key Components in Network Namespaces & veth Pairs
- struct net (network namespace): Top-level container for an isolated network stack. holds references to all network devices (struct net_device), routing tables (struct fib_table), netfilter hooks, conntrack, and /proc/net entries
- veth pair (struct veth): Virtual ethernet device pair. two ends connected internally; packets sent on one end appear on the other; one end lives in the host namespace, the other in the container namespace
- Linux bridge (struct net_bridge): Software L2 switch connecting multiple veth endpoints and physical NICs . performs MAC learning, flooding, and forwarding; Docker's docker0 bridge connects all containers
- struct nsproxy: Per-task structure holding pointers to all namespaces (net, pid, mnt, uts, ipc, cgroup, user). clone/unshare/setns modify this to enter or create namespaces
Key Points for Network Namespaces & veth Pairs
- Every namespace starts empty -- just a loopback interface and nothing else. The init namespace (PID 1) owns the physical NICs. You must explicitly move or create interfaces in new namespaces via 'ip link set dev vethX netns <ns>'.
- A veth pair is a virtual Ethernet cable: packets in one end, out the other. No loss, no reordering, no MTU surprises. Throughput is limited by CPU (it's memcpy between sk_buffs), not by any physical link.
- Docker's bridge mode in five words: veth pair, docker0 bridge, MASQUERADE. Each container gets its own netns, a veth connects it to the bridge, and iptables rewrites source addresses for outbound traffic.
- Kubernetes pods share a network namespace. The pause container creates it; app containers join via setns(). The CNI plugin (Calico, Flannel, Cilium) handles the veth pair and assigns a cluster-routable pod IP.
- setns(fd, CLONE_NEWNET) switches a process into an existing namespace. This is how nsenter and 'docker exec' work. The fd comes from /proc/<pid>/ns/net or a bind-mounted namespace file.
Common Mistakes with Network Namespaces & veth Pairs
- Mistake: forgetting to bring up loopback in a new namespace. Reality: without 'ip link set lo up', localhost connections fail inside the namespace. Every namespace needs this.
- Mistake: assigning an IP but no route. Reality: even after giving the veth an IP inside the namespace, traffic can't reach external networks without a default route pointing to the bridge's IP.
- Mistake: confusing 'ip netns' with Docker namespaces. Reality: 'ip netns' creates bind mounts in /var/run/netns/. Docker doesn't use named namespaces -- it creates them via clone(CLONE_NEWNET) and identifies them through /proc/<pid>/ns/net.
- Mistake: assuming stale veth pairs are cleaned up. Reality: if the container's netns was bind-mounted and not unmounted, stale interfaces persist even after the container exits. Normal exit (all processes terminated, no bind mount) does clean up automatically.
Related Topics
Netfilter & nftables/iptables, Socket Programming (TCP/UDP), XDP & AF_XDP: Kernel-Bypass Networking, Unix Domain Sockets
NUMA Architecture & Memory Policy — Memory Management
Difficulty: Advanced
Multi-socket servers split memory across CPU sockets -- each socket owns a memory controller and attached DRAM, forming a NUMA node. Local hits land in 80-100 ns at 60-80 GB/s; crossing the interconnect (Intel UPI, AMD Infinity Fabric) costs 130-200 ns at roughly half the bandwidth. The kernel defaults to first-touch placement. Four policies via set_mempolicy()/mbind() control where pages land: MPOL_DEFAULT, MPOL_BIND, MPOL_INTERLEAVE, MPOL_PREFERRED. AutoNUMA watches for remote access patterns and migrates pages at 20-50 us each.
System Calls for NUMA Architecture & Memory Policy
- mbind
- set_mempolicy
- get_mempolicy
- migrate_pages
- move_pages
Key Components in NUMA Architecture & Memory Policy
- pg_data_t (NUMA node descriptor): Per-node structure containing the node's zone list (DMA, Normal, HighMem), free page counts, kswapd thread, and LRU lists. one pg_data_t per NUMA node
- struct mempolicy: Per-VMA or per-process memory policy. specifies allocation strategy (default, bind, interleave, preferred, local) and the nodemask of allowed NUMA nodes
- zonelist (fallback order): Ordered list of zones across NUMA nodes that the page allocator tries when the preferred node is exhausted. determines the fallback path from local to remote memory
- struct numa_stat: Per-node allocation counters (numa_hit, numa_miss, numa_foreign, local_node, other_node) exposed via /sys/devices/system/node/node*/numastat. key metric for detecting NUMA imbalance
Key Points for NUMA Architecture & Memory Policy
- Local memory access: 80-100 ns. Remote access via the interconnect (QPI/UPI): 130-200 ns. That is a 1.5-2x penalty on every single memory read -- and for bandwidth-bound workloads, remote access can cut throughput by 30-50%
- The default NUMA policy is 'first touch' -- pages land on whatever node the faulting CPU belongs to; if one thread initializes all the data, ALL of it ends up on one node, and every other node pays the remote penalty forever
- Interleave policy (MPOL_INTERLEAVE) distributes pages round-robin across all nodes -- this averages out latency and multiplies available bandwidth, making it ideal for shared data accessed from every socket
- AutoNUMA is the kernel's attempt to fix bad placement automatically -- it scans PTEs, detects remote access patterns, and migrates pages to the local node; but each migration costs ~20 us per page, so it only works for stable patterns
- NUMA is not just about memory -- PCIe devices are attached to specific nodes too; a NIC on node 1 doing DMA into node 0's memory crosses the interconnect on every packet
Common Mistakes with NUMA Architecture & Memory Policy
- Running a database on a multi-socket server without NUMA awareness -- if shared buffers are allocated on node 0 (where postmaster starts) but queries run on both nodes, half the buffer accesses pay remote latency; use numactl --interleave=all
- Using --cpunodebind without --membind -- constraining CPUs to node 0 does not guarantee memory lands there; under pressure, the allocator falls back to remote nodes silently
- MPOL_BIND without monitoring -- binding to a single node means OOM when that node is exhausted, even if other nodes have gigabytes free; always monitor per-node memory with numastat
- Ignoring NUMA in containers -- Docker and Kubernetes do not enforce NUMA by default; a container's threads may run on CPUs across all nodes while its memory sits on one, creating the worst possible access pattern
Related Topics
Virtual Memory & Address Spaces, Huge Pages & THP, Memory Cgroups & Resource Limits, Page Tables & TLB
OOM Killer & Memory Pressure — Memory Management
Difficulty: Intermediate
When physical memory runs out -- or a cgroup hits its ceiling -- the kernel picks a victim and sends SIGKILL. No warning, no cleanup, no chance for a graceful shutdown. Selection comes from an oom_score combining RSS, swap usage, and the oom_score_adj knob that operators set.
System Calls for OOM Killer & Memory Pressure
Key Components in OOM Killer & Memory Pressure
- oom_badness(): Kernel function scoring each process for OOM candidacy. returns points proportional to RSS + swap usage, multiplied by oom_score_adj adjustment; highest score gets killed
- vm.overcommit_memory: Sysctl controlling kernel memory allocation policy. 0 = heuristic overcommit (default, allows ~50% over), 1 = always overcommit, 2 = strict accounting (commit limit = swap + ratio * RAM)
- PSI (Pressure Stall Information): /proc/pressure/memory. reports percentage of time tasks are stalled waiting for memory (some = at least one task stalled, full = all tasks stalled); enables proactive memory pressure response before OOM
- struct oom_control: Kernel structure passed through the OOM path. contains memcg pointer (NULL for global OOM), GFP flags, allocation order, and selected victim task
Key Points for OOM Killer & Memory Pressure
- The OOM killer is the absolute last resort -- before it fires, the kernel has already tried reclaiming page cache, writing back dirty pages, swapping anonymous pages, and compacting memory; if you are seeing OOM kills, the system was drowning for a while before that
- oom_score_adj is how you rig the game: -1000 makes a process immortal (but if everything is immortal, the kernel panics), +1000 volunteers it as tribute; Kubernetes uses this to protect Guaranteed pods and sacrifice BestEffort ones
- Overcommit mode 0 is the kernel making a bet that not everyone will cash their checks at once -- malloc succeeds now, but if everyone touches their pages later, someone gets killed; this is why "malloc succeeded but the process died later" confuses so many developers
- PSI metrics are your early warning system -- user-space daemons like systemd-oomd watch these numbers and kill processes BEFORE the kernel OOM killer fires, giving you cleaner shutdowns and actual log messages instead of a bare SIGKILL
- The OOM reaper is the kernel's backup plan -- if the victim is stuck in D-state and cannot exit, the reaper strips its anonymous memory anyway, because a dead process that cannot release its pages is worse than useless
Common Mistakes with OOM Killer & Memory Pressure
- Making everything immune with oom_score_adj=-1000 -- if the kernel cannot find anyone to kill, it panics or hangs; at least one non-essential process must be killable, always
- Seeing "Killed" in logs and assuming it is an application bug -- search dmesg for "oom-kill" or "Out of memory" first; OOM kills look identical to crashes unless you check the kernel log
- Using strict overcommit (mode 2) without enough swap -- the commit limit is swap + (ratio * RAM); with no swap and the default 50% ratio, only half your RAM is allocatable, causing ENOMEM with gigabytes still free
- Ignoring PSI until it is too late -- by the time the OOM killer fires, the system has been thrashing for seconds or minutes; monitoring /proc/pressure/memory lets you act before things get that bad
Related Topics
Virtual Memory & Address Spaces, Memory Cgroups & Resource Limits, Huge Pages & THP, Heap Allocators (malloc internals)
OverlayFS & Union File Systems — File Systems & I/O
Difficulty: Intermediate
OverlayFS (mainline since Linux 3.18) stacks read-only lowerdirs beneath a single read-write upperdir, presenting them as one merged view. Lookups check the upper layer first, then walk lowerdirs top-to-bottom. Writing to a lower-layer file copies the entire file to upperdir first (crash-safe via workdir). Deleting drops a whiteout (char device 0/0) in upperdir. Metacopy (4.19+) skips the data copy for metadata-only changes like chmod. Docker's overlay2 maps image layers to lowerdirs and gives each container its own thin upperdir -- 100 containers off one 800 MB image share the base layers with near-zero per-container overhead.
System Calls for OverlayFS & Union File Systems
Key Components in OverlayFS & Union File Systems
- lowerdir (lower layers): One or more read-only directory trees stacked bottom-to-top; these form the base image layers that are shared across containers and never modified by overlay operations
- upperdir (upper layer): Single read-write directory where all modifications (creates, writes, deletes) are recorded; each container gets its own upperdir to isolate writes from other containers and from the image layers
- workdir (work directory): Internal scratch space on the same filesystem as upperdir; used by the kernel for atomic copy-up operations and whiteout creation to ensure crash consistency
- merged (mount point): The unified view presented to userspace; path lookups search upperdir first, then lowerdirs top-to-bottom, presenting the first match found as the definitive file
Key Points for OverlayFS & Union File Systems
- First write to a lower-layer file is expensive -- the kernel copies the ENTIRE file to upperdir before applying your one-byte change. A 2 GB base image file means a 2 GB copy-up, even if you only appended a newline. Subsequent writes hit the upper copy directly.
- Deleting a file does not actually delete anything. The kernel drops a "whiteout" (character device 0/0) in upperdir that hides the lower-layer file. Opaque directories (xattr trusted.overlay.opaque=y) hide everything below when you rm -rf and recreate a directory.
- 100 containers from one image cost almost zero extra disk. All share read-only lowerdirs; only unique writes accumulate in each container's upperdir. This is why Docker images are small but containers feel full-size.
- Metacopy (Linux 4.19+) is the performance shortcut for chmod/chown -- it creates a tiny metadata node in upperdir instead of copying gigabytes of file data. If you are doing permission changes on large files, this is the difference between milliseconds and minutes.
- upperdir and workdir must live on the same filesystem (ext4 or xfs). tmpfs gives fast writes but no persistence. NFS is not supported because overlay needs POSIX rename atomicity that NFS cannot guarantee.
Common Mistakes with OverlayFS & Union File Systems
- Mistake: Containers mysteriously run out of inodes with plenty of disk space. Reality: Every whiteout file and copied-up file consumes an inode on the upper filesystem. High container churn with lots of deletions exhausts inodes before bytes.
- Mistake: Assuming copy-up is instant. Reality: Writing a single byte to a 2 GB lower-layer file triggers a full 2 GB copy to upperdir. Structure Dockerfiles to modify large files in early layers, not late ones.
- Mistake: Manually mounting overlayfs to debug Docker and getting confused by the options. Reality: Docker's overlay2 driver manages lowerdir stacking, link indirection, and layer metadata automatically. Debugging requires reconstructing the full lowerdir chain from /var/lib/docker/overlay2/*/diff.
- Mistake: Expecting hard links to survive across layers. Reality: A file hard-linked in a lower layer becomes two separate files if both names are written to in the upper layer. The hard-link relationship silently breaks.
Related Topics
Virtual File System (VFS), Inodes & File Metadata, chroot & pivot_root, Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
Page Cache & Block I/O — File Systems & I/O
Difficulty: Advanced
The kernel keeps recently accessed disk data in RAM. Every read() and write() passes through this cache. Reads hit RAM when the page is cached; writes land in RAM first and reach disk later through dirty page writeback. This explains two things that confuse people: write() returns in microseconds, and "free" memory on a busy server looks alarmingly low.
System Calls for Page Cache & Block I/O
- sync
- fsync
- fdatasync
- posix_fadvise
- readahead
Key Components in Page Cache & Block I/O
- struct address_space: Per-inode data structure managing all cached pages for a file; contains an xarray (radix tree) mapping file offsets to struct page pointers
- Page cache (unified): System-wide cache of file data in physical memory pages (4KB each); replaces the legacy buffer cache; managed by the address_space of each inode
- Writeback threads (kworker/flush): Kernel threads that flush dirty pages to disk; triggered by timer (dirty_writeback_centisecs), memory pressure (dirty_background_ratio), or explicit sync/fsync
- struct bio: Block I/O request representing a contiguous set of disk sectors; submitted to the block layer's I/O scheduler for merging and dispatch to the device driver
Key Points for Page Cache & Block I/O
- write() does NOT put data on disk. It copies to page cache, marks the page dirty, and returns. Without fsync(), you're trusting electricity.
- The kernel detects sequential reads and prefetches pages before you ask. Readahead starts at 128KB and grows up to 2MB. That's why second reads are fast — data is already waiting in RAM.
- dirty_background_ratio (10%) triggers background writeback. dirty_ratio (20%) triggers BLOCKING writeback — your write() call hangs. If you hit dirty_ratio, your disk can't keep up with your writes.
- O_DIRECT bypasses the page cache entirely — used by databases that run their own buffer pool. But O_DIRECT alone doesn't guarantee durability. You still need fsync() to flush the disk's hardware write cache.
- drop_caches evicts clean pages from RAM. It does NOT flush dirty pages. It's for benchmarking only — never use it in production.
Common Mistakes with Page Cache & Block I/O
- Thinking write() means data is safe. Reality: it's only in RAM. The default 30-second writeback delay means up to 30 seconds of data loss on power failure. For anything that matters, you need fsync().
- Using fsync() when fdatasync() is enough. fsync() flushes data AND metadata (inode), requiring an extra disk write. For append workloads like logs, fdatasync() is sufficient and can be 2x faster.
- Calling fsync() on the file but forgetting the parent directory. On ext4, a newly created file's directory entry may not be persisted until you fsync the directory itself. A crash can make your file vanish entirely.
- Using O_DIRECT without understanding alignment. Buffers must be aligned to the filesystem block size (usually 4096 bytes). Misaligned I/O silently falls back to buffered mode on some kernels or returns EINVAL on others.
Related Topics
Virtual File System (VFS), I/O Models: Blocking, Non-Blocking, Async, io_uring: Modern Async I/O, Virtual Memory & Address Spaces
Page Tables & TLB — Memory Management
Difficulty: Advanced
Four levels of lookup tables (PGD, PUD, PMD, PTE) stand between a virtual address and physical RAM. The TLB caches recent translations so most accesses skip the walk entirely. A miss costs 5-30 ns as the CPU chases pointers through four memory reads. On multi-core machines, changing any mapping forces a TLB shootdown -- an IPI to every core that might hold a stale entry.
System Calls for Page Tables & TLB
Key Components in Page Tables & TLB
- pgd_t / p4d_t / pud_t / pmd_t / pte_t: Page table entry types at each level. each is a 64-bit value containing the physical frame number, permission bits, accessed/dirty flags, and NX bit
- struct mm_struct->pgd: Pointer to the process's top-level page global directory, loaded into CR3 register on context switch to activate that process's address space
- TLB (Translation Lookaside Buffer): Hardware cache in the MMU holding recent virtual-to-physical translations. L1 dTLB (64 entries), L1 iTLB (128 entries), L2 sTLB (1536 entries) on modern Intel
- struct mmu_gather: Kernel structure for batching TLB invalidations during munmap/exit. defers shootdown IPIs until the entire batch is prepared, reducing cross-core interrupts
Key Points for Page Tables & TLB
- The TLB is tiny -- 64 entries in the L1 dTLB. With 4 KB pages, that covers just 256 KB of memory. Miss it, and the CPU walks four levels of page tables, burning 10-30 ns per access. For databases with GB-sized working sets, this is the bottleneck nobody talks about
- TLB shootdowns are the silent killer on multi-core systems -- when one CPU changes a mapping, it must IPI every other core running threads of that process, and everyone waits. On 128 cores, a single munmap() can stall the entire machine for 100+ microseconds
- Huge pages (2 MB) skip an entire page table level and give each TLB entry 512x more coverage -- this is why every serious database deployment uses them
- The PTE's accessed and dirty bits are set by hardware on every read/write, letting the kernel's page reclaim (kswapd) find cold pages to evict without any software overhead
- PCID tags TLB entries per process so context switches do not flush the entire TLB -- this became critical after Meltdown when KPTI turned every syscall into an effective context switch
Common Mistakes with Page Tables & TLB
- Ignoring page table memory overhead -- a process with a fragmented 1 TB virtual address space can consume several GB of page tables even if RSS is small, because every mapped region needs page table pages at each level
- Calling mprotect() in a tight loop on many small regions -- each call can trigger TLB shootdowns across all cores, creating O(n * num_cpus) IPI storms that tank latency
- Assuming TLB flushes are free on context switch -- without PCID, switching processes flushes the entire TLB, costing ~1000 cycles plus all the subsequent miss penalties; frequent context switches destroy performance
- Overlooking TLB miss cost in pointer-chasing workloads -- random access across a large address space means a TLB miss on nearly every access, each requiring 4 sequential memory reads
Related Topics
Virtual Memory & Address Spaces, Huge Pages & THP, NUMA Architecture & Memory Policy, mmap & Memory-Mapped Files
PAM: Pluggable Authentication Modules — Security & Access Control
Difficulty: Intermediate
PAM decouples authentication logic from applications entirely. Programs call pam_authenticate(), pam_acct_mgmt(), and pam_open_session() through libpam, and per-service configs in /etc/pam.d/ chain modules -- pam_unix.so, pam_ldap.so, pam_google_authenticator.so -- across four stacks: auth (identity), account (validity), password (credential changes), session (environment setup). Control flags (required, requisite, sufficient, optional) govern evaluation flow. Buried in the session stack, pam_limits.so quietly sets the ulimits that Docker containers and Kubernetes pods inherit.
System Calls for PAM: Pluggable Authentication Modules
- pam_start
- pam_authenticate
- pam_acct_mgmt
- pam_open_session
- pam_end
Key Components in PAM: Pluggable Authentication Modules
- PAM Stack (auth, account, password, session): Four types of modules that handle different phases of authentication. 'auth' verifies identity (password, token, biometric). 'account' checks account validity (expiration, time-of-day restrictions, host access). 'password' manages credential updates (password changes). 'session' sets up/tears down the user environment (ulimits, home directory, logging). Each application's /etc/pam.d/ file configures which modules run in each stack.
- Control Flags (required, requisite, sufficient, optional): 'required'. module must succeed for final result to be success, but remaining modules still run. 'requisite'. module must succeed AND failure immediately returns to the application (short-circuit). 'sufficient'. if this module succeeds AND no prior required module failed, immediately return success (skip remaining). 'optional'. result only matters if it is the only module in the stack.
- PAM Modules (/lib/security/pam_*.so): Shared libraries that implement the actual authentication logic. pam_unix.so checks /etc/shadow passwords. pam_ldap.so authenticates against LDAP/Active Directory. pam_google_authenticator.so verifies TOTP codes. pam_limits.so sets ulimits from /etc/security/limits.conf. pam_env.so sets environment variables. Modules are loaded dynamically by libpam.
- /etc/pam.d/ Configuration: Per-application PAM configuration files. /etc/pam.d/sshd controls SSH authentication, /etc/pam.d/sudo controls sudo, /etc/pam.d/login controls console login. Each file lists module type, control flag, and module path. '@include common-auth' includes shared base configurations. Order matters: modules execute top-to-bottom within each stack type.
Key Points for PAM: Pluggable Authentication Modules
- PAM deliberately runs ALL required modules even after one fails. This is not a bug -- it is a timing-attack defense. If PAM stopped early on a bad username, an attacker could measure response time to distinguish "user doesn't exist" from "wrong password." Only 'requisite' short-circuits, and using it carelessly leaks exactly this information.
- The 'sufficient' flag is PAM's fast lane: if pam_unix succeeds (correct local password) and no prior 'required' module has failed, PAM skips everything else and returns success immediately. This is how systems implement "local password OR LDAP" fallback chains -- try local first, fall through to LDAP only if local fails.
- pam_limits.so is the reason your containers have the ulimits they have. It reads /etc/security/limits.conf during the session stack and sets nofile, nproc, memlock via setrlimit(). Docker containers inherit these from dockerd's PAM session unless you override with --ulimit. Kubernetes nodes often need limits.conf tuning for Elasticsearch and other file-hungry pods.
- SSH flows through PAM in a strict order: auth stack (password or TOTP check), account stack (is this account expired or locked?), session stack (set ulimits, register with systemd-logind, set the audit UID). Here is the catch -- key-based SSH auth bypasses the auth stack entirely (sshd handles keys itself) but still runs account and session. Your PAM session modules always fire.
- A PAM misconfiguration can lock you out of a system completely. If both /etc/pam.d/login and /etc/pam.d/sshd are broken, there is no way in -- not console, not SSH. Always keep a root shell open when editing PAM configs, and test with 'pamtester' before deploying. Recovery means booting into single-user mode or mounting the disk from a rescue image.
Common Mistakes with PAM: Pluggable Authentication Modules
- Mistake: Setting pam_google_authenticator as 'sufficient' for 2FA. Reality: if sufficient, a correct TOTP code alone grants access WITHOUT checking the password. Anyone who knows the TOTP seed walks right in. For real 2FA, both pam_unix and pam_google_authenticator must be 'required' in the auth stack.
- Mistake: Using 'requisite' instead of 'required' for pam_unix without understanding the tradeoff. Reality: requisite short-circuits on failure, which leaks timing information. An attacker can distinguish "invalid username" (fast return) from "wrong password" (slower /etc/shadow lookup). Use 'required' unless you have a specific reason not to.
- Mistake: Editing /etc/pam.d/common-auth without testing first. Reality: common-auth is @included by sshd, sudo, login, su, and virtually every PAM-aware application. A single syntax error here locks out every authentication path simultaneously. Always validate with 'pamtester sshd youruser authenticate' before saving.
- Mistake: Adding pam_limits.so to the auth stack. Reality: pam_limits is a session module -- putting it in the auth stack does nothing. It only processes limits when called as 'session required pam_limits.so'. Same principle applies to pam_env.so: stack placement determines when it runs.
Related Topics
File Permissions, Ownership & ACLs, Linux Capabilities, SELinux & AppArmor, Process Lifecycle (fork/exec/wait)
Perf Events & Performance Counters — Kernel Internals
Difficulty: Advanced
perf_event_open() taps the CPU's hardware performance counters -- dedicated PMU registers that tick with no software overhead. Counting mode (perf stat) reads totals at context switch; near-zero cost. Sampling mode (perf record) triggers an NMI every N events, grabs the instruction pointer and callchain, and dumps them into a shared ring buffer for offline analysis. Common events: cycles, instructions, cache-misses, branch-misses. Most CPUs expose 4-8 simultaneous counters; request more and the kernel multiplexes. Flame graphs come from aggregating callchain samples by stack trace.
System Calls for Perf Events & Performance Counters
- perf_event_open
- ioctl
- mmap
- read
Key Components in Perf Events & Performance Counters
- perf_event_open() syscall: Creates a performance counter file descriptor. Takes a struct perf_event_attr specifying the event type (hardware, software, tracepoint, kprobe, uprobe), event config (which specific counter), sampling period or frequency, and flags. Returns an fd that can be read() for counts, mmap()'d for the ring buffer (sampling), or controlled via ioctl (enable/disable/reset).
- Hardware Performance Counters (PMU): CPU-specific registers that count micro-architectural events without software overhead. Common counters: instructions retired, CPU cycles, cache-references, cache-misses (L1, L2, LLC), branch-instructions, branch-misses, bus-cycles. Limited to 4-8 simultaneous counters on most CPUs (Intel: 4 general + 3 fixed). The kernel multiplexes if more events are requested.
- Ring Buffer (perf_mmap): When sampling (not counting), perf_event_open returns an fd that is mmap()'d to a shared ring buffer between kernel and user-space. The kernel writes sample records (IP, timestamp, callchain, registers) into the ring buffer at each sampling interrupt; user-space reads them without syscalls. perf record reads this buffer and writes perf.data files.
- perf_event_attr struct: The configuration struct passed to perf_event_open. Key fields: type (PERF_TYPE_HARDWARE, _SOFTWARE, _TRACEPOINT, _KPROBE), config (specific event ID), sample_period (every N events) or sample_freq (target N samples/sec), sample_type (bitmask: IP, TID, TIME, CALLCHAIN, BRANCH_STACK), exclude_kernel/exclude_user flags for filtering.
Key Points for Perf Events & Performance Counters
- Counting mode (perf stat) has near-zero overhead. Hardware counters tick in dedicated CPU registers with no interrupts. The kernel reads them at context switch. 'perf stat -e cache-misses,instructions ./myprogram' costs less than 0.1%.
- Sampling mode (perf record) fires an NMI every N events and captures the instruction pointer plus call chain. At 99 Hz, overhead is ~1%. At 100K Hz, it is 10-20%. The key insight: sampling tells you WHERE events occur, not just HOW MANY.
- Flame graphs are built from callchain samples. The x-axis is alphabetical (NOT time). Width is proportional to sample count. The y-axis is stack depth. Wide bars at the top are your optimization targets.
- Hardware counter multiplexing kicks in when you request more events than PMU counters (typically 4-8). The kernel time-slices and extrapolates. You see a percentage indicator in perf stat output. For precise measurements, stay within the counter limit.
- perf traces kernel functions (kprobes) and user-space functions (uprobes) without recompilation. Combined with 'perf record -e probe:*', this gives function-level tracing with far less overhead than strace because it runs in-kernel.
Common Mistakes with Perf Events & Performance Counters
- Mistake: Using perf record without -g (call graph). Reality: Without stack traces, you see which functions are hot but not WHY they are hot (which callers lead to them). Always use 'perf record -g'. Use --call-graph dwarf for user-space or --call-graph fp for kernel.
- Mistake: Comparing raw cache-miss counts across different CPUs. Reality: A 'cache miss' on Intel Skylake refers to a different cache level than on AMD Zen. Compare cache-misses/instructions (miss rate) instead, and verify which level the event maps to.
- Mistake: Profiling without debug symbols. Reality: perf captures instruction pointers and needs symbol tables to map them to function names. Without debuginfo, you see hex addresses. Install debuginfo packages or build with -g -O2.
- Mistake: Setting sample frequency too high. Reality: 'perf record -F 99999' generates ~100K NMIs/sec, consuming significant CPU and perturbing the workload. Use -F 99 as the default -- enough for statistical significance with minimal observer effect.
Related Topics
eBPF: Programmable Kernel, Process Scheduling (CFS), Page Cache & Block I/O, NUMA Architecture & Memory Policy
Inter-Process Communication (Pipes & FIFOs) — Processes & Threads
Difficulty: Starter
An in-memory kernel buffer linking one process's output to another's input. Anonymous pipes tie parent and child together after fork(). Named pipes (FIFOs) give unrelated processes a filesystem path to rendezvous at. Writes of PIPE_BUF bytes or fewer (4096 on Linux) are guaranteed atomic -- larger writes can be split and interleaved.
System Calls for Inter-Process Communication (Pipes & FIFOs)
- pipe
- pipe2
- mkfifo
- read
- write
- select
Key Components in Inter-Process Communication (Pipes & FIFOs)
- pipe_inode_info: Kernel structure representing a pipe. Contains a circular buffer of pipe_buffer entries (each pointing to a page), reader/writer counts, and wait queues for blocking I/O.
- pipe_buffer: Each slot in the pipe's circular buffer holds a reference to a page, an offset, and a length. Default pipe capacity is 16 pages (64KB on systems with 4KB pages). Adjustable per-pipe via fcntl(F_SETPIPE_SZ).
- pipefs: A pseudo-filesystem (mounted internally) that provides inodes for pipe file descriptors. Pipes appear as entries in /proc/[pid]/fd but have no path in the regular filesystem (anonymous pipes).
- FIFO inode: A named pipe created with mkfifo(). Unlike anonymous pipes, FIFOs have a filesystem path and persist until deleted. Multiple unrelated processes can communicate through a FIFO by opening the same path.
Key Points for Inter-Process Communication (Pipes & FIFOs)
- Writes of PIPE_BUF (4096 on Linux) bytes or fewer are guaranteed atomic -- they'll never be interleaved with other writers' data. This is a POSIX guarantee that every shell pipeline depends on. Writes larger than PIPE_BUF can be split and interleaved.
- When the last reader closes a pipe, the writer gets SIGPIPE (default: terminate). If SIGPIPE is blocked, write() returns EPIPE. That's how 'yes | head -5' works -- head closes its stdin after 5 lines, and yes gets killed by SIGPIPE.
- pipe2(fds, O_CLOEXEC | O_NONBLOCK) is the right way to create pipes. Between pipe() and a separate fcntl(), another thread can fork() and leak the fds to a child process. pipe2() is atomic.
- Default capacity is 64KB (16 pages), but you can grow it up to pipe-max-size (default 1MB) via fcntl(F_SETPIPE_SZ). Unprivileged users are capped by pipe-user-pages-soft (16MB total across all pipes).
- splice() and tee() move data between pipes and fds without copying through userspace -- the kernel transfers page references directly. This is how efficient proxy servers achieve zero-copy I/O.
Common Mistakes with Inter-Process Communication (Pipes & FIFOs)
- Mistake: not closing unused pipe ends after fork(). Reality: if the child still has the write end open, the child's own read() will never see EOF because there's still an open writer -- itself. Deadlock. Always close the end you don't use.
- Mistake: assuming read() returns the full amount. Reality: pipe reads can be short, especially with O_NONBLOCK. Always loop until you get the expected bytes or hit EOF/error.
- Mistake: opening a FIFO without understanding the block. Reality: open() on a FIFO blocks until the other end is also opened. O_NONBLOCK on a reader succeeds immediately; O_NONBLOCK on a writer fails with ENXIO if no reader exists.
- Mistake: assuming pipes are bidirectional. Reality: Unix pipes are strictly one-way. For bidirectional communication, use two pipes or a socketpair().
Related Topics
Process Lifecycle (fork/exec/wait), System V & POSIX Message Queues, Shared Memory & Semaphores, Signals & Signal Handling
POSIX Threads — Processes & Threads
Difficulty: Intermediate
Under the hood, Linux threads are just task_struct instances created with clone() and flags that share the address space, file descriptors, and signal handlers. Each thread gets its own tid, an 8 MB stack (mmap'd with a guard page), and a Thread Control Block at the stack top -- addressed through the FS segment register for O(1) thread-local storage. pthread_create() wraps all of this. Mutexes fast-path through a single atomic CAS in userspace (no syscall); the kernel only gets involved via futex() when contention actually happens.
System Calls for POSIX Threads
- pthread_create
- pthread_join
- pthread_mutex_lock
- pthread_cond_wait
Key Components in POSIX Threads
- task_struct (per-thread): Each pthread is a kernel task_struct with its own PID (tid), stack, register state, and signal mask. Threads in the same process share the same mm_struct (address space), files_struct, and sighand_struct.
- futex: Fast Userspace Mutex. the kernel primitive underlying pthread_mutex, pthread_cond, and pthread_rwlock. Uncontended operations (the fast path) stay entirely in userspace via atomic compare-and-swap. Only when contention occurs does futex(FUTEX_WAIT) put the thread to sleep.
- pthread_t / tcb (Thread Control Block): Glibc allocates a per-thread struct pthread on the thread's stack. Contains thread-local storage pointer, cleanup handlers, cancellation state, join state, and the futex used for pthread_join().
- robust_list: Kernel-maintained list of reliable futexes owned by a thread. If the thread dies while holding a reliable mutex, the kernel marks it with FUTEX_OWNER_DIED so the next locker gets EOWNERDEAD and can recover.
Key Points for POSIX Threads
- Linux makes no distinction between threads and processes at the kernel level. Thread creation is clone() with CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD. The CLONE_THREAD flag groups them under the same tgid (what getpid() returns). That is it.
- An uncontended mutex lock is a single atomic cmpxchg in userspace. No syscall. Only when another thread holds the lock does futex(FUTEX_WAIT) enter the kernel. Uncontended mutexes cost about 25 nanoseconds. This is why pthreads are fast.
- pthread_cond_wait() does two things atomically: releases the mutex and sleeps on the condvar's futex. This atomicity is critical. Without it, a signal could slip between unlock and sleep -- the classic 'lost wakeup' problem.
- Thread-local storage (__thread / thread_local) uses the FS segment register on x86-64. Each thread's FS base points to its TCB, and TLS variables are offsets from FS. Accessing a thread-local variable is a single mov instruction, not a function call.
- Thread cancellation is almost never what you want. Deferred cancellation only fires at specific 'cancellation points' (sleep, read, write, etc.). Asynchronous cancellation can fire at any instruction and will leave mutexes locked and heap corrupted. Prefer a shared 'shutdown' flag instead.
Common Mistakes with POSIX Threads
- Not checking pthread_mutex_lock() return value. With PTHREAD_MUTEX_ERRORCHECK or robust mutexes, it can return EDEADLK or EOWNERDEAD. Default mutexes silently deadlock on recursive locking -- no error, just a hang.
- Signaling a condition variable without holding the associated mutex. Technically allowed by POSIX, but it creates a race: the waiter may miss the signal if it has not entered pthread_cond_wait() yet. Always signal while holding the mutex, or immediately after unlocking.
- Using a stack-allocated mutex or condvar after the declaring function returns. The mutex memory gets reused, and other threads end up locking garbage. Always use heap-allocated or global synchronization primitives.
- Calling fork() in a multithreaded program without immediately calling exec(). Only the calling thread survives in the child. Mutexes held by vanished threads remain locked forever. Use pthread_atfork() as a band-aid, or avoid this pattern entirely.
Related Topics
Process Lifecycle (fork/exec/wait), Signals & Signal Handling, Shared Memory & Semaphores, Copy-on-Write & Process Creation Internals
/proc and /sys Advanced Patterns — System Tuning
Difficulty: Intermediate
The /proc and /sys filesystems are the kernel's API exposed as files. They contain no data on disk. Every read triggers a kernel function that generates the content on the fly. Some files are cheap (formatted counters), some are expensive (full page table walks). Knowing the difference matters when scraping at scale.
System Calls for /proc and /sys Advanced Patterns
- open
- read
- write
- openat
- readv
- writev
Key Components in /proc and /sys Advanced Patterns
- /proc/PID/ (per-process): Each process gets a directory under /proc named by its PID. The files inside expose process state: status (UID, VmRSS, threads), maps (virtual memory areas), smaps (detailed memory per VMA), fd/ (open file descriptors as symlinks), io (bytes read/written), cgroup (cgroup membership), cmdline (command line arguments), environ (environment variables), stat (scheduling counters), and dozens more.
- /proc/sys/ (kernel tunables): Writable files that control kernel behavior. Organized by subsystem: /proc/sys/net/ (networking), /proc/sys/vm/ (virtual memory), /proc/sys/fs/ (filesystem limits), /proc/sys/kernel/ (core kernel). The sysctl command reads/writes these files. Changes via /etc/sysctl.conf or /etc/sysctl.d/*.conf are applied at boot by systemd-sysctl.service.
- /sys/fs/cgroup/ (cgroup v2 interface): The control interface for cgroup v2. Each cgroup is a directory. Resource limits are set by writing to files: memory.max, cpu.max, io.max, pids.max. Current usage is read from memory.current, cpu.stat, io.stat. Events (OOM kills, throttling) are reported in memory.events, cpu.stat. Pressure stall information is in cpu.pressure, memory.pressure, io.pressure.
- /sys/class/ and /sys/devices/ (device model): Represents the kernel's device tree. /sys/class/ organizes devices by type (net, block, tty). /sys/devices/ organizes by bus topology (PCI, USB). Each device directory contains attribute files: carrier and speed for network interfaces, size and queue/ for block devices. These are the files that udev rules read to make device management decisions.
- /proc/net/ (network subsystem): Network state files. /proc/net/tcp lists all TCP sockets with state, local/remote addresses, queue sizes. /proc/net/dev shows per-interface byte and packet counters. /proc/net/snmp has protocol-level statistics (retransmits, errors). These files are generated by iterating kernel data structures, and some (/proc/net/tcp) hold locks while doing so.
Key Points for /proc and /sys Advanced Patterns
- /proc files have zero bytes on disk. The kernel generates content at read time via seq_file or simple_read callbacks. A read of /proc/meminfo calls meminfo_proc_show() which formats current values from global variables. Nothing is cached between reads. Every open+read gets a fresh snapshot.
- Read cost varies by orders of magnitude. /proc/loadavg: read a few cached integers (nanoseconds). /proc/PID/smaps: walk every page table entry for every VMA in the process (milliseconds for large processes). /proc/net/tcp: iterate the entire TCP established hash table under a spinlock (tens of milliseconds on busy servers with 100k connections).
- Writing to /proc/sys files calls a kernel handler that validates the input and updates an in-memory variable. There is no file I/O. Writing "1" to /proc/sys/net/ipv4/ip_forward calls devinet_sysctl_forward() which toggles a flag in the network stack. The sysctl command, echo redirect, and direct write() all do exactly the same thing.
- /sys follows the kobject model. Each directory represents a kernel object (device, bus, driver). Attributes are files that map to show/store callbacks on the object. Creating a new network interface creates a new directory in /sys/class/net/ with attribute files. This is how udev discovers devices: it watches /sys via netlink for kobject events.
- In cgroup v2, the interface files (memory.max, cpu.max) live in /sys/fs/cgroup. Each cgroup is a directory. The hierarchy is the filesystem hierarchy. Moving a process between cgroups is done by writing its PID to the target cgroup's cgroup.procs file. Listing members is reading cgroup.procs.
Common Mistakes with /proc and /sys Advanced Patterns
- Scraping /proc/PID/smaps for hundreds of processes at high frequency. Each read walks the entire page table. For a process with 10 GB of memory (2.5M pages), this takes 5-15ms. At 500 processes every 10 seconds, that is 2.5 to 7.5 seconds of CPU time per scrape cycle. Use /proc/PID/smaps_rollup for a single-line summary, or /proc/PID/status for VmRSS if detailed per-VMA breakdown is not needed.
- Reading /proc/net/tcp on servers with 100k+ connections. The kernel iterates the TCP hash table under a lock, serializing all readers. On a busy load balancer, frequent reads of /proc/net/tcp cause lock contention that increases TCP latency. Use ss (which uses netlink SOCK_DIAG) instead of parsing /proc/net/tcp directly.
- Assuming /proc/PID files are consistent across reads. A process can exit between opening /proc/PID/status and reading it. The read returns stale data or ESRCH. Multi-file reads from the same PID directory are not atomic: /proc/PID/stat and /proc/PID/status can reflect different moments. For monitoring, this is usually acceptable. For debugging, it means values might not add up perfectly.
- Writing to cgroup files with echo and forgetting that some files require specific formats. cpu.max takes "quota period" (e.g., "100000 100000" for 100%). memory.max takes bytes or "max" for unlimited. Writing an invalid format silently fails or returns EINVAL. Always check the return value of the write.
Related Topics
Virtual Memory & Address Spaces, cgroups v2 (Control Groups), Memory Cgroups & Resource Limits, Page Cache & Block I/O, Page Tables & TLB
/proc and /sys Filesystems — File Systems & I/O
Difficulty: Intermediate
Nothing in /proc or /sys lives on disk. Every read() fires a kernel callback that formats live state into text on the spot -- proc_dir_entry handlers for procfs, kobject show/store functions for sysfs. /proc surfaces per-process data (status, maps, fd) and kernel tunables (/proc/sys/*). /sys mirrors the device hierarchy with a strict one-value-per-file rule. Writing to /proc/sys calls a sysctl handler that validates the value and applies it immediately, no restart required.
System Calls for /proc and /sys Filesystems
Key Components in /proc and /sys Filesystems
- procfs (proc_fs_type): Pseudo-filesystem mounted at /proc; dynamically generates file content from kernel data structures on each read(). nothing is stored on disk
- sysfs (sysfs_fs_type): Pseudo-filesystem mounted at /sys; mirrors the kernel's kobject hierarchy (devices, drivers, buses, classes) as a directory tree
- struct proc_dir_entry: Kernel object defining a /proc entry: name, permissions, and read/write handlers that generate content from kernel state
- struct ctl_table (sysctl): Defines a /proc/sys tunable: name, data pointer, max length, permissions, and proc_handler function for validation and application
Key Points for /proc and /sys Filesystems
- /proc/PID files do not exist on disk. Reading /proc/PID/status triggers a live walk of the task's mm_struct, signal state, and scheduler data. The kernel generates the text on every read() call.
- /proc/sys is sysctl exposed as files. Writing to /proc/sys/vm/dirty_ratio calls a kernel handler that validates the value and updates the global variable immediately. No restart needed. No special API.
- sysfs enforces one-value-per-file. procfs dumps multi-line formatted data. That strict policy makes sysfs easy to script but less human-readable.
- /proc/PID/maps is a complete X-ray of a process's virtual memory -- every mapped region with address range, permissions, offset, device, inode, and pathname. Essential for debugging memory issues.
- Reading /proc files is NOT atomic. The kernel generates output page by page via seq_file. If data changes between pages (connections opening while you read /proc/net/tcp), you get an inconsistent snapshot.
Common Mistakes with /proc and /sys Filesystems
- Mistake: Parsing /proc files by field position. Reality: Kernel upgrades frequently add new fields to /proc/PID/status and /proc/stat. Always parse by field name, not position.
- Mistake: Using MemFree from /proc/meminfo to gauge available memory. Reality: Most 'free' memory is in page cache and reclaimable slabs. Use MemAvailable (Linux 3.14+) -- it accounts for reclaimable memory minus kernel reserves.
- Mistake: Writing to /proc/sys and assuming success. Reality: write() returns bytes written even if the kernel rejected the value. Always read back after writing to verify.
- Mistake: Using /proc/PID/stat utime/stime for precise CPU timing. Reality: These fields are in clock ticks (typically 100 Hz = 10ms resolution). For anything finer, use clock_gettime().
Related Topics
Virtual File System (VFS), Inodes & File Metadata, Page Cache & Block I/O, cgroups v2 (Control Groups)
Process Groups, Sessions & Job Control — Processes & Threads
Difficulty: Intermediate
Every process sits in exactly one process group (PGID) and one session (SID). Shells put all pipeline members into the same group via setpgid(). The terminal driver routes keyboard signals (SIGINT on Ctrl+C, SIGTSTP on Ctrl+Z) to whatever group tcsetpgrp() designated as foreground. Sessions collect process groups under a session leader that owns the controlling terminal. setsid() breaks away: new session, no terminal, no hangup cascade.
System Calls for Process Groups, Sessions & Job Control
- setpgid
- setsid
- tcsetpgrp
- getpgrp
- tcgetpgrp
Key Components in Process Groups, Sessions & Job Control
- pid_t pgid (Process Group ID): Every process belongs to exactly one process group. All processes in a shell pipeline share the same PGID (set to the PID of the first process in the pipeline). kill(-pgid, sig) sends a signal to all members of the group.
- pid_t sid (Session ID): A session is a collection of process groups. Created by setsid(), which makes the calling process the session leader (SID == PID). The session leader is the only process that can acquire a controlling terminal.
- tty_struct / controlling terminal: Each session can have at most one controlling terminal (the tty the user logged into). The terminal driver sends SIGINT (Ctrl+C), SIGTSTP (Ctrl+Z), and SIGQUIT (Ctrl+\) to the foreground process group. SIGHUP is sent to the session leader when the terminal hangs up.
- foreground process group: The one process group in the session that 'owns' the terminal for I/O. Set via tcsetpgrp(). Background process groups get SIGTTIN when they try to read from the terminal and SIGTTOU when they write (if tostop is set).
Key Points for Process Groups, Sessions & Job Control
- When a shell creates a pipeline (cmd1 | cmd2 | cmd3), all three processes go into the same process group. Ctrl+C sends SIGINT to the entire foreground group. That is why all three die at once, not one at a time.
- setsid() is the escape hatch. It creates a new session AND a new process group, with no controlling terminal. This is step one of daemonization, and it is why tmux sessions survive terminal disconnect.
- When a terminal hangs up (SSH disconnect, window closed), the kernel sends SIGHUP to the session leader. The shell then cascades SIGHUP to all its job process groups. That is why background jobs die when you log out -- unless they are in a different session.
- Background processes that try to read from the terminal get stopped with SIGTTIN. This prevents background jobs from stealing terminal input. Similarly, SIGTTOU stops background writers if the terminal's tostop flag is set.
- Orphaned process groups -- where no member has a parent in a different group within the same session -- get SIGHUP + SIGCONT if any member is stopped. This prevents stopped processes from being stuck forever when the shell exits.
Common Mistakes with Process Groups, Sessions & Job Control
- Thinking nohup makes a process a daemon. Reality: nohup only ignores SIGHUP and redirects output. The process still shares the session and may receive other signals. For a real daemon, use setsid() + double-fork or systemd.
- Not calling setpgid() in both the parent (shell) and child after fork(). There is a race window: if the shell signals the process group before the child has set its own PGID, the signal goes to the wrong group. bash calls setpgid() from both sides to eliminate this race.
- Expecting all processes to die when you close the terminal. Processes that have called setsid() or been reparented will not receive SIGHUP. tmux and screen work precisely by creating new sessions for their children.
- Confusing process group leader with session leader. The process group leader is the first process in a pipeline (PGID == its PID). The session leader is the login shell (SID == its PID). They serve different roles.
Related Topics
Signals & Signal Handling, Daemons & Service Management, Process Lifecycle (fork/exec/wait), POSIX Threads
Process Lifecycle (fork/exec/wait) — Processes & Threads
Difficulty: Starter
Every Linux process follows the same three-step dance: fork (clone the parent), exec (overwrite the clone with a new program), wait (parent collects the exit status). Skip wait() and the dead child lingers as a zombie. If the parent dies first, init adopts the orphan and reaps it.
System Calls for Process Lifecycle (fork/exec/wait)
- fork
- execve
- waitpid
- _exit
- clone
Key Components in Process Lifecycle (fork/exec/wait)
- task_struct: The ~8KB kernel structure representing a process. holds PID, state, memory mappings, file descriptors, signal handlers, scheduling info, and parent/child pointers.
- pid / tgid: The kernel distinguishes thread group ID (tgid, what userspace calls PID) from the per-thread pid. getpid() returns tgid; gettid() returns the thread's actual pid.
- wait_queue: When a parent calls waitpid(), it sleeps on a wait queue until a child changes state (exits, stops, or continues). SIGCHLD wakes the parent.
- exit_state: Tracks whether a process is EXIT_ZOMBIE (terminated but not yet reaped) or EXIT_DEAD (fully released). Zombies retain their task_struct until the parent calls wait().
Key Points for Process Lifecycle (fork/exec/wait)
- fork() does not copy memory. It sets up copy-on-write page table entries, making the cost proportional to page table size, not memory size. This is why forking a 2GB process takes microseconds, not seconds.
- execve() is a point of no return. It replaces the process image atomically -- if it succeeds, it never returns. All signal handlers with custom dispositions reset to SIG_DFL because the handler code no longer exists in memory.
- Zombies are not as scary as they sound. They consume a PID and a tiny task_struct (~2KB) but no memory pages, file descriptors, or CPU. The real danger is PID exhaustion in long-running daemons that never reap their children.
- When a parent dies without calling wait(), orphaned children are reparented to the nearest subreaper (set via prctl(PR_SET_CHILD_SUBREAPER)) or to PID 1 (init/systemd), which automatically reaps them. No process is truly abandoned.
- Here is a subtle trap: exit() vs _exit(). The C library exit() flushes stdio buffers and runs atexit handlers. The syscall _exit() goes straight to the kernel. After fork(), the child should call _exit() if it is not exec'ing, to avoid double-flushing the parent's buffers.
Common Mistakes with Process Lifecycle (fork/exec/wait)
- Forgetting to loop waitpid(). Mistake: calling it once and assuming you are done. Reality: waitpid() can be interrupted by signals (returns -1 with EINTR), and with WNOHANG you must loop until it returns 0 or -1. One call is not enough.
- Using exit() instead of _exit() in a forked child that does not exec. This flushes the parent's stdio buffers a second time, corrupting output. The fix is simple: always use _exit() in the child after fork unless you are about to exec.
- Setting SA_NOCLDWAIT or ignoring SIGCHLD without understanding the side effects. With SA_NOCLDWAIT, children are auto-reaped and wait() returns ECHILD. This can break code that relies on collecting exit status.
- Assuming fork() copies threads. It does not. In a multithreaded process, only the calling thread is duplicated. Mutexes held by other threads remain locked in the child -- deadlock city. This is why Go's os/exec does fork+exec atomically.
Related Topics
Copy-on-Write & Process Creation Internals, Signals & Signal Handling, Process Groups, Sessions & Job Control, Process Scheduling (CFS)
Process Scheduling (CFS) — Processes & Threads
Difficulty: Advanced
CFS (Completely Fair Scheduler) picks which task runs next on each core by maintaining a red-black tree sorted by virtual runtime. The task that has received the least weighted CPU time always wins. Nice values and cgroup CPU shares control how fast virtual runtime accumulates -- higher weight means slower accumulation, which means more CPU.
System Calls for Process Scheduling (CFS)
- sched_setscheduler
- nice
- getpriority
- sched_yield
Key Components in Process Scheduling (CFS)
- cfs_rq (CFS runqueue): Per-CPU data structure containing the red-black tree of runnable SCHED_NORMAL/SCHED_BATCH tasks, total load weight, and min_vruntime (monotonically increasing floor used to normalize new tasks' vruntime).
- sched_entity: Embedded in task_struct, represents a schedulable entity. Holds vruntime, load weight (derived from nice value), run statistics, and the rb_node for the CFS red-black tree.
- sched_class: Polymorphic scheduling class with function pointers (enqueue_task, dequeue_task, pick_next_task, etc.). Classes checked in priority order: stop > deadline > realtime > fair (CFS) > idle.
- load_weight: Maps nice values (-20 to +19) to multiplicative weights. Each nice level is a ~1.25x ratio. Nice 0 = weight 1024. Nice -20 = 88761. This weight determines CPU share proportionally.
Key Points for Process Scheduling (CFS)
- CFS has no fixed timeslice. It divides a 'target latency' (default 6ms for 8 or fewer CPUs) among runnable tasks, weighted by their nice values. Each task gets at least 0.75ms (sched_min_granularity). With 100 runnable tasks, the target latency scales up automatically.
- vruntime is the key insight. It advances slower for high-priority tasks. A nice -20 process accumulates vruntime ~88x slower than a nice +19 process, so it stays at the left of the red-black tree and gets proportionally more CPU. No fixed timeslices needed.
- Nice values are multiplicative, not linear. Most people miss this. Nice 10 vs nice 0 is roughly a 10:1 CPU ratio, not 2:1. Each nice level is a ~1.25x multiplier, and that compounds across 10 levels.
- SCHED_DEADLINE is the most powerful scheduling policy most people have never heard of. Tasks declare runtime/period/deadline parameters and the kernel guarantees the CPU time. Hard real-time without root.
- The scheduler runs in O(log n) for pick_next_task -- leftmost node of the rb-tree, cached for O(1) access. scheduler_tick() fires every 4ms (at 250Hz) and updates vruntime, checking if preemption is needed.
Common Mistakes with Process Scheduling (CFS)
- Calling sched_yield() in a busy loop thinking it helps other threads. Reality: CFS puts the yielding task at the rightmost position of the rb-tree, but it is immediately eligible to run again. sched_yield is only meaningful for real-time scheduling classes and spinlocks.
- Setting a process to SCHED_FIFO at max priority without a safety net. A CPU-bound SCHED_FIFO task starves ALL normal processes on that CPU. Always use SCHED_DEADLINE or cpu.rt_runtime_us cgroup limits to prevent this.
- Assuming nice values are linear. Thinking nice 10 gets 'half' the CPU of nice 0. The 1.25x-per-level multiplicative scheme means nice 10 vs nice 0 is about a 10:1 CPU ratio. The scale is exponential, not linear.
- Ignoring NUMA topology. CFS has per-CPU runqueues and periodically load-balances across them. Migrating a task to a remote NUMA node means slower memory access. Use numactl or cgroup's cpuset to pin latency-sensitive tasks.
Related Topics
Process Lifecycle (fork/exec/wait), POSIX Threads, Copy-on-Write & Process Creation Internals, Daemons & Service Management
Pressure Stall Information (PSI) — Resource Monitoring
Difficulty: Intermediate
The kernel tracks how long tasks stall waiting for CPU, memory, and I/O resources. PSI exposes this as percentages in /proc/pressure/{cpu,memory,io}. Unlike utilization metrics that show how much of a resource is used, PSI shows how much work is delayed because a resource is scarce.
System Calls for Pressure Stall Information (PSI)
Key Components in Pressure Stall Information (PSI)
- /proc/pressure/cpu: Reports the percentage of time that runnable tasks are waiting for CPU time. Only the "some" line exists (no "full" line) because at least one task is always running on each CPU. Format: some avg10=X.XX avg60=X.XX avg300=X.XX total=N. The avg10 value is the most responsive, updating every 2 seconds with a 10-second exponential moving average window.
- /proc/pressure/memory: Reports stall time caused by memory scarcity (direct reclaim, swap I/O, thrashing). Has both "some" (at least one task stalled) and "full" (all non-idle tasks stalled simultaneously). The "full" line is the critical metric -- when it is nonzero, the entire system is making zero progress on productive work while the kernel reclaims memory.
- /proc/pressure/io: Reports stall time caused by I/O waits. Has both "some" and "full" lines. High "some" pressure with low "full" pressure indicates individual tasks waiting on I/O but the system overall making progress. High "full" pressure means all tasks are blocked on I/O and nothing is getting done.
- PSI Triggers (poll/epoll interface): Userspace programs can register for notifications when pressure exceeds a threshold. Write a trigger string like "some 150000 1000000" to a pressure file fd (meaning 150ms of stall per 1-second window), then use poll() or epoll() on that fd. The kernel wakes the program when the threshold is breached. This avoids constant polling and provides event-driven pressure monitoring.
- Per-cgroup PSI (cgroup v2): With cgroup v2, each cgroup has its own cpu.pressure, memory.pressure, and io.pressure files. This allows monitoring pressure at the container, pod, or service level, not just system-wide. Kubernetes and systemd use per-cgroup PSI to make eviction and resource management decisions scoped to individual workloads.
Key Points for Pressure Stall Information (PSI)
- PSI measures stall time, not utilization. A system at 95% CPU utilization with 0% CPU pressure means all tasks are running and none are waiting. A system at 60% CPU utilization with 30% CPU pressure means tasks are frequently queued behind others. The distinction matters for capacity planning because utilization alone cannot tell whether adding more work will cause latency degradation.
- The "some" vs "full" distinction is critical. "some" means at least one task is stalled but others are making progress. "full" means every non-idle task is stalled simultaneously -- the machine is doing zero productive work during that time. For memory, "some" pressure triggers page reclaim. "full" pressure means the system is thrashing.
- PSI averages (avg10, avg60, avg300) are exponential moving averages, not simple averages. avg10 reacts to pressure spikes within seconds. avg300 smooths out transient bursts and shows sustained pressure. The "total" field is a cumulative microsecond counter that allows computing exact pressure over any arbitrary time window by taking two readings and dividing the delta.
- PSI triggers use the kernel's internal PSI tracking and deliver notifications through poll()/epoll(). The trigger format is "some|full STALL_US WINDOW_US", meaning "notify when stall time exceeds STALL_US microseconds within a WINDOW_US microsecond window." This is far more efficient than polling /proc/pressure files from userspace on a timer.
- PSI was added in Linux 4.20 (December 2018) by Facebook (now Meta) engineers. It was designed specifically because utilization-based metrics failed to predict OOM events and I/O stalls in Meta's fleet. The kernel already tracked scheduling delays and memory reclaim stalls internally -- PSI simply exposed these existing counters to userspace.
Common Mistakes with Pressure Stall Information (PSI)
- Using utilization thresholds (percent of RAM used, percent of CPU busy) to predict resource exhaustion. A node at 90% memory usage with most of that in reclaimable page cache is healthy. A node at 70% with most of that in anonymous pages under active use can be on the edge of thrashing. PSI captures the actual contention, not just the raw usage numbers.
- Reading PSI avg10 values too infrequently. avg10 is a 10-second exponential moving average that updates every 2 seconds. Polling once per minute misses short pressure spikes entirely. For responsive eviction or scaling decisions, use PSI triggers (poll/epoll) instead of periodic reads. The trigger mechanism delivers sub-second notifications without any polling overhead.
- Ignoring per-cgroup PSI and relying only on system-wide /proc/pressure files. System-wide PSI aggregates all tasks. A single misbehaving container can cause system-wide memory pressure while 49 other containers are fine. Per- cgroup memory.pressure pinpoints exactly which workload is causing stalls, enabling targeted eviction instead of random OOM kills.
- Setting PSI trigger thresholds too low, causing constant alerting. A healthy system under normal load will occasionally show brief memory pressure spikes during page reclaim. Start with moderate thresholds (e.g., some 150000 1000000 for memory) and tune based on observed baseline pressure. Zero pressure at all times is not a realistic goal on a system doing real work.
Related Topics
OOM Killer & Memory Pressure, Memory Cgroups & Resource Limits, Process Scheduling (CFS), Swap, kswapd & Memory Reclaim, Perf Events & Performance Counters
ptrace: Process Tracing & Debugging — Processes & Threads
Difficulty: Advanced
One syscall grants full control over another process. ptrace establishes a tracer-tracee relationship: the tracer can freeze the target, read and write its registers (GETREGS/SETREGS) and memory (PEEKDATA/POKETEXT), intercept every syscall at entry and exit (PTRACE_SYSCALL), or single-step instructions. strace, GDB, LLDB, ltrace -- all ptrace under the hood. Only one tracer can attach at a time.
System Calls for ptrace: Process Tracing & Debugging
Key Components in ptrace: Process Tracing & Debugging
- PTRACE_TRACEME: Called by the child process to indicate that it should be traced by its parent. After this call, any signal delivered to the child (except SIGKILL) causes it to stop and notify the parent via waitpid(). This is how GDB launches debugged programs: fork -> child calls PTRACE_TRACEME -> child calls execve -> child stops at first instruction -> parent (GDB) takes control.
- PTRACE_ATTACH / PTRACE_SEIZE: Attaches a tracer to an already-running process. PTRACE_ATTACH sends SIGSTOP to the tracee, while PTRACE_SEIZE (Linux 3.4+) attaches without stopping. the tracee continues until the tracer explicitly interrupts it. PTRACE_SEIZE also enables PTRACE_EVENT_STOP for group-stop detection, fixing race conditions in the older ATTACH API.
- PTRACE_PEEKDATA / PTRACE_POKETEXT: Read and write the tracee's memory one word at a time. PTRACE_PEEKDATA reads a word from the tracee's address space. PTRACE_POKETEXT writes a word into the tracee's text (code) segment. this is how GDB sets software breakpoints by replacing an instruction with INT3 (0xCC on x86). The original instruction byte is saved and restored when the breakpoint is removed.
- PTRACE_GETREGS / PTRACE_SETREGS: Read and write all general-purpose registers of the tracee. Returns a struct user_regs_struct containing rax, rip, rsp, etc. (x86-64). GDB uses GETREGS to display register state and SETREGS to modify registers (e.g., changing RIP to skip over a function call). On modern kernels, PTRACE_GETREGSET with NT_PRSTATUS is preferred.
Key Points for ptrace: Process Tracing & Debugging
- strace stops the tracee at every syscall entry AND exit -- two stops per syscall. At entry it reads the number from orig_rax and arguments from registers. At exit it reads the return value from rax. This two-stop pattern is why strace slows programs by 10-100x.
- GDB sets breakpoints by overwriting instruction bytes with 0xCC (INT3). When the CPU hits INT3, it generates SIGTRAP. GDB restores the original byte, single-steps past it, re-inserts the breakpoint, and continues. Hardware breakpoints use debug registers DR0-DR3.
- Yama LSM restricts ptrace via /proc/sys/kernel/yama/ptrace_scope: 0 = any process can trace any other, 1 = only parent can trace child (Ubuntu default), 2 = only CAP_SYS_PTRACE, 3 = no ptrace at all. This prevents malware from reading secrets out of other processes' memory.
- PTRACE_SEIZE (Linux 3.4+) is preferred over PTRACE_ATTACH. It does not send SIGSTOP (avoids race conditions), enables PTRACE_EVENT_STOP, and allows PTRACE_INTERRUPT for on-demand stopping.
- Only one tracer can attach to a process at a time. You cannot strace a process that GDB is already debugging. A process can also self-trace via PTRACE_TRACEME to block later debugger attachments -- a common anti-debugging technique.
Common Mistakes with ptrace: Process Tracing & Debugging
- Mistake: Not calling waitpid() after PTRACE_ATTACH. Reality: The tracee does not stop synchronously. PTRACE_ATTACH sends SIGSTOP, and you must waitpid() for the stop before issuing other ptrace commands. Reading registers before the tracee stops gives stale data.
- Mistake: Reading the syscall number from rax at syscall exit. Reality: The return value overwrites rax. The original syscall number is in orig_rax (offset 120 in user_regs_struct). Use ORIG_RAX, not RAX.
- Mistake: Forgetting to handle PTRACE_EVENT_* stops after setting PTRACE_O_TRACEFORK. Reality: Fork events produce a PTRACE_EVENT_FORK stop, not a signal stop. Use PTRACE_GETEVENTMSG to get the child PID and PTRACE_CONT to resume. Missing these events hangs the tracee indefinitely.
- Mistake: Using ptrace for production monitoring. Reality: ptrace stops the tracee for each operation, adding ~10-20us per syscall. For production, use eBPF or seccomp with SECCOMP_RET_LOG -- they run in-kernel without stopping the target.
Related Topics
System Calls: User to Kernel Transition, Signals & Signal Handling, Seccomp: Sandboxing System Calls, Process Lifecycle (fork/exec/wait)
Read-Copy-Update (RCU) & Lock-Free Read Access — Synchronization
Difficulty: Advanced
Why the kernel can read shared data structures on every CPU simultaneously without locks. Readers pay nothing -- no atomic instructions, no cache-line bounces, no memory barriers. Writers do the heavy lifting by waiting for all existing readers to finish before reclaiming old data.
System Calls for Read-Copy-Update (RCU) & Lock-Free Read Access
- rcu_read_lock
- rcu_read_unlock
- synchronize_rcu
- call_rcu
- rcu_barrier
Key Components in Read-Copy-Update (RCU) & Lock-Free Read Access
- Read-Side Critical Section: Code between rcu_read_lock() and rcu_read_unlock(). On non-preemptible kernels, rcu_read_lock() simply disables preemption -- no atomic instructions, no memory barriers, no cache-line bouncing. Readers inside a critical section are guaranteed that any RCU-protected data structure they hold a pointer to will not be freed until they call rcu_read_unlock().
- Grace Period: The interval after a writer publishes a new version of a data structure and before the old version can be freed. A grace period ends when every CPU has passed through at least one quiescent state (context switch, idle, or user-mode execution). synchronize_rcu() blocks the caller until a grace period completes. call_rcu() queues a callback to run after the grace period without blocking.
- rcu_assign_pointer / rcu_dereference: The publish-subscribe primitives. rcu_assign_pointer() stores a new pointer with the necessary write barrier so that readers see a fully initialized structure. rcu_dereference() loads the pointer with the necessary read barrier (on weakly-ordered architectures like ARM) so that subsequent accesses use the correct pointer. On x86, rcu_dereference() compiles to a plain load because x86 does not reorder dependent loads.
- Quiescent State: A point where a CPU is guaranteed not to hold any RCU read-side references. Context switches, idle loops, and user-mode execution are all quiescent states. The RCU core tracks when every CPU has passed through a quiescent state to determine when a grace period has ended. This is what makes RCU work -- it leverages existing scheduling events rather than requiring explicit reader-side signaling.
Key Points for Read-Copy-Update (RCU) & Lock-Free Read Access
- rcu_read_lock() on a non-preemptible kernel compiles to preempt_disable(). No atomic instruction. No memory barrier. No cache-line bounce. This is why RCU readers scale perfectly -- each CPU touches only its own local data. Compare this to a rwlock where every reader atomically increments a shared counter, bouncing the cache line across every CPU.
- RCU does not protect data. It protects pointers to data. The pattern is always: publish a new pointer, wait for readers of the old pointer to finish, then free the old data. If code needs to read the contents of a structure that might be concurrently modified in place, RCU alone is not sufficient -- combine it with per-field atomic operations or seqlocks.
- synchronize_rcu() vs call_rcu() is a latency-vs-memory tradeoff. synchronize_rcu() blocks until the grace period ends (typically 1-10ms), keeping the call stack simple. call_rcu() returns immediately but the old data stays allocated until the callback fires. In a tight loop freeing thousands of objects, call_rcu() can build up a large backlog. Use rcu_barrier() to drain all pending callbacks when that matters.
- SRCU (Sleepable RCU) exists for cases where readers need to sleep. Classic RCU read-side critical sections cannot sleep because the grace period detection relies on context switches as quiescent states. SRCU uses per-CPU counters instead, allowing readers to block. The tradeoff is higher read-side overhead (atomic increment/decrement on the counter).
- RCU grace periods in a real kernel typically complete in 1-20ms. The kernel's RCU implementation (Tree RCU) uses a hierarchical structure to scale grace period detection across thousands of CPUs. Each CPU reports quiescent states to its leaf node, and the information propagates up the tree. This avoids a single global counter that all CPUs would contend on.
Common Mistakes with Read-Copy-Update (RCU) & Lock-Free Read Access
- Accessing RCU-protected data outside rcu_read_lock/rcu_read_unlock. The data can be freed at any moment after the read-side critical section ends. Saving a pointer obtained via rcu_dereference() and using it after rcu_read_unlock() is a use-after-free bug. The pointer is valid only within the critical section.
- Blocking or sleeping inside an RCU read-side critical section (classic RCU, not SRCU). Sleeping prevents the CPU from reaching a quiescent state, which stalls the grace period for all writers system-wide. In the worst case, this causes memory exhaustion as call_rcu() callbacks pile up waiting for the stalled grace period.
- Using synchronize_rcu() in a loop that frees many objects. Each call blocks for a full grace period (milliseconds). Freeing 10,000 objects one at a time with synchronize_rcu() takes seconds. Use call_rcu() to batch the frees, or collect objects into a list and call synchronize_rcu() once for the entire batch.
- Forgetting rcu_assign_pointer() when publishing a new pointer. A plain store can be reordered by the compiler or CPU (on weakly-ordered architectures), letting readers see a pointer to a partially initialized structure. rcu_assign_pointer() includes the necessary store barrier. On x86 this matters for compiler reordering; on ARM/POWER it matters for both compiler and hardware reordering.
Related Topics
Process Scheduling (CFS), Kernel Data Structures, Interrupt Handling & Softirqs, POSIX Threads
Seccomp: Sandboxing System Calls — Security & Access Control
Difficulty: Advanced
A classic BPF filter bolted onto a process's task_struct, running on every syscall entry before the kernel even looks at sys_call_table. It inspects a seccomp_data struct -- syscall number, architecture, all six args -- and hands back a verdict: ALLOW, KILL, ERRNO, TRAP, LOG, or USER_NOTIF. Once installed, a filter is permanent and inherited by every child. Stack multiple filters and all of them must say ALLOW for a syscall to proceed.
System Calls for Seccomp: Sandboxing System Calls
Key Components in Seccomp: Sandboxing System Calls
- seccomp_data: The struct passed to the BPF filter on each syscall: contains nr (syscall number), arch (architecture), instruction_pointer, and args[6] (syscall arguments). The filter uses BPF instructions to examine these fields and return an action.
- BPF filter program: An array of struct sock_filter instructions (classic BPF, not eBPF). Loaded via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER. .) or seccomp(SECCOMP_SET_MODE_FILTER. .). The program is JIT-compiled for performance on supported architectures.
- SECCOMP_RET_* actions: Filter return values: SECCOMP_RET_ALLOW (pass through), SECCOMP_RET_KILL_THREAD (SIGSYS), SECCOMP_RET_KILL_PROCESS (kill entire process group), SECCOMP_RET_ERRNO(val) (return errno to caller), SECCOMP_RET_TRAP (deliver SIGSYS to sigaction handler), SECCOMP_RET_LOG (allow but log).
- seccomp notifier (SECCOMP_RET_USER_NOTIF): Added in kernel 5.0. Instead of allowing/denying, the filter can forward the syscall decision to a supervisor process via a notification fd. The supervisor inspects the syscall and responds with allow/deny/inject-result. Used by container runtimes for controlled syscall emulation.
Key Points for Seccomp: Sandboxing System Calls
- Seccomp filters are permanent. They cannot be removed, only made more restrictive by stacking additional filters. ALL stacked filters must return ALLOW for a syscall to proceed. This guarantee holds even if the attacker gains code execution and root inside the sandbox.
- The filter runs BEFORE the syscall touches the kernel's dispatch table. A blocked syscall is killed before any work happens. This is faster and more reliable than LSM hooks (which run after initial setup) and ptrace-based sandboxes (which run in a separate process).
- Architecture checking is the most commonly missed security detail. On x86-64, a process can use int 0x80 to invoke 32-bit syscalls with completely different numbers. A filter that only checks x86-64 syscall numbers is trivially bypassed. Always check seccomp_data.arch first.
- Docker blocks about 44 of roughly 450 syscalls by default, including keyctl, reboot, mount, kexec_load, ptrace, and userfaultfd. The profile is a JSON file compiled to BPF instructions by the container runtime. Most applications never notice the missing syscalls.
- Seccomp uses classic BPF (cBPF), not eBPF. No maps, no loops, no function calls -- just linear load/store/compare/jump instructions on the seccomp_data struct. The kernel converts cBPF to eBPF internally for JIT compilation. The simplicity is deliberate: it guarantees termination and makes formal verification feasible.
Common Mistakes with Seccomp: Sandboxing System Calls
- Mistake: Blocking open() but not openat(). Reality: modern glibc uses openat(AT_FDCWD, ...) for all file opens. A filter that only blocks the open syscall number is useless -- you must also block openat, openat2, and the 32-bit compat versions.
- Mistake: Not checking the architecture field in the filter. Reality: an attacker can use int 0x80 on x86-64 to invoke the 32-bit syscall table where numbers are completely different. The filter MUST verify arch == AUDIT_ARCH_X86_64 before checking the syscall number.
- Mistake: Testing only the target application and missing library syscalls. Reality: glibc, libpthread, and the dynamic linker make syscalls (futex, mprotect, mmap, brk) that the application never calls directly. Blocking these causes cryptic segfaults, not clean error messages.
- Mistake: Using SECCOMP_RET_TRAP without installing a SIGSYS handler. Reality: the default action for SIGSYS is process termination. If you want RET_TRAP for logging, you must install a sigaction handler that catches SIGSYS and decides what to do.
Related Topics
System Calls: User to Kernel Transition, eBPF: Programmable Kernel, Linux Capabilities, SELinux & AppArmor
SELinux & AppArmor — Security & Access Control
Difficulty: Advanced
Two Mandatory Access Control systems, both implemented as Linux Security Modules. SELinux stamps a label (user:role:type:level) on every kernel object and enforces Type Enforcement rules through the Access Vector Cache at roughly 100ns per lookup. AppArmor takes the simpler route: path-based profiles with glob patterns. Both kick in after DAC checks pass, so even root gets denied when no MAC rule exists. Policies load at boot and cover 200+ LSM hook points spanning files, network, IPC, and capabilities.
System Calls for SELinux & AppArmor
- getcon
- setcon
- security_compute_av
Key Components in SELinux & AppArmor
- LSM (Linux Security Module) framework: Hook points embedded throughout the kernel (200+ hooks in inode_ops, file_ops, task_ops, socket_ops, etc.). When a security-relevant operation occurs, the LSM framework calls all registered security modules (SELinux, AppArmor, Smack) to make allow/deny decisions. Multiple LSMs can stack since kernel 5.4.
- SELinux security context (label): Every process, file, socket, port, and IPC object is labeled with user:role:type:level. The type field is most important. type enforcement (TE) rules define which types can access which other types with which permissions. 'ls -Z' and 'ps -eZ' show labels.
- SELinux policy module: A compiled binary policy loaded by the kernel at boot. Built from .te (type enforcement). fc (file context), and .if (interface) source files. semodule manages policy modules. The reference policy provides a comprehensive starting point.
- AppArmor profile: A text file defining allowed file paths, capabilities, network access, and mount rules for a specific binary. Located in /etc/apparmor.d/. Profiles operate in enforce mode (deny violations) or complain mode (log but allow). 'aa-status' shows loaded profiles.
Key Points for SELinux & AppArmor
- SELinux labels every object (inode-level); AppArmor matches on file paths. This means SELinux survives file renames and hard links (the label stays on the inode), while AppArmor rules break when paths change. The tradeoff: AppArmor profiles are dramatically simpler to write and understand.
- SELinux's type enforcement is default-deny. A rule like 'allow httpd_t httpd_sys_content_t:file { read open getattr }' explicitly permits Apache to read web content files. Without that rule, the access is silently blocked. Every allowed action must be declared.
- MCS (Multi-Category Security) is how container runtimes use SELinux for isolation. Each container gets a unique category pair like s0:c1,c2. Files written by that container are labeled with the same categories. Another container with s0:c3,c4 cannot read them, even with correct DAC permissions.
- AppArmor profiles support file globs (/var/www/** for recursive), owner conditionals, and capability lists. A profile for nginx: allow read /var/www/**, allow write /var/log/nginx/**, deny /etc/shadow, network inet tcp. Compilation happens at profile load time, not on every access.
- Setting SELinux to permissive mode (setenforce 0) logs violations without blocking them -- essential for debugging 'why does my app fail?' But permissive is NOT a security posture. It is a diagnostic tool. Production must run enforcing.
Common Mistakes with SELinux & AppArmor
- Mistake: Disabling SELinux entirely because an application fails. Reality: This removes a critical security layer. Use 'audit2allow' to generate policy rules from AVC denials, review them, and apply. The denial messages tell you exactly what rule is missing.
- Mistake: Assuming AppArmor path rules apply to hard links. Reality: If a confined process creates a hard link to /etc/shadow at /tmp/shadow_copy, the rule denying /etc/shadow does not apply to the new path. SELinux handles this correctly because the label is on the inode, not the path.
- Mistake: Moving files instead of copying them and wondering why SELinux breaks. Reality: 'mv' preserves the source label. 'cp' inherits the destination directory's default context. A config file moved from /tmp to /etc/httpd/ keeps its tmp_t label, and httpd cannot read it. Fix with 'restorecon -Rv /etc/httpd/'.
- Mistake: Writing overly broad AppArmor profiles (allowing /** rw) to avoid breakage. Reality: This defeats the purpose of MAC entirely. Start in complain mode ('aa-complain /path/to/profile'), exercise the application, use 'aa-logprof' to generate tight rules from the logs, then switch to enforce.
Related Topics
File Permissions, Ownership & ACLs, Linux Capabilities, Seccomp: Sandboxing System Calls, Audit Framework & Logging
SELinux Type Enforcement & Contexts — Security
Difficulty: Advanced
Why a file with 777 permissions still gets "Permission denied." SELinux type enforcement operates above Unix DAC. Every process has a domain, every file has a type, and the policy explicitly lists which domains can access which types. If the rule does not exist, access is denied regardless of Unix permissions.
System Calls for SELinux Type Enforcement & Contexts
- getxattr
- setxattr
- lgetxattr
- lsetxattr
- fgetxattr
- fsetxattr
Key Components in SELinux Type Enforcement & Contexts
- Security Context: A colon-separated string: user:role:type:level. Example: system_u:system_r:httpd_t:s0. The type field is what matters most in targeted policy. The user field (system_u, unconfined_u) maps SELinux users to Linux users. The role constrains which types a user can transition to. The level (s0, s0:c123,c456) enables MLS/MCS.
- Type Enforcement (TE): The core enforcement mechanism. Every access decision is: "Can domain X perform operation Y on type Z?" Rules are written as allow statements: "allow httpd_t httpd_sys_content_t:file { read open getattr };" If no allow rule exists, access is denied. There are roughly 100,000 allow rules in a typical RHEL targeted policy.
- Domain Transition: When a process in one domain executes a binary labeled with another type, a domain transition occurs. Example: init_t executes /usr/sbin/httpd (labeled httpd_exec_t) and the resulting process enters httpd_t. Three rules control this: a type_transition rule, an allow rule for the parent to execute the binary, and an allow rule for the parent to transition. Without all three, the child stays in the parent's domain.
- MCS (Multi-Category Security): An extension of MLS (Multi-Level Security) used in practice for container isolation. Each container gets a unique category pair (e.g., s0:c123,c456). Access is allowed only when the process's category set dominates the file's category set. Two containers with different categories cannot access each other's files even if both run as container_t.
- AVC (Access Vector Cache): Kernel cache of recent SELinux access decisions. On each access check, the kernel first looks in the AVC. Cache misses go to the security server (the policy engine in the kernel). AVC denial messages are logged to the audit subsystem. These "avc: denied" messages in /var/log/audit/audit.log are the primary debugging tool for SELinux issues.
Key Points for SELinux Type Enforcement & Contexts
- SELinux operates after Unix DAC (Discretionary Access Control). Even if Unix permissions allow access, SELinux can still deny it. Both checks must pass. This is mandatory access control (MAC): the policy is set by the administrator, not by file owners.
- The targeted policy confines specific daemons while leaving user sessions unconfined. This means most SELinux denials come from service processes (httpd_t, mysqld_t, named_t), not interactive shell sessions. The policy is conservative: services get the minimum access they need.
- File labels are stored in extended attributes (security.selinux). Moving a file preserves its label. Copying a file inherits the label of the destination directory. This distinction is the root cause of many SELinux issues: "mv" from /home to /var/www preserves user_home_t labels, but "cp" would inherit httpd_sys_content_t.
- Boolean switches (getsebool/setsebool) toggle optional policy rules without writing custom modules. Example: "setsebool -P httpd_can_network_connect on" allows httpd_t to make outbound TCP connections. There are hundreds of booleans; semanage boolean -l lists them all.
- Permissive mode (setenforce 0) logs denials without blocking access. It is invaluable for debugging but should never stay on in production. The audit log in permissive mode shows exactly which allow rules are needed, and audit2allow can generate a policy module from those denials.
Common Mistakes with SELinux Type Enforcement & Contexts
- Disabling SELinux instead of fixing the denial. Running "setenforce 0" or setting SELINUX=disabled in /etc/selinux/config removes an entire security layer. Most denials are fixed with a single restorecon, setsebool, or semanage command. Disabling SELinux to fix a permission error is like removing a lock because the key is in the wrong pocket.
- Using "mv" instead of "cp" when deploying web content. "mv /home/user/index.html /var/www/html/" preserves the user_home_t label. The fix: "restorecon -Rv /var/www/html/" to reset labels based on the file_contexts policy, or use "cp" which inherits the destination directory's label.
- Running audit2allow on the entire audit log without filtering. This generates an overly permissive policy module that allows everything that was denied. Instead, filter by the specific domain: "ausearch -m avc -ts recent -c httpd | audit2allow -M mypolicy" generates rules only for httpd denials.
- Ignoring MCS categories on container volumes. Mounting a host directory into a container without :Z or :z leaves the host's SELinux label. The container (running as container_t with category s0:c123,c456) cannot access files labeled with a different category or with no category at all. The result is "Permission denied" that only manifests on SELinux-enforcing systems.
Related Topics
Linux Capabilities, Seccomp: Sandboxing System Calls, SELinux & AppArmor, File Permissions, Ownership & ACLs, PAM: Pluggable Authentication Modules
Shared Memory & Semaphores — Processes & Threads
Difficulty: Advanced
Same physical pages mapped into multiple processes' address spaces via mmap(MAP_SHARED). The POSIX path: shm_open() creates a tmpfs-backed fd in /dev/shm, ftruncate() sizes it, mmap() wires it in. Writes from one process become visible to others through hardware cache coherence (MESI protocol, 10-100ns) -- but visibility is not consistency. Correct use demands POSIX semaphores, atomics, or explicit memory barriers. The older System V API (shmget/shmat) still exists but belongs in legacy codebases.
System Calls for Shared Memory & Semaphores
- shm_open
- mmap
- sem_open
- sem_wait
- sem_post
- ftruncate
Key Components in Shared Memory & Semaphores
- vm_area_struct: Kernel structure representing a mapped memory region in a process's address space. When two processes mmap the same shared memory object, they get separate vm_area_struct entries pointing to the same physical pages.
- shmem / tmpfs: POSIX shared memory objects (shm_open) live on a tmpfs instance mounted at /dev/shm. They are backed by page cache and swap, not a regular filesystem. So shared memory can be swapped out under memory pressure.
- struct sem (kernel semaphore): For SysV semaphores (semget/semop), the kernel maintains arrays of semaphore values with atomic adjustment and undo operations. POSIX named semaphores use a futex on a file in /dev/shm.
- futex (for POSIX unnamed semaphores): POSIX unnamed semaphores (sem_init) are implemented as a futex embedded in the shared memory region itself. The uncontended path is a single atomic operation in userspace. no syscall needed.
Key Points for Shared Memory & Semaphores
- shm_open() creates a file descriptor backed by tmpfs, but it starts at size zero -- if you forget ftruncate() and try to access the mapping, you get SIGBUS, not a helpful error; this catches everyone at least once
- Unnamed semaphores placed directly in shared memory (sem_init with pshared=1) are the fastest inter-process synchronization -- just a futex word, no file operations; named semaphores (sem_open) create files in /dev/shm and are slightly slower
- Without memory barriers, shared memory is a lie -- the CPU's store buffer and compiler reordering mean one process can write a flag then data, but the other process sees the data before the flag; you MUST use atomics, semaphores, or explicit fences
- System V shared memory (shmget/shmat) is a legacy API with a key-based namespace and confusing lifecycle -- segments persist until IPC_RMID is set AND all processes detach; use POSIX shm for new code
- Huge pages work with shared memory via MAP_HUGETLB -- for large shared regions like database buffer pools, this reduces TLB misses by 512x and can be the difference between acceptable and terrible performance
Common Mistakes with Shared Memory & Semaphores
- Forgetting ftruncate() after shm_open() -- the object starts at size zero; accessing it without setting the size gives you SIGBUS, which looks like a random crash and is miserable to debug
- Using memcpy to communicate via shared memory without synchronization -- even on x86's strong memory model, the COMPILER can reorder accesses; use atomic operations, semaphores, or explicit memory barriers
- Leaking shared memory objects -- they persist on /dev/shm until explicitly unlinked or the system reboots; check 'ls /dev/shm/' periodically and always call shm_unlink() in cleanup
- Using SysV semaphores in new code -- they have awkward 'semaphore array' semantics, complex undo handling, and per-operation permission checks; POSIX semaphores are simpler, faster, and saner
Related Topics
POSIX Threads, System V & POSIX Message Queues, Copy-on-Write & Process Creation Internals, Inter-Process Communication (Pipes & FIFOs)
Signals & Signal Handling — Processes & Threads
Difficulty: Intermediate
Asynchronous notifications from the kernel that hijack a thread's control flow. The kernel saves registers to a ucontext_t on the stack, redirects execution to the sigaction handler, and restores everything via sigreturn(). Only about 25 functions are safe to call inside a handler -- write, _exit, sem_post -- while printf and malloc are not. Standard signals (1-31) do not queue; real-time signals (32-64) do. signalfd() sidesteps the entire handler mess by turning signals into readable file descriptor events.
System Calls for Signals & Signal Handling
- sigaction
- kill
- sigprocmask
- sigsuspend
- signalfd
Key Components in Signals & Signal Handling
- sighand_struct: Per-process structure holding an array of 64 signal action entries (struct k_sigaction). Shared across threads via reference counting. Defines what happens when each signal is delivered.
- sigpending: Per-thread and per-process pending signal sets. Standard signals (1-31) are not queued. sending SIGUSR1 twice before it's handled results in only one delivery. Real-time signals (32-64) ARE queued.
- signal_struct: Per-thread-group structure containing shared signal state: group exit code, job control state, resource usage counters, and the shared pending signal queue.
- TIF_SIGPENDING: Thread flag checked on every return from kernel to userspace. When set, the kernel invokes do_signal() to deliver pending signals before resuming userspace execution.
Key Points for Signals & Signal Handling
- Signal handlers literally hijack your thread's control flow. The kernel saves registers to a ucontext_t on the stack, redirects execution to the handler, and on return, a hidden sigreturn() trampoline restores the original context. Your code resumes as if nothing happened.
- Only about 25 functions are async-signal-safe. printf(), malloc(), and mutex operations are NOT among them. In a handler, you should only set a volatile sig_atomic_t flag, call write() on a pipe, or use sem_post(). Everything else risks deadlock or corruption.
- SA_RESTART makes some syscalls auto-restart after signal delivery. read(), write(), wait() are restartable. But connect(), poll(), sem_wait(), and nanosleep() are never restarted -- they always return EINTR. Memorize which ones restart and which do not.
- signalfd() turns signals into file descriptor events. Block the signal with sigprocmask(), then read signalfd_siginfo structs from the fd in your event loop. No async handler needed. This is the modern Linux way to handle signals in event-driven servers.
- SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. Period. But here is the catch: a process in uninterruptible sleep (D state) will not respond to even SIGKILL until it leaves that state. That is why hung NFS mounts create unkillable processes.
Common Mistakes with Signals & Signal Handling
- Using signal() instead of sigaction(). Mistake: thinking they are equivalent. Reality: signal() has undefined behavior regarding handler reset (SA_RESETHAND semantics vary across systems) and does not let you control SA_RESTART. Always use sigaction().
- Calling printf, malloc, or syslog in a signal handler. This corrupts internal data structures when the handler interrupts those same functions mid-operation. The fix: set a flag in the handler, do the real work in the main loop.
- Not blocking signals during critical sections. If a handler fires between a check and an update of shared data, the handler sees inconsistent state. Use sigprocmask() to block signals around critical sections.
- Assuming signals are queued. They are not (for standard signals 1-31). If 5 children exit before the parent handles SIGCHLD, only one SIGCHLD may be delivered. The handler must call waitpid() in a WNOHANG loop to reap ALL exited children, not just one.
Related Topics
Process Lifecycle (fork/exec/wait), POSIX Threads, Process Groups, Sessions & Job Control, Daemons & Service Management
Kernel Memory Allocators (Slab, SLUB & kmalloc) — Memory Management
Difficulty: Advanced
How the kernel avoids calling the page allocator millions of times per second. Slab caches pre-slice pages into fixed-size objects, per-CPU freelists eliminate lock contention, and SLUB replaced the original SLAB allocator with a simpler design that scales better on modern hardware.
System Calls for Kernel Memory Allocators (Slab, SLUB & kmalloc)
- kmalloc
- kfree
- kmem_cache_create
- kmem_cache_alloc
- kmem_cache_free
Key Components in Kernel Memory Allocators (Slab, SLUB & kmalloc)
- kmem_cache (struct kmem_cache): The descriptor for a slab cache. Stores the object size, alignment, constructor function pointer, per-CPU freelist pointers, and partial slab lists per NUMA node. Created via kmem_cache_create() and destroyed via kmem_cache_destroy(). Each cache serves objects of exactly one size.
- SLUB page (struct slab): A compound page or set of pages carved into fixed-size slots for one cache. Contains a freelist pointer to the first free object, an inuse count, and a pointer back to the owning kmem_cache. When all objects are freed, the page returns to the page allocator.
- Per-CPU freelist (kmem_cache_cpu): Each CPU has a pointer to a slab page and a freelist within that page. Allocation from the per-CPU freelist is a single cmpxchg with no locks. This is the fast path. If the per-CPU freelist is empty, SLUB promotes a partial slab from the node partial list, which requires taking the node lock.
- Node partial list (kmem_cache_node): Per-NUMA-node list of partially-filled slab pages. When a CPU exhausts its local slab, it grabs a partial slab from the node list (slow path). If no partial slabs exist, SLUB calls the page allocator to get fresh pages (slowest path). The min_partial tunable controls how many partial slabs the node retains.
Key Points for Kernel Memory Allocators (Slab, SLUB & kmalloc)
- kmalloc is not a syscall exposed to userspace. It is the kernel's internal general-purpose allocator, backed by a set of size-bucketed slab caches (kmalloc-8, kmalloc-16, kmalloc-32, up to kmalloc-8192). Larger allocations fall through to the page allocator directly.
- SLUB replaced the original SLAB allocator as the default in Linux 2.6.23 (2007). SLAB had per-CPU arrays, shared arrays, and three list heads per node. SLUB simplified this to one freelist pointer per CPU and one partial list per node, cutting metadata overhead by 50-70% and removing the complex queue management entirely.
- The fast path in SLUB is a single cmpxchg instruction on the per-CPU freelist pointer. No spinlock, no disabling interrupts, no per-CPU array management. On x86-64 this takes 10-20 nanoseconds. The slow path (promoting a partial slab) takes 200-500 nanoseconds. Falling through to the page allocator takes 1-10 microseconds.
- Object constructors run when a slab page is first carved into objects, not on every allocation. If a constructor initializes a mutex inside each object, that initialization happens once when the slab is created. When the object is freed and reallocated, the mutex is already initialized. This amortizes expensive setup across many allocation cycles.
- Slab merging in SLUB combines caches with the same object size and alignment into a single cache. Two modules each creating a 128-byte cache end up sharing one cache, reducing fragmentation. Disable merging with slub_nomerge boot parameter when debugging use-after-free bugs, since merged caches make it harder to identify which subsystem owns a corrupt object.
Common Mistakes with Kernel Memory Allocators (Slab, SLUB & kmalloc)
- Assuming slab memory is leaked. SReclaimable slab memory (dentries, inodes) is not a leak. The kernel intentionally caches filesystem metadata until something else needs the memory. Only SUnreclaim growth without a corresponding increase in active kernel objects indicates a real problem.
- Using GFP_KERNEL in interrupt context. kmalloc(size, GFP_KERNEL) can sleep to reclaim memory, which is illegal in interrupt handlers, softirqs, and any code holding a spinlock. Use GFP_ATOMIC in those contexts, but understand that GFP_ATOMIC allocations can fail more easily since they cannot invoke reclaim.
- Creating a dedicated slab cache for an object that is the same size as an existing kmalloc bucket. SLUB will merge them anyway unless slub_nomerge is set. The dedicated cache adds a kmem_cache descriptor (256 bytes per NUMA node) with no benefit. Use kmalloc unless a constructor or specific alignment is needed.
- Ignoring NUMA locality. kmalloc allocates from the slab cache associated with the CPU's NUMA node. If a structure allocated on node 0 is frequently accessed by CPUs on node 1, every access pays the cross-node latency penalty (50-100ns extra). Use kmalloc_node() to allocate on the node where the object will be consumed.
Related Topics
Virtual Memory & Address Spaces, NUMA Architecture & Memory Policy, OOM Killer & Memory Pressure
Socket Programming (TCP/UDP) — Networking & Sockets
Difficulty: Starter
A kernel object -- struct socket plus struct sock -- that wraps the TCP/IP stack behind a file descriptor. Server path: socket(), bind(), listen(), accept(). listen() creates two queues: the SYN queue for half-open connections and the accept queue for completed handshakes waiting to be picked up. Every accept() hands back a fresh fd tied to a unique 4-tuple. SO_REUSEADDR prevents TIME_WAIT bind failures on restart; SO_REUSEPORT lets multiple sockets share a port with kernel-level distribution via source hash.
System Calls for Socket Programming (TCP/UDP)
- socket
- bind
- listen
- accept
- connect
- send
- recv
Key Components in Socket Programming (TCP/UDP)
- struct socket / struct sock: struct socket is the user-facing socket object (VFS layer); struct sock (sk) is the network-layer socket with protocol state, send/receive buffers, and connection queues
- syn_queue (request_sock_queue): Half-open connection queue holding SYN_RECV sockets (received SYN, sent SYN-ACK, awaiting ACK). bounded by tcp_max_syn_backlog (default 128-2048 depending on memory)
- accept_queue: Fully-established connection queue holding completed three-way handshakes waiting for accept(). bounded by the backlog argument to listen() (capped by somaxconn, default 4096)
- sk_buff (skb): The fundamental network packet buffer. contains packet data, protocol headers (via pointer arithmetic), and metadata; every received/sent packet traverses the stack as an skb
Key Points for Socket Programming (TCP/UDP)
- The backlog in listen(fd, backlog) controls the accept queue, not the SYN queue. Most people get this backwards. On modern kernels, the SYN queue is dynamically sized, and the backlog is capped by net.core.somaxconn (default 4096 since kernel 5.4).
- SO_REUSEADDR lets you bind to a port stuck in TIME_WAIT -- it's why every TCP server sets it before bind(). SO_REUSEPORT goes further: multiple sockets on the same port, kernel distributes connections via hash. This is how Nginx eliminated the thundering herd.
- accept() returns a NEW file descriptor every time. The listening socket never changes. That's how a server on port 80 handles thousands of connections: each one has a unique 4-tuple (src_ip, src_port, dst_ip, dst_port).
- UDP sockets can call connect() too -- it sets a default destination so send() works instead of sendto(), and the kernel filters incoming packets to only deliver from the connected peer. It also enables the kernel to return ICMP errors as socket errors.
- send() and recv() may transfer fewer bytes than you asked for. Always loop. MSG_WAITALL on recv() blocks until the full length arrives (TCP only), but for send(), there's no shortcut -- you must handle short writes yourself.
Common Mistakes with Socket Programming (TCP/UDP)
- Mistake: not handling EINTR. Reality: system calls like accept(), recv(), and send() can be interrupted by signals and return -1 with errno=EINTR. Production code must retry the call.
- Mistake: ignoring send()'s return value. Reality: send() on a TCP socket may send fewer bytes than requested (short write). Always check the return value and retry with the remaining data.
- Mistake: forgetting SO_REUSEADDR before bind(). Reality: without it, restarting a server within TIME_WAIT (up to 60 seconds) fails with EADDRINUSE. Every TCP server should set this option.
- Mistake: using a small backlog in listen(). Reality: on high-connection-rate servers, a small backlog causes the accept queue to fill, dropping new connections silently. Set backlog to at least 1024 or SOMAXCONN.
Related Topics
TCP State Machine & Connection Lifecycle, epoll & I/O Multiplexing, Unix Domain Sockets, TCP Tuning & Congestion Control
Swap, kswapd & Memory Reclaim — Memory Management
Difficulty: Advanced
When free memory drops below the low watermark, kswapd wakes and scans LRU lists to reclaim pages in the background. If allocation pressure outpaces kswapd, the allocating thread enters direct reclaim and does the work itself, stalling the process. Swap is the last resort for anonymous pages that cannot be dropped -- they must be written to a swap device before their frame can be reused. The kernel maintains four LRU lists per memory zone (active/inactive for both anonymous and file-backed pages), using a second-chance algorithm to avoid evicting pages that are still in use.
System Calls for Swap, kswapd & Memory Reclaim
- madvise
- mlock
- swapon
- swapoff
Key Components in Swap, kswapd & Memory Reclaim
- LRU Lists: Four per-zone linked lists that track page recency: active anonymous, inactive anonymous, active file, inactive file. Pages enter the inactive list on first access. A second access promotes them to active. kswapd demotes pages from active to inactive by clearing the referenced bit. Pages at the tail of the inactive list are candidates for reclaim. The split between anonymous and file pages lets the kernel tune reclaim ratio via vm.swappiness.
- kswapd: A per-NUMA-node kernel thread that performs background page reclaim. It sleeps until free memory in any zone drops below the low watermark, then scans LRU lists and reclaims pages until free memory reaches the high watermark. kswapd is the happy path -- it works asynchronously so application threads do not stall. When kswapd cannot keep up with allocation pressure, the allocating thread falls into direct reclaim, which is synchronous and causes latency spikes.
- Watermarks (min/low/high): Three thresholds per memory zone that control reclaim behavior. High watermark is the target -- kswapd stops when free pages reach this level. Low watermark triggers kswapd wakeup. Min watermark is the emergency reserve -- only the kernel and PF_MEMALLOC processes can allocate below it. If an allocation cannot be satisfied even at min, direct reclaim or OOM killing begins. Watermarks are computed as fractions of zone size, tunable via vm.watermark_scale_factor (default 10, meaning 0.1% of zone memory).
- PSI (Pressure Stall Information): A kernel subsystem (Linux 4.20+) that measures the percentage of wall-clock time tasks spend stalled waiting for memory. Exposed via /proc/pressure/memory with three windows: avg10, avg60, avg300 seconds. The "some" metric means at least one task is stalled; "full" means all tasks are stalled. Kubelet uses PSI to trigger pod eviction before the OOM killer fires. A "some avg10" above 20% indicates severe memory pressure that will visibly degrade application performance.
Key Points for Swap, kswapd & Memory Reclaim
- kswapd is the background janitor. It wakes when free memory drops below the low watermark and reclaims pages until the high watermark is restored. Direct reclaim is the penalty -- when kswapd cannot keep up, the thread that called malloc() does the reclaim work itself and blocks until a page is freed.
- The kernel maintains four LRU lists per zone: active/inactive for both anonymous and file-backed pages. The second-chance algorithm means a page must be unreferenced twice before eviction -- once to move from active to inactive, once to actually reclaim it from the tail of the inactive list.
- vm.swappiness controls the ratio of anonymous vs file page reclaim. At 60 (default), the kernel reclaims both roughly equally. At 0, the kernel avoids swapping anonymous pages unless the system is critically low on memory. In cgroup v2, the range extends to 200, letting individual cgroups be more aggressive about swapping than the global default.
- Swap is not inherently bad. It lets the kernel move genuinely cold anonymous pages (e.g., init-time data never touched again) to disk, freeing RAM for hot working sets. The disaster scenario is when actively used pages get swapped -- especially JVM heaps during GC, where the collector touches every page and triggers mass swap-in.
- PSI (Pressure Stall Information) replaced guesswork with measurement. Instead of inferring memory pressure from free memory counters (which are misleading because the kernel intentionally keeps free memory low), PSI directly measures how much time tasks spend waiting for memory.
- Kubernetes disabled swap for years because the scheduler assumed all pod memory was resident in RAM. Since 1.28, LimitedSwap mode allows Burstable pods to use swap proportional to their memory request while Guaranteed pods remain swap-free. This gives the system a buffer against transient spikes without breaking QoS guarantees.
Common Mistakes with Swap, kswapd & Memory Reclaim
- Mistake: Setting vm.swappiness=0 and assuming swap is disabled. Reality: swappiness=0 does not disable swap. It tells the kernel to strongly prefer reclaiming file pages over anonymous pages, but under severe memory pressure the kernel will still swap. To truly prevent swapping, use swapoff -a or set memory.swap.max=0 in the cgroup.
- Mistake: Monitoring free memory and panicking when it is low. Reality: Linux intentionally keeps free memory low by using it for page cache. A system showing 200 MB free out of 64 GB is probably healthy -- the rest is cache that can be reclaimed instantly. Check "available" memory from /proc/meminfo (MemAvailable), not "free" (MemFree).
- Mistake: Disabling swap entirely on all servers. Reality: A small amount of swap (1-2 GB) provides a safety valve for transient spikes. Without swap, the gap between "memory is tight" and "OOM killer fires" is zero. Swap gives kswapd somewhere to put genuinely cold pages and buys time for alerts to fire before processes die.
- Mistake: Running JVM or PostgreSQL with default vm.swappiness=60 and large swap. Reality: GC-heavy workloads and database buffer pools must not be swapped. Major GC touches every object in the heap, so swapped-out pages cause 10-100x pause time amplification. Use swappiness=10, mlock, or cgroup swap limits for these workloads.
Related Topics
OOM Killer & Memory Pressure, Memory Cgroups & Resource Limits, Virtual Memory & Address Spaces, Page Cache & Block I/O, Page Tables & TLB, cgroups v2 (Control Groups)
Sysctl Tuning Reference — System Tuning
Difficulty: Intermediate
The kernel ships with defaults tuned for modest hardware and general-purpose workloads. A production server handling 100K connections, terabytes of data, or hundreds of containers needs different numbers. sysctl is the interface for changing them at runtime.
System Calls for Sysctl Tuning Reference
- sysctl
- open
- socket
- listen
- mmap
Key Components in Sysctl Tuning Reference
- /proc/sys/: Virtual filesystem exposing every tunable kernel parameter as a readable and writable file. Each subdirectory maps to a subsystem: net/ for networking, vm/ for virtual memory, fs/ for filesystem limits, kernel/ for process and scheduling parameters. Reading a file returns the current value. Writing a new value takes effect immediately but does not survive reboot.
- /etc/sysctl.d/: Drop-in directory for persistent sysctl configuration. Files are read in lexicographic order at boot by systemd-sysctl.service. Convention is to prefix filenames with a two-digit priority (e.g., 99-custom.conf) so that site-specific overrides load after distribution defaults. Each line follows the format key = value.
- sysctl(2) system call: The underlying syscall that reads and writes kernel parameters. The sysctl command-line tool and /proc/sys/ both use this interface. Takes a name as a MIB-style integer array (e.g., CTL_NET, NET_IPV4, NET_IPV4_TCP_WMEM) and a buffer for the value. Modern kernels prefer /proc/sys file operations over the raw syscall.
- net.core.somaxconn: Upper bound on the listen backlog for any socket. When an application calls listen(fd, backlog), the kernel clamps backlog to somaxconn. The default of 128 means no socket can have more than 128 pending connections, regardless of what the application requests. Raising it to 65535 allows applications to specify larger backlogs.
Key Points for Sysctl Tuning Reference
- sysctl changes via the command line or /proc/sys writes take effect immediately but are lost on reboot. Persistent changes go in /etc/sysctl.d/*.conf files and are applied at boot by systemd-sysctl.service. Always do both: apply now with sysctl -w and persist in a conf file.
- Network tuning has three layers: global limits (somaxconn, netdev_budget), protocol defaults (tcp_rmem, tcp_wmem, tcp_max_syn_backlog), and per-socket overrides (SO_RCVBUF, SO_SNDBUF via setsockopt). The kernel auto-tunes per-socket buffers within the min/default/max range set by tcp_rmem and tcp_wmem. Setting SO_RCVBUF explicitly disables auto-tuning for that socket.
- vm.swappiness does not control whether swapping happens. It controls the relative weight the kernel gives to reclaiming anonymous pages (swap) versus page cache. At swappiness=0, the kernel avoids swapping almost entirely and prefers dropping page cache. At swappiness=100, anonymous and page cache reclaim are weighted equally. Database hosts typically use swappiness=10 to protect heap pages from being swapped.
- vm.dirty_ratio and vm.dirty_background_ratio control when dirty page writeback happens. background_ratio triggers the background flusher threads (pdflush/writeback). dirty_ratio is the hard limit where a writing process is forced to do synchronous writeback and blocks. Setting background_ratio too high causes bursty I/O; setting dirty_ratio too low causes frequent process stalls.
- fs.file-max sets the system-wide limit on open file descriptors. fs.nr_open sets the ceiling for per-process limits (what ulimit -n can be raised to). The per-process soft/hard limits in /etc/security/limits.conf or systemd LimitNOFILE are capped by nr_open. A common mistake is raising only ulimit while leaving file-max or nr_open at defaults.
Common Mistakes with Sysctl Tuning Reference
- Raising the Nginx or HAProxy backlog directive without also raising net.core.somaxconn. The kernel clamps the listen backlog to somaxconn, so setting backlog=65535 in Nginx while somaxconn=128 means the actual backlog is 128. Both must be raised together.
- Setting SO_RCVBUF or SO_SNDBUF explicitly in application code and wondering why tcp_rmem/tcp_wmem changes have no effect. Explicit setsockopt calls disable the kernel auto-tuning for that socket. Remove the explicit setsockopt and let the kernel auto-tune within the tcp_rmem/tcp_wmem range instead.
- Writing sysctl values only to /proc/sys and forgetting to persist them in /etc/sysctl.d/. After a reboot or kernel update, all values revert to defaults. The production incident repeats, typically at the worst possible time.
- Setting vm.swappiness=0 on a host that needs swap as a safety net. swappiness=0 tells the kernel to avoid swapping almost entirely, which means the OOM killer activates sooner when memory pressure hits. On database hosts, swappiness=10 is usually the right balance: swap is available as a last resort but the kernel strongly prefers dropping page cache.
- Increasing tcp_rmem and tcp_wmem max values without considering total memory impact. With 100K connections and a 32 MB max buffer, the theoretical worst case is 3.2 TB of buffer memory. The kernel auto-tuning prevents this in practice, but applications that set large SO_RCVBUF values explicitly bypass auto-tuning and can exhaust memory.
- Tuning sysctls in the host namespace and expecting them to apply inside containers. Many network sysctls are per-network-namespace. Containers with their own network namespace inherit the defaults, not the host values. Use sysctl settings in the container runtime configuration (docker run --sysctl, Kubernetes securityContext.sysctls) instead.
Related Topics
TCP Tuning & Congestion Control, OOM Killer & Memory Pressure, Page Cache & Block I/O, Virtual Memory & Address Spaces, Network Namespaces & veth Pairs
System Calls: User to Kernel Transition — Kernel Internals
Difficulty: Intermediate
The narrow gate between user-space code and kernel-controlled hardware. Every file write, network send, and process creation passes through it. The CPU flips from ring 3 to ring 0, runs the kernel function, and flips back. Round trip: 100-300ns. Some high-frequency calls like gettimeofday() skip the gate entirely via the vDSO, resolving in user space at roughly 20ns.
System Calls for System Calls: User to Kernel Transition
- syscall
- write
- getpid
- gettimeofday
Key Components in System Calls: User to Kernel Transition
- sys_call_table: A kernel array of function pointers indexed by syscall number. On x86-64, defined in arch/x86/entry/syscall_64.c. Each entry maps a number (e.g., 1 = write, 39 = getpid) to a kernel function (e.g., __x64_sys_write).
- MSR_LSTAR: Model-Specific Register (0xC0000082) that holds the kernel entry point address. On syscall instruction, the CPU loads RIP from LSTAR. Set during boot in syscall_init() to point to entry_SYSCALL_64.
- pt_regs: Stack frame structure that saves user-space registers (RIP, RSP, RFLAGS, etc.) on kernel entry. The kernel reads syscall arguments from specific registers (RDI, RSI, RDX, R10, R8, R9) stored here.
- vDSO (virtual Dynamic Shared Object): A kernel-provided shared library mapped into every process's address space. Implements fast syscalls (gettimeofday, clock_gettime, getcpu) entirely in user space by reading kernel-maintained shared memory pages, avoiding the ring transition entirely.
Key Points for System Calls: User to Kernel Transition
- The 4th argument uses R10 instead of RCX because the hardware clobbers RCX -- the syscall instruction saves RIP into RCX and RFLAGS into R11. This one hardware quirk shapes the entire x86-64 calling convention for syscalls.
- The kernel never trusts user pointers. copy_from_user()/copy_to_user() validate addresses and handle page faults gracefully. Dereferencing a user pointer directly from kernel code is a security hole -- SMEP/SMAP enforce this in hardware.
- gettimeofday() almost never enters the kernel. The vDSO reads a shared page updated on each timer tick, making it a pure user-space call (~20ns vs ~200ns). Most 'syscall benchmarks' are actually measuring vDSO speed.
- Linux has ~450 syscalls on x86-64. New ones are rare because each is a permanent ABI commitment. The kernel prefers extending existing syscalls with flags (openat2, clone3) or multiplexers (ioctl, prctl) over adding new entries.
- Negative return values between -1 and -4095 encode errors. The C library translates -errno to -1 and sets errno. This is also why kernel addresses start above 0xFFFF800000000000 -- to stay out of the error range.
Common Mistakes with System Calls: User to Kernel Transition
- Mistake: Assuming glibc wrappers are thin pass-throughs. Reality: glibc's getpid() caches the PID in user space -- it does not issue a syscall after the first call in each thread. The wrapper can add overhead, transform arguments, or skip the kernel entirely.
- Mistake: Using int 0x80 on a 64-bit system. Reality: This enters the 32-bit compatibility path, truncates arguments to 32 bits, and uses completely different syscall numbers. Programs mixing 64-bit code with int 0x80 get silent data corruption.
- Mistake: Assuming syscalls are atomic. Reality: Most blocking syscalls (read, write, sleep) can return early with EINTR when a signal arrives. Not retrying on EINTR is one of the most common Unix programming bugs.
- Mistake: Using inline assembly for raw syscalls without understanding the clobber list. Reality: The syscall instruction clobbers RCX and R11, and the kernel may clobber additional registers. Use the syscall() wrapper or explicit clobber declarations.
Related Topics
Process Lifecycle (fork/exec/wait), Seccomp: Sandboxing System Calls, eBPF: Programmable Kernel, Signals & Signal Handling
Systemd Internals — System Initialization
Difficulty: Intermediate
How Linux boots, manages services, and tracks every process through a dependency graph of units, cgroup trees, and socket-activated file descriptors. Systemd replaced SysVinit not by doing something different, but by doing everything in parallel and letting the kernel sort out the ordering.
System Calls for Systemd Internals
- clone
- execve
- socket
- epoll_ctl
- inotify_add_watch
- mount
Key Components in Systemd Internals
- Unit Files: Declarative configuration files describing a resource systemd manages. Each unit has a type (service, socket, timer, mount, slice, target, path, device, scope, swap). Unit files live in /usr/lib/systemd/system (vendor), /etc/systemd/system (admin overrides), and /run/systemd/system (runtime). The [Unit], [Install], and type-specific sections ([Service], [Socket], etc.) define dependencies, ordering, and behavior.
- Dependency Graph and Transaction Engine: Systemd builds a directed graph of units connected by Requires=, Wants=, After=, Before=, Conflicts=, and BindsTo= edges. When starting a unit, the transaction engine computes a job queue respecting ordering constraints. If the graph contains a cycle, systemd breaks it by dropping the weakest edge (Wants beats Requires). The graph is queryable at runtime via systemd-analyze dot or systemctl list-dependencies.
- Socket Activation: Socket units (.socket) create listening sockets before the corresponding service starts. The kernel queues incoming connections in the socket backlog. When the first connection arrives (or immediately, depending on configuration), systemd starts the service and passes the open file descriptors via the LISTEN_FDS/LISTEN_PID environment variables. Services call sd_listen_fds(3) to receive them. This enables on-demand service startup and zero-downtime restarts.
- Cgroup Integration: Systemd is the default cgroup manager on modern Linux. Every service gets its own cgroup under /sys/fs/cgroup/system.slice/. Resource limits (MemoryMax, CPUQuota, TasksMax, IOWeight) in unit files translate directly to cgroup v2 controller knobs. When a service is stopped, systemd kills every process in the cgroup, preventing orphaned children from surviving.
- Journal (journald): A structured, indexed binary log replacing traditional syslog. Captures stdout/stderr of all services, kernel messages, and audit logs. Indexed by unit name, PID, UID, boot ID, priority, and custom fields. Supports forward sealing (FSS) for tamper detection. Accessed via journalctl with rich filtering and output formats (short, json, verbose, cat).
- sd-bus (D-Bus Integration): Systemd communicates with services and tools via D-Bus. The systemctl command does not directly manipulate processes; it sends D-Bus method calls to PID 1. sd-bus is systemd's own D-Bus client library, replacing libdbus with a smaller, faster implementation. Bus activation allows starting services on first D-Bus message, similar to socket activation.
- Target Units: Targets are synchronization points in the boot process, replacing SysVinit runlevels. multi-user.target corresponds to runlevel 3, graphical.target to runlevel 5. A target unit has no process of its own; it simply groups other units via Wants= and Requires= dependencies. The default target (default.target symlink) defines what the system boots into.
Key Points for Systemd Internals
- Systemd parallelizes boot by starting all units simultaneously and letting socket dependencies serialize only what must be serialized. Service A does not wait for service B to finish starting; it waits for B's socket to exist. If B's socket is activated, A can start before B's process even launches.
- The dependency graph has two separate concepts that are often confused. Requires/Wants define what gets pulled in (activation dependencies). After/Before define startup ordering. They are independent. Requires=B without After=B means A and B start in parallel. Most configurations need both.
- Every service runs in its own cgroup. This is not optional. When systemctl stop is called, systemd sends SIGTERM to the main process, waits TimeoutStopSec, then sends SIGKILL to every process in the cgroup. No orphaned child process survives a service stop, unlike SysVinit where daemonized children could escape the PID file tracking.
- Socket activation decouples socket lifetime from service lifetime. The .socket unit creates and binds the listening socket. The .service unit inherits the file descriptor. Between service restarts, the kernel keeps the socket open and buffers connections. This is why systemd can restart dbus.service without breaking every D-Bus client.
- The journal is append-only and structured. Each entry has implicit fields (_PID, _UID, _SYSTEMD_UNIT, _BOOT_ID) added by journald, plus explicit fields from the application. Binary format enables O(log n) seeks by timestamp, unlike grep on text log files.
- Target units replaced runlevels but are more flexible. A target can depend on other targets, creating a tree. emergency.target pulls in almost nothing. multi-user.target pulls in networking, logging, cron, and all enabled services. graphical.target adds the display manager on top of multi-user.target.
Common Mistakes with Systemd Internals
- Confusing Requires= with After=. Writing Requires=database.service without After=database.service means both services start in parallel. The application crashes because the database is not ready yet. The fix is to add both directives, or use socket activation so the application connects to the database socket, which exists before the database process is fully initialized.
- Creating dependency cycles with drop-in files. A drop-in in /etc/systemd/system/dbus.service.d/ that adds Wants=myapp.service, combined with myapp.service having After=dbus.service, creates a cycle. Systemd silently breaks the cycle and deletes the weakest job. The service fails to start with a cryptic "deleted to break ordering cycle" message. Use systemd-analyze verify to catch cycles before deploying.
- Using Type=simple for a service that forks. If the service binary daemonizes itself (double fork), systemd considers the main PID to have exited and marks the service as failed. Use Type=forking with PIDFile= for legacy daemons, or better, remove the daemonization code and let systemd manage the process lifecycle directly.
- Not setting resource limits and then wondering why a misbehaving service caused an OOM kill on unrelated processes. Without MemoryMax=, a service can consume all available memory. Systemd's default cgroup placement helps systemd-oomd identify the culprit, but explicit limits prevent the problem entirely.
- Ignoring the journal and relying solely on application log files. When a service crashes before writing to its log file, the journal still captures its stderr, the exit code, the signal that killed it, and any kernel messages from the same moment. Running journalctl -u myservice -p err --since "1 hour ago" surfaces failures that never made it to application logs.
Related Topics
cgroups v2 (Control Groups), Linux Namespaces (PID, NET, MNT, UTS, IPC, USER), Daemons & Service Management, Process Lifecycle (fork/exec/wait)
TCP State Machine & Connection Lifecycle — Networking & Sockets
Difficulty: Intermediate
Eleven kernel states, from CLOSED through ESTABLISHED to TIME_WAIT. The kernel keeps TIME_WAIT sockets cheap -- roughly 160 bytes each via tcp_timewait_sock -- and caps them at tcp_max_tw_buckets (default 262144). SYN_RECV is tracked through tcp_request_sock with optional SYN cookies for flood defense. CLOSE_WAIT means the remote side sent FIN but the application never called close(); it lingers until the process closes the socket or dies.
System Calls for TCP State Machine & Connection Lifecycle
- connect
- accept
- shutdown
- close
- getsockopt
Key Components in TCP State Machine & Connection Lifecycle
- TCP state machine (tcp_states.h): 11 states defined in the kernel. CLOSED, LISTEN, SYN_SENT, SYN_RECV, ESTABLISHED, FIN_WAIT_1, FIN_WAIT_2, CLOSE_WAIT, CLOSING, LAST_ACK, TIME_WAIT. each with specific valid transitions
- tcp_timewait_sock: Lightweight kernel structure for TIME_WAIT connections. consumes only ~160 bytes compared to ~2 KB for a full tcp_sock, holding just enough state to handle duplicate packets and respond with RST
- inet_timewait_death_row: Global hash table of TIME_WAIT sockets. bounded by tcp_max_tw_buckets (default 262144); if exceeded, new TIME_WAIT sockets are immediately destroyed, logged as 'TCP: time wait bucket table overflow'
- tcp_request_sock: Kernel structure for SYN_RECV state connections (in the SYN queue). holds the SYN cookie or full state; SYN cookies allow the kernel to handle SYN floods without allocating any memory per connection
Key Points for TCP State Machine & Connection Lifecycle
- TIME_WAIT is 60 seconds on Linux. Hardcoded. Not tunable. Its job is to absorb delayed packets from old connections that could corrupt a new one reusing the same 4-tuple. Each TIME_WAIT socket costs only ~160 bytes -- the real danger is port exhaustion, not memory.
- CLOSE_WAIT means the remote peer sent FIN but your application hasn't called close(). This is always an application bug -- leaked socket, missing error handling in a read loop, broken connection pool. CLOSE_WAIT sockets pile up until you hit fd limits.
- tcp_tw_reuse=1 lets outbound connections reuse TIME_WAIT sockets when the TCP timestamp is strictly increasing. Safe and effective for clients behind proxies. tcp_tw_recycle was removed in kernel 4.12 because it broke NAT.
- SYN cookies (tcp_syncookies=1, default on) defend against SYN floods without any server-side memory. The kernel encodes connection state into the SYN-ACK's initial sequence number. Valid ACKs reconstruct the connection from thin air.
- shutdown(SHUT_WR) sends FIN but keeps the fd open for reading -- a half-close. close() sends FIN and releases the fd, but if another process shares the fd (via fork/dup), close() only decrements the refcount.
Common Mistakes with TCP State Machine & Connection Lifecycle
- Mistake: treating TIME_WAIT as a bug. Reality: TIME_WAIT is correct TCP behavior that prevents data corruption. The real fix is connection pooling, not hacks like tcp_tw_recycle (which was removed for breaking NAT).
- Mistake: ignoring CLOSE_WAIT accumulation. Reality: CLOSE_WAIT sockets mean your app received the peer's FIN but never closed its end. Common in HTTP clients that don't drain response bodies, or connection pools that can't detect dead connections.
- Mistake: setting SO_LINGER with timeout 0. Reality: this causes close() to send RST instead of FIN, destroying the connection without TIME_WAIT. It loses in-flight data and confuses the peer. Only acceptable for aborting known-bad connections.
- Mistake: not monitoring SYN_RECV queue. Reality: under SYN flood attacks, the SYN queue fills even with SYN cookies enabled. Monitor TcpExtListenDrops and TcpExtTCPReqQFullDrop via nstat to detect drops.
Related Topics
Socket Programming (TCP/UDP), epoll & I/O Multiplexing, TCP Tuning & Congestion Control, Netfilter & nftables/iptables
TCP Tuning & Congestion Control — Networking & Sockets
Difficulty: Advanced
Throughput ceiling for any TCP connection: min(cwnd, rwnd) / RTT. The kernel auto-tunes receive buffers up to tcp_rmem[2] based on measured bandwidth-delay product, but calling setsockopt with SO_RCVBUF disables autotuning for that socket. Congestion control is pluggable -- CUBIC reacts to loss, BBR models bandwidth and RTT, DCTCP reads ECN marks -- all loadable via tcp_congestion_ops. TCP_NODELAY kills Nagle buffering; TCP_CORK batches small writes into full segments. New connections start at initcwnd=10, capping the first RTT to 14.6 KB.
System Calls for TCP Tuning & Congestion Control
Key Components in TCP Tuning & Congestion Control
- tcp_congestion_ops (congestion control): Pluggable kernel module implementing the congestion control algorithm . determines how fast to send based on ACKs, loss, or latency; CUBIC (default), BBR, Reno, Vegas, and DCTCP are built-in options
- tcp_sock (per-connection state): Extended socket structure holding TCP-specific state: send/receive window sizes, congestion window (cwnd), slow-start threshold (ssthresh), RTT estimates, retransmission timer, SACK scoreboard
- sk_buff send/receive queues: Per-socket buffers sized by SO_SNDBUF/SO_RCVBUF (or autotuned by tcp_wmem/tcp_rmem). undersized buffers limit throughput on high-BDP (bandwidth-delay product) links; oversized buffers waste memory
- tcp_metrics (struct tcp_metrics_block): Cached per-destination TCP metrics (RTT, cwnd, ssthresh) reused across connections. allows new connections to skip slow start by inheriting the previous connection's parameters
Key Points for TCP Tuning & Congestion Control
- The throughput ceiling for any TCP connection is: min(cwnd, rwnd) / RTT. If your buffer is 6 MB and your RTT is 20ms, you max out at 2.4 Gbps -- on a 10 Gbps link. This is the Bandwidth-Delay Product (BDP = bandwidth * RTT), and it determines how big your buffers need to be.
- BBR measures actual bandwidth and RTT instead of panicking at packet loss. On lossy links (WiFi, cellular, transcontinental), BBR delivers 2-10x more throughput than CUBIC because it doesn't mistake random loss for congestion.
- TCP_NODELAY disables Nagle's algorithm, which batches small writes until an ACK arrives. Without it, a tiny Redis response sits in the buffer for up to 200ms. Every interactive protocol (HTTP, Redis, gRPC) should set it.
- Linux autotuning dynamically adjusts receive buffers up to tcp_rmem[2]. But here's the trap: setting SO_RCVBUF manually DISABLES autotuning for that socket. Only override buffer sizes if you've measured the optimal value.
- The initial congestion window (initcwnd=10 since Linux 3.0) means a new connection can only send 14.6 KB in the first RTT. A 100 KB web page needs 4 RTTs just for slow start. On datacenter links, increasing initcwnd to 32-64 cuts this dramatically.
Common Mistakes with TCP Tuning & Congestion Control
- Mistake: small receive buffers on high-BDP links. Reality: a 64 KB buffer on a 100 Mbps / 100ms link limits throughput to 5 Mbps (64KB / 0.1s) regardless of bandwidth. Calculate BDP first, then size buffers.
- Mistake: setting SO_SNDBUF/SO_RCVBUF without knowing the kernel doubles them. Reality: the kernel reserves half for metadata (sk_buff overhead). Setting SO_RCVBUF=65536 gives ~32 KB usable space. And it disables autotuning.
- Mistake: enabling BBR without setting net.core.default_qdisc=fq. Reality: BBR requires the fq (Fair Queue) scheduler to pace packets properly. With default pfifo_fast, BBR can't control inter-packet timing and loses its advantage.
- Mistake: enabling TCP_NODELAY but forgetting TCP_QUICKACK. Reality: delayed ACKs (default 200ms wait) can still add latency on the receiver side, especially in bidirectional protocols where the ACK would piggyback on data that hasn't arrived yet.
Related Topics
TCP State Machine & Connection Lifecycle, Socket Programming (TCP/UDP), epoll & I/O Multiplexing, Zero-Copy Networking (sendfile, splice)
Timers, Clocks & High-Resolution Timers — Kernel Internals
Difficulty: Intermediate
Multiple clock IDs for different correctness needs: CLOCK_REALTIME tracks wall time and can jump backward on NTP sync, CLOCK_MONOTONIC only moves forward, CLOCK_BOOTTIME includes time spent in suspend. The vDSO maps a shared vvar page into every process, making clock_gettime() resolve in roughly 20ns without touching the kernel. Two timer subsystems coexist: the jiffies-based timer wheel (O(1) insert, CONFIG_HZ granularity) for coarse network timeouts, and hrtimers (per-CPU red-black tree, nanosecond resolution) for precision work. timerfd wraps hrtimers as file descriptors that plug straight into epoll.
System Calls for Timers, Clocks & High-Resolution Timers
- clock_gettime
- timer_create
- timer_settime
- timerfd_create
- nanosleep
Key Components in Timers, Clocks & High-Resolution Timers
- clocksource / struct clocksource: Abstraction for hardware time counters. The kernel selects the best available source at boot: TSC (Time Stamp Counter, ~1ns resolution on modern x86), HPET, ACPI PM timer, or jiffies. 'cat /sys/devices/system/clocksource/clocksource0/current_clocksource' shows the active source.
- hrtimer (struct hrtimer): High-resolution timer using a per-CPU red-black tree sorted by expiry time. Provides nanosecond granularity backed by hardware (local APIC timer). Used by nanosleep(), POSIX timers, and the scheduler's tick. Fires from hardirq context via HRTIMER softirq.
- timer_list (struct timer_list): The classic jiffies-based timer wheel. Uses a hierarchical bucket structure (like timing wheels in Hashed and Hierarchical Timing Wheels paper) for O(1) insertion and amortized O(1) expiry. Used for coarse timeouts (TCP retransmit, device polling) where nanosecond precision isn't needed.
- timerfd (timerfd_create/timerfd_settime): File descriptor-based timer interface. Returns an fd that becomes readable when the timer expires. Integrates naturally with epoll/select/poll event loops, eliminating the need for signal-based POSIX timers in event-driven servers.
Key Points for Timers, Clocks & High-Resolution Timers
- CLOCK_MONOTONIC never goes backward -- not during NTP adjustments, not during manual time changes. Use it for elapsed time and timeouts. CLOCK_REALTIME tracks wall clock time and CAN jump backward. Using it for timeouts causes hangs or early wakes.
- CLOCK_BOOTTIME includes time spent in suspend. A 10-minute CLOCK_MONOTONIC timer will not fire if the laptop sleeps for an hour. CLOCK_BOOTTIME will fire immediately on wake because the 10 minutes elapsed during sleep.
- The kernel tick runs at CONFIG_HZ (typically 250 Hz = 4ms). With CONFIG_NO_HZ_FULL, the tick stops entirely when one task is running -- reducing jitter for latency-sensitive workloads at the cost of slightly higher overhead when the tick fires.
- clock_gettime via the vDSO reads a shared page and the TSC -- no syscall, ~20ns. Most benchmarks measuring 'syscall overhead' with clock_gettime are actually measuring vDSO speed, not syscall cost.
- Timer slack rounds timer expiries to align with other timers, reducing CPU wakeups. Default is 50us for non-RT tasks. That is why setTimeout(1ms) rarely fires at 1ms. prctl(PR_SET_TIMERSLACK, 0) disables it for real-time tasks.
Common Mistakes with Timers, Clocks & High-Resolution Timers
- Mistake: Using CLOCK_REALTIME for timeout calculations. Reality: NTP can adjust the clock backward, turning a 10-second timeout into a 15-second wait. Always use CLOCK_MONOTONIC for deadlines and elapsed time.
- Mistake: Expecting nanosleep() to wake at exactly the requested time. Reality: The kernel rounds to timer resolution and adds slack. On a 250 HZ kernel, nanosleep(1ms) typically sleeps for 4ms. Use clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) for precise wakeups.
- Mistake: Creating thousands of POSIX timers per process. Reality: Each consumes kernel memory and a signal slot. Use a single timerfd with the nearest expiry, or a userspace timing wheel.
- Mistake: Assuming TSC is synchronized across CPU cores. Reality: Older or misconfigured systems (NUMA, VMs without TSC offsetting) have per-core TSC drift. Check for constant_tsc and nonstop_tsc CPU flags. Modern CPUs are safe.
Related Topics
Interrupt Handling & Softirqs, Process Scheduling (CFS), System Calls: User to Kernel Transition, epoll & I/O Multiplexing
tmpfs & ramfs -- In-Memory Filesystems — Storage & Filesystems
Difficulty: Intermediate
tmpfs stores files in RAM backed by swap. ramfs stores files in RAM with no size limit and no swap. Both vanish on reboot. tmpfs powers /dev/shm, /run, /tmp in containers, and POSIX shared memory -- anywhere data must be fast and ephemeral.
System Calls for tmpfs & ramfs -- In-Memory Filesystems
- mount
- umount2
- shm_open
- shm_unlink
- mmap
- ftruncate
- statfs
Key Components in tmpfs & ramfs -- In-Memory Filesystems
- tmpfs (mm/shmem.c): A virtual filesystem that stores file data in the kernel page cache and swaps pages to disk under memory pressure. It enforces a configurable size limit (default 50% of RAM). Pages are allocated on demand and freed immediately when files are deleted. tmpfs supports all standard POSIX operations including hard links, symlinks, permissions, and xattrs.
- ramfs (fs/ramfs/): The simplest possible Linux filesystem. It stores data in the page cache like tmpfs but has no size limit, no swap support, and no resource accounting. A process writing to ramfs can consume all available memory until the OOM killer fires. ramfs exists mainly as a reference implementation and internal mechanism. It is almost never used directly in production.
- /dev/shm: A tmpfs instance mounted at /dev/shm that backs POSIX shared memory. shm_open("/name", O_CREAT, 0600) creates a file at /dev/shm/name. Processes mmap this file to share memory without kernel copies. The default size is 50% of RAM. In containers, Docker defaults to 64 MB, which is insufficient for databases and scientific workloads.
- /run (formerly /var/run): A tmpfs mount that holds runtime state: PID files, socket files, lock files, and systemd transient data. It is created early in boot before the root filesystem is writable. Size is typically 10-25% of RAM. Anything stored here disappears on reboot by design.
Key Points for tmpfs & ramfs -- In-Memory Filesystems
- tmpfs is not a RAM disk. A RAM disk (like /dev/ram0) allocates a fixed block of memory at creation. tmpfs allocates pages on demand and frees them when files are deleted. An empty tmpfs uses zero RAM. A 10 GB tmpfs mount with only 50 MB of files in it uses only 50 MB of physical memory.
- tmpfs pages can be swapped out. Under memory pressure, the kernel treats tmpfs pages like any other anonymous page and moves them to swap. This means tmpfs data survives memory pressure (it just gets slower), while ramfs data can never be evicted and will cause OOM conditions instead.
- The size= mount option limits total file data, not resident memory. Setting size=1G means up to 1 GB of file content can exist in the filesystem. If memory is tight, some of those pages live in swap. If nr_inodes= is not set, the default allows approximately 50% of RAM worth of inodes.
- Container runtimes mount tmpfs for /tmp, /run, and /dev/shm inside each container. These are independent tmpfs instances in the container's mount namespace. The size= parameter on each mount is critical -- without it, a single container can consume half of host RAM through tmpfs writes alone.
- tmpfs supports huge pages via the huge= mount option. Setting huge=within_size or huge=always allows tmpfs to use 2 MB huge pages for large files, reducing TLB pressure during sequential access. This matters for shared memory segments used by databases.
Common Mistakes with tmpfs & ramfs -- In-Memory Filesystems
- Assuming tmpfs data survives a reboot. tmpfs lives in volatile memory (RAM plus swap). Power loss or reboot destroys everything. Data that must survive restarts belongs on a persistent filesystem. This bites container deployments where tmpfs-backed volumes silently lose state during pod rescheduling.
- Using ramfs in production instead of tmpfs. ramfs has no size limit. A process that writes continuously to a ramfs mount will consume all system memory because ramfs pages cannot be evicted or reclaimed. The OOM killer is the only backstop. tmpfs with an explicit size= limit prevents this.
- Not setting --shm-size in Docker or a memory-backed emptyDir in Kubernetes for applications that use POSIX shared memory. The default /dev/shm in Docker is 64 MB. PostgreSQL with shared_buffers=256MB will fail to start. Oracle databases, MATLAB, and many MPI-based scientific tools also require larger /dev/shm.
- Confusing tmpfs size with memory reservation. A tmpfs mounted with size=4G does not reserve 4 GB of RAM. It sets a ceiling. Actual memory use depends on files written. But if processes fill it to 4 GB and memory is scarce, those pages compete with application memory for physical frames, and the result is swap thrashing or OOM kills.
- Mounting tmpfs without noexec,nosuid when used for temporary data. On security-sensitive systems, tmpfs mounts at /tmp should include noexec and nosuid options to prevent execution of uploaded binaries and privilege escalation through setuid files staged in /tmp.
Related Topics
Virtual File System (VFS), Page Cache & Block I/O, Shared Memory & Semaphores, mmap & Memory-Mapped Files, Swap, kswapd & Memory Reclaim, OverlayFS & Union File Systems
Unix Domain Sockets — Networking & Sockets
Difficulty: Intermediate
Sockets that skip the entire TCP/IP stack -- no checksums, no routing, no congestion control, no TIME_WAIT. Data moves by memcpy between kernel buffers. Three types: SOCK_STREAM for byte streams, SOCK_DGRAM for reliable datagrams, SOCK_SEQPACKET for reliable delivery with message boundaries preserved. The standout feature is SCM_RIGHTS: passing open file descriptors between unrelated processes by duplicating the struct file into the receiver's fd table. Addressing can be a filesystem path (permission-controlled), abstract namespace (\\0-prefixed, memory-only), or unnamed via socketpair().
System Calls for Unix Domain Sockets
- socket
- bind
- connect
- sendmsg
- recvmsg
- socketpair
Key Components in Unix Domain Sockets
- struct unix_sock: AF_UNIX-specific socket structure. extends struct sock with peer pointer, credential info, pathname, and the unix_peer direct pointer for connected sockets
- struct sockaddr_un: Address structure containing sun_family (AF_UNIX) and sun_path. either a filesystem path (/var/run/docker.sock), an abstract namespace path (\0name), or unnamed (socketpair)
- struct msghdr / struct cmsghdr: Message header for sendmsg/recvmsg. cmsghdr carries ancillary data: SCM_RIGHTS (fd passing), SCM_CREDENTIALS (PID/UID/GID of sender), SCM_SECURITY (SELinux label)
- unix_sk_send_queue: Kernel buffer holding sk_buffs for the peer. data is copied directly from sender's user space to a kernel buffer, then from the buffer to receiver's user space; no network stack processing
Key Points for Unix Domain Sockets
- 2-3x faster than TCP loopback, and the reason is what it skips: no IP routing, no TCP checksums, no segmentation, no congestion control, no TIME_WAIT. Data is just memcpy'd between socket buffers in the kernel.
- SCM_RIGHTS lets you pass open file descriptors between unrelated processes. The kernel duplicates the fd in the receiver's table -- new number, same underlying struct file. This is how Nginx hands listen sockets to new workers during hot restart without dropping connections.
- Abstract namespace sockets (sun_path[0] = '\0') live only in memory, not the filesystem. They vanish automatically when the last fd closes -- no stale socket files to clean up. Docker's containerd uses them for shim communication.
- SOCK_SEQPACKET gives you the best of both worlds: reliable, ordered delivery like a stream, but with message boundaries preserved. Each send() becomes exactly one recv(). Ideal for control protocols where framing matters.
- Filesystem-path sockets use file permissions for access control -- only processes with write permission to the socket path can connect. Abstract sockets have no permissions: anything in the same network namespace can connect, so plan your authentication accordingly.
Common Mistakes with Unix Domain Sockets
- Mistake: not unlinking the socket file before bind(). Reality: if the file exists from a previous run, bind() fails with EADDRINUSE. Always unlink(path) before bind(), or use abstract sockets that self-clean.
- Mistake: using long socket paths. Reality: sun_path is limited to 108 bytes including the null terminator. Long paths like /run/containers/storage/overlay-containers/../attach silently truncate.
- Mistake: undersizing the ancillary data buffer. Reality: if recvmsg()'s msg_control buffer is too small, the kernel silently closes excess file descriptors. You leak fds in the sender's process with no error.
- Mistake: using read()/write() for fd passing. Reality: only sendmsg()/recvmsg() support ancillary data (cmsg). read/write work for data but cannot pass file descriptors or credentials.
Related Topics
Socket Programming (TCP/UDP), epoll & I/O Multiplexing, Network Namespaces & veth Pairs, Zero-Copy Networking (sendfile, splice)
User Namespaces in Depth — Containers & Isolation
Difficulty: Advanced
How a container sees UID 0 while the host sees UID 1000. User namespaces remap UID/GID ranges so that "root" inside the container has zero privilege on the host. The mapping is explicit, written to /proc/PID/uid_map, and enforced by the kernel on every permission check.
System Calls for User Namespaces in Depth
- unshare
- clone
- clone3
- setns
- newuidmap
- newgidmap
Key Components in User Namespaces in Depth
- /proc/PID/uid_map: Three-column file defining the UID mapping for a user namespace. Format: "inside_uid host_uid count". Example: "0 1000 1" maps container UID 0 to host UID 1000 for a range of 1. "1 100000 65535" maps container UIDs 1-65535 to host UIDs 100000-165534. The kernel consults this mapping on every credential check, file creation, and signal delivery.
- /proc/PID/gid_map: Same format as uid_map but for GIDs. Must be written after writing "deny" to /proc/PID/setgroups (which prevents the process from calling setgroups(2) to manipulate supplementary groups it does not legitimately own). Without the setgroups deny, writing gid_map fails with EPERM.
- newuidmap / newgidmap: Setuid-root helper binaries that write UID/GID mappings on behalf of an unprivileged process. They validate the requested mappings against /etc/subuid and /etc/subgid. Without these helpers, an unprivileged process can only map its own UID inside the namespace (a single-UID mapping). The helpers enable multi-UID mappings within the allocated subordinate range.
- /etc/subuid and /etc/subgid: Delegation files that define which subordinate UID/GID ranges each user may use. Format: "username:start:count". Example: "alice:100000:65536" grants alice host UIDs 100000 through 165535 for use inside user namespaces. System administrators control namespace UID allocation by editing these files. Without an entry, a user cannot create multi-UID user namespaces.
- user_namespace struct (kernel): Kernel structure (include/linux/user_namespace.h) that holds the UID/GID maps (uid_map, gid_map as struct uid_gid_map), the owning user (creator_cred), parent namespace pointer, and capability flags. The kernel walks the namespace hierarchy to translate between namespace UIDs and host UIDs using map_id_range_down() and map_id_range_up().
Key Points for User Namespaces in Depth
- A process with CAP_SYS_ADMIN inside a user namespace has that capability only within the namespace. It can mount filesystems, create network namespaces, and manipulate cgroups scoped to its namespace. On the host, the same process runs as an unprivileged user with zero extra capabilities.
- Writing to uid_map is a one-shot operation. Once written, the mapping is immutable for the lifetime of the namespace. There is no way to change or extend the mapping after the fact. Getting the mapping wrong means destroying and recreating the namespace.
- Without a valid UID mapping, the kernel uses the overflow UID (65534, typically "nobody"). Any file owned by an unmapped host UID appears as owned by 65534 inside the namespace. This is why cross-container shared volumes show files as "nobody" when UID mappings do not overlap.
- Linux 5.12 introduced idmapped mounts (mount_setattr with MOUNT_ATTR_IDMAP). This allows a single filesystem to be mounted with different UID translations for different containers, solving the shared volume problem without changing on-disk ownership. The translation happens at the VFS layer during path resolution.
- The user namespace is the root of all other namespace capabilities. Creating a network namespace, PID namespace, or mount namespace as an unprivileged user requires first creating a user namespace. The user namespace grants the in-namespace CAP_SYS_ADMIN needed to create the others.
Common Mistakes with User Namespaces in Depth
- Assuming "root inside the container" means root on the host. With user namespaces, container UID 0 maps to an unprivileged host UID. But without user namespaces (Docker default mode until recently), container UID 0 is real host root. The difference is the presence or absence of the user namespace layer.
- Forgetting to write "deny" to /proc/PID/setgroups before writing gid_map. The kernel requires this to prevent an unprivileged process from dropping supplementary groups it does not control. Omitting this step results in EPERM when writing gid_map and a confusing error message.
- Sharing a volume between two containers with non-overlapping UID mappings. Container A maps host UID 100000 as its UID 0. Container B maps host UID 200000 as its UID 0. Files created by container A's root (host 100000) appear as nobody (65534) inside container B. Solutions: idmapped mounts, shared subordinate ranges, or running both containers with the same UID mapping.
- Not allocating enough subordinate UIDs in /etc/subuid. A container running real services (systemd, multiple users) needs a range of at least 65536. Allocating only 1000 subordinate UIDs causes the container to fail when any process tries to use a UID above 1000 inside.
Related Topics
Linux Namespaces (PID, NET, MNT, UTS, IPC, USER), Network Namespaces & veth Pairs, Linux Capabilities, Seccomp: Sandboxing System Calls, Container Runtime Internals
Virtual File System (VFS) — File Systems & I/O
Difficulty: Advanced
Four core objects -- super_block, inode, dentry, file -- each backed by a function pointer table that routes every I/O call to the right filesystem driver. read() follows file->f_op->read_iter() straight to ext4, NFS, procfs, or wherever the file lives. No switch statement, no runtime detection. Path resolution walks the dentry cache (dcache) with a lockless RCU fast path and a sleeping ref-walk fallback. New filesystems register with register_filesystem() and get grafted into the mount tree. The unified page cache operates per-inode regardless of backend.
System Calls for Virtual File System (VFS)
Key Components in Virtual File System (VFS)
- struct super_block: Represents a mounted filesystem instance; holds block size, root dentry, filesystem-specific operations (super_operations), and device reference
- struct inode: Generic in-memory inode; holds metadata (mode, uid, gid, size) and pointers to inode_operations and file_operations for dispatch
- struct dentry: Cached directory entry linking a name component to an inode; forms the dentry tree used for path resolution
- struct file: Represents an open file instance; holds offset, flags, and a pointer to file_operations for actual I/O dispatch
Key Points for Virtual File System (VFS)
- There's no giant switch/case on filesystem type. When you call read(), the kernel invokes file->f_op->read_iter(), which is ext4_file_read_iter() for ext4, nfs_file_read() for NFS, etc. Pure function-pointer indirection — one level of dispatch, zero branching
- The dentry cache (dcache) caches "file not found" too. Negative dentries prevent repeated disk reads for names that don't exist — critical when your shell searches $PATH or a build system checks dozens of include directories
- Plugging in a new filesystem is registering a struct. register_filesystem() adds a file_system_type to a global linked list; mount() walks the list, finds the right driver, and calls its mount() method to create a superblock
- 'Everything is a file' isn't a metaphor — it's an architecture. Pseudo-filesystems (procfs, sysfs, tmpfs) implement VFS operations purely in kernel memory, never touching a block device
- One page cache rules them all. The VFS page cache is unified across ext4, NFS, and even FUSE — so eviction, dirty writeback, and readahead work consistently regardless of backend
Common Mistakes with Virtual File System (VFS)
- Assuming all filesystems support the same features — VFS operations can return -EOPNOTSUPP for unsupported ops (e.g., fallocate on NFS v3, xattrs on FAT32). The abstraction is uniform; the capabilities are not
- Confusing the VFS inode with the on-disk inode — the VFS inode is an in-memory, filesystem-independent structure populated by the driver's read_inode. It may contain fields that don't exist on disk at all
- Ignoring mount propagation (shared, slave, private) — this controls how mount events flow across namespaces. Get it wrong and mounts inside a container leak to the host, or vice versa
- Expecting FUSE to perform like an in-kernel filesystem — every VFS operation on FUSE requires a context switch to userspace and back. That round-trip is the price of flexibility
Related Topics
Inodes & File Metadata, Page Cache & Block I/O, /proc and /sys Filesystems, File Descriptors & File Tables
Virtual Memory & Address Spaces — Memory Management
Difficulty: Intermediate
Every process gets a private 128 TB address space that is entirely fictional. Each address is virtual, translated to physical RAM only when touched. This single indirection layer makes process isolation, demand paging, copy-on-write fork, and memory-mapped files all possible.
System Calls for Virtual Memory & Address Spaces
- mmap
- munmap
- brk
- sbrk
- mprotect
Key Components in Virtual Memory & Address Spaces
- mm_struct: Per-process memory descriptor holding VMA list, page table root (pgd), RSS counters, and mmap_base
- vm_area_struct (VMA): Describes one contiguous virtual region with start/end addresses, permissions (rwxp), and backing file/anon info
- page (struct page): Kernel descriptor for each physical page frame. reference count, mapping pointer, flags (dirty, locked, slab)
- pgd/p4d/pud/pmd/pte: Four-level page table hierarchy translating 48-bit virtual addresses to physical frame numbers on x86-64
Key Points for Virtual Memory & Address Spaces
- Every process gets a 128 TB private address space -- but it is all fake. The kernel splits the 64-bit space at 0xFFFF800000000000, giving user space the lower half and keeping the upper half for itself
- Page faults are features, not bugs. When you touch a page for the first time, the kernel traps, allocates a physical frame, and wires up the translation -- this is how a 1 GB mmap completes in microseconds but costs zero RAM until accessed
- ASLR randomizes where your stack, heap, and libraries land on every exec -- 28 bits of entropy means an attacker has a 1-in-268-million chance of guessing your mmap base
- brk() is the old way to grow the heap; modern malloc uses mmap() for anything above 128 KB because mmap'd regions can be freed independently, while brk can only shrink from the top
- The vDSO is a kernel page mapped into user space that lets gettimeofday() run without a syscall -- zero ring transitions, pure speed
Common Mistakes with Virtual Memory & Address Spaces
- Mistaking virtual size for real usage -- a process can map terabytes via mmap and show huge VIRT/VSZ numbers while consuming almost no physical RAM; RSS is what matters for memory pressure
- Checking mmap() returns against NULL instead of MAP_FAILED -- mmap returns (void*)-1 on failure, not NULL, and address 0 can theoretically be a valid mapping
- Calling brk()/sbrk() directly in threaded code -- these modify a single program break pointer shared across all threads, so concurrent calls corrupt the heap
- Assuming the stack grows forever -- it is capped by ulimit (default 8 MB), and exceeding it gives you a silent SIGSEGV, not a helpful error message
Related Topics
Page Tables & TLB, mmap & Memory-Mapped Files, OOM Killer & Memory Pressure, Memory Cgroups & Resource Limits
VXLAN & Overlay Networking — Networking & Sockets
Difficulty: Advanced
VXLAN (Virtual Extensible LAN) encapsulates Layer 2 Ethernet frames inside Layer 3 UDP packets on port 4789. A 24-bit VNI (VXLAN Network Identifier) field supports up to 16,777,216 isolated network segments -- far beyond the 4,094 limit of 802.1Q VLANs. Each node runs a VTEP (Virtual Tunnel Endpoint) that performs encapsulation on egress and decapsulation on ingress. The FDB (Forwarding Database) on each VTEP maps inner destination MACs to outer destination IPs, directing encapsulated traffic to the correct remote node. Total encapsulation overhead is 50 bytes: 14 (outer Ethernet) + 20 (outer IP) + 8 (UDP) + 8 (VXLAN header).
System Calls for VXLAN & Overlay Networking
- socket
- sendmsg
- recvmsg
- ioctl
Key Components in VXLAN & Overlay Networking
- VTEP (Virtual Tunnel Endpoint): The VXLAN device on each node that performs encapsulation and decapsulation. On egress, the VTEP wraps an inner Ethernet frame with outer Ethernet, IP, UDP, and VXLAN headers, then sends the resulting UDP packet to the remote VTEP. On ingress, it strips the outer headers and delivers the inner frame to the local bridge. Each VTEP has its own MAC address and IP address. In Flannel, the VTEP is the flannel.1 device; in Docker Swarm, it is automatically created per overlay network.
- VNI (VXLAN Network Identifier): A 24-bit field in the VXLAN header that identifies which overlay network a frame belongs to. With 24 bits, VXLAN supports 16,777,216 unique network segments, solving the 4,094-segment limitation of 802.1Q VLANs. Each overlay network gets a unique VNI. A VTEP receiving an encapsulated packet uses the VNI to determine which bridge or network namespace to deliver the inner frame to. Flannel typically uses VNI 1; Docker Swarm assigns VNIs dynamically per overlay network.
- FDB (Forwarding Database): The MAC-to-VTEP mapping table on each VXLAN device. When the local VTEP needs to send an encapsulated frame, it looks up the inner destination MAC in the FDB to find the outer destination IP (the remote VTEP address). Entries can be learned dynamically via multicast or programmed statically by the CNI plugin. Flannel and Docker Swarm populate FDB entries through their control planes. Stale or missing FDB entries cause frames to be dropped or flooded, which is a common source of cross-node connectivity failures.
- Outer UDP Header (port 4789): The transport layer wrapper that carries encapsulated VXLAN frames across the physical network. IANA assigned UDP port 4789 for VXLAN. The outer UDP header is 8 bytes and includes a source port derived from a hash of the inner frame (typically the inner 5-tuple), which enables ECMP (Equal-Cost Multi-Path) load balancing across multiple physical paths. Without this entropy in the source port, all VXLAN traffic between two nodes would follow a single physical path, creating hotspots.
Key Points for VXLAN & Overlay Networking
- VXLAN encapsulation adds exactly 50 bytes: 14 (outer Ethernet header) + 20 (outer IPv4 header) + 8 (UDP header) + 8 (VXLAN header with 24-bit VNI). This means inner MTU must be reduced from 1500 to 1450 on standard networks. Failing to adjust the MTU is the single most common cause of mysterious connectivity failures in overlay networks.
- The FDB is the control plane of VXLAN. Without correct FDB entries, the VTEP does not know which remote node to send encapsulated frames to. In multicast-based VXLAN, the FDB learns entries by flooding BUM (Broadcast, Unknown unicast, Multicast) traffic to a multicast group. In controller-based VXLAN (Flannel, Docker Swarm), entries are programmed directly via netlink, eliminating the need for multicast on the physical network.
- The outer UDP source port is not random -- it is derived from a hash of the inner packet headers. This is critical for performance. Physical switches and routers use the outer 5-tuple for ECMP hashing, so varying the source port distributes VXLAN traffic across multiple physical links. Without this, all overlay traffic between two nodes follows a single path.
- VXLAN operates at Layer 2 inside Layer 3. This means the overlay provides L2 adjacency between pods on different nodes, even though the physical network between those nodes is purely L3 routed. Broadcast, ARP, and multicast all work inside the overlay as if every pod were on the same Ethernet segment.
- Hardware offload (Intel X710, Mellanox ConnectX-5+) handles VXLAN encap/decap entirely in the NIC, including outer checksums and TSO. Check with 'ethtool -k eth0 | grep vxlan' -- look for tx-udp_tnl-segmentation and rx-udp_tnl-segmentation.
Common Mistakes with VXLAN & Overlay Networking
- Mistake: Leaving inner MTU at 1500 when physical MTU is also 1500. Reality: VXLAN adds 50 bytes, so 1500-byte inner frames become 1550 outer frames exceeding physical MTU. DF-set packets are silently dropped. Set inner MTU to 1450, or physical MTU to 9000 and inner to 8950.
- Mistake: Blocking UDP 4789 in the host firewall while expecting overlay networking to work. Reality: all cross-node overlay traffic fails silently. Node-to-node ping still works, making it look like an application bug.
- Mistake: Assuming stale FDB entries clean up automatically. Reality: during restarts or upgrades, entries can point to non-existent VTEPs. Inspect with 'bridge fdb show dev flannel.1' and remove with 'bridge fdb del'.
- Mistake: Using VXLAN multicast mode in the cloud. Reality: AWS, GCP, Azure do not support IP multicast. BUM flooding fails silently. Use unicast mode or let the CNI plugin manage FDB entries directly.
Related Topics
Network Namespaces & veth Pairs, Netfilter & nftables/iptables, Kernel Network Stack, XDP & AF_XDP: Kernel-Bypass Networking, TCP Tuning & Congestion Control
Workqueues & Tasklets — Interrupts & Scheduling
Difficulty: Advanced
Hardirq handlers must finish fast and cannot sleep. Workqueues solve this by deferring heavy processing to kernel threads that run in process context with full sleeping capability. The kernel manages a pool of kworker threads per CPU, automatically scaling concurrency based on how many work items are blocked at any moment.
System Calls for Workqueues & Tasklets
Key Components in Workqueues & Tasklets
- struct work_struct: The basic unit of deferred work. Contains a function pointer and is embedded directly in driver or subsystem structures. Once queued via queue_work(), the function executes on a kworker thread. A work item cannot be queued again until the current execution finishes.
- struct delayed_work: Wraps work_struct with a timer. queue_delayed_work(wq, dwork, delay) starts a timer that queues the work item after the specified jiffies elapse. Used for periodic housekeeping tasks like link state polling, watchdog timers, and garbage collection sweeps.
- struct workqueue_struct: The workqueue itself, created via alloc_workqueue(). Does not own threads directly in the CMWQ model. Instead it routes work items to per-CPU or unbound worker pools. Flags like WQ_UNBOUND, WQ_HIGHPRI, WQ_MEM_RECLAIM, and WQ_FREEZABLE control scheduling behavior.
- worker_pool / kworker threads: CMWQ maintains shared pools of kworker threads. Per-CPU pools (one normal priority, one high priority) handle bound workqueues. Unbound pools service WQ_UNBOUND workqueues and can run on any CPU. The pool manager spawns new kworker threads when all existing workers are blocked, ensuring forward progress.
Key Points for Workqueues & Tasklets
- Concurrency Managed Workqueues (CMWQ) replaced the old create_workqueue() API. Instead of each workqueue owning dedicated threads, all workqueues share per-CPU kworker pools. The pool manager monitors how many workers are sleeping. If all workers for a CPU are blocked (waiting on I/O, mutexes), it spawns a new kworker to keep other work items flowing. This keeps kworker thread count proportional to actual concurrency needs, not the number of registered workqueues.
- WQ_UNBOUND workqueues do not pin work to the CPU that queued it. The scheduler is free to run the kworker on any CPU, which helps latency- sensitive work avoid head-of-line blocking behind CPU-bound tasks on the originating core. The tradeoff is potential cache misses when the work item accesses data that was hot on the queuing CPU.
- WQ_MEM_RECLAIM guarantees forward progress under memory pressure. Without this flag, a workqueue that needs to allocate memory during low-memory conditions can deadlock if the memory reclaim path itself depends on workqueue execution. Filesystems and block drivers must set this flag. Internally, the kernel pre-allocates a rescue worker thread for each WQ_MEM_RECLAIM workqueue.
- Tasklets are the older bottom-half mechanism. They run in softirq context (TASKLET_SOFTIRQ / HI_SOFTIRQ), cannot sleep, and are serialized per tasklet instance (the same tasklet never runs on two CPUs simultaneously). Different tasklet instances can run in parallel on different CPUs. The kernel community is gradually converting tasklets to workqueues because workqueues provide better concurrency control and debugging.
- alloc_ordered_workqueue() creates a workqueue that processes items strictly one at a time, in FIFO order. This is useful when work items have ordering dependencies (journal commits, firmware command sequences), but creates a bottleneck if items are independent. Always verify whether ordering is actually required before choosing ordered execution.
Common Mistakes with Workqueues & Tasklets
- Using system_wq for long-running or potentially blocking work. The shared system workqueue has limited concurrency, and one slow work item blocks unrelated subsystems. A firmware reset taking 500ms on system_wq delays timer callbacks, RCU processing, and driver state machines across the entire kernel. Allocate a private workqueue for anything that may block for more than a few milliseconds.
- Calling flush_workqueue() or flush_work() from within a work item on the same workqueue. This deadlocks because the flushing work item is waiting for the target work item to complete, but the target is queued behind the flushing item (or the workqueue is ordered). Use separate workqueues for work items that need to wait on each other.
- Queuing a work_struct that is already pending or executing. queue_work() returns false in this case and the new request is silently dropped. Code that needs to ensure a function runs again after the current execution must re-queue from within the work function itself, or use a flag to signal that re-execution is needed.
- Forgetting WQ_MEM_RECLAIM on block or filesystem workqueues. Under memory pressure, the kernel reclaims pages by flushing dirty data through the block layer. If the block driver's completion workqueue cannot make progress because kworker allocation fails, the system deadlocks. The rescue worker mechanism exists specifically to prevent this.
- Assuming tasklets provide parallelism. A single tasklet instance is strictly serialized. Scheduling the same tasklet on multiple CPUs does not make it run in parallel. The second CPU spins or reschedules until the first CPU finishes. For parallel execution of the same function, use per-CPU work items instead.
Related Topics
Interrupt Handling & Softirqs, Process Scheduling (CFS), Timers, Clocks & High-Resolution Timers, Kernel Modules & Device Drivers
XDP & AF_XDP: Kernel-Bypass Networking — Networking & Sockets
Difficulty: Advanced
eBPF programs attached at the NIC driver's NAPI poll -- before sk_buff allocation, before netfilter, before routing touches the packet. Each program gets an xdp_buff with raw packet pointers and returns one of five verdicts: DROP at the driver, PASS to the kernel stack, TX back out the same NIC, REDIRECT to another NIC/CPU/AF_XDP socket, or ABORTED on error. Native mode runs inside the driver for maximum speed; generic mode runs after sk_buff allocation (5-10x slower); offloaded mode runs on NIC hardware itself (zero CPU). AF_XDP brings DPDK-class zero-copy packet I/O through shared UMEM ring buffers.
System Calls for XDP & AF_XDP: Kernel-Bypass Networking
Key Components in XDP & AF_XDP: Kernel-Bypass Networking
- struct xdp_buff / xdp_md: The XDP packet metadata structure passed to the eBPF program. contains pointers to packet data start/end, the ingress interface index, and the rx_queue_index; the eBPF program reads/writes packet data through these pointers
- XDP actions (enum xdp_action): Return values from the XDP program: XDP_PASS (continue to kernel stack), XDP_DROP (discard at NIC driver), XDP_TX (transmit back on the same NIC), XDP_REDIRECT (send to another NIC, CPU, or AF_XDP socket), XDP_ABORTED (error, increment error counter)
- AF_XDP socket (struct xdp_sock): User-space raw socket receiving XDP_REDIRECT'd packets via shared memory ring buffers (UMEM). eliminates kernel-to-user copy; provides mmap'd completion/receive/transmit/fill rings for zero-copy packet I/O
- UMEM (struct xdp_umem): Contiguous shared memory region between kernel and user space for AF_XDP . divided into fixed-size frames (e.g., 2048 or 4096 bytes); the fill/completion rings manage frame ownership between user and kernel without copying
Key Points for XDP & AF_XDP: Kernel-Bypass Networking
- XDP runs at the earliest possible point in the stack: inside the NIC driver's NAPI poll, before sk_buff allocation, before netfilter, before routing. This skips 90% of per-packet kernel overhead, enabling 10M+ packets/sec DROP rates on a single core.
- XDP programs are eBPF -- verified by the kernel to be safe. They cannot crash the kernel, access arbitrary memory, or loop infinitely. This is what makes XDP production-ready, unlike kernel modules which can take down the entire system.
- Three modes: native (in the NIC driver, fastest, needs driver support), generic (after sk_buff allocation, works everywhere, 5-10x slower), and offloaded (on NIC hardware, only Netronome SmartNICs, zero CPU).
- AF_XDP delivers DPDK-like performance without leaving the kernel. The NIC DMAs packets into shared UMEM frames, the XDP program redirects them to the AF_XDP socket, and user space reads from the ring buffer. Zero copies. Line rate on 25+ Gbps NICs.
- XDP vs DPDK: XDP coexists with the kernel stack (XDP_PASS falls through), doesn't require dedicated NICs, and uses standard tools (ip, ethtool). DPDK has lower per-packet latency but takes over the NIC entirely. Choose XDP unless you need sub-microsecond per-packet latency.
Common Mistakes with XDP & AF_XDP: Kernel-Bypass Networking
- Mistake: expecting sk_buff fields in XDP programs. Reality: XDP operates on raw packet data (xdp_buff), not sk_buff. No socket info, no conntrack state, no pre-parsed headers. The BPF program must parse everything manually.
- Mistake: using jumbo frames without multi-buffer support. Reality: XDP originally only handled single-buffer packets. Multi-buffer support arrived in kernel 6.0. Without it, jumbo frames fall back to generic (slow) processing.
- Mistake: using generic XDP and expecting native performance. Reality: generic XDP runs after sk_buff allocation. It doesn't skip the expensive part. Performance is 5-10x worse than native. Always use native mode with a supported driver.
- Mistake: not pinning AF_XDP threads to the correct CPU/RX-queue. Reality: XDP programs run on the CPU handling the NIC's RX queue interrupt. The AF_XDP socket must be bound to the same queue. Without CPU pinning, cross-CPU access to shared maps causes cache bouncing.
Related Topics
Netfilter & nftables/iptables, Network Namespaces & veth Pairs, Zero-Copy Networking (sendfile, splice), TCP Tuning & Congestion Control
Zero-Copy Networking (sendfile, splice) — Networking & Sockets
Difficulty: Advanced
sendfile() moves data from a file fd to a socket fd entirely inside the kernel, skipping user-space buffers completely. With scatter-gather DMA, page cache pages go straight into the NIC's DMA descriptor ring -- zero CPU copies for the payload. splice() generalizes this to any two fds through a pipe buffer holding page references rather than data. tee() duplicates pipe data without consuming it, enabling fan-out. MSG_ZEROCOPY (kernel 4.14+) pins user-space pages directly into the NIC's DMA ring, with completions arriving on the socket error queue. vmsplice maps user pages into a pipe buffer with no copy at all.
System Calls for Zero-Copy Networking (sendfile, splice)
- sendfile
- splice
- tee
- vmsplice
Key Components in Zero-Copy Networking (sendfile, splice)
- sendfile (do_sendfile): Transfers data from a file fd to a socket fd entirely within the kernel. maps page cache pages into the socket buffer without copying data to user space; single syscall replaces read()+write() loop
- splice (do_splice): Transfers data between any two fds via a pipe buffer. one fd must be a pipe; the pipe buffer holds references to pages (not copies), enabling zero-copy between sockets, files, and pipes
- struct pipe_buffer: Ring buffer entry in a pipe. holds a reference to a struct page (page cache or anonymous), offset, and length; splice moves page references between the pipe and fds, avoiding data copying
- MSG_ZEROCOPY (SO_ZEROCOPY): Socket option enabling true zero-copy send from user-space buffers. the kernel maps user pages into the NIC's DMA descriptor ring, notifying via errqueue when the pages can be freed; avoids even the user-to-kernel copy
Key Points for Zero-Copy Networking (sendfile, splice)
- Traditional read()+write() copies data FOUR times and requires TWO context switches. sendfile eliminates the user-space round trip, cutting it to TWO copies (both DMA) and ZERO user-kernel context switches.
- With scatter-gather DMA (most modern NICs), sendfile achieves true zero CPU copy: page cache pages go directly into the NIC's DMA descriptor ring. Only TCP/IP headers are copied, not the payload. Check support with ethtool -k <iface> | grep scatter.
- splice() is more powerful than sendfile -- it works between any two fds as long as one is a pipe. For socket-to-socket proxying (reverse proxy), splice from socket A into a pipe, then from the pipe to socket B. The actual data bytes never touch your process's address space.
- tee() duplicates pipe data without consuming it -- the source pipe's data remains available for another splice. This enables fan-out patterns where one input stream is forwarded to multiple destinations.
- MSG_ZEROCOPY (kernel 4.14+) eliminates even the user-to-kernel copy for send(). The kernel pins user-space pages and DMAs directly from them. But the overhead of page pinning makes it worthwhile only for sends above ~10 KB.
Common Mistakes with Zero-Copy Networking (sendfile, splice)
- Mistake: Trying to use sendfile() for file-to-file copies. Reality: On Linux < 2.6.33, sendfile requires the output fd to be a socket. For file-to-file copies, use splice() with a pipe as intermediate, or copy_file_range() (Linux 4.5+).
- Mistake: Assuming sendfile works with all input types. Reality: sendfile requires a regular file (or mmap-able) input. It does not work with pipes or sockets as input. For socket-to-socket transfer, use splice().
- Mistake: Not looping on splice with SPLICE_F_NONBLOCK. Reality: Like read/write, splice may transfer fewer bytes than requested. Always loop until the desired count or EOF.
- Mistake: Using MSG_ZEROCOPY for small sends. Reality: Page pinning, reference counting, and error queue notification overhead exceeds the copy cost for payloads under ~10 KB. Only use it for bulk transfers.
Related Topics
mmap & Memory-Mapped Files, Socket Programming (TCP/UDP), epoll & I/O Multiplexing, TCP Tuning & Congestion Control