Kernel & BootTopic 10 of 13

Kernel InternalsAdvanced

Kernel Data Structures

KafkaPostgreSQL

🧠

Mental Model

Hospital supply room, three systems in parallel. Every syringe, bandage, and IV bag has a built-in clip that hooks onto the cart rail -- no separate box needed (intrusive list). The medicine shelf is sorted by expiration date; the nurse grabs the soonest-to-expire item without searching (red-black tree). Each nurse station keeps its own tally sheet -- nobody walks to a central counter (per-CPU variable). The policy board outside can be read by anyone at any time, but only the administrator can swap it, and the old one stays up until every reader walks away (RCU).

💡

The Problem

Four million objects on a busy server -- a task_struct per process, a struct page per 4 KB frame, a dentry per cached path, a socket per connection. malloc fragments after millions of cycles. STL containers add per-node allocation overhead and cache misses. Global counters updated 10 million times per second across 64 CPUs waste 40-60 ns per atomic update to cache-line bouncing. And the kernel cannot link against userspace libraries, cannot afford lock contention on read paths, and must work inside interrupt context where sleeping is forbidden.

Architecture

The Linux kernel manages millions of objects -- processes, files, sockets, memory pages -- and it does it without libc, without malloc, without any standard C++ container.

It built its own. And these data structures are some of the most elegant engineering in systems programming.

The patterns covered here -- intrusive lists, red-black trees, lock-free reads, per-CPU counters -- are not just kernel curiosities. They appear in every high-performance codebase that takes cache lines and concurrency seriously.

What Actually Happens

The intrusive linked list. The kernel's most fundamental structure is struct list_head -- just two pointers (next, prev) embedded directly inside the data structure. No separate node allocation. No wrapper. The container_of() macro uses offsetof() to navigate from the embedded list_head back to the enclosing struct.

This has three huge advantages. First, adding to a list costs zero allocations. Second, a single struct can be on multiple lists simultaneously by embedding multiple list_heads -- task_struct has over 10 of them. Third, cache locality is excellent because the list node sits right next to the data.

The red-black tree. CFS stores all runnable tasks in an rb_tree keyed by virtual runtime (vruntime). The leftmost node -- the task with the least vruntime -- is cached in rb_leftmost for O(1) access. That is the next task to run. The VMA subsystem uses an rb_tree to store memory mappings sorted by address, enabling efficient page fault resolution.

One unusual design choice: the kernel's rb_tree does not self-rebalance automatically. Callers must invoke rb_insert_color() after insertion. This gives callers control over when the rebalancing work happens, which matters in interrupt and RCU contexts.

The xarray (radix tree). The page cache maps (inode, page_offset) pairs to struct page pointers via xarray. Lookups are O(depth), typically 3-4 levels for a 64-bit key. The critical feature: RCU-safe lockless reads. Multiple CPUs can look up cached pages simultaneously without any lock contention. Kafka's sendfile() performance depends directly on this.

Per-CPU variables. DEFINE_PER_CPU() creates separate copies for each CPU, accessed via this_cpu_read() / this_cpu_write(). Each CPU reads from its own cache line without invalidating others. No locks. No cache-line bouncing. This is how the kernel maintains packet counts, scheduling statistics, and memory allocation counters at scale.

Under the Hood

RCU makes reads essentially free. Readers call rcu_read_lock() / rcu_read_unlock() to mark their critical section. On non-preempt kernels, these compile to literally nothing. Writers create a modified copy, atomically swap the pointer (rcu_assign_pointer), and wait for a grace period -- until all readers have finished -- before freeing the old version. This makes reads lock-free and wait-free at the cost of more complex writes.

The slab allocator replaces malloc. The kernel uses pre-allocated pools of fixed-size objects. kmem_cache_create("task_struct", ...) creates a cache, and kmem_cache_alloc() / kmem_cache_free() allocate from it. Pre-allocation eliminates fragmentation and keeps hot objects in CPU cache. All active slab caches are visible in /proc/slabinfo.

Hash tables save memory with singly-linked lists. hlist_head / hlist_node use singly-linked chains per bucket, saving 8 bytes per bucket compared to doubly-linked list_head. With millions of buckets in the PID hash, dentry cache, and inode hash, that savings is significant. Bucket counts are determined at boot based on available memory.

Common Questions

Why intrusive lists instead of wrapper containers?

Three reasons. (1) No separate allocation -- the list node lives inside the struct. (2) A struct can be on multiple lists by embedding multiple list_heads. (3) Cache locality -- the node is adjacent to its data, reducing misses during traversal. The tradeoff is type safety: container_of() uses void pointer casts, and the compiler cannot verify the correct member is being used.

How does RCU achieve zero overhead reads on non-preempt kernels?

On non-preempt kernels, rcu_read_lock() is a no-op. Grace period detection relies on context switches: if a CPU has gone through a context switch or is idle, it must have exited any RCU critical section. The kernel tracks quiescent states per CPU. Once all CPUs pass one, callbacks (kfree, etc.) can run.

When to use a red-black tree vs a hash table?

An rb_tree is the right choice for ordered iteration, range queries, or min/max. CFS needs the leftmost node. VMA lookup needs the closest match to a faulting address. A hash table wins when only exact-key lookup is needed and ordering does not matter. PID lookup, dentry cache, and socket demux are all hash tables.

How Technologies Use This

Kafka

Kafka is serving 800K messages per second to hundreds of consumers, but throughput plateaus and tail latency climbs. Profiling shows the bottleneck is not in Kafka's application code but in kernel page cache lookups during sendfile(), where concurrent consumers serialize behind shared locks.

The aha moment is that every sendfile() call must locate cached log segment pages, and without lock-free data structures this lookup serializes all consumers into a queue. The xarray (radix tree) maps file offsets to page pointers using RCU-safe lockless reads, meaning hundreds of concurrent consumer threads never wait on a shared lock. The hlist-based socket hash table resolves each of Kafka's thousands of connections in O(1) average time.

The fix is ensuring log segment files stay in the page cache by sizing the OS page cache appropriately and avoiding competing I/O workloads on the same host. Without these kernel-level data structures working efficiently, Kafka's zero-copy path bottlenecks on page cache lookups, adding 2-5us of latency per fetch request.

PostgreSQL

PostgreSQL is running 500+ connections and the system slows to a crawl. Each new connection forks a process, and at high connection counts the kernel overhead of allocating process structures and resolving page faults starts dominating query latency.

The key insight is that fork() allocates a task_struct from the slab cache in under 1us, avoiding the fragmentation that a general-purpose allocator would create over millions of forks. When a backend process touches a shared buffer page, the kernel walks the VMA red-black tree to resolve the page fault in O(log n) time, typically 3-4 comparisons for a process with 50 mapped regions. The dcache hash table gives O(1) lookups when PostgreSQL stats WAL segment files thousands of times per second.

Together these kernel data structures keep per-query kernel overhead under 10us even at high connection counts. The takeaway is that PostgreSQL's fork-per-connection model works not because forking is cheap by accident, but because the kernel's slab allocator, red-black trees, and hash tables were specifically designed to make it cheap.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Intrusive list pattern	N/A (kernel-internal)	N/A (GC manages nodes)	N/A	container/list (non-intrusive)	N/A (relies on kernel)
Balanced tree	N/A	TreeMap (red-black tree)	N/A (Map is hash-based)	btree package (B-tree)	N/A
Concurrent read-optimized map	N/A	ConcurrentHashMap	N/A	sync.Map (read-heavy)	Informer cache (read-optimized)
Per-CPU / per-thread counters	N/A	LongAdder / Striped64	N/A (single-threaded)	atomic with sharding	Prometheus per-pod counters
Object pool / slab allocator	N/A	TLAB (Thread-Local Allocation Buffer)	Buffer pool (Buffer.allocUnsafe)	sync.Pool	N/A

Stack Layer Mapping

Layer	Data Structure Role
Hardware	Cache lines (64B) dictate struct layout and per-CPU variable alignment
Kernel core	list_head, rb_tree, xarray, hlist, per-CPU vars, RCU
Slab allocator	Pre-allocated fixed-size object pools (task_struct, inode, dentry)
Subsystems	CFS uses rb_tree, page cache uses xarray, networking uses hlist
/proc interface	/proc/slabinfo, /proc/sys/fs/* expose kernel DS statistics
Userspace	Interacts indirectly -- every syscall traverses these structures

Design Rationale

Intrusive lists avoid per-node heap allocation entirely -- critical when the allocator itself might fragment or when code runs in interrupt context where sleeping is off limits. Red-black trees won over AVL trees because they need fewer rotations per insert, and inserts land in scheduler hot paths where every cycle counts. Per-CPU variables solve a scaling problem that locks cannot: atomic operations on a shared counter bounce the cache line to every core, and contention grows linearly with CPU count. Sixty-four cores means sixty-four times the traffic. Giving each core its own copy eliminates the bouncing entirely.

If You See This, Think This

Symptom	Likely Cause	First Check
High dentry cache memory usage (GBs)	Large working set of file paths, normal behavior	`cat /proc/sys/fs/dentry-state` and check nr_unused
Slab memory growing unbounded	Slab cache not shrinking under memory pressure	`slabtop -o` to identify largest caches, check shrinkers
Scheduler latency spikes	Rb-tree rebalancing under high task churn	`perf sched latency` to measure scheduling delay
Page cache lookup contention	xarray lock contention under heavy parallel I/O	`perf lock` or `lockstat` for xa_lock contention
Per-CPU counter drift / inaccuracy	Reading per-CPU sum without accounting for all CPUs	Verify summing across all CPUs with `for_each_possible_cpu`
Hash table performance degradation	Too few buckets or poor hash distribution	Check bucket chain lengths via crash tool or eBPF

When to Use / Avoid

Use when:

Debugging kernel performance issues where data structure choice drives latency
Writing kernel modules or eBPF programs that interact with kernel lists and trees
Understanding why CFS scheduling is O(1) for pick-next and O(log n) for insert
Tuning slab allocator caches for workloads with millions of short-lived objects
Analyzing page cache performance under concurrent I/O (xarray/RCU behavior)

Avoid when:

Writing userspace code (use standard library containers unless profiling proves otherwise)
The system has low concurrency and standard locking is sufficient
Debugging application-level performance (start with strace/perf before diving into kernel internals)

Try It Yourself

 1  # Show kernel slab allocator statistics (object caches)
 2  
 3  sudo cat /proc/slabinfo | head -20
 4  
 5  # Show sizes of key kernel structures via /proc/kallsyms
 6  
 7  sudo cat /proc/kallsyms | grep -E '(task_struct|list_head|rb_root)' | head -10
 8  
 9  # View memory layout of kernel structures (requires dwarves/pahole)
10  
11  pahole -C task_struct /usr/lib/debug/boot/vmlinux-$(uname -r) 2>/dev/null | head -30 || echo 'pahole/debug symbols not installed'
12  
13  # Show dentry cache and inode cache statistics
14  
15  cat /proc/sys/fs/dentry-state 2>/dev/null && cat /proc/sys/fs/inode-state 2>/dev/null
16  
17  # Monitor slab usage sorted by size
18  
19  sudo slabtop -o 2>/dev/null | head -20 || echo 'slabtop not available'
20  
21  # Show per-CPU info for understanding per-CPU variable layout
22  
23  ls /sys/devices/system/cpu/ | grep -E '^cpu[0-9]' | wc -l && echo 'CPUs available'

Debug Checklist

1sudo cat /proc/slabinfo | head -20 -- show kernel slab cache statistics
2sudo slabtop -o | head -20 -- monitor slab usage sorted by size
3cat /proc/sys/fs/dentry-state -- dentry cache usage (nr_dentry, nr_unused)
4cat /proc/sys/fs/inode-state -- inode cache usage
5perf stat -e cache-misses,cache-references -- measure cache-line efficiency
6pahole -C task_struct vmlinux | head -30 -- show struct layout and padding

Key Takeaways

✓container_of() is the kernel's most important macro. It uses offsetof-based pointer arithmetic to go from an embedded list_head to the containing struct. This lets one struct live on multiple lists at once -- task_struct has over 10 list_head members.
✓The kernel's red-black tree is intentionally NOT self-rebalancing. Callers must call rb_insert_color() after insertion. This gives the kernel control over when rebalancing happens -- critical in interrupt and RCU contexts.
✓Hash tables use singly-linked lists (hlist_head/hlist_node) instead of doubly-linked, halving memory per bucket. With millions of buckets in the PID hash, dentry cache, and inode hash, those 8 saved bytes add up fast.
✓RCU makes reads free. Readers take no lock -- they just call rcu_read_lock(). Writers create a modified copy and atomically swap the pointer, waiting for all readers to finish before freeing the old version. On non-preempt kernels, rcu_read_lock() compiles to nothing.
✓The bitmap API (include/linux/bitmap.h) manages sets of flags using unsigned long arrays with bit manipulation. cpumask is a thin wrapper used for CPU affinity, IRQ balancing, and NUMA topology.

Common Pitfalls

✗Mistake: Iterating a list with list_for_each_entry() while modifying it. Reality: This corrupts list pointers. Use list_for_each_entry_safe() which pre-fetches the next pointer, or RCU list iteration for lock-free reads.
✗Mistake: Forgetting to initialize list_head with INIT_LIST_HEAD(). Reality: Uninitialized list_heads contain garbage pointers and cause an immediate oops on first list operation.
✗Mistake: Using per-CPU variables without disabling preemption. Reality: If the thread migrates between get_cpu_var() and put_cpu_var(), it accesses the wrong CPU's data. this_cpu_read()/this_cpu_write() are preemption-safe for simple scalars.
✗Mistake: Assuming O(1) hash table lookups regardless of load. Reality: The kernel uses chaining, so a bad hash or too few buckets degrades to O(n). The dentry and inode hashes use jhash (Jenkins hash) for good distribution.

Reference

System Calls

(kernel internal)

Tools

crash / kdumppahole (dwarves)/proc/slabinfo

📌

In One Line

Intrusive lists for zero-alloc membership, rb-trees for ordered access, per-CPU variables to kill cross-core contention -- learn these three and most kernel code stops looking alien.

Kernel Data Structures

KafkaPostgreSQL

🧠

Mental Model

💡

The Problem

Architecture

The Linux kernel manages millions of objects -- processes, files, sockets, memory pages -- and it does it without libc, without malloc, without any standard C++ container.

It built its own. And these data structures are some of the most elegant engineering in systems programming.

What Actually Happens

Under the Hood

Common Questions

Why intrusive lists instead of wrapper containers?

How does RCU achieve zero overhead reads on non-preempt kernels?

When to use a red-black tree vs a hash table?

How Technologies Use This

Kafka

PostgreSQL

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Intrusive list pattern	N/A (kernel-internal)	N/A (GC manages nodes)	N/A	container/list (non-intrusive)	N/A (relies on kernel)
Balanced tree	N/A	TreeMap (red-black tree)	N/A (Map is hash-based)	btree package (B-tree)	N/A
Concurrent read-optimized map	N/A	ConcurrentHashMap	N/A	sync.Map (read-heavy)	Informer cache (read-optimized)
Per-CPU / per-thread counters	N/A	LongAdder / Striped64	N/A (single-threaded)	atomic with sharding	Prometheus per-pod counters
Object pool / slab allocator	N/A	TLAB (Thread-Local Allocation Buffer)	Buffer pool (Buffer.allocUnsafe)	sync.Pool	N/A

Stack Layer Mapping

Layer	Data Structure Role
Hardware	Cache lines (64B) dictate struct layout and per-CPU variable alignment
Kernel core	list_head, rb_tree, xarray, hlist, per-CPU vars, RCU
Slab allocator	Pre-allocated fixed-size object pools (task_struct, inode, dentry)
Subsystems	CFS uses rb_tree, page cache uses xarray, networking uses hlist
/proc interface	/proc/slabinfo, /proc/sys/fs/* expose kernel DS statistics
Userspace	Interacts indirectly -- every syscall traverses these structures

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
High dentry cache memory usage (GBs)	Large working set of file paths, normal behavior	`cat /proc/sys/fs/dentry-state` and check nr_unused
Slab memory growing unbounded	Slab cache not shrinking under memory pressure	`slabtop -o` to identify largest caches, check shrinkers
Scheduler latency spikes	Rb-tree rebalancing under high task churn	`perf sched latency` to measure scheduling delay
Page cache lookup contention	xarray lock contention under heavy parallel I/O	`perf lock` or `lockstat` for xa_lock contention
Per-CPU counter drift / inaccuracy	Reading per-CPU sum without accounting for all CPUs	Verify summing across all CPUs with `for_each_possible_cpu`
Hash table performance degradation	Too few buckets or poor hash distribution	Check bucket chain lengths via crash tool or eBPF

When to Use / Avoid

Use when:

Debugging kernel performance issues where data structure choice drives latency
Writing kernel modules or eBPF programs that interact with kernel lists and trees
Understanding why CFS scheduling is O(1) for pick-next and O(log n) for insert
Tuning slab allocator caches for workloads with millions of short-lived objects
Analyzing page cache performance under concurrent I/O (xarray/RCU behavior)

Avoid when:

Writing userspace code (use standard library containers unless profiling proves otherwise)
The system has low concurrency and standard locking is sufficient
Debugging application-level performance (start with strace/perf before diving into kernel internals)

Try It Yourself

 1  # Show kernel slab allocator statistics (object caches)
 2  
 3  sudo cat /proc/slabinfo | head -20
 4  
 5  # Show sizes of key kernel structures via /proc/kallsyms
 6  
 7  sudo cat /proc/kallsyms | grep -E '(task_struct|list_head|rb_root)' | head -10
 8  
 9  # View memory layout of kernel structures (requires dwarves/pahole)
10  
11  pahole -C task_struct /usr/lib/debug/boot/vmlinux-$(uname -r) 2>/dev/null | head -30 || echo 'pahole/debug symbols not installed'
12  
13  # Show dentry cache and inode cache statistics
14  
15  cat /proc/sys/fs/dentry-state 2>/dev/null && cat /proc/sys/fs/inode-state 2>/dev/null
16  
17  # Monitor slab usage sorted by size
18  
19  sudo slabtop -o 2>/dev/null | head -20 || echo 'slabtop not available'
20  
21  # Show per-CPU info for understanding per-CPU variable layout
22  
23  ls /sys/devices/system/cpu/ | grep -E '^cpu[0-9]' | wc -l && echo 'CPUs available'

Debug Checklist

1sudo cat /proc/slabinfo | head -20 -- show kernel slab cache statistics
2sudo slabtop -o | head -20 -- monitor slab usage sorted by size
3cat /proc/sys/fs/dentry-state -- dentry cache usage (nr_dentry, nr_unused)
4cat /proc/sys/fs/inode-state -- inode cache usage
5perf stat -e cache-misses,cache-references -- measure cache-line efficiency
6pahole -C task_struct vmlinux | head -30 -- show struct layout and padding

Key Takeaways

✓container_of() is the kernel's most important macro. It uses offsetof-based pointer arithmetic to go from an embedded list_head to the containing struct. This lets one struct live on multiple lists at once -- task_struct has over 10 list_head members.
✓The kernel's red-black tree is intentionally NOT self-rebalancing. Callers must call rb_insert_color() after insertion. This gives the kernel control over when rebalancing happens -- critical in interrupt and RCU contexts.
✓Hash tables use singly-linked lists (hlist_head/hlist_node) instead of doubly-linked, halving memory per bucket. With millions of buckets in the PID hash, dentry cache, and inode hash, those 8 saved bytes add up fast.
✓RCU makes reads free. Readers take no lock -- they just call rcu_read_lock(). Writers create a modified copy and atomically swap the pointer, waiting for all readers to finish before freeing the old version. On non-preempt kernels, rcu_read_lock() compiles to nothing.
✓The bitmap API (include/linux/bitmap.h) manages sets of flags using unsigned long arrays with bit manipulation. cpumask is a thin wrapper used for CPU affinity, IRQ balancing, and NUMA topology.

Common Pitfalls

✗Mistake: Iterating a list with list_for_each_entry() while modifying it. Reality: This corrupts list pointers. Use list_for_each_entry_safe() which pre-fetches the next pointer, or RCU list iteration for lock-free reads.
✗Mistake: Forgetting to initialize list_head with INIT_LIST_HEAD(). Reality: Uninitialized list_heads contain garbage pointers and cause an immediate oops on first list operation.
✗Mistake: Using per-CPU variables without disabling preemption. Reality: If the thread migrates between get_cpu_var() and put_cpu_var(), it accesses the wrong CPU's data. this_cpu_read()/this_cpu_write() are preemption-safe for simple scalars.
✗Mistake: Assuming O(1) hash table lookups regardless of load. Reality: The kernel uses chaining, so a bad hash or too few buckets degrades to O(n). The dentry and inode hashes use jhash (Jenkins hash) for good distribution.

Reference

System Calls

(kernel internal)

Tools

crash / kdumppahole (dwarves)/proc/slabinfo

📌

In One Line

Intrusive lists for zero-alloc membership, rb-trees for ordered access, per-CPU variables to kill cross-core contention -- learn these three and most kernel code stops looking alien.

Kernel Data Structures

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Kernel Data Structures

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics