Page Tables & TLB
Mental Model
A massive four-floor library. The card catalog says "Floor 3, Section B, Shelf 7, Slot 12" -- four lookups before reaching the book. But the front desk keeps a tiny pad of sticky notes with recent locations. If someone asked for the same book a moment ago, one glance at the sticky note is enough. The pad only holds about 64 notes, though. Once it fills up, old notes get tossed, and the next lookup means walking all four floors again.
The Problem
A database benchmark plateaus at 70% of expected throughput. CPU and memory look fine. Profiling reveals 15% of cycles burned on dTLB-load-misses -- the hardware is walking four levels of page tables because the TLB (64-1024 entries depending on the CPU) cannot cover a 32 GB working set. The application code is not the problem; address translation is. On a 128-core machine, a single mprotect() call fires a TLB shootdown IPI to every core, stalling the entire machine for microseconds.
Architecture
The CPU just accessed a memory address. Simple, right?
Not even close. Before that access can reach physical RAM, the hardware has to translate a fake address into a real one. It checks a tiny cache called the TLB. If the translation is there -- one cycle, done. If not, the CPU walks a four-level tree structure stored in memory, chasing pointers through four sequential reads, each potentially a cache miss.
This happens on every single memory access. And it is the reason that database is slower than it should be.
What Actually Happens
On x86-64, the MMU translates every virtual address through this sequence:
The fast path: The TLB holds the translation. One cycle. Cost is essentially zero. This is what happens 99%+ of the time for well-behaved workloads.
The slow path: TLB miss. The CPU reads the CR3 register to find the top-level page table (PGD), then walks four levels. Each level is a 4 KB page with 512 eight-byte entries. Nine bits of the virtual address index into each level:
- Bits 47-39 index the PGD (512 entries)
- Bits 38-30 index the PUD (512 entries)
- Bits 29-21 index the PMD (512 entries)
- Bits 20-12 index the PTE (512 entries)
- Bits 11-0 are the offset within the 4 KB page
The final PTE contains the physical frame number plus flags: present, read/write, user/supervisor, NX, accessed, dirty. Add the 12-bit offset to the frame number, and the result is the physical address.
With 200 ns DRAM latency, a worst-case TLB miss (all four levels in DRAM) costs 800 ns. CPU caches usually absorb intermediate page table pages, bringing typical cost to 10-30 ns. Still, that is 10-30x slower than a TLB hit.
Under the Hood
TLB coverage is the real constraint. The L1 data TLB on modern Intel holds 64 entries for 4 KB pages. That covers just 256 KB. If the working set is 4 GB -- a modest database -- 1 million TLB entries are needed but only 64 exist. Every random access to a cold page triggers a full walk.
With 2 MB huge pages, those same 64 entries cover 128 MB. That is 512x more coverage. This single change is why PostgreSQL with huge_pages=on runs measurably faster on large shared buffers.
TLB shootdowns are the hidden cost of multi-core. When the kernel modifies a page table entry (via munmap, mprotect, page migration), it must invalidate the TLB entry on every CPU that might have cached it. This requires an IPI (inter-processor interrupt) to each target CPU. The target stops what it is doing, runs the invalidation handler, and acknowledges. The initiating CPU spins until all acks arrive.
On a 128-core machine, a single munmap() can stall for over 100 microseconds. Linux mitigates this with lazy TLB mode (idle CPUs skip the flush) and batched invalidation via struct mmu_gather.
PCID changed the context switch game. Before PCID (Process Context Identifier), every context switch flushed the entire TLB because CR3 was reloaded. With PCID (Linux 4.14+), TLB entries are tagged with a 12-bit address space ID, allowing entries from other processes to survive context switches.
This became critical after Meltdown. KPTI (Kernel Page Table Isolation) maintains separate page tables for user and kernel mode, switching CR3 on every syscall. Without PCID, that would flush the TLB twice per syscall. With PCID, the entries persist, limiting the overhead to about 5% on syscall-heavy workloads.
The PTE's dirty and accessed bits work for free. The CPU sets these bits in hardware on every read (accessed) and write (dirty). The kernel's page reclaim (kswapd) periodically clears the accessed bit, then checks later. Pages whose bit stayed clear have not been touched -- they are cold and can be evicted. The dirty bit tells the kernel which pages must be written back to disk before evicting. All of this happens without any software instrumentation.
Common Questions
How much memory do page tables actually consume?
Each PTE page maps 512 x 4 KB = 2 MB. For a process with 1 GB RSS, the PTE level needs about 512 pages (2 MB). Add the upper levels and it is roughly 2 MB of page tables per GB mapped. But here is the catch: sparse mappings are brutal. A single 4 KB page at a random address still needs one page at each level -- 16 KB of page table memory for one 4 KB data page.
What is a TLB shootdown and why should I care?
When one CPU modifies a PTE, all CPUs with cached translations for that address must flush them. This requires inter-processor interrupts, which force target CPUs to stop, invalidate, and acknowledge. The initiator waits for everyone. On NUMA systems, cross-node IPIs are even slower. A single poorly-timed munmap() on a many-core machine can create a latency spike visible to end users.
How does KPTI work after Meltdown?
KPTI maintains two page table hierarchies per process: one with full kernel mappings (kernel mode) and one with only minimal kernel entry points (user mode). Every syscall entry switches to the full table; every return switches back. Without PCID, this flushes the TLB twice per syscall. With PCID, entries persist across the switch, but the CR3 writes still cost about 100 cycles each.
Why does sequential access not suffer from TLB misses but random access does?
Sequential access hits the same TLB entry for 4096/element_size consecutive accesses. Hardware TLB prefetchers can also predict the next page. Random access across a range larger than TLB coverage (256 KB with 64 entries and 4 KB pages) causes a miss on nearly every access, each requiring a full four-level walk.
How Technologies Use This
A 32 GB JVM heap loses 12-15% of CPU time to address translation even though cache hit rates look healthy. Profiling with perf stat reveals millions of dTLB-load-misses per second, and GC pauses consistently overshoot their target by 2-3x.
The heap spans 8 million 4 KB pages, but the L1 dTLB holds only 64 entries covering 256 KB. Every object allocation, field access, or GC pointer chase outside that tiny window triggers a four-level page table walk costing 10-30 ns. The G1 collector compounds the problem because each mprotect call during a pause fires TLB shootdown IPIs to all 128 application threads, adding 50-100 us of stall per pause.
Enable -XX:+UseLargePages to map the heap with 2 MB pages, collapsing 8 million entries to 16,384 and boosting GC throughput by 10-20%. For workloads where mprotect-based barriers are the bottleneck, ZGC sidesteps the issue entirely by remapping pages concurrently without mprotect.
Same Concept Across Tech
| Technology | How page tables affect it | Key insight |
|---|---|---|
| JVM | Large heaps (32+ GB) cause high TLB miss rates. UseCompressedOops reduces page table entries | -XX:+UseHugePages reduces TLB misses by 512x per entry |
| PostgreSQL | shared_buffers as huge pages reduces page table overhead for 200+ backend processes | Each backend otherwise maintains its own page table entries for the same pages |
| Redis | fork() for BGSAVE copies page tables (not data). 30 GB process = ~60 MB page table copy | latest_fork_usec measures this cost directly |
| Docker | Container processes have their own page tables in their own mm_struct | Shared libraries across containers share physical pages but have separate PTEs |
| Go | GC scans pages and can trigger TLB shootdowns on mprotect for write barriers | GOGC tuning affects how often page protections change |
Stack layer mapping (TLB miss debugging):
| Layer | What to check | Tool |
|---|---|---|
| Application | Working set size vs TLB coverage? Random vs sequential access? | Application memory profiler |
| Virtual memory | Can huge pages reduce page table levels? | perf stat with/without huge pages |
| TLB | dTLB miss rate? TLB shootdown frequency? | perf stat -e dTLB-load-misses, tlb:tlb_flush |
| Kernel | Page table memory overhead? PCID enabled? | /proc/PID/status PageTables, /proc/cpuinfo |
| Hardware | TLB size and associativity? STLB? | cpuid tool, CPU architecture manual |
Design Rationale A flat page table for a 48-bit address space would need 512 GB of metadata per process -- obviously impossible. The four-level radix tree trades slower lookups (four sequential memory reads) for space efficiency, since only populated address ranges actually allocate page table pages. Even the optimized walk is too slow to run on every memory access, which is where the TLB comes in: caching recent translations drops the amortized cost to near zero. Huge pages arrived because 64 TLB entries covering 256 KB is laughably small for a database with a multi-gigabyte working set. Those same 64 entries covering 128 MB with 2 MB pages change the math completely.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| 10-15% CPU spent on dTLB-load-misses | Working set exceeds TLB coverage with 4 KB pages | perf stat -e dTLB-load-misses, consider huge pages |
| Latency spikes on mprotect/munmap calls | TLB shootdown: IPI sent to all cores to flush stale entries | perf stat -e tlb:tlb_flush, reduce mapping changes |
| PageTables in /proc/PID/status is 200+ MB | Process has millions of page table entries (very large address space) | Consider huge pages to reduce PTE count by 512x |
| fork() takes 5-10ms for large process | Page table copying is proportional to PTE count, not data size | Check PageTables size, use huge pages to reduce |
| Performance degrades on NUMA machines | TLB entries may reference remote memory, adding latency | numactl --hardware, check memory placement |
| Context switch latency higher than expected | TLB flush on switch if PCID not available | Check PCID support: grep pcid /proc/cpuinfo |
When to Use / Avoid
Relevant when:
- Diagnosing high dTLB-load-misses in perf stat (hidden CPU overhead)
- Understanding why huge pages improve performance (fewer page table levels, fewer TLB entries needed)
- Debugging TLB shootdown storms on multi-core machines (mprotect, munmap, fork)
- Sizing page table memory overhead for large-memory processes
Watch out for:
- Page table memory itself can be significant: 25M pages (100 GB process) = ~200 MB of page tables
- TLB shootdowns from mprotect/munmap send IPIs to ALL cores, causing microsecond stalls
- PCID (process-context ID) reduces TLB flush cost on context switches but has limited entries
Try It Yourself
1 # Count TLB misses for a workload
2
3 perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses -p $(pidof postgres)
4
5 # Check page table size for a process
6
7 grep 'VmPTE' /proc/$(pidof redis-server)/status
8
9 # Read pagemap entry for a virtual address
10
11 python3 -c "import struct; f=open('/proc/self/pagemap','rb'); f.seek(0x7ff000000000//4096*8); print(hex(struct.unpack('Q',f.read(8))[0]))"
12
13 # Check if PCID is supported
14
15 grep pcid /proc/cpuinfo
16
17 # Monitor TLB shootdown IPIs system-wide
18
19 cat /proc/interrupts | grep TLB
20
21 # Check KPTI status (Meltdown mitigation)
22
23 dmesg | grep 'page tables isolation'Debug Checklist
- 1
Count TLB misses: perf stat -e dTLB-load-misses,dTLB-store-misses -p <pid> - 2
Check page table memory: grep PageTables /proc/<pid>/status - 3
Monitor TLB shootdowns: perf stat -e tlb:tlb_flush -a -- sleep 5 - 4
Check huge page TLB coverage: perf stat -e dTLB-load-misses ./app (compare with and without huge pages) - 5
Check PCID support: grep pcid /proc/cpuinfo - 6
View page table levels: cat /proc/<pid>/smaps | grep -E 'Size|KernelPageSize'
Key Takeaways
- ✓The TLB is tiny -- 64 entries in the L1 dTLB. With 4 KB pages, that covers just 256 KB of memory. Miss it, and the CPU walks four levels of page tables, burning 10-30 ns per access. For databases with GB-sized working sets, this is the bottleneck nobody talks about
- ✓TLB shootdowns are the silent killer on multi-core systems -- when one CPU changes a mapping, it must IPI every other core running threads of that process, and everyone waits. On 128 cores, a single munmap() can stall the entire machine for 100+ microseconds
- ✓Huge pages (2 MB) skip an entire page table level and give each TLB entry 512x more coverage -- this is why every serious database deployment uses them
- ✓The PTE's accessed and dirty bits are set by hardware on every read/write, letting the kernel's page reclaim (kswapd) find cold pages to evict without any software overhead
- ✓PCID tags TLB entries per process so context switches do not flush the entire TLB -- this became critical after Meltdown when KPTI turned every syscall into an effective context switch
Common Pitfalls
- ✗Ignoring page table memory overhead -- a process with a fragmented 1 TB virtual address space can consume several GB of page tables even if RSS is small, because every mapped region needs page table pages at each level
- ✗Calling mprotect() in a tight loop on many small regions -- each call can trigger TLB shootdowns across all cores, creating O(n * num_cpus) IPI storms that tank latency
- ✗Assuming TLB flushes are free on context switch -- without PCID, switching processes flushes the entire TLB, costing ~1000 cycles plus all the subsequent miss penalties; frequent context switches destroy performance
- ✗Overlooking TLB miss cost in pointer-chasing workloads -- random access across a large address space means a TLB miss on nearly every access, each requiring 4 sequential memory reads
Reference
In One Line
If dTLB-load-misses are eating CPU, switch to huge pages; if TLB shootdowns are spiking latency, reduce mprotect/munmap frequency.