Huge Pages & THP
Mental Model
A library with a tiny index box that holds 64 cards. With small pages, each card points to one shelf -- 64 cards cover a single bookcase. The librarian spends most of the day flipping through the full catalog instead of retrieving books. Group entire bookcases under one card, and suddenly 64 cards cover the whole floor. Faster lookups. But try to rearrange a single shelf within a grouped bookcase, and the whole unit has to be torn apart.
The Problem
PostgreSQL with 32 GB of shared buffers: 8 million 4 KB pages, a TLB that holds 64 entries, and every buffer access outside that tiny window triggers a four-level page table walk -- four memory reads before the actual data is touched. Fifteen percent of CPU time goes to address translation instead of running queries.
Architecture
The CPU has a tiny address book. It holds 64 entries. Each entry covers one page of memory.
With 4 KB pages, those 64 entries cover 256 KB. A database working set of 32 GB creates a 131,000x mismatch. Every time the CPU accesses a page not in the address book, it stops and walks a four-level page table. Ten to thirty nanoseconds, wasted.
Huge pages make each entry cover 2 MB instead of 4 KB. Same 64 entries, but now they cover 128 MB. That is a 512x improvement from changing nothing about the code.
What Actually Happens
Linux offers two ways to use huge pages, and they have very different tradeoffs:
Explicit huge pages (hugetlbfs) are preallocated. The administrator tells the kernel "reserve 1,024 huge pages" via /proc/sys/vm/nr_hugepages, and it carves 2 GB of contiguous memory from the buddy allocator right then. These pages are pinned -- they cannot be swapped, reclaimed, or fragmented. Applications use them via mmap(MAP_HUGETLB) or by mapping files from hugetlbfs.
The upside: guaranteed availability, zero runtime overhead, completely deterministic.
The downside: manual management. If the pool runs out, allocations fail. Over-reserving locks that memory away from everything else. And the 2 MB granularity wastes memory for non-aligned allocations.
Transparent Huge Pages (THP) are automatic. When the kernel handles a page fault, it tries to allocate a 2 MB page instead of a 4 KB one. If contiguous memory is not available, it falls back to 4 KB. A background kernel thread called khugepaged scans for groups of 512 contiguous 4 KB pages that can be promoted (collapsed) into a single 2 MB THP.
The upside: no configuration, no preallocated pools, works transparently.
The downside: promotion requires memory compaction -- the kernel physically moves pages to create contiguous free regions. This can stall a process for milliseconds. For latency-sensitive workloads, this is a dealbreaker.
Under the Hood
The page table savings are concrete. A 2 MB huge page uses a single PMD entry with the PS (Page Size) bit set, eliminating the entire PTE level. Each PTE page holds 512 entries at 8 bytes each -- that is 4 KB of kernel memory saved per huge page. For a 64 GB database shared buffer, this saves 128 MB of page table memory compared to 4 KB pages.
Compaction is where THP gets dangerous. When khugepaged finds 512 consecutive 4 KB pages that could be a huge page, but the physical frames are not contiguous, it triggers compaction. The kernel's compaction algorithm migrates pages to create free contiguous regions. This runs synchronously if defrag is set to "always" -- meaning a simple page fault can stall for milliseconds while pages are shuffled around.
Setting THP to madvise mode avoids this. Only regions explicitly marked with madvise(MADV_HUGEPAGE) get the transparent treatment. Everything else stays at 4 KB.
THP splitting is the reverse problem. When a 2 MB THP is partially unmapped (munmap of a 4 KB region within it) or gets mixed permissions (mprotect on a subset), the kernel splits it back into 512 individual 4 KB pages. This requires allocating a PTE page, populating all 512 entries, and shooting down TLB entries. It is expensive and creates a burst of overhead at the worst possible time.
NUMA makes huge pages trickier. On NUMA systems, a misplaced huge page is 512x worse than a misplaced regular page -- it is 2 MB of memory on the wrong node instead of 4 KB. Use numactl --membind to ensure huge pages land on the right node.
Common Questions
Why does Redis say to disable THP?
Redis makes many small, random allocations (keys, values, dict entries). THP's compaction causes unpredictable latency spikes that show up as p99 tail latency. And the 2 MB granularity wastes memory for Redis's small objects. Redis's init script runs echo never > /sys/kernel/mm/transparent_hugepage/enabled.
How to know if huge pages will help a given workload?
Measure dTLB-load-misses with perf stat. If TLB miss rate is above 1% of total loads and the working set exceeds 256 KB, huge pages will likely help. Also consider the access pattern: dense access (sequential or strided) benefits most because the CPU fully uses each 2 MB entry. Sparse random access across a huge range wastes the extra coverage.
What is the difference between hugetlbfs and THP from the kernel's perspective?
hugetlbfs pages come from a reserved pool, are not on the LRU list, are never swapped or reclaimed, and their allocation either succeeds immediately or fails. THP pages live on the normal LRU, can be split and reclaimed individually, and allocation is best-effort with fallback. Use hugetlbfs for deterministic performance. Use THP for transparent optimization of workloads that can tolerate occasional compaction latency.
Can 1 GB pages be allocated at runtime?
Practically no. The buddy allocator manages memory in power-of-2 blocks up to MAX_ORDER (typically 4 MB). Allocating a contiguous 1 GB region at runtime would require compacting 262,144 consecutive 4 KB pages -- virtually impossible on a running system. 1 GB pages must be reserved at boot time via kernel command line (hugepagesz=1G hugepages=4) before fragmentation has a chance to scatter memory.
How Technologies Use This
A Kafka consumer reading 500 MB/s of log data shows unexpectedly low throughput. Profiling reveals millions of dTLB misses per second as sequential reads blow through 125,000 4 KB page boundaries per second, each triggering a 10-30 ns page table walk.
The TLB only holds 64 entries covering 256 KB of address space with 4 KB pages. Sequential log reads exhaust that coverage instantly, and the CPU spends more time translating addresses than processing data. With 2 MB huge pages, the same data crosses only 244 boundaries, giving each TLB entry 512x more coverage.
Set THP to madvise mode to let the kernel promote hot log segment pages to 2 MB transparent huge pages, improving consumer throughput by 8-12%. Monitor /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed and cap scan frequency on produce-heavy brokers, because khugepaged compaction can spike latency by 2-5 ms.
A sequential scan of a 100 GB table spends 15% of CPU time on dTLB misses instead of processing rows. The 32 GB shared_buffers pool looks properly sized, but perf stat shows the CPU is burning cycles on address translation rather than query execution.
The pool with 4 KB pages requires 8 million page table entries, but the L1 dTLB only holds 64 entries covering 256 KB. Every buffer access outside that tiny window triggers a four-level page table walk costing 10-30 ns. The page table overhead alone consumes 32 MB of kernel memory for metadata.
Set huge_pages=on to map shared_buffers with 2 MB pages, collapsing 8 million entries to just 16,384. Page table memory overhead drops from 32 MB to 64 KB, and query throughput on scan-heavy workloads improves 10-15% because the CPU spends its cycles processing tuples instead of chasing page table pointers.
G1 GC pauses on a 16 GB heap consistently take 200 ms instead of the target 50 ms. GC logs show the marking phase dominates pause time, and perf stat reveals massive dTLB miss rates during collection cycles.
During marking, the collector walks every reachable object, chasing pointers across the entire heap. With 4 KB pages, the heap spans 4 million page table entries, and the TLB covers less than 0.01% of them. Every pointer chase to a cold region triggers a 10-30 ns page table walk, adding tens of milliseconds per GC cycle that have nothing to do with the application's object graph complexity.
Enable -XX:+UseLargePages to map the heap with 2 MB pages, cutting TLB entries to 8,192 and reducing marking-phase pauses by 20-40%. ZGC benefits even more because its concurrent relocation phase touches pages across the entire heap continuously, making TLB coverage the dominant factor in throughput.
Same Concept Across Tech
| Technology | How it uses huge pages | Configuration |
|---|---|---|
| PostgreSQL | shared_buffers mapped with huge pages, reduces TLB misses on buffer pool access | huge_pages = on in postgresql.conf |
| JVM (Kafka, Elasticsearch) | Heap allocated with huge pages, reduces GC pause TLB overhead | -XX:+UseHugePages or -XX:+UseTransparentHugePages |
| Redis | Single-threaded, benefits from THP for large datasets, but THP can cause fork latency spikes during RDB saves | echo madvise > /sys/.../enabled |
| Docker/K8s | Containers can request huge pages as a resource (hugepages-2Mi), kernel reserves them at node level | resources.limits: hugepages-2Mi: 512Mi |
| DPDK | Network packet processing requires huge pages for DMA buffer pools, will not start without them | Mandatory, not optional |
Stack layer mapping (high TLB miss debugging):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the working set much larger than available TLB coverage? | Application memory profiler |
| Runtime | Is the JVM/database configured to use huge pages? | JVM flags, DB config |
| Syscall | Is mmap called with MAP_HUGETLB? Is madvise(MADV_HUGEPAGE) applied? | strace -e mmap,madvise |
| Kernel | Are huge pages reserved? Is THP enabled? Compaction activity? | /proc/meminfo, /proc/vmstat |
| Hardware | TLB size and associativity? NUMA topology? | cpuid, numactl --hardware |
Design Rationale The TLB is fixed hardware -- software cannot add entries. Huge pages are the only lever to close the gap between 64 TLB slots and working sets measured in gigabytes, providing 512x more coverage per entry. Two mechanisms exist because the tradeoff is genuinely irreconcilable: explicit hugetlbfs pages are deterministic but demand manual pool management, while THP is automatic but drags in compaction latency. After THP set to "always" caused 2-5ms stalls that wrecked Redis p99 numbers, the "madvise" mode was carved out as a middle ground -- applications opt in on specific regions. Pinning hugetlbfs pages and excluding them from reclaim was a deliberate choice: deterministic performance requires pages that the kernel will never swap out.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| High dTLB-load-misses in perf stat | Working set exceeds TLB coverage with 4 KB pages | perf stat -e dTLB-load-misses ./app |
| Periodic latency spikes (P99) with THP enabled | Kernel compacting memory to form huge pages | grep compact /proc/vmstat, check THP setting |
| Fork/RDB save extremely slow in Redis | THP causes copy-on-write to copy 2 MB pages instead of 4 KB | Set THP to madvise, not always |
| Application cannot allocate huge pages | Not enough contiguous memory, or nr_hugepages not reserved | grep -i huge /proc/meminfo |
| Container OOM despite free memory on host | Huge pages reserved at host level, not available to cgroup | Check hugepages resource limits in pod spec |
| CPU spending 10-20% on address translation | Page table walks dominating, no huge pages configured | perf stat -e dTLB-load-misses,cycles |
When to Use / Avoid
Use huge pages when:
- Large sequential memory access patterns (databases, Kafka brokers, JVM heaps)
- TLB miss rate is high (check with perf stat -e dTLB-load-misses)
- Working set is much larger than TLB coverage (64 entries x 4 KB = 256 KB)
- Stable memory allocation that does not change frequently
Avoid when:
- Sparse or random memory access patterns (small allocations scattered across address space)
- Memory-constrained systems where reserving huge pages wastes RAM
- Latency-sensitive workloads with THP set to always (compaction stalls cause spikes)
- Applications that frequently mmap/munmap small regions
Try It Yourself
1 # Check current huge page allocation
2
3 grep -i huge /proc/meminfo
4
5 # Reserve 1024 x 2MB huge pages (2 GB total)
6
7 echo 1024 | sudo tee /proc/sys/vm/nr_hugepages
8
9 # Check THP mode and set to madvise
10
11 cat /sys/kernel/mm/transparent_hugepage/enabled
12
13 echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
14
15 # Check THP usage per process
16
17 grep AnonHugePages /proc/$(pidof postgres)/smaps | awk '{sum+=$2} END {print sum/1024 " MB"}'
18
19 # Monitor khugepaged activity
20
21 grep '' /sys/kernel/mm/transparent_hugepage/khugepaged/*_pages 2>/dev/nullDebug Checklist
- 1
Check current huge page config: grep -i huge /proc/meminfo - 2
Monitor TLB misses: perf stat -e dTLB-load-misses,dTLB-store-misses ./app - 3
Check THP setting: cat /sys/kernel/mm/transparent_hugepage/enabled - 4
Look for compaction stalls: grep compact /proc/vmstat - 5
Check per-process huge page usage: grep -i huge /proc/<pid>/smaps_rollup - 6
Check if explicit huge pages are reserved: cat /proc/sys/vm/nr_hugepages
Key Takeaways
- ✓The TLB coverage math is brutal -- 64 entries at 4 KB covers 256 KB, but at 2 MB covers 128 MB; for any working set above 256 KB (which is every database, every JVM, every real workload), this 512x difference dominates performance
- ✓THP is the convenient trap -- it works transparently, but khugepaged compaction can stall your process for milliseconds while the kernel moves pages around to create contiguous 2 MB regions; this is why Redis says 'disable THP'
- ✓Explicit hugetlbfs pages trade flexibility for reliability -- they are preallocated, pinned, never swapped, never fragmented, but waste memory at 2 MB granularity; a 2.1 MB allocation wastes 43%
- ✓1 GB huge pages exist but must be reserved at boot -- the buddy allocator cannot produce 1 GB contiguous regions at runtime because memory is already fragmented; boot-time reservation carves them before anything else runs
- ✓Both Redis and PostgreSQL warn about THP -- Redis because latency spikes from compaction are unacceptable, PostgreSQL because THP interacts poorly with its buffer management; but PostgreSQL strongly recommends explicit huge pages
Common Pitfalls
- ✗Enabling THP as 'always' on latency-sensitive workloads -- khugepaged compaction can cause multi-millisecond stalls; use 'madvise' mode and let applications opt in on specific regions
- ✗Not reserving enough huge pages upfront -- if the hugetlb pool runs out, mmap with MAP_HUGETLB fails with ENOMEM; applications may fall back silently to 4 KB pages without telling you
- ✗Assuming huge pages always help -- for sparse access patterns across huge address ranges, 2 MB pages waste physical memory (2 MB per accessed byte in the worst case) and the TLB benefit does not compensate
- ✗Forgetting that hugetlbfs pages are pinned -- they cannot be swapped or reclaimed; over-reserving starves the page cache and other processes of memory they actually need
Reference
In One Line
Explicit huge pages for predictable latency, THP in madvise mode for convenience -- and always measure dTLB misses before assuming either one will help.