Heap Allocators (malloc internals)
Mental Model
A warehouse manager gets pallets from a supplier (the kernel). When a department returns goods, the manager never sends them back -- trucks are slow. Returned goods stay on the shelves, ready for whoever needs them next. The warehouse looks full even when every shelf holds nothing but returns. Floor space tells a lie. Only the inventory ledger shows what is actually in use.
The Problem
Three weeks of uptime, traffic at zero for the last hour, every cache flushed, every connection torn down, free() called on everything -- and the process still holds 8 GB of RSS. Valgrind says zero leaks. The memory is not leaked; the allocator is hoarding freed pages in its own free lists, waiting to recycle them. Fragmentation compounds the problem: one live allocation stranded in the middle of a freed region locks the entire region in place.
Architecture
Everything freed. valgrind says zero leaks. But the process is still holding 8 GB of RSS.
This is not a bug. This is malloc working as designed. And it catches almost every developer off guard.
Here is the thing: malloc() is not a syscall. It is a user-space algorithm running a private memory pool. Calling free() sends memory back to the pool, not to the operating system. The allocator hoards it for future reuse. Process RSS does not budge.
Understanding how this pool works is the difference between filing a false "memory leak" bug and actually fixing production memory bloat.
What Actually Happens
When malloc(n) is called, here is the lookup order:
Step 1: tcache (glibc 2.26+). A per-thread cache holding up to 7 chunks per size class. No locking, no contention. If a matching freed chunk exists here, it is returned in nanoseconds.
Step 2: Fastbins. For small allocations (16-160 bytes on 64-bit). These are singly-linked LIFO lists -- fast but they never coalesce adjacent free chunks. Speed over efficiency.
Step 3: Small/large/unsorted bins. Small bins hold exact-size chunks in FIFO order. Large bins sort chunks by size. The unsorted bin is a staging area where recently freed chunks wait before being sorted. This is where coalescing happens.
Step 4: Top chunk. The wilderness at the end of the heap. If no free chunk fits, the top chunk is carved. If the top chunk is exhausted, brk() extends the heap.
Step 5: mmap. Anything above MMAP_THRESHOLD (default 128 KB) skips all of the above and goes straight to mmap(MAP_ANONYMOUS). These chunks ARE returned to the OS on free() via munmap() -- but each allocation costs a full syscall.
Under the Hood
The chunk header is the hidden tax. Every allocation gets a malloc_chunk header prepended: 8 bytes for prev_size, 8 bytes for size (with 3 flag bits: PREV_INUSE, IS_MMAPPED, NON_MAIN_ARENA). On 64-bit, malloc(16) allocates a 32-byte chunk. The minimum chunk size is 32 bytes because a freed chunk needs room for forward and backward pointers. For a 1-byte allocation, the overhead is 31 bytes -- 97%.
Coalescing is selective. When a chunk is freed (not fastbin-sized), ptmalloc2 checks the PREV_INUSE bit and the next chunk's size to determine if neighbors are also free. If so, it merges them into a single larger free chunk. But fastbins skip this entirely -- they hold individual small chunks in a LIFO list for maximum speed. Consolidation only happens when a large allocation triggers malloc_consolidate().
The MMAP_THRESHOLD is dynamic. If a large mmap'd allocation is freed and immediately followed by a similar-sized request, glibc raises the threshold to keep future allocations on the heap for faster reuse. This is controllable with mallopt(M_MMAP_THRESHOLD, value).
Per-arena fragmentation is the real production killer. ptmalloc2 creates up to 8 * num_cpus arenas on 64-bit systems. Each arena has its own bins, heap segments, and top chunk. Memory freed in arena A cannot be reused by arena B. In a long-running server with many threads, this creates isolated pockets of freed memory across dozens of arenas, inflating RSS far beyond actual live data.
This is why jemalloc (Redis, FreeBSD, Firefox) and tcmalloc (Google, Envoy) exist. jemalloc uses extent-based allocation with cross-thread deallocation queues and actively purges dirty pages via madvise(MADV_DONTNEED). tcmalloc uses per-CPU caches with periodic rebalancing. Both dramatically outperform ptmalloc2 for multi-threaded, long-running workloads.
Common Questions
Why does RSS not decrease after freeing most of the allocated memory?
glibc keeps freed memory in bins for reuse. Only mmap'd chunks (above 128 KB) are returned immediately via munmap. The brk-based heap can only shrink from the top -- a single small allocation at the highest address prevents everything below from being returned. Use malloc_trim(0) to force a release attempt, or switch to jemalloc which proactively purges dirty pages.
How do heap overflow exploits work?
Writing past an allocated chunk corrupts the next chunk's size/prev_size metadata. An attacker crafts these values so that on the next malloc/free, the allocator's unlink operation writes to an arbitrary address. Modern glibc has extensive hardening (safe unlinking, tcache key validation, pointer encryption), but legacy code on older glibc remains vulnerable.
How does jemalloc reduce fragmentation vs ptmalloc2?
jemalloc uses size-class slabs where all allocations of the same size class come from the same 2 MB extent. When all objects in a slab are freed, the entire extent can be returned to the OS. It also actively purges dirty pages via madvise. ptmalloc2 holds arbitrary-sized chunks in a general-purpose heap, making whole-page returns impossible when even one live chunk sits on a page.
What is the real overhead of malloc per allocation?
16 bytes of metadata per chunk on 64-bit, plus alignment to 16 bytes. Minimum chunk is 32 bytes. For a 1-byte allocation, that is 31 bytes of overhead (97%). For workloads with millions of small allocations, a slab allocator or arena allocator that amortizes metadata across many objects is far more efficient.
How Technologies Use This
A JVM running Cassandra shows 12 GB RSS when the heap is only 8 GB. No native leak is detected by any profiling tool, and the team wastes weeks analyzing Java code that has nothing to do with the bloat.
The JVM manages its own heap via mmap, but JNI calls, compression codecs, and the C++ runtime still use glibc ptmalloc2 for native allocations. Over weeks of operation, ptmalloc2 creates dozens of arenas that fragment independently, and memory freed in one arena cannot serve requests from another. This invisible native fragmentation accounts for the 4 GB gap between heap size and total RSS.
Swap to jemalloc via LD_PRELOAD to drop native RSS by 25-40%. jemalloc's extent-based design actively purges dirty pages with madvise(MADV_DONTNEED), returning unused physical frames to the OS instead of hoarding them across isolated arena pools.
A Go service handles 100,000 requests per second at peak and consumes 4 GB RSS. Traffic drops to zero, but unlike most services, RSS falls back to 800 MB within minutes. Teams used to other runtimes are surprised that the memory actually returns.
Go's allocator uses a tcmalloc-inspired design with per-P mcaches holding 67 size classes, serving allocations without locks or syscalls. When the GC sweep frees objects, they return to the local mcache instantly. A background scavenger goroutine walks the heap every 2.5 minutes, calling madvise(MADV_DONTNEED) on spans that have been free for at least 5 minutes, telling the kernel to unmap physical pages while keeping virtual addresses valid.
Rely on Go's built-in scavenger for automatic RSS reduction after traffic drops. Unlike glibc ptmalloc2 which hoards freed memory indefinitely, Go's runtime actively returns unused pages to the OS, so RSS tracks actual live data within a few minutes of load reduction without any manual intervention.
A heap buffer overflow in one Chrome tab reaches across the shared heap and reads passwords stored in another tab's memory. With a single shared heap, an out-of-bounds write in a renderer process could touch any object in the same address space, turning a minor bug into a full credential leak.
PartitionAlloc solves this by segregating allocations into size-class buckets backed by separate virtual memory regions with guard pages between them. An overflow in the 64-byte bucket hits a guard page and crashes the renderer instead of corrupting the 256-byte bucket where DOM nodes live. The isolation boundary means a corruption exploit is contained to a single size class.
Chrome switched from tcmalloc to PartitionAlloc specifically for this security boundary, accepting a 1-2% allocation throughput cost in exchange for containing over 70% of heap corruption exploits. The lesson is that allocator design is a security decision, not just a performance one.
Same Concept Across Tech
| Technology | Which allocator it uses | Why |
|---|---|---|
| Redis | jemalloc (compiled in) | Better fragmentation resistance than ptmalloc2 for many small allocations |
| PostgreSQL | glibc ptmalloc2 (default) | Considering jemalloc integration for large shared buffer workloads |
| Go | Built-in allocator (not malloc) | Uses mmap directly, returns memory to OS via madvise(MADV_DONTNEED) |
| JVM | Does not use malloc for heap (uses mmap). malloc used for native/JNI code | Heap is managed by GC, not allocator. Native memory leaks are a different problem |
| Node.js (V8) | V8 manages its own heap. malloc used for Buffer and native addons | --max-old-space-size controls V8 heap, not malloc |
| Nginx | Custom pool allocator per connection, avoids malloc fragmentation | Pool freed entirely when connection closes |
Stack layer mapping (RSS not shrinking after free):
| Layer | What to check | Tool |
|---|---|---|
| Application | Was free() actually called on everything? Any hidden caches? | Code review, heap profiler |
| Allocator | Is the allocator hoarding freed pages? Which allocator? | malloc_stats(), ldd binary |
| Virtual memory | VmData size vs actual allocation? Fragmentation? | /proc/PID/status, /proc/PID/smaps |
| Kernel | Is memory marked MADV_DONTNEED? Can kernel reclaim it? | /proc/PID/smaps LazyFree field |
| Configuration | Is MALLOC_TRIM_THRESHOLD set? Is malloc_trim() called periodically? | Environment variables, application config |
Design Rationale Hoarding freed memory makes sense when returning it costs hundreds of nanoseconds per syscall but recycling a cached chunk costs single digits. For any server that churns through alloc/free cycles, the syscall tax of returning every block would dwarf actual work. ptmalloc2 layered per-thread arenas on top to dodge lock contention, but that decision came after the original single-threaded design, and it is why memory freed in one arena stays trapped -- invisible to every other arena. jemalloc and tcmalloc exist precisely because they were built for multi-threaded, long-running servers from the start, with cross-thread deallocation and proactive OS return as core design goals rather than afterthoughts.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| RSS does not shrink after free() | Allocator hoarding freed pages for reuse (normal behavior) | Check VmData in /proc/PID/status |
| RSS grows slowly over weeks, no leak detected | Heap fragmentation preventing large freed regions from returning to OS | malloc_stats(), consider jemalloc |
| RSS jumps 2x after traffic spike, never returns | Peak allocation stretched the heap. Freed pages stay mapped | malloc_trim(0) or switch to jemalloc for better return policy |
| Different RSS behavior after switching allocators | jemalloc and tcmalloc return memory more aggressively than ptmalloc2 | Expected. Different allocator, different policy |
| Process using 4 GB more than expected on multi-core | Per-thread arenas in ptmalloc2 each hold freed memory | Set MALLOC_ARENA_MAX=2 to limit arena count |
| Go process returns memory but C extension does not | Go heap uses madvise(MADV_DONTNEED). C malloc does not | The C extension is using a different allocator |
When to Use / Avoid
Relevant when:
- RSS refuses to shrink after freeing memory -- the most common false-positive "leak"
- Picking an allocator (ptmalloc2 vs jemalloc vs tcmalloc) for a long-running service
- Heap fragmentation is growing RSS over weeks with no detectable leak
- Someone asks why malloc_trim() or MALLOC_TRIM_THRESHOLD exists
Watch out for:
- ptmalloc2 is conservative about returning memory; that is by design, not a bug
- jemalloc purges dirty pages via madvise; tcmalloc rebalances per-CPU caches -- different policies, different RSS behavior
- Small allocations use brk() and cannot be partially returned; large ones use mmap() and can
- Per-thread arenas in ptmalloc2 fragment independently -- memory freed in one is invisible to the others
Try It Yourself
1 # Check which malloc implementation is in use
2
3 ldd /usr/bin/redis-server | grep -E 'malloc|jemalloc'
4
5 # Enable malloc debug checks
6
7 MALLOC_CHECK_=3 MALLOC_PERTURB_=165 ./my_server
8
9 # Print glibc malloc stats from running process
10
11 gdb -batch -ex 'call malloc_stats()' -p $(pidof myapp)
12
13 # Trace brk/mmap syscalls made by malloc
14
15 strace -e brk,mmap,munmap -f ./my_program 2>&1 | head -50
16
17 # Check malloc tuning parameters
18
19 python3 -c "import ctypes; libc=ctypes.CDLL('libc.so.6'); print('MMAP_THRESHOLD:', libc.mallopt(-3, 0))"
20
21 # Profile heap allocation with valgrind massif
22
23 valgrind --tool=massif --pages-as-heap=yes ./my_program && ms_print massif.out.*Debug Checklist
- 1
Check which allocator is in use: ldd <binary> | grep -E 'malloc|jemalloc|tcmalloc' - 2
Check heap size: cat /proc/<pid>/status | grep VmData - 3
Check RSS vs virtual: cat /proc/<pid>/status | grep -E 'VmRSS|VmSize' - 4
Force glibc to release: gdb -p <pid> -batch -ex 'call malloc_trim(0)' (careful in production) - 5
Check malloc stats (glibc): MALLOC_STATS=1 or mallinfo2() in code - 6
Check fragmentation: compare VmData growth vs actual allocation size
Key Takeaways
- ✓malloc(16) actually allocates 32 bytes -- 16 for the payload, 16 for the chunk header; the minimum chunk size is 32 bytes because freed chunks need space for the free-list pointers
- ✓Anything above 128 KB bypasses the heap entirely and goes straight to mmap -- these chunks ARE returned to the OS on free() via munmap, but at the cost of a syscall per allocation
- ✓The top chunk sits at the end of the heap, and it is the only chunk that can shrink the heap via brk() -- if a single small allocation sits at the very top, all the memory below it stays trapped
- ✓Fastbins are the speed trap: they serve allocations in nanoseconds but never coalesce freed chunks, causing fragmentation when allocation sizes vary; consolidation only happens when a large allocation triggers it
- ✓jemalloc and tcmalloc exist because ptmalloc2's per-arena fragmentation is a fundamental design flaw -- memory freed in one arena cannot be reused by another, and both alternatives solve this with cross-thread deallocation
Common Pitfalls
- ✗Expecting free() to reduce RSS -- glibc hoards freed memory in bins for reuse; only mmap'd chunks (>128 KB) and brk-shrink of the top chunk actually reduce RSS; this is the #1 source of 'memory leak' false alarms
- ✗Double-free bugs -- freeing a chunk twice corrupts the free list; glibc 2.26+ has tcache double-free detection, but older versions are exploitable for arbitrary code execution
- ✗Heap buffer overflows -- writing past an allocation corrupts the next chunk's metadata; on the next malloc/free, glibc detects the damage and aborts with 'corrupted size vs. prev_size'
- ✗Ignoring malloc overhead for small objects -- a million 16-byte allocations consume 32 MB (32 bytes each), not 16 MB; for small objects, a slab or arena allocator is 2-4x more memory-efficient
Reference
In One Line
RSS staying high after free() is the allocator doing its job, not a leak -- switch to jemalloc or call malloc_trim() if the hoarding is actually a problem.