Kernel Memory Allocators (Slab, SLUB & kmalloc)
Mental Model
A restaurant kitchen with prep stations. The head chef (page allocator) orders whole sides of beef (pages). The line cooks (slab caches) pre-cut those into identical portions -- one station cuts 8 oz steaks, another cuts 4 oz filets, another portions soup bowls. When a waiter (kernel subsystem) needs a steak, the line cook hands one from the pre-cut tray instantly. No waiting for the head chef to carve a new side. Each cook keeps a personal tray at their station (per-CPU freelist) so multiple waiters get served simultaneously without crowding around one cutting board.
The Problem
A networking workload handling 500K packets per second on a 16-core machine shows intermittent latency spikes of 50-200 microseconds. CPU profiles point to time spent in kmalloc and page allocation paths. The per-CPU slab freelists are running dry under burst traffic, forcing fallback to the shared node partial lists and, in the worst case, the page allocator. Each fallback adds 2-10 microseconds of latency. The fix requires understanding slab cache tuning, per-CPU partial slab counts, and SLUB's allocation fast path vs slow path.
Architecture
A Linux system is processing 800,000 network packets per second. Each packet needs a 232-byte sk_buff structure allocated, processed, and freed. That is 1.6 million allocations and frees every second, on a structure smaller than a single 4KB page.
If the kernel called the page allocator for each one, it would waste 94% of every page and spend microseconds per allocation contending on zone locks. The system would top out at maybe 100,000 packets per second. Instead, it handles 800,000 without breaking a sweat.
The reason is the slab allocator. It pre-slices pages into fixed-size object slots, keeps per-CPU freelists so cores never contend with each other, and recycles freed objects without returning them to the page allocator. This is how the kernel allocates memory for its own internal data structures -- and it is one of the most performance-critical subsystems in the entire kernel.
What Actually Happens
When a kernel subsystem needs a small, fixed-size structure repeatedly, it creates a slab cache via kmem_cache_create():
- The kernel allocates a
struct kmem_cachedescriptor containing the object size, alignment, flags, an optional constructor, per-CPU freelist pointers, and per-NUMA-node partial slab lists. - No pages are allocated yet. The cache is empty.
- On the first
kmem_cache_alloc()call, SLUB requests one or more contiguous pages from the buddy allocator (the page order depends on object size and themin_partialtuning). - SLUB carves those pages into fixed-size slots. If the cache has a constructor, it runs on every slot in the page now -- not on every future allocation.
- One object is returned to the caller. The rest sit on the per-CPU freelist, ready for the next allocation.
The fast path for subsequent allocations is stunningly simple:
object = per_cpu_freelist;
per_cpu_freelist = object->next_free;
return object;
On x86-64, this compiles to a single cmpxchg instruction on the per-CPU freelist pointer. No spinlock. No disabling interrupts. 10-20 nanoseconds.
When kmem_cache_free() is called, the object goes back onto the per-CPU freelist. If the slab page becomes completely empty and there are already enough partial slabs on the node list, SLUB returns the page to the buddy allocator.
For general-purpose allocations, kmalloc(size, flags) routes to a pre-created set of size-bucketed caches: kmalloc-8, kmalloc-16, kmalloc-32, all the way up to kmalloc-8192. A 192-byte allocation lands in the kmalloc-256 bucket (next power of two). Larger allocations bypass slab entirely and go straight to the page allocator.
Under the Hood
SLUB vs SLAB vs SLOB. Linux has had three slab implementations. The original SLAB (1994, ported from Solaris) used per-CPU arrays, shared arrays, and three list heads (full, partial, empty) per node. SLUB (2007) replaced it as the default by eliminating the array queues and maintaining just one freelist pointer per CPU and one partial list per node. SLOB was a minimal allocator for embedded systems with less than 16 MB of RAM, removed in Linux 6.4. Modern kernels use SLUB exclusively.
The three allocation paths. Every kmem_cache_alloc() tries them in order:
- Fast path: pop from per-CPU freelist. One
cmpxchg, done. ~15 ns. - Slow path: per-CPU freelist empty. Check the per-CPU partial list. If a partially-used slab is available, promote it to active and use its freelist. ~300 ns.
- Slowest path: no partials anywhere. Call
alloc_pages()to get fresh pages from the buddy allocator. Carve into objects, run constructors, return one. ~3000 ns.
On a well-tuned system, 99%+ of allocations hit the fast path.
NUMA awareness. Each kmem_cache has a kmem_cache_node structure per NUMA node, holding that node's partial slab list. Allocations prefer the local node's memory. When code running on node 0 calls kmalloc(), it gets memory physically located on node 0. Cross-node slab access adds 50-100 nanoseconds of latency per access. For NUMA-critical paths, kmalloc_node(size, flags, node) forces allocation on a specific node.
Object poisoning and red zones. With CONFIG_SLUB_DEBUG (enabled by default in debug kernels), SLUB can:
- Poison freed objects with
0x6bbytes, catching use-after-free bugs (accessing freed memory returns recognizable garbage). - Place red zones (guard bytes) before and after each object, catching buffer overflows.
- Track the allocation call site and last free call site via stack traces embedded in the object metadata.
Enable at boot with slub_debug=FPZU or per-cache with slub_debug=FPZ,skbuff_head_cache.
Slab merging. SLUB merges caches with identical object size, alignment, and flags into a single physical cache. Two different subsystems creating 128-byte caches end up sharing one. This reduces fragmentation but makes debugging harder -- /proc/slabinfo shows one merged entry instead of two separate ones. Boot with slub_nomerge to disable this when investigating memory corruption.
The shrinker interface. Filesystem slab caches (dentries, inodes) register shrinker callbacks with the kernel. When memory pressure rises, kswapd calls these shrinkers to free LRU entries from the caches. The shrinker returns objects to kmem_cache_free(), which may eventually free entire slab pages back to the buddy allocator. This is why inode and dentry caches grow to fill available memory -- they shrink on demand, not on a timer.
Common Questions
How does kmalloc differ from vmalloc?
kmalloc() returns physically contiguous memory from slab caches. The returned pointer maps to a contiguous range of physical frames. This is necessary for DMA buffers and any hardware that needs contiguous physical addresses. vmalloc() returns virtually contiguous memory that may be physically scattered across many non-adjacent pages, mapped through the kernel's page tables. vmalloc is slower (requires page table setup) and cannot be used for DMA, but it can satisfy large allocations when physical memory is fragmented.
What happens when a slab cache leaks objects?
If kernel code allocates from a slab cache but never frees, the cache grows without bound. /proc/slabinfo shows active_objs climbing steadily while num_objs stays close (no free slots). The SUnreclaim value in /proc/meminfo grows in lockstep. Unlike userspace memory leaks caught by tools like Valgrind, slab leaks require kernel-specific tools: kmemleak (CONFIG_DEBUG_KMEMLEAK) scans kernel memory for unreferenced allocations and reports them via /sys/kernel/debug/kmemleak.
Why does /proc/slabinfo show far more num_objs than active_objs?
Internal fragmentation. A slab page is carved into N slots. If only one object is active, the entire page is pinned -- the remaining N-1 slots show as allocated-but-inactive in slabinfo. This is the slab equivalent of heap fragmentation. It happens when allocation and free patterns leave one long-lived object on many different slab pages. The slabinfo -v tool (from the kernel source tree) reports per-cache fragmentation ratios.
Can slab allocations fail?
With GFP_KERNEL, the allocator tries hard: it invokes direct reclaim, calls shrinkers, may even invoke the OOM killer. Failure is rare but possible on a truly exhausted system. With GFP_ATOMIC (used in interrupt context), there is no reclaim -- the allocator returns NULL if the per-CPU and node freelists are empty and the buddy allocator has no free pages in the atomic reserves. Code using GFP_ATOMIC must always check for NULL returns.
How Technologies Use This
Launching 200 containers on a single host in a burst deployment causes the kernel to allocate roughly 800 to 1200 slab objects per container across 20+ caches. Each container needs a task_struct (6144 bytes on x86-64), an mm_struct for its memory descriptor, at least one cgroup structure per controller (cpu, memory, pids, io), nsproxy entries for each namespace, and mount-point structures for the overlayfs layers. At 200 containers, that is over 200,000 slab allocations in a few seconds.
The slab allocator handles this burst without degradation because each object type has a dedicated kmem_cache with per-CPU freelists. When a CPU allocates a task_struct, it pops one from its local freelist in a single cmpxchg instruction taking 15 nanoseconds. No global lock, no cross-CPU cache-line bouncing. The 200th container starts just as fast as the first because SLUB keeps warm partial slabs on each CPU, and the per-CPU design means 16 cores create containers in parallel with zero contention.
Monitoring slab growth during container bursts with `grep Slab /proc/meminfo` shows baseline slab memory jumping from roughly 200 MB to 600-800 MB. The SUnreclaim portion (task_struct, nsproxy, active cgroup structures) represents the hard floor of kernel memory that persists as long as those containers run.
An Nginx reverse proxy handling 50,000 concurrent connections allocates one sk_buff structure (232 bytes on x86-64) for every incoming and outgoing packet, plus a tcp_sock structure (roughly 2200 bytes) for every open TCP connection. At 50K connections generating 300K packets per second, the skbuff_head_cache processes 600K alloc+free cycles per second while the tcp_sock cache holds 50K persistent objects.
The slab allocator makes this sustainable by keeping both caches on per-CPU freelists. Each core handling Nginx worker traffic allocates sk_buffs without touching a shared lock. The fast path is a single cmpxchg on the per-CPU freelist pointer, completing in 15 nanoseconds. At 300K packets per second, falling through to the slow path (node partial list lock at 300 nanoseconds) even 1% of the time would add measurable tail latency. On a tuned system, 99%+ of allocations hit the fast path.
Running `grep skbuff /proc/slabinfo` on a loaded Nginx host shows 20K-50K active sk_buff objects at any moment. The tcp_sock slab holds one object per connection, so 50K concurrent connections means 50K active objects consuming roughly 110 MB of slab memory. This memory is SUnreclaim and persists until connections close.
Redis triggers a BGSAVE every 60 seconds on a 30 GB dataset by calling fork() to create a child process that writes the RDB snapshot. The fork itself is fast because of copy-on-write page sharing, but the kernel must allocate a full set of page table entries for the child process. On a 30 GB heap with 4 KB pages, that is roughly 7.8 million page table entries, each tracked through slab-allocated pte structures.
The slab allocator absorbs this burst by drawing from the pte cache (a kmem_cache holding page table entry pages). During the fork, the kernel allocates several thousand slab objects for the child's mm_struct, vm_area_struct entries (one per memory mapping), and the page table pages themselves. On a system with 200+ memory mappings in the Redis process, the vm_area_struct slab cache sees a spike of 200+ allocations in microseconds. Per-CPU freelists ensure this burst completes without lock contention.
As the parent Redis process modifies pages while the child writes the snapshot, copy-on-write faults trigger new page allocations and additional page table slab growth. Monitoring `grep -E "Slab|SUnreclaim" /proc/meminfo` during BGSAVE shows slab memory climbing by 50-200 MB depending on the write rate during the snapshot window. This growth recedes once the child exits and its page tables are freed back to the slab caches.
Same Concept Across Tech
| Subsystem | Slab cache | Object size (x86-64) | Typical active count |
|---|---|---|---|
| Networking | skbuff_head_cache | 232 bytes | 20K-50K on busy NIC |
| Filesystem | dentry_cache | 192 bytes | 2-5M on file server |
| Filesystem | ext4_inode_cache | 1080 bytes | 1-3M on ext4 |
| Process mgmt | task_struct cache | 6144 bytes | 1 per thread |
| Networking | TCP (tcp_sock) | ~2200 bytes | 1 per TCP connection |
| Containers | nsproxy | 40 bytes | 1 per namespace set |
Stack layer mapping (unexpected slab memory growth):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the workload creating many short-lived kernel objects (connections, files, processes)? | Application metrics |
| Kernel caches | Which slab cache is growing? Is it reclaimable or unreclaimable? | slabtop -s c, /proc/meminfo |
| SLUB internals | Are per-CPU partial lists holding too many empty slabs? | /sys/kernel/slab/<cache>/cpu_partial |
| Page allocator | Is the buddy allocator fragmented, preventing slab pages from being freed? | /proc/buddyinfo |
| NUMA topology | Are slabs allocated on remote nodes, wasting memory and adding latency? | numastat -m |
Design Rationale The page allocator works in units of 4 KB pages, but most kernel objects are far smaller -- an sk_buff is 232 bytes, a dentry is 192 bytes. Allocating a full page for each object wastes 90%+ of memory. The original SLAB allocator (Solaris, ported to Linux 1.2) solved this by carving pages into fixed-size slots and caching freed objects for reuse. SLUB (2007) stripped away SLAB's complex per-CPU arrays and three-list management, replacing them with a single freelist pointer per CPU. The result is less metadata, fewer cache lines touched per allocation, and better scalability on machines with hundreds of cores.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Slab memory steadily growing, never shrinks | Filesystem caches filling (normal) or kernel object leak | grep SReclaimable /proc/meminfo. If SUnreclaim grows without load increase, investigate |
| Latency spikes under network load | Per-CPU slab freelist exhausted, falling to slow path | perf record -e kmem:kmalloc, check /sys/kernel/slab/skbuff_head_cache/cpu_partial |
| kmalloc returns NULL with GFP_ATOMIC | No free objects on per-CPU list and page allocator cannot reclaim (atomic context) | Check if interrupt handlers allocate large objects. Preallocate in process context |
| Container host memory pressure despite low app usage | Slab memory from namespace/cgroup structures of many containers | slabtop, focus on nsproxy, pid_namespace, cgroup entries |
| High %sys CPU time during object allocation | SLUB slow path contention on node partial list lock | perf lock record, check for node->list_lock contention |
| /proc/slabinfo shows cache with many objects but few active | Internal fragmentation. Objects freed but pages not returned | Partially-full slabs pinned by one active object per page. Check with slabinfo -v |
When to Use / Avoid
Relevant when:
- Investigating high slab memory usage shown in /proc/meminfo
- Diagnosing latency spikes in network-heavy workloads caused by slab allocation pressure
- Understanding why a container host uses more kernel memory than expected
- Writing kernel modules that need efficient allocation of fixed-size structures
Watch out for:
- SReclaimable slab memory is not a problem. It is filesystem cache the kernel releases under pressure
- GFP_ATOMIC allocations in the slow path fail more readily than GFP_KERNEL since they cannot trigger reclaim
- SLUB merging makes /proc/slabinfo harder to interpret because multiple logical caches share one entry
Try It Yourself
1 # Show top 15 slab caches by memory consumption
2
3 slabtop -o -s c | head -20
4
5 # Watch slab memory in /proc/meminfo
6
7 watch -n2 'grep -E "Slab|SReclaimable|SUnreclaim" /proc/meminfo'
8
9 # Examine a specific slab cache in detail
10
11 grep -E "^skbuff|^dentry|^inode" /proc/slabinfo
12
13 # Check SLUB tunables for a cache
14
15 ls /sys/kernel/slab/kmalloc-256/ && cat /sys/kernel/slab/kmalloc-256/object_size
16
17 # Trace kmalloc calls for 5 seconds and show call stacks
18
19 perf record -e kmem:kmalloc -ag -- sleep 5 && perf report
20
21 # Count slab allocations per cache over 10 seconds
22
23 perf stat -e kmem:kmalloc,kmem:kfree -a -- sleep 10
24
25 # Force inode/dentry cache reclaim
26
27 echo 2 > /proc/sys/vm/drop_caches
28
29 # Check NUMA distribution of slab memory
30
31 numastat -m | grep -i slabDebug Checklist
- 1
Check total slab usage: grep Slab /proc/meminfo - 2
Find top slab consumers: slabtop -o -s c | head -20 - 3
Check a specific cache: grep skbuff /proc/slabinfo - 4
Monitor slab growth over time: watch -n5 'grep -E "Slab|SReclaim|SUnreclaim" /proc/meminfo' - 5
Check SLUB tuning for a cache: cat /sys/kernel/slab/kmalloc-256/cpu_partial - 6
Trace slab allocations: perf record -e kmem:kmalloc -a -- sleep 10 - 7
Check for slab fragmentation: cat /proc/buddyinfo alongside slabinfo
Key Takeaways
- ✓kmalloc is not a syscall exposed to userspace. It is the kernel's internal general-purpose allocator, backed by a set of size-bucketed slab caches (kmalloc-8, kmalloc-16, kmalloc-32, up to kmalloc-8192). Larger allocations fall through to the page allocator directly.
- ✓SLUB replaced the original SLAB allocator as the default in Linux 2.6.23 (2007). SLAB had per-CPU arrays, shared arrays, and three list heads per node. SLUB simplified this to one freelist pointer per CPU and one partial list per node, cutting metadata overhead by 50-70% and removing the complex queue management entirely.
- ✓The fast path in SLUB is a single cmpxchg instruction on the per-CPU freelist pointer. No spinlock, no disabling interrupts, no per-CPU array management. On x86-64 this takes 10-20 nanoseconds. The slow path (promoting a partial slab) takes 200-500 nanoseconds. Falling through to the page allocator takes 1-10 microseconds.
- ✓Object constructors run when a slab page is first carved into objects, not on every allocation. If a constructor initializes a mutex inside each object, that initialization happens once when the slab is created. When the object is freed and reallocated, the mutex is already initialized. This amortizes expensive setup across many allocation cycles.
- ✓Slab merging in SLUB combines caches with the same object size and alignment into a single cache. Two modules each creating a 128-byte cache end up sharing one cache, reducing fragmentation. Disable merging with slub_nomerge boot parameter when debugging use-after-free bugs, since merged caches make it harder to identify which subsystem owns a corrupt object.
Common Pitfalls
- ✗Assuming slab memory is leaked. SReclaimable slab memory (dentries, inodes) is not a leak. The kernel intentionally caches filesystem metadata until something else needs the memory. Only SUnreclaim growth without a corresponding increase in active kernel objects indicates a real problem.
- ✗Using GFP_KERNEL in interrupt context. kmalloc(size, GFP_KERNEL) can sleep to reclaim memory, which is illegal in interrupt handlers, softirqs, and any code holding a spinlock. Use GFP_ATOMIC in those contexts, but understand that GFP_ATOMIC allocations can fail more easily since they cannot invoke reclaim.
- ✗Creating a dedicated slab cache for an object that is the same size as an existing kmalloc bucket. SLUB will merge them anyway unless slub_nomerge is set. The dedicated cache adds a kmem_cache descriptor (256 bytes per NUMA node) with no benefit. Use kmalloc unless a constructor or specific alignment is needed.
- ✗Ignoring NUMA locality. kmalloc allocates from the slab cache associated with the CPU's NUMA node. If a structure allocated on node 0 is frequently accessed by CPUs on node 1, every access pays the cross-node latency penalty (50-100ns extra). Use kmalloc_node() to allocate on the node where the object will be consumed.
Reference
In One Line
The kernel pre-slices pages into fixed-size object slots so that allocating an sk_buff or an inode takes 15 nanoseconds instead of the microseconds a page allocator round-trip would cost.