Swap, kswapd & Memory Reclaim
Mental Model
Think of a swimming pool with a continuous inflow of water (memory allocations) and three drain levels. The pool has markings at three heights on the wall: high, low, and min. A maintenance pump (kswapd) sits idle when the water is below the high mark -- everything is fine, drains are keeping up. When water rises past the low mark, the pump kicks on automatically and starts draining in the background. Swimmers (application threads) do not even notice. But if water rises past the min mark, the lifeguard blows the whistle and forces the swimmers themselves to grab buckets and bail water (direct reclaim) -- they cannot swim until the level drops. The pool also has two kinds of water: clear water (file pages) that can simply be drained onto the ground and reused, and dyed water (anonymous pages) that must be carefully pumped into a holding tank (swap) before the space is reclaimed. The holding tank has limited capacity, and pumping dyed water back in when a swimmer needs it is slow. If the pool overflows despite everyone bailing, the lifeguard picks the swimmer making the biggest splash and ejects them entirely (OOM kill).
The Problem
A 64 GB production server runs 50 containers. At 3 AM, a batch job allocates 12 GB in 30 seconds. Free memory drops from 8 GB to zero. kswapd cannot reclaim fast enough, so every allocation enters direct reclaim -- stalling application threads for 10-200 ms per page. Swap I/O saturates the disk at 400 MB/s. The JVM in container-7 enters a 14-second GC pause because 3 GB of heap was swapped out. PostgreSQL in container-12 sees query latencies jump from 2 ms to 900 ms. The OOM killer fires on container-3, which was innocent but had the highest oom_score. Five services are now degraded, two are dead, and the on-call engineer is staring at vmstat output at 3:17 AM wondering why swap-in is at 800 MB/s on an NVMe drive.
Architecture
Why does a 64 GB server with 90% memory used run perfectly fine, while the same server at 95% used suddenly has 500 ms latency spikes?
The answer is not "it ran out of memory." Linux intentionally uses nearly all available RAM for page cache. The difference between smooth operation and catastrophic latency is whether page reclaim happens in the background (kswapd) or in the foreground (direct reclaim). That transition is the single most important performance cliff in Linux memory management.
The Reclaim Pipeline
kswapd is a per-NUMA-node kernel thread that runs in the background. It wakes when free memory in any zone drops below the low watermark, scans LRU lists to free pages, and sleeps once free memory reaches the high watermark. Applications never notice kswapd -- their allocations proceed at full speed.
Direct reclaim is the disaster path. When a thread calls __alloc_pages() and free memory is below the min watermark, the allocating thread must stop and scan LRU lists itself, freeing pages synchronously before its allocation can proceed. Every direct reclaim event adds 1-200 ms of latency to whatever operation triggered it.
The gap between these two paths is the watermark distance. On a 64 GB server with default vm.watermark_scale_factor=10 (0.1%), the gap between low and min is roughly 65 MB. A burst of allocations at 1 GB/s burns through that in 65 ms -- not enough time for kswapd to react.
LRU Lists and the Second-Chance Algorithm
The kernel tracks page recency using four per-zone LRU lists and a referenced bit -- not timestamps (which would cost 8 bytes per page and constant overhead).
- Active anonymous -- hot heap/stack pages accessed recently
- Inactive anonymous -- cold anonymous pages (swap candidates)
- Active file -- hot page cache pages
- Inactive file -- cold page cache pages (drop/writeback candidates)
The anonymous/file split is critical. Clean file pages can be dropped instantly (re-read from disk if needed). Anonymous pages have no backing store unless swap is configured, so they must be written to swap before the frame can be freed.
Page aging uses the hardware-set accessed bit in the PTE. The mark_page_accessed() path works as follows: (1) new pages start on the inactive list, (2) a second access promotes them to active (the "second chance"), (3) kswapd scans the active list clearing referenced bits and demoting unreferenced pages to inactive, (4) pages at the tail of inactive with no referenced bit are reclaimed. A page must go unreferenced for two full scan cycles before eviction, preventing one-time sequential reads from flushing the hot working set.
Watermarks: min, low, high
Every memory zone has three watermarks tunable via /proc/sys/vm/watermark_scale_factor:
- High -- kswapd target; it sleeps when free pages reach this level (~195 MB on 64 GB)
- Low -- kswapd wakeup threshold (~130 MB on 64 GB)
- Min -- emergency reserve; only PF_MEMALLOC kernel allocations can dip below (~65 MB on 64 GB)
Widening these gaps is the single most impactful tuning for latency-sensitive workloads:
# Increase watermark gaps to 1% of memory (default 0.1%)
echo 100 > /proc/sys/vm/watermark_scale_factor
# On 64 GB: changes kswapd runway from ~65 MB to ~650 MB
Swap Mechanics
Swap provides a backing store for anonymous pages. Without it, the only option when memory is full is OOM-killing a process.
Swap-out path: kswapd selects a page from the inactive anonymous LRU, allocates a swap slot from swap_info_struct, writes the page to the swap device, adds it to the swap cache (a radix tree mapping slots to frames -- prevents double I/O), updates all PTEs to store the swap entry (present bit cleared), and frees the frame.
Swap-in path: a process accesses a PTE containing a swap entry, triggering a major page fault. The kernel checks the swap cache first (instant hit if the page was not yet freed), otherwise allocates a new frame, reads from the swap slot, updates the PTE, and resumes the process. Swap cache retention after swap-in saves one disk write per cycle for unmodified pages.
vm.swappiness controls the anonymous-vs-file reclaim ratio. With swappiness=60 (default), anon_weight=60 and file_weight=140. At 0, the kernel avoids anonymous reclaim unless critically low -- but does NOT disable swap. In cgroup v2, the range extends to 200, and each cgroup can have independent swap limits via memory.swap.max.
PSI: Pressure Stall Information
PSI (kernel 4.20+) directly measures the percentage of wall-clock time tasks spend stalled on memory, replacing guesswork from indirect counters:
$ cat /proc/pressure/memory
some avg10=4.67 avg60=2.15 avg300=0.88 total=8842892
full avg10=1.03 avg60=0.47 avg300=0.19 total=1924810
"some" = at least one task stalled. "full" = all tasks stalled. Thresholds: some avg10 < 5% is healthy; 5-15% warrants investigation; > 25% means visible degradation; full > 5% is critical.
Kubelet (1.27+) integrates PSI for pod eviction, replacing crude MemAvailable thresholds that incorrectly treat reclaimable page cache as consumed memory.
Kubernetes Swap Support (1.28+)
The NodeSwap feature gate (beta in 1.28) enables controlled swap with LimitedSwap behavior. Guaranteed QoS pods (requests == limits) never touch swap. Burstable pods get swap proportional to (limit - request) * node_swap / node_memory. BestEffort pods can use all available swap.
failSwapOn: false
featureGates:
NodeSwap: true
memorySwap:
swapBehavior: LimitedSwap
The JVM Swap Disaster
During major GC, the collector touches every live object in the old generation. With a 8 GB heap at 70% occupancy and 20% swapped out (1.6 GB), GC triggers ~409,600 major page faults. At 100 us each on NVMe, that is 41 seconds of cumulative I/O wait -- a GC pause that should take 200 ms balloons to 5+ seconds. The fix: swapoff -a, -XX:+AlwaysPreTouch + mlock, or memory.swap.max=0 in the cgroup.
Docker and cgroup v2 Swap Control
Docker's --memory-swap flag specifies total (RAM + swap), not just swap. --memory=4g --memory-swap=4g means zero swap. This is frequently misconfigured. In cgroup v2, the controls are explicit: memory.max for RAM, memory.swap.max for swap, memory.pressure for per-cgroup PSI.
Common Questions
Does swappiness=0 disable swap? No. It tells the kernel to strongly prefer file reclaim, but under severe pressure, pages will still be swapped. Use swapoff -a or memory.swap.max=0 to truly disable swap.
How much swap should a server have? The "2x RAM" rule is wrong for servers. 1-4 GB on SSD/NVMe provides a safety buffer for transient spikes without enabling sustained swapping. Database and JVM servers should consider zero swap or mlock.
Why does free memory show 200 MB on a 64 GB server? Normal. Linux uses free memory for page cache. Monitor MemAvailable (accounts for reclaimable cache), not MemFree. A system with 200 MB free but 30 GB available is healthy.
What is the swap cache? An in-memory mapping between swap slots and page frames. Pages stay cached after swap-in until modified. If swapped out again before modification, no write is needed -- the swap copy is still valid. This reduces swap I/O by 30-50% for read-heavy workloads.
How Technologies Use This
A container running a Python data pipeline steadily leaks memory until the cgroup hits its 4 GB limit. The kernel begins direct reclaim inside the container's allocation path, stalling every malloc() for 50-200 ms. Throughput drops 80% and upstream services start timing out within minutes.
The root cause is that Docker sets memory.max in cgroup v2 but leaves memory.swap.max at its default, which on many distributions means the container can swap indefinitely. Once the working set exceeds physical memory, the kernel swaps anonymous pages in and out on every access. Each page fault costs 2-10 ms on SSD, 20-50 ms on spinning disk, turning a fast pipeline into a crawl.
The fix is explicit: set --memory=4g --memory-swap=4g (which means zero swap allowed, since memory-swap is total = RAM + swap). For workloads that genuinely benefit from a small swap buffer, set --memory-swap=5g for 1 GB of swap headroom. Monitor container_memory_swap from cAdvisor. If swap usage exceeds 10% of the memory limit for more than 60 seconds, the workload needs more RAM or a memory leak fix -- not more swap.
A 200-node cluster running Kubernetes 1.26 has swap disabled on every node because kubelet refused to start with swap enabled. One node experiences a transient memory spike from a log aggregator pod, and the OOM killer terminates the node's kubelet process itself. The entire node goes NotReady, triggering rescheduling of 47 pods across the cluster.
The root cause is the historical Kubernetes requirement that swap be disabled. Without swap, there is zero buffer between memory exhaustion and OOM kills. The kernel has no fallback -- when free memory hits the min watermark and direct reclaim cannot free pages fast enough, the OOM killer fires immediately.
Starting in Kubernetes 1.28 (beta), the NodeSwap feature gate allows limited swap usage. Setting failSwapOn=false and memorySwap.swapBehavior=LimitedSwap lets pods in Burstable QoS use swap proportional to their memory request, while Guaranteed QoS pods never touch swap. Kubelet also integrates PSI (Pressure Stall Information) from /proc/pressure/memory to trigger pod eviction before the OOM killer fires. Enabling 2 GB of swap per node with PSI-based eviction at some=10% eliminated OOM kills on the cluster entirely.
A Java application with -Xmx8g runs on a 12 GB host. After a few hours, response times spike from 5 ms to 800 ms during major GC pauses. The GC log shows a full GC that should take 200 ms is taking 4-6 seconds. The host has 16 GB of swap, and sar shows 2 GB of heap pages have been swapped out.
The root cause is that major GC must touch every object in the old generation to determine reachability. When 25% of the old gen has been swapped out, the GC thread triggers thousands of swap-in page faults. Each fault costs 2-10 ms on SSD. A full GC that scans 6 GB of heap with 1.5 GB swapped triggers roughly 384,000 page faults (1.5 GB / 4 KB), and even at 2 ms each on fast NVMe, that is 768 seconds of cumulative I/O wait spread across GC threads. The GC pause balloons by 10-100x.
The fix is to either disable swap entirely for JVM hosts (swapoff -a), or pin the JVM heap with mlock via -XX:+AlwaysPreTouch combined with mlockall in a wrapper. The nuclear option is vm.swappiness=0, which tells kswapd to avoid swapping anonymous pages unless the system is critically low on memory. For containerized JVMs, set memory.swap.max=0 in the cgroup.
A PostgreSQL 15 instance with shared_buffers=4GB on a 16 GB server starts experiencing query latencies jumping from 2 ms to 500 ms. The server has 8 GB of swap configured, and vmstat shows si (swap-in) at 40,000 KB/s during peak query load. PostgreSQL shared buffers are backed by shared memory (anonymous pages), and the kernel has swapped out cold portions of the buffer pool.
The root cause is that PostgreSQL implements its own buffer replacement algorithm (clock sweep) on top of shared_buffers, but the kernel does not know that. The kernel sees anonymous pages that have not been accessed recently and swaps them out to make room for page cache. When a query needs a page from shared_buffers that has been swapped, it stalls on a major page fault. With vm.swappiness=60 (default), the kernel aggressively reclaims anonymous pages in favor of file cache.
The fix is a combination: set vm.swappiness=10 (prefer reclaiming file cache over anonymous pages), enable huge_pages=on in postgresql.conf (huge pages are not swappable), and use mlock via the postgresql systemd unit (LimitMEMLOCK=infinity). For cgroup v2 environments, set memory.swap.max=0 for the PostgreSQL cgroup. Monitoring should alert when pgstat shows buffers_backend_fsync spikes coinciding with si/so in vmstat.
Same Concept Across Tech
| Concept | Docker | Kubernetes | JVM | PostgreSQL |
|---|---|---|---|---|
| Swap limit | --memory-swap flag (memory.swap.max) | memory.swap.max via LimitedSwap | N/A (OS-level) | N/A (OS-level) |
| Disable swap | --memory-swap=same as --memory | Guaranteed QoS + LimitedSwap | swapoff -a or cgroup | vm.swappiness=1 + huge_pages |
| Memory pressure signal | cAdvisor container_memory_working_set | PSI-based eviction in kubelet | GC log pause times | pg_stat_activity wait events |
| Pin memory | N/A | N/A | -XX:+AlwaysPreTouch + mlock | LimitMEMLOCK=infinity in systemd |
| Swappiness control | Per-cgroup memory.swap.max | Per-pod via cgroup v2 | Host-level vm.swappiness | Host-level vm.swappiness |
Stack Layer Mapping
| Layer | Reclaim Mechanism |
|---|---|
| Hardware | DRAM vs swap device (SSD/NVMe) I/O latency: 100ns vs 10-100us |
| Kernel | kswapd background reclaim, direct reclaim, LRU lists, watermarks |
| cgroup v2 | memory.max, memory.swap.max, memory.pressure (PSI) per group |
| Container runtime | Translates --memory/--memory-swap flags to cgroup v2 knobs |
| Orchestrator | Kubelet PSI-based eviction, NodeSwap feature gate, QoS classes |
| Application | GC behavior, mlock/mlockall, huge pages, buffer pool sizing |
Design Rationale
The kernel keeps free memory intentionally low because unused RAM is wasted RAM -- it should be used for page cache. The three-watermark system (min/low/high) creates a staged response: background reclaim first, then synchronous reclaim, then OOM kill as the last resort. Swap exists because anonymous pages (heap, stack) cannot simply be dropped like file cache -- they have no backing store on disk. The swap device provides that backing store. The LRU split into anonymous and file lists lets vm.swappiness control the tradeoff: reclaim cache pages (fast, no I/O needed for clean pages) versus reclaim anonymous pages (requires swap write). This is fundamentally a latency vs throughput tradeoff that must be tuned per workload.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| High si/so in vmstat | Active swap I/O, working set exceeds RAM | swapon --show and cat /proc/meminfo |
| Application latency spikes every few minutes | Direct reclaim stalling allocation paths | `cat /proc/vmstat |
| JVM GC pauses 10-100x longer than expected | Heap pages swapped out, GC triggers mass swap-in | vmstat 1 during GC, check si column |
| OOM kills with swap available | vm.swappiness too low or cgroup memory.max hit | `dmesg |
| MemAvailable dropping while MemFree is stable | Page cache being consumed, reclaim keeping up | cat /proc/meminfo -- this is normal |
| PSI "some" above 25% | Severe memory contention, tasks frequently stalled | cat /proc/pressure/memory |
| kswapd consuming high CPU | Continuous reclaim due to sustained allocation pressure | top or pidstat -u -p $(pgrep kswapd) |
When to Use / Avoid
Use when:
- Diagnosing latency spikes that correlate with memory pressure on the host
- Tuning vm.swappiness for database or JVM workloads that must not be swapped
- Configuring Kubernetes nodes with swap enabled (1.28+ NodeSwap feature gate)
- Setting cgroup v2 memory.swap.max limits for containerized workloads
- Investigating OOM kills that seem premature or target the wrong process
- Building monitoring dashboards that track PSI metrics for memory pressure
Avoid when:
- The system has abundant free memory and no swap usage (nothing to tune)
- Running in a serverless environment where the platform manages memory entirely
- The performance bottleneck is CPU or network, not memory (check PSI first)
Try It Yourself
1 # Show real-time swap and memory stats (1-second interval)
2
3 vmstat 1 5
4
5 # Check current swap usage and devices
6
7 swapon --show && free -h
8
9 # Show memory pressure (PSI) -- requires kernel 4.20+
10
11 cat /proc/pressure/memory 2>/dev/null || echo 'PSI not available (kernel < 4.20)'
12
13 # Show kswapd and direct reclaim scan counters
14
15 cat /proc/vmstat | grep -E 'pgscank|pgscand|pswpin|pswpout|pgsteal|pgrefill'
16
17 # Check current swappiness and watermark settings
18
19 echo "swappiness: $(cat /proc/sys/vm/swappiness)" && echo "watermark_scale: $(cat /proc/sys/vm/watermark_scale_factor)"
20
21 # Show per-zone watermark levels
22
23 cat /proc/zoneinfo | grep -E 'Node|zone|min|low|high|free ' | head -30
24
25 # Show LRU list sizes from /proc/meminfo
26
27 cat /proc/meminfo | grep -E 'Active|Inactive|Anon|File|Swap|Mlocked|Unevictable'
28
29 # Check cgroup v2 swap limit for a container
30
31 find /sys/fs/cgroup -name memory.swap.max 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "{}:" && cat {}'
32
33 # Monitor swap-in rate and correlate with process
34
35 pidstat -r 1 5 2>/dev/null || echo 'pidstat not available (install sysstat)'
36
37 # Show which processes are using the most swap
38
39 for f in /proc/[0-9]*/status; do awk '/VmSwap/{swap=$2} /Name/{name=$2} END{if(swap>0) print swap" kB "name}' "$f" 2>/dev/null; done | sort -rn | head -10Debug Checklist
- 1
vmstat 1 -- watch si/so columns for active swap I/O - 2
cat /proc/meminfo | grep -E 'MemFree|MemAvailable|SwapTotal|SwapFree|Active|Inactive' -- memory breakdown - 3
cat /proc/pressure/memory -- PSI stall percentages - 4
cat /proc/vmstat | grep -E 'pgscank|pgscand|pswpin|pswpout' -- reclaim and swap counters - 5
swapon --show -- list swap devices with usage - 6
cat /proc/sys/vm/swappiness -- current swappiness value - 7
dmesg | grep -i 'oom\|killed\|reclaim' -- recent OOM or reclaim events - 8
cat /sys/fs/cgroup/<path>/memory.swap.current -- cgroup swap usage
Key Takeaways
- ✓kswapd is the background janitor. It wakes when free memory drops below the low watermark and reclaims pages until the high watermark is restored. Direct reclaim is the penalty -- when kswapd cannot keep up, the thread that called malloc() does the reclaim work itself and blocks until a page is freed.
- ✓The kernel maintains four LRU lists per zone: active/inactive for both anonymous and file-backed pages. The second-chance algorithm means a page must be unreferenced twice before eviction -- once to move from active to inactive, once to actually reclaim it from the tail of the inactive list.
- ✓vm.swappiness controls the ratio of anonymous vs file page reclaim. At 60 (default), the kernel reclaims both roughly equally. At 0, the kernel avoids swapping anonymous pages unless the system is critically low on memory. In cgroup v2, the range extends to 200, letting individual cgroups be more aggressive about swapping than the global default.
- ✓Swap is not inherently bad. It lets the kernel move genuinely cold anonymous pages (e.g., init-time data never touched again) to disk, freeing RAM for hot working sets. The disaster scenario is when actively used pages get swapped -- especially JVM heaps during GC, where the collector touches every page and triggers mass swap-in.
- ✓PSI (Pressure Stall Information) replaced guesswork with measurement. Instead of inferring memory pressure from free memory counters (which are misleading because the kernel intentionally keeps free memory low), PSI directly measures how much time tasks spend waiting for memory.
- ✓Kubernetes disabled swap for years because the scheduler assumed all pod memory was resident in RAM. Since 1.28, LimitedSwap mode allows Burstable pods to use swap proportional to their memory request while Guaranteed pods remain swap-free. This gives the system a buffer against transient spikes without breaking QoS guarantees.
Common Pitfalls
- ✗Mistake: Setting vm.swappiness=0 and assuming swap is disabled. Reality: swappiness=0 does not disable swap. It tells the kernel to strongly prefer reclaiming file pages over anonymous pages, but under severe memory pressure the kernel will still swap. To truly prevent swapping, use swapoff -a or set memory.swap.max=0 in the cgroup.
- ✗Mistake: Monitoring free memory and panicking when it is low. Reality: Linux intentionally keeps free memory low by using it for page cache. A system showing 200 MB free out of 64 GB is probably healthy -- the rest is cache that can be reclaimed instantly. Check "available" memory from /proc/meminfo (MemAvailable), not "free" (MemFree).
- ✗Mistake: Disabling swap entirely on all servers. Reality: A small amount of swap (1-2 GB) provides a safety valve for transient spikes. Without swap, the gap between "memory is tight" and "OOM killer fires" is zero. Swap gives kswapd somewhere to put genuinely cold pages and buys time for alerts to fire before processes die.
- ✗Mistake: Running JVM or PostgreSQL with default vm.swappiness=60 and large swap. Reality: GC-heavy workloads and database buffer pools must not be swapped. Major GC touches every object in the heap, so swapped-out pages cause 10-100x pause time amplification. Use swappiness=10, mlock, or cgroup swap limits for these workloads.
Reference
In One Line
kswapd reclaims pages in the background when free memory drops below the low watermark; if it cannot keep up, the allocating thread stalls in direct reclaim until a page is freed or the OOM killer intervenes.