NUMA Architecture & Memory Policy
Mental Model
Two workshops sit on opposite sides of a corridor. Each has a tool rack right next to the workbench -- grabbing a tool takes 2 seconds. But reaching across the corridor to the other rack takes 4. Now suppose the supervisor stocked every tool on workshop A's rack, even though half the crew works in B. Those workers burn their day walking back and forth. Spread the tools across both racks, and each worker finds roughly half within arm's reach. Personal tools that only one person ever uses belong at that person's station exclusively.
The Problem
After migrating a database from one socket to two, throughput drops 30% even though CPU and memory metrics look fine. The postmaster initialized 16 GB of shared_buffers from a single thread on CPU 0, so first-touch placed every page on node 0. Now the 16 backends on node 1 pay 150 ns per buffer access instead of 90 ns -- a 67% penalty compounding across millions of reads per second. Half the server's memory bandwidth sits unused. Meanwhile, PCIe NICs on node 1 are DMA-ing into node 0's memory, dragging cross-interconnect latency into every network packet too.
Architecture
The team just upgraded to a dual-socket server. Twice the cores. Twice the RAM. Twice the performance was the expectation.
Instead, the database is 30% slower.
CPU is not maxed. Memory is plentiful. Disk I/O is unchanged. Every metric says the system is fine. But it is not fine, and the reason is invisible without knowing where to look.
What Actually Happens
Each CPU socket has its own memory controller and attached DRAM, forming a NUMA node. A CPU accessing its local node's memory gets full bandwidth (80 GB/s) at minimum latency (90 ns). Accessing memory on a remote node requires crossing the interconnect (Intel UPI or AMD Infinity Fabric) -- 150 ns latency, half the bandwidth. On a quad-socket server, some nodes are two hops away, making the penalty even worse.
The kernel's page allocator is NUMA-aware. The default policy is first-touch: when a page fault occurs, the physical page is allocated on the NUMA node of the CPU that triggered the fault.
This sounds reasonable. But here is where things break.
The initialization trap. PostgreSQL starts a single postmaster process on node 0. The postmaster allocates shared_buffers -- all of them. First-touch puts every page on node 0. Now 128 backend threads across both sockets access those buffers. The 64 backends on node 1 pay the remote penalty on every buffer read. Throughput drops 30%.
The fix is one command: numactl --interleave=all postgres. This sets MPOL_INTERLEAVE, distributing pages round-robin across all nodes. Now half the buffer accesses from each node are local. Not perfect, but dramatically better than everything on one node.
Under the Hood
The kernel provides four memory policies via set_mempolicy() and mbind():
MPOL_DEFAULT (first-touch): pages go to the faulting CPU's node. Works great when threads are pinned to CPUs and access private data. Terrible for shared data structures initialized by one thread.
MPOL_BIND: restricts allocation to specific nodes. Useful for isolation, but dangerous -- if those nodes run out of memory, OOM triggers even if other nodes have plenty free.
MPOL_INTERLEAVE: distributes pages round-robin. Averages out latency and bandwidth across all nodes. Ideal for shared data structures (hash tables, buffer pools, caches) accessed from everywhere.
MPOL_PREFERRED: tries a specific node first, falls back to others if exhausted. A soft preference that will not cause OOM.
AutoNUMA is the kernel's automatic fix. When enabled (/proc/sys/kernel/numa_balancing), the kernel periodically marks PTEs as inaccessible (PROT_NONE). When a thread accesses a marked page, the resulting fault tells the kernel which CPU/node triggered it. If the accessing node differs from the page's current node, AutoNUMA schedules a migration.
Each page migration copies 4 KB of data (~1 us), updates all reverse-mapped PTEs, and triggers TLB shootdowns. Total cost: 20-50 us per page. Migrating 1 GB takes 5-12 seconds. AutoNUMA works well for stable access patterns but can thrash pages between nodes when access patterns shift.
NUMA affects more than memory. PCIe devices are attached to specific NUMA nodes. A NIC on node 1 DMAs packets into node 1's memory. If the receiving thread runs on node 0, every packet access crosses the interconnect. The fix: pin NIC interrupt handlers and packet processing threads to the same node as the NIC. irqbalance helps, but for high-performance networking (DPDK), explicit CPU and memory pinning is essential.
Memory bandwidth saturation is the other NUMA trap. Each memory controller handles 60-80 GB/s. If all threads access one node, that controller saturates while others sit idle. Interleave policy effectively multiplies bandwidth by distributing load across all controllers.
Common Questions
A database is 30% slower on 2 sockets than 1 socket with the same total resources. What is wrong?
Almost certainly NUMA misplacement. Run numastat -p <pid>. If one node holds most of the allocated memory but CPUs on both nodes are active, threads on the remote node pay 1.5-2x latency on every access. Fix: start with numactl --interleave=all for shared data. For thread-private data, pin threads to specific nodes and use local allocation.
When is interleave the right choice vs bind?
Interleave for shared data accessed from all nodes (hash tables, buffer pools, shared caches). Bind for thread-private data where guaranteed local access is needed. Never bind shared data to one node -- it concentrates all remote traffic on one interconnect link.
How does AutoNUMA detect remote accesses?
It periodically marks a subset of PTEs as PROT_NONE. When any thread accesses a marked page, a fault fires. The handler records which node triggered it. If the accessing node is different from the page's current node, the page is scheduled for migration. Scan rate is configurable via /proc/sys/kernel/numa_balancing_scan_delay_ms.
What is the cost of migrate_pages()?
Per page: allocate on destination, copy 4 KB (~1 us), update all PTEs (reverse mapping walk), TLB shootdown on all CPUs with cached translations. Total: 20-50 us per page. For 1 GB of data: 5-12 seconds. This is why AutoNUMA only migrates pages with clear access patterns -- speculative migration is too expensive.
How Technologies Use This
A 32 GB JVM on a dual-socket server shows 40% higher GC pause times than the same workload on a single socket with identical total resources. The team expected a performance boost from doubling the cores, but GC latency got significantly worse.
Without -XX:+UseNUMA, the entire heap concentrates on the NUMA node where the JVM started. GC threads running on the remote node pay 150 ns per memory access instead of 90 ns -- a 67% penalty on every object scan, pointer update, and card table check. The remote access penalty applies to millions of operations per GC cycle, compounding into massive pause time inflation.
Enable -XX:+UseNUMA so G1 and ZGC allocate eden regions on the local node of each application thread, keeping the hot young-generation allocation path under 90 ns. Promotion to old-gen uses interleaved placement so long-lived objects spread across both nodes, delivering 25-35% lower GC pause times and more predictable p99 latency on dual-socket machines.
A PostgreSQL cluster loses 30% throughput after migrating from a single-socket to a dual-socket server with twice the cores. CPU utilization and memory pressure look fine, but query latency has increased across the board with no obvious explanation.
The postmaster initializes shared_buffers in a single thread on CPU 0, and first-touch policy places all 16 GB of buffer pages on node 0. The 16 backends running on node 1 pay 150 ns instead of 90 ns on every buffer access -- a 67% latency penalty applied millions of times per second. Half the server's memory bandwidth sits idle because only one memory controller is being used.
Run numactl --interleave=all postgres to distribute pages round-robin across both nodes, ensuring that on average 50% of accesses from any socket are local. This single command recovers 20-25% of the lost throughput and doubles available memory bandwidth from 80 GB/s to 160 GB/s by utilizing both memory controllers.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| NUMA awareness | No built-in NUMA support | -XX:+UseNUMA (G1/ZGC node-local eden) | N/A (single-threaded event loop) | runtime.GOMAXPROCS + OS-level pinning | Topology Manager (kubelet) |
| Memory placement | Inherits host NUMA policy | Heap interleaved with UseNUMA | N/A | Inherits host NUMA policy | --topology-manager-policy=single-numa-node |
| CPU pinning | --cpuset-cpus | -XX:ActiveProcessorCount | N/A | runtime.LockOSThread + affinity | CPU Manager with static policy |
| Device affinity | N/A | N/A | N/A | N/A | Device plugins expose NUMA topology |
| Monitoring | numastat on host | GC logs show NUMA-aware eden allocation | N/A | N/A | Node-level numastat via monitoring agents |
Stack Layer Mapping
| Layer | NUMA Mechanism |
|---|---|
| Hardware | Memory controllers per socket, UPI/Infinity Fabric interconnect |
| Kernel page allocator | zonelist fallback order: local node first, remote nodes last |
| Memory policy | set_mempolicy()/mbind() set per-process or per-VMA allocation rules |
| AutoNUMA | PTE scanning + page migration for automatic rebalancing |
| Container runtime | No NUMA awareness by default; requires cpuset/topology manager |
| Application | numactl wrapper or explicit libnuma calls for placement control |
Design Rationale
First-touch works as the default because most programs access data from the same thread that allocated it -- local placement is correct without any configuration. Shared data structures like database buffers and caches break that assumption: they get hit from every socket, so MPOL_INTERLEAVE distributes pages round-robin to average out latency and spread bandwidth across controllers. AutoNUMA tries to fix bad placement automatically for general-purpose workloads, though the 20-50 us cost of migrating each page means it only pays off when access patterns hold steady.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| 30% throughput drop on dual-socket vs single-socket | All shared memory on one node, remote access penalty | numastat -p <pid> to check memory distribution |
| High numa_miss in /proc/vmstat | Pages allocated on remote node due to first-touch misplacement | cat /sys/devices/system/node/node*/numastat for miss counts |
| JVM GC pauses 40% higher on dual-socket | GC threads on remote node scanning heap on local node | Enable -XX:+UseNUMA and verify with GC logs |
| Memory bandwidth saturated on one node | All allocations concentrated on one node, other idle | numastat -m to compare node utilization |
| OOM on one node despite free memory on others | MPOL_BIND restricting allocations to exhausted node | Check cat /proc/<pid>/numa_maps for bind policies |
| NIC throughput lower than expected | NIC on different NUMA node than packet processing threads | cat /sys/bus/pci/devices/<nic>/numa_node |
When to Use / Avoid
Use when:
- Running databases (PostgreSQL, MongoDB) on multi-socket servers with large shared buffers
- Tuning JVM GC on dual-socket machines (-XX:+UseNUMA)
- High-performance networking where NIC NUMA affinity matters (DPDK, XDP)
- Diagnosing unexplained throughput drops after migrating to larger hardware
- Pinning latency-sensitive workloads to specific NUMA nodes
Avoid when:
- Running on single-socket servers (NUMA is irrelevant)
- The workload is I/O-bound, not memory-bandwidth-bound
- Cloud VMs that abstract NUMA topology (check with numactl --hardware first)
Try It Yourself
1 # Show NUMA hardware topology
2
3 numactl --hardware
4
5 # Run process with memory interleaved across all nodes
6
7 numactl --interleave=all -- postgres -D /data
8
9 # Show per-process NUMA memory allocation
10
11 numastat -p $(pidof postgres)
12
13 # Check NUMA statistics per node
14
15 cat /sys/devices/system/node/node0/numastat
16
17 # Migrate a running process's memory to node 1
18
19 migratepages $(pidof myapp) 0 1
20
21 # Check NUMA node of a PCIe device (NIC)
22
23 cat /sys/bus/pci/devices/0000:3b:00.0/numa_nodeDebug Checklist
- 1
numactl --hardware -- show NUMA topology (nodes, CPUs, memory per node) - 2
numastat -p <pid> -- show per-node memory allocation for a process - 3
numastat -m -- system-wide per-node memory stats - 4
perf stat -e node-load-misses,node-store-misses -- count remote NUMA accesses - 5
cat /sys/devices/system/node/node*/numastat -- per-node hit/miss counters - 6
cat /sys/bus/pci/devices/<addr>/numa_node -- check NIC NUMA affinity
Key Takeaways
- ✓Local memory access: 80-100 ns. Remote access via the interconnect (QPI/UPI): 130-200 ns. That is a 1.5-2x penalty on every single memory read -- and for bandwidth-bound workloads, remote access can cut throughput by 30-50%
- ✓The default NUMA policy is 'first touch' -- pages land on whatever node the faulting CPU belongs to; if one thread initializes all the data, ALL of it ends up on one node, and every other node pays the remote penalty forever
- ✓Interleave policy (MPOL_INTERLEAVE) distributes pages round-robin across all nodes -- this averages out latency and multiplies available bandwidth, making it ideal for shared data accessed from every socket
- ✓AutoNUMA is the kernel's attempt to fix bad placement automatically -- it scans PTEs, detects remote access patterns, and migrates pages to the local node; but each migration costs ~20 us per page, so it only works for stable patterns
- ✓NUMA is not just about memory -- PCIe devices are attached to specific nodes too; a NIC on node 1 doing DMA into node 0's memory crosses the interconnect on every packet
Common Pitfalls
- ✗Running a database on a multi-socket server without NUMA awareness -- if shared buffers are allocated on node 0 (where postmaster starts) but queries run on both nodes, half the buffer accesses pay remote latency; use numactl --interleave=all
- ✗Using --cpunodebind without --membind -- constraining CPUs to node 0 does not guarantee memory lands there; under pressure, the allocator falls back to remote nodes silently
- ✗MPOL_BIND without monitoring -- binding to a single node means OOM when that node is exhausted, even if other nodes have gigabytes free; always monitor per-node memory with numastat
- ✗Ignoring NUMA in containers -- Docker and Kubernetes do not enforce NUMA by default; a container's threads may run on CPUs across all nodes while its memory sits on one, creating the worst possible access pattern
Reference
In One Line
Interleave shared data across nodes, bind thread-private data locally -- that one distinction eliminates most cross-socket latency problems.