Memory ManagementTopic 7 of 10

Memory ManagementAdvanced

NUMA Architecture & Memory Policy

JVMPostgreSQL

🧠

Mental Model

Two workshops sit on opposite sides of a corridor. Each has a tool rack right next to the workbench -- grabbing a tool takes 2 seconds. But reaching across the corridor to the other rack takes 4. Now suppose the supervisor stocked every tool on workshop A's rack, even though half the crew works in B. Those workers burn their day walking back and forth. Spread the tools across both racks, and each worker finds roughly half within arm's reach. Personal tools that only one person ever uses belong at that person's station exclusively.

💡

The Problem

After migrating a database from one socket to two, throughput drops 30% even though CPU and memory metrics look fine. The postmaster initialized 16 GB of shared_buffers from a single thread on CPU 0, so first-touch placed every page on node 0. Now the 16 backends on node 1 pay 150 ns per buffer access instead of 90 ns -- a 67% penalty compounding across millions of reads per second. Half the server's memory bandwidth sits unused. Meanwhile, PCIe NICs on node 1 are DMA-ing into node 0's memory, dragging cross-interconnect latency into every network packet too.

Architecture

The team just upgraded to a dual-socket server. Twice the cores. Twice the RAM. Twice the performance was the expectation.

Instead, the database is 30% slower.

CPU is not maxed. Memory is plentiful. Disk I/O is unchanged. Every metric says the system is fine. But it is not fine, and the reason is invisible without knowing where to look.

What Actually Happens

Each CPU socket has its own memory controller and attached DRAM, forming a NUMA node. A CPU accessing its local node's memory gets full bandwidth (80 GB/s) at minimum latency (90 ns). Accessing memory on a remote node requires crossing the interconnect (Intel UPI or AMD Infinity Fabric) -- 150 ns latency, half the bandwidth. On a quad-socket server, some nodes are two hops away, making the penalty even worse.

The kernel's page allocator is NUMA-aware. The default policy is first-touch: when a page fault occurs, the physical page is allocated on the NUMA node of the CPU that triggered the fault.

This sounds reasonable. But here is where things break.

The initialization trap. PostgreSQL starts a single postmaster process on node 0. The postmaster allocates shared_buffers -- all of them. First-touch puts every page on node 0. Now 128 backend threads across both sockets access those buffers. The 64 backends on node 1 pay the remote penalty on every buffer read. Throughput drops 30%.

The fix is one command: numactl --interleave=all postgres. This sets MPOL_INTERLEAVE, distributing pages round-robin across all nodes. Now half the buffer accesses from each node are local. Not perfect, but dramatically better than everything on one node.

Under the Hood

The kernel provides four memory policies via set_mempolicy() and mbind():

MPOL_DEFAULT (first-touch): pages go to the faulting CPU's node. Works great when threads are pinned to CPUs and access private data. Terrible for shared data structures initialized by one thread.

MPOL_BIND: restricts allocation to specific nodes. Useful for isolation, but dangerous -- if those nodes run out of memory, OOM triggers even if other nodes have plenty free.

MPOL_INTERLEAVE: distributes pages round-robin. Averages out latency and bandwidth across all nodes. Ideal for shared data structures (hash tables, buffer pools, caches) accessed from everywhere.

MPOL_PREFERRED: tries a specific node first, falls back to others if exhausted. A soft preference that will not cause OOM.

AutoNUMA is the kernel's automatic fix. When enabled (/proc/sys/kernel/numa_balancing), the kernel periodically marks PTEs as inaccessible (PROT_NONE). When a thread accesses a marked page, the resulting fault tells the kernel which CPU/node triggered it. If the accessing node differs from the page's current node, AutoNUMA schedules a migration.

Each page migration copies 4 KB of data (~1 us), updates all reverse-mapped PTEs, and triggers TLB shootdowns. Total cost: 20-50 us per page. Migrating 1 GB takes 5-12 seconds. AutoNUMA works well for stable access patterns but can thrash pages between nodes when access patterns shift.

NUMA affects more than memory. PCIe devices are attached to specific NUMA nodes. A NIC on node 1 DMAs packets into node 1's memory. If the receiving thread runs on node 0, every packet access crosses the interconnect. The fix: pin NIC interrupt handlers and packet processing threads to the same node as the NIC. irqbalance helps, but for high-performance networking (DPDK), explicit CPU and memory pinning is essential.

Memory bandwidth saturation is the other NUMA trap. Each memory controller handles 60-80 GB/s. If all threads access one node, that controller saturates while others sit idle. Interleave policy effectively multiplies bandwidth by distributing load across all controllers.

Common Questions

A database is 30% slower on 2 sockets than 1 socket with the same total resources. What is wrong?

Almost certainly NUMA misplacement. Run numastat -p <pid>. If one node holds most of the allocated memory but CPUs on both nodes are active, threads on the remote node pay 1.5-2x latency on every access. Fix: start with numactl --interleave=all for shared data. For thread-private data, pin threads to specific nodes and use local allocation.

When is interleave the right choice vs bind?

Interleave for shared data accessed from all nodes (hash tables, buffer pools, shared caches). Bind for thread-private data where guaranteed local access is needed. Never bind shared data to one node -- it concentrates all remote traffic on one interconnect link.

How does AutoNUMA detect remote accesses?

It periodically marks a subset of PTEs as PROT_NONE. When any thread accesses a marked page, a fault fires. The handler records which node triggered it. If the accessing node is different from the page's current node, the page is scheduled for migration. Scan rate is configurable via /proc/sys/kernel/numa_balancing_scan_delay_ms.

What is the cost of migrate_pages()?

Per page: allocate on destination, copy 4 KB (~1 us), update all PTEs (reverse mapping walk), TLB shootdown on all CPUs with cached translations. Total: 20-50 us per page. For 1 GB of data: 5-12 seconds. This is why AutoNUMA only migrates pages with clear access patterns -- speculative migration is too expensive.

How Technologies Use This

JVM

A 32 GB JVM on a dual-socket server shows 40% higher GC pause times than the same workload on a single socket with identical total resources. The team expected a performance boost from doubling the cores, but GC latency got significantly worse.

Without -XX:+UseNUMA, the entire heap concentrates on the NUMA node where the JVM started. GC threads running on the remote node pay 150 ns per memory access instead of 90 ns -- a 67% penalty on every object scan, pointer update, and card table check. The remote access penalty applies to millions of operations per GC cycle, compounding into massive pause time inflation.

Enable -XX:+UseNUMA so G1 and ZGC allocate eden regions on the local node of each application thread, keeping the hot young-generation allocation path under 90 ns. Promotion to old-gen uses interleaved placement so long-lived objects spread across both nodes, delivering 25-35% lower GC pause times and more predictable p99 latency on dual-socket machines.

PostgreSQL

A PostgreSQL cluster loses 30% throughput after migrating from a single-socket to a dual-socket server with twice the cores. CPU utilization and memory pressure look fine, but query latency has increased across the board with no obvious explanation.

The postmaster initializes shared_buffers in a single thread on CPU 0, and first-touch policy places all 16 GB of buffer pages on node 0. The 16 backends running on node 1 pay 150 ns instead of 90 ns on every buffer access -- a 67% latency penalty applied millions of times per second. Half the server's memory bandwidth sits idle because only one memory controller is being used.

Run numactl --interleave=all postgres to distribute pages round-robin across both nodes, ensuring that on average 50% of accesses from any socket are local. This single command recovers 20-25% of the lost throughput and doubles available memory bandwidth from 80 GB/s to 160 GB/s by utilizing both memory controllers.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
NUMA awareness	No built-in NUMA support	-XX:+UseNUMA (G1/ZGC node-local eden)	N/A (single-threaded event loop)	runtime.GOMAXPROCS + OS-level pinning	Topology Manager (kubelet)
Memory placement	Inherits host NUMA policy	Heap interleaved with UseNUMA	N/A	Inherits host NUMA policy	--topology-manager-policy=single-numa-node
CPU pinning	--cpuset-cpus	-XX:ActiveProcessorCount	N/A	runtime.LockOSThread + affinity	CPU Manager with static policy
Device affinity	N/A	N/A	N/A	N/A	Device plugins expose NUMA topology
Monitoring	numastat on host	GC logs show NUMA-aware eden allocation	N/A	N/A	Node-level numastat via monitoring agents

Stack Layer Mapping

Layer	NUMA Mechanism
Hardware	Memory controllers per socket, UPI/Infinity Fabric interconnect
Kernel page allocator	zonelist fallback order: local node first, remote nodes last
Memory policy	set_mempolicy()/mbind() set per-process or per-VMA allocation rules
AutoNUMA	PTE scanning + page migration for automatic rebalancing
Container runtime	No NUMA awareness by default; requires cpuset/topology manager
Application	numactl wrapper or explicit libnuma calls for placement control

Design Rationale

First-touch works as the default because most programs access data from the same thread that allocated it -- local placement is correct without any configuration. Shared data structures like database buffers and caches break that assumption: they get hit from every socket, so MPOL_INTERLEAVE distributes pages round-robin to average out latency and spread bandwidth across controllers. AutoNUMA tries to fix bad placement automatically for general-purpose workloads, though the 20-50 us cost of migrating each page means it only pays off when access patterns hold steady.

If You See This, Think This

Symptom	Likely Cause	First Check
30% throughput drop on dual-socket vs single-socket	All shared memory on one node, remote access penalty	`numastat -p <pid>` to check memory distribution
High numa_miss in /proc/vmstat	Pages allocated on remote node due to first-touch misplacement	`cat /sys/devices/system/node/node*/numastat` for miss counts
JVM GC pauses 40% higher on dual-socket	GC threads on remote node scanning heap on local node	Enable `-XX:+UseNUMA` and verify with GC logs
Memory bandwidth saturated on one node	All allocations concentrated on one node, other idle	`numastat -m` to compare node utilization
OOM on one node despite free memory on others	MPOL_BIND restricting allocations to exhausted node	Check `cat /proc/<pid>/numa_maps` for bind policies
NIC throughput lower than expected	NIC on different NUMA node than packet processing threads	`cat /sys/bus/pci/devices/<nic>/numa_node`

When to Use / Avoid

Use when:

Running databases (PostgreSQL, MongoDB) on multi-socket servers with large shared buffers
Tuning JVM GC on dual-socket machines (-XX:+UseNUMA)
High-performance networking where NIC NUMA affinity matters (DPDK, XDP)
Diagnosing unexplained throughput drops after migrating to larger hardware
Pinning latency-sensitive workloads to specific NUMA nodes

Avoid when:

Running on single-socket servers (NUMA is irrelevant)
The workload is I/O-bound, not memory-bandwidth-bound
Cloud VMs that abstract NUMA topology (check with numactl --hardware first)

Try It Yourself

 1  # Show NUMA hardware topology
 2  
 3  numactl --hardware
 4  
 5  # Run process with memory interleaved across all nodes
 6  
 7  numactl --interleave=all -- postgres -D /data
 8  
 9  # Show per-process NUMA memory allocation
10  
11  numastat -p $(pidof postgres)
12  
13  # Check NUMA statistics per node
14  
15  cat /sys/devices/system/node/node0/numastat
16  
17  # Migrate a running process's memory to node 1
18  
19  migratepages $(pidof myapp) 0 1
20  
21  # Check NUMA node of a PCIe device (NIC)
22  
23  cat /sys/bus/pci/devices/0000:3b:00.0/numa_node

Debug Checklist

1numactl --hardware -- show NUMA topology (nodes, CPUs, memory per node)
2numastat -p <pid> -- show per-node memory allocation for a process
3numastat -m -- system-wide per-node memory stats
4perf stat -e node-load-misses,node-store-misses -- count remote NUMA accesses
5cat /sys/devices/system/node/node*/numastat -- per-node hit/miss counters
6cat /sys/bus/pci/devices/<addr>/numa_node -- check NIC NUMA affinity

Key Takeaways

✓Local memory access: 80-100 ns. Remote access via the interconnect (QPI/UPI): 130-200 ns. That is a 1.5-2x penalty on every single memory read -- and for bandwidth-bound workloads, remote access can cut throughput by 30-50%
✓The default NUMA policy is 'first touch' -- pages land on whatever node the faulting CPU belongs to; if one thread initializes all the data, ALL of it ends up on one node, and every other node pays the remote penalty forever
✓Interleave policy (MPOL_INTERLEAVE) distributes pages round-robin across all nodes -- this averages out latency and multiplies available bandwidth, making it ideal for shared data accessed from every socket
✓AutoNUMA is the kernel's attempt to fix bad placement automatically -- it scans PTEs, detects remote access patterns, and migrates pages to the local node; but each migration costs ~20 us per page, so it only works for stable patterns
✓NUMA is not just about memory -- PCIe devices are attached to specific nodes too; a NIC on node 1 doing DMA into node 0's memory crosses the interconnect on every packet

Common Pitfalls

✗Running a database on a multi-socket server without NUMA awareness -- if shared buffers are allocated on node 0 (where postmaster starts) but queries run on both nodes, half the buffer accesses pay remote latency; use numactl --interleave=all
✗Using --cpunodebind without --membind -- constraining CPUs to node 0 does not guarantee memory lands there; under pressure, the allocator falls back to remote nodes silently
✗MPOL_BIND without monitoring -- binding to a single node means OOM when that node is exhausted, even if other nodes have gigabytes free; always monitor per-node memory with numastat
✗Ignoring NUMA in containers -- Docker and Kubernetes do not enforce NUMA by default; a container's threads may run on CPUs across all nodes while its memory sits on one, creating the worst possible access pattern

Reference

System Calls

mbindset_mempolicyget_mempolicymigrate_pagesmove_pages

Tools

numactl / numastatperf stat (NUMA counters)lstopo (hwloc)

📌

In One Line

Interleave shared data across nodes, bind thread-private data locally -- that one distinction eliminates most cross-socket latency problems.

NUMA Architecture & Memory Policy

JVMPostgreSQL

🧠

Mental Model

💡

The Problem

Architecture

The team just upgraded to a dual-socket server. Twice the cores. Twice the RAM. Twice the performance was the expectation.

Instead, the database is 30% slower.

CPU is not maxed. Memory is plentiful. Disk I/O is unchanged. Every metric says the system is fine. But it is not fine, and the reason is invisible without knowing where to look.

What Actually Happens

The kernel's page allocator is NUMA-aware. The default policy is first-touch: when a page fault occurs, the physical page is allocated on the NUMA node of the CPU that triggered the fault.

This sounds reasonable. But here is where things break.

Under the Hood

The kernel provides four memory policies via set_mempolicy() and mbind():

MPOL_BIND: restricts allocation to specific nodes. Useful for isolation, but dangerous -- if those nodes run out of memory, OOM triggers even if other nodes have plenty free.

MPOL_INTERLEAVE: distributes pages round-robin. Averages out latency and bandwidth across all nodes. Ideal for shared data structures (hash tables, buffer pools, caches) accessed from everywhere.

MPOL_PREFERRED: tries a specific node first, falls back to others if exhausted. A soft preference that will not cause OOM.

Common Questions

A database is 30% slower on 2 sockets than 1 socket with the same total resources. What is wrong?

When is interleave the right choice vs bind?

How does AutoNUMA detect remote accesses?

What is the cost of migrate_pages()?

How Technologies Use This

JVM

PostgreSQL

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
NUMA awareness	No built-in NUMA support	-XX:+UseNUMA (G1/ZGC node-local eden)	N/A (single-threaded event loop)	runtime.GOMAXPROCS + OS-level pinning	Topology Manager (kubelet)
Memory placement	Inherits host NUMA policy	Heap interleaved with UseNUMA	N/A	Inherits host NUMA policy	--topology-manager-policy=single-numa-node
CPU pinning	--cpuset-cpus	-XX:ActiveProcessorCount	N/A	runtime.LockOSThread + affinity	CPU Manager with static policy
Device affinity	N/A	N/A	N/A	N/A	Device plugins expose NUMA topology
Monitoring	numastat on host	GC logs show NUMA-aware eden allocation	N/A	N/A	Node-level numastat via monitoring agents

Stack Layer Mapping

Layer	NUMA Mechanism
Hardware	Memory controllers per socket, UPI/Infinity Fabric interconnect
Kernel page allocator	zonelist fallback order: local node first, remote nodes last
Memory policy	set_mempolicy()/mbind() set per-process or per-VMA allocation rules
AutoNUMA	PTE scanning + page migration for automatic rebalancing
Container runtime	No NUMA awareness by default; requires cpuset/topology manager
Application	numactl wrapper or explicit libnuma calls for placement control

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
30% throughput drop on dual-socket vs single-socket	All shared memory on one node, remote access penalty	`numastat -p <pid>` to check memory distribution
High numa_miss in /proc/vmstat	Pages allocated on remote node due to first-touch misplacement	`cat /sys/devices/system/node/node*/numastat` for miss counts
JVM GC pauses 40% higher on dual-socket	GC threads on remote node scanning heap on local node	Enable `-XX:+UseNUMA` and verify with GC logs
Memory bandwidth saturated on one node	All allocations concentrated on one node, other idle	`numastat -m` to compare node utilization
OOM on one node despite free memory on others	MPOL_BIND restricting allocations to exhausted node	Check `cat /proc/<pid>/numa_maps` for bind policies
NIC throughput lower than expected	NIC on different NUMA node than packet processing threads	`cat /sys/bus/pci/devices/<nic>/numa_node`

When to Use / Avoid

Use when:

Running databases (PostgreSQL, MongoDB) on multi-socket servers with large shared buffers
Tuning JVM GC on dual-socket machines (-XX:+UseNUMA)
High-performance networking where NIC NUMA affinity matters (DPDK, XDP)
Diagnosing unexplained throughput drops after migrating to larger hardware
Pinning latency-sensitive workloads to specific NUMA nodes

Avoid when:

Running on single-socket servers (NUMA is irrelevant)
The workload is I/O-bound, not memory-bandwidth-bound
Cloud VMs that abstract NUMA topology (check with numactl --hardware first)

Try It Yourself

 1  # Show NUMA hardware topology
 2  
 3  numactl --hardware
 4  
 5  # Run process with memory interleaved across all nodes
 6  
 7  numactl --interleave=all -- postgres -D /data
 8  
 9  # Show per-process NUMA memory allocation
10  
11  numastat -p $(pidof postgres)
12  
13  # Check NUMA statistics per node
14  
15  cat /sys/devices/system/node/node0/numastat
16  
17  # Migrate a running process's memory to node 1
18  
19  migratepages $(pidof myapp) 0 1
20  
21  # Check NUMA node of a PCIe device (NIC)
22  
23  cat /sys/bus/pci/devices/0000:3b:00.0/numa_node

Debug Checklist

1numactl --hardware -- show NUMA topology (nodes, CPUs, memory per node)
2numastat -p <pid> -- show per-node memory allocation for a process
3numastat -m -- system-wide per-node memory stats
4perf stat -e node-load-misses,node-store-misses -- count remote NUMA accesses
5cat /sys/devices/system/node/node*/numastat -- per-node hit/miss counters
6cat /sys/bus/pci/devices/<addr>/numa_node -- check NIC NUMA affinity

Key Takeaways

✓Local memory access: 80-100 ns. Remote access via the interconnect (QPI/UPI): 130-200 ns. That is a 1.5-2x penalty on every single memory read -- and for bandwidth-bound workloads, remote access can cut throughput by 30-50%
✓The default NUMA policy is 'first touch' -- pages land on whatever node the faulting CPU belongs to; if one thread initializes all the data, ALL of it ends up on one node, and every other node pays the remote penalty forever
✓Interleave policy (MPOL_INTERLEAVE) distributes pages round-robin across all nodes -- this averages out latency and multiplies available bandwidth, making it ideal for shared data accessed from every socket
✓AutoNUMA is the kernel's attempt to fix bad placement automatically -- it scans PTEs, detects remote access patterns, and migrates pages to the local node; but each migration costs ~20 us per page, so it only works for stable patterns
✓NUMA is not just about memory -- PCIe devices are attached to specific nodes too; a NIC on node 1 doing DMA into node 0's memory crosses the interconnect on every packet

Common Pitfalls

✗Running a database on a multi-socket server without NUMA awareness -- if shared buffers are allocated on node 0 (where postmaster starts) but queries run on both nodes, half the buffer accesses pay remote latency; use numactl --interleave=all
✗Using --cpunodebind without --membind -- constraining CPUs to node 0 does not guarantee memory lands there; under pressure, the allocator falls back to remote nodes silently
✗MPOL_BIND without monitoring -- binding to a single node means OOM when that node is exhausted, even if other nodes have gigabytes free; always monitor per-node memory with numastat
✗Ignoring NUMA in containers -- Docker and Kubernetes do not enforce NUMA by default; a container's threads may run on CPUs across all nodes while its memory sits on one, creating the worst possible access pattern

Reference

System Calls

mbindset_mempolicyget_mempolicymigrate_pagesmove_pages

Tools

numactl / numastatperf stat (NUMA counters)lstopo (hwloc)

📌

In One Line

Interleave shared data across nodes, bind thread-private data locally -- that one distinction eliminates most cross-socket latency problems.

NUMA Architecture & Memory Policy

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

NUMA Architecture & Memory Policy

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics