NUMA & Cache Effects
Multi-socket servers have NUMA: each CPU socket has its own memory; accessing the other socket's memory is slower (1.5-3x). Pinning threads to NUMA nodes and allocating memory locally can dramatically improve performance for memory-bound workloads. False sharing is the related cache-line problem within a socket.
What NUMA is
A big server with two CPU chips (called sockets) presents itself to the operating system and application as one computer, but the memory is actually physically split. Each chip has its own bank of RAM wired directly to it. A thread running on chip 0 can read chip 0's RAM very fast. To read chip 1's RAM, the request has to travel over a wire between the two chips and back, which takes roughly 1.5 to 3 times longer than a local read.
The acronym NUMA stands for non-uniform memory access. The "non-uniform" part is the entire point: not all memory addresses cost the same to read, and the cost depends on which chip the running thread is on relative to which RAM bank holds the data.
The two important numbers in that picture:
- Local read (CPU 0 reads its own RAM bank 0): around 80 nanoseconds.
- Remote read (CPU 0 reads RAM bank 1, going over the inter-CPU wire): around 150 to 200 nanoseconds.
Most cloud VMs hide this layout when the instance is small enough to fit on one socket. Bigger instances and bare-metal servers expose both nodes directly to the OS, and any program that does not think about which thread runs where will see a noticeable performance gap on memory-heavy workloads.
How to see it on Linux
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-15
node 0 size: 64000 MB
node 1 cpus: 16-31
node 1 size: 64000 MB
node distances:
node 0 1
0: 10 21 < reading from node 0 to node 0 = relative cost 10
1: 21 10 < reading from node 1 to node 0 = cost 21 (about 2x slower)
The "distances" matrix is the punchline. Cross-chip is roughly twice as expensive.
How to fix it
The standard move: pin the process to one NUMA node. It only runs on that node's cores and only allocates memory from that node's bank. Every memory access is local.
numactl --cpunodebind=0 --membind=0 ./myapp
For multi-process systems, run one process per node, each pinned. Put a load balancer in front. The result effectively turns one big NUMA machine into two smaller, faster non-NUMA machines.
This is what large databases, message brokers, and in-memory caches do in production. Go programs in particular do this because Go's runtime doesn't try to be NUMA-aware. Java has -XX:+UseNUMA for the G1 collector, which helps somewhat, but pinning is still the bigger win.
False sharing: NUMA's smaller cousin
Even within a single chip, two cores can fight over the same chunk of memory without any logical sharing. CPUs do not cache memory one byte or one field at a time; they cache it in fixed 64-byte chunks called cache lines. If core A writes to one byte and core B writes to a different byte that happens to live on the same cache line, the hardware treats the line as a single unit. Every write by one core invalidates the other core's cached copy of that line, and the line bounces back and forth between the two cores' caches even though the two writes are touching independent variables.
The hardware has no way to know that the two writes are logically unrelated. It just enforces "one writer at a time per cache line". The result is a 5x to 10x throughput drop on what should have been an embarrassingly parallel workload.
The bad layout. Two counters in the same struct fit inside one 64-byte cache line.
struct Counters {
long aCount; // core A writes here
long bCount; // core B writes here
};
// both fields fit in one 64-byte cache line
// every write by A invalidates B's cached copy and vice versa
// the line ping-pongs; throughput drops 5-10x
The good layout. Pad each counter so it sits alone on its own 64-byte cache line.
struct PaddedCounters {
long aCount;
char pad_a[56]; // 56 bytes of nothing fills out the line
long bCount;
char pad_b[56];
};
// each counter sits alone on its own cache line, no ping-pong
The fix exists in every language, with slightly different syntax:
| Language | How to align a hot field to its own cache line |
|---|---|
| Java | @Contended annotation (with -XX:-RestrictContended for application code) |
| C++ | alignas(std::hardware_destructive_interference_size) |
| C | manual padding bytes |
| Go | a padding field (commonly _ [56]byte) |
Java's LongAdder does this internally for its per-thread cells, which is why it scales much better than AtomicLong under contention.
The cost of padding is real: each padded field uses 64 bytes instead of 8. The whole technique is only worth it for fields that are actually hot under contention. Measure first; do not pad on suspicion.
The cache hierarchy in numbers
Modern CPUs have several layers of cache between the cores and main memory. Each layer is bigger and slower than the one above it. Knowing the rough sizes and latencies is the difference between writing fast code on purpose and being fast by accident.
| Level | Where | Size | Access latency |
|---|---|---|---|
| L1 | per-core | 32 to 64 KB | ~1 ns |
| L2 | per-core | 256 KB to 1 MB | ~3 to 10 ns |
| L3 | shared per chip | 8 to 64 MB | ~10 to 30 ns |
| RAM (local) | per chip (NUMA) | 16 GB to 1 TB | ~80 to 200 ns |
| RAM (remote) | other chip's bank, over the inter-CPU wire | same | ~150 to 400 ns |
The implication is straightforward. If the hot data fits in L1 or L2, code is very fast. If it only fits in L3, code is reasonably fast. If it spills out of L3 every access hits RAM, and if those RAM accesses also cross sockets, the NUMA penalty stacks on top of the cache miss.
For high-performance code, fitting the working set into cache and arranging the access pattern to be sequential often matters more than the algorithm being O(log n) versus O(n). A linear scan over a contiguous array can outpace a tree lookup that does fewer total operations but scatters them across cache lines.
When this matters
NUMA matters for memory-heavy workloads on multi-socket servers: in-memory databases, message brokers, analytics, caches. The fix is process pinning.
False sharing matters when multiple threads write hot atomic counters in the same struct. The fix is padding (or using LongAdder-style per-thread cells).
Most everyday code doesn't think about either. A typical web service spends most of its time waiting on the network or the database, so a 2x memory access penalty rounds to noise. For data-intensive systems it can be the whole game. Profile first; don't pad on suspicion.
Implementations
The @Contended annotation tells the JVM to pad the field so it occupies its own cache line. Without it, two threads incrementing 'aCount' and 'bCount' on the same struct would cause cache-line ping-pong. With it, they're independent.
1 // Requires -XX:-RestrictContended or this annotation in jdk.internal.vm.annotation
2 import jdk.internal.vm.annotation.Contended;
3
4 public class StripedCounter {
5 @Contended public volatile long aCount;
6 @Contended public volatile long bCount;
7 // Without @Contended, both fields share a cache line. Two threads
8 // writing aCount and bCount cause cache-line bouncing between cores.
9 // With @Contended, each gets its own cache line. No false sharing.
10 }
11
12 // LongAdder uses this internally for its per-cell counters.Key points
- •NUMA: each socket has its own memory bank. Cross-socket access is slower than local.
- •Linux: numactl --hardware shows NUMA topology. numactl --cpunodebind=0 --membind=0 pins to node 0.
- •JVM, Go runtime, Postgres can be NUMA-aware; check docs and tune.
- •False sharing: two cores writing different fields on the same cache line cause cache-line bouncing.
- •Fix false sharing with padding (@Contended in Java, _Alignas(64) in C) so independent fields land on different cache lines.
Follow-up questions
▸How big is the NUMA penalty?
▸How is false sharing identified?
▸Should atomics always be padded?
▸How does NUMA interact with virtualisation/cloud?
Gotchas
- !Cross-socket memory access is 1.5-3x slower; bind threads and memory to the same node
- !False sharing on hot atomics can dominate even simple workloads
- !JVM and Go runtimes are mostly NUMA-blind by default; manual binding helps
- !Large heaps that span sockets thrash; either pin or use multiple smaller processes
- !Padding wastes memory; only apply where false sharing has been measured