NUMA & Cache Effects

What NUMA is

A big server with two CPU chips (called sockets) presents itself to the operating system and application as one computer, but the memory is actually physically split. Each chip has its own bank of RAM wired directly to it. A thread running on chip 0 can read chip 0's RAM very fast. To read chip 1's RAM, the request has to travel over a wire between the two chips and back, which takes roughly 1.5 to 3 times longer than a local read.

The acronym NUMA stands for non-uniform memory access. The "non-uniform" part is the entire point: not all memory addresses cost the same to read, and the cost depends on which chip the running thread is on relative to which RAM bank holds the data.

The two important numbers in that picture:

Local read (CPU 0 reads its own RAM bank 0): around 80 nanoseconds.
Remote read (CPU 0 reads RAM bank 1, going over the inter-CPU wire): around 150 to 200 nanoseconds.

Most cloud VMs hide this layout when the instance is small enough to fit on one socket. Bigger instances and bare-metal servers expose both nodes directly to the OS, and any program that does not think about which thread runs where will see a noticeable performance gap on memory-heavy workloads.

How to see it on Linux

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-15
node 0 size: 64000 MB
node 1 cpus: 16-31
node 1 size: 64000 MB
node distances:
node   0   1
  0:  10  21       < reading from node 0 to node 0 = relative cost 10
  1:  21  10       < reading from node 1 to node 0 = cost 21 (about 2x slower)

The "distances" matrix is the punchline. Cross-chip is roughly twice as expensive.

How to fix it

The standard move: pin the process to one NUMA node. It only runs on that node's cores and only allocates memory from that node's bank. Every memory access is local.

numactl --cpunodebind=0 --membind=0 ./myapp

For multi-process systems, run one process per node, each pinned. Put a load balancer in front. The result effectively turns one big NUMA machine into two smaller, faster non-NUMA machines.

This is what large databases, message brokers, and in-memory caches do in production. Go programs in particular do this because Go's runtime doesn't try to be NUMA-aware. Java has -XX:+UseNUMA for the G1 collector, which helps somewhat, but pinning is still the bigger win.

False sharing: NUMA's smaller cousin

Even within a single chip, two cores can fight over the same chunk of memory without any logical sharing. CPUs do not cache memory one byte or one field at a time; they cache it in fixed 64-byte chunks called cache lines. If core A writes to one byte and core B writes to a different byte that happens to live on the same cache line, the hardware treats the line as a single unit. Every write by one core invalidates the other core's cached copy of that line, and the line bounces back and forth between the two cores' caches even though the two writes are touching independent variables.

The hardware has no way to know that the two writes are logically unrelated. It just enforces "one writer at a time per cache line". The result is a 5x to 10x throughput drop on what should have been an embarrassingly parallel workload.

The bad layout. Two counters in the same struct fit inside one 64-byte cache line.

struct Counters {
    long aCount;   // core A writes here
    long bCount;   // core B writes here
};
// both fields fit in one 64-byte cache line
// every write by A invalidates B's cached copy and vice versa
// the line ping-pongs; throughput drops 5-10x

The good layout. Pad each counter so it sits alone on its own 64-byte cache line.

struct PaddedCounters {
    long aCount;
    char pad_a[56];   // 56 bytes of nothing fills out the line
    long bCount;
    char pad_b[56];
};
// each counter sits alone on its own cache line, no ping-pong

The fix exists in every language, with slightly different syntax:

Language	How to align a hot field to its own cache line
Java	`@Contended` annotation (with `-XX:-RestrictContended` for application code)
C++	`alignas(std::hardware_destructive_interference_size)`
C	manual padding bytes
Go	a padding field (commonly `_ [56]byte`)

Java's LongAdder does this internally for its per-thread cells, which is why it scales much better than AtomicLong under contention.

The cost of padding is real: each padded field uses 64 bytes instead of 8. The whole technique is only worth it for fields that are actually hot under contention. Measure first; do not pad on suspicion.

The cache hierarchy in numbers

Modern CPUs have several layers of cache between the cores and main memory. Each layer is bigger and slower than the one above it. Knowing the rough sizes and latencies is the difference between writing fast code on purpose and being fast by accident.

Level	Where	Size	Access latency
L1	per-core	32 to 64 KB	~1 ns
L2	per-core	256 KB to 1 MB	~3 to 10 ns
L3	shared per chip	8 to 64 MB	~10 to 30 ns
RAM (local)	per chip (NUMA)	16 GB to 1 TB	~80 to 200 ns
RAM (remote)	other chip's bank, over the inter-CPU wire	same	~150 to 400 ns

The implication is straightforward. If the hot data fits in L1 or L2, code is very fast. If it only fits in L3, code is reasonably fast. If it spills out of L3 every access hits RAM, and if those RAM accesses also cross sockets, the NUMA penalty stacks on top of the cache miss.

For high-performance code, fitting the working set into cache and arranging the access pattern to be sequential often matters more than the algorithm being O(log n) versus O(n). A linear scan over a contiguous array can outpace a tree lookup that does fewer total operations but scatters them across cache lines.

When this matters

NUMA matters for memory-heavy workloads on multi-socket servers: in-memory databases, message brokers, analytics, caches. The fix is process pinning.

False sharing matters when multiple threads write hot atomic counters in the same struct. The fix is padding (or using LongAdder-style per-thread cells).

Most everyday code doesn't think about either. A typical web service spends most of its time waiting on the network or the database, so a 2x memory access penalty rounds to noise. For data-intensive systems it can be the whole game. Profile first; don't pad on suspicion.

What NUMA is

The two important numbers in that picture:

Local read (CPU 0 reads its own RAM bank 0): around 80 nanoseconds.
Remote read (CPU 0 reads RAM bank 1, going over the inter-CPU wire): around 150 to 200 nanoseconds.

How to see it on Linux

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-15
node 0 size: 64000 MB
node 1 cpus: 16-31
node 1 size: 64000 MB
node distances:
node   0   1
  0:  10  21       < reading from node 0 to node 0 = relative cost 10
  1:  21  10       < reading from node 1 to node 0 = cost 21 (about 2x slower)

The "distances" matrix is the punchline. Cross-chip is roughly twice as expensive.

How to fix it

The standard move: pin the process to one NUMA node. It only runs on that node's cores and only allocates memory from that node's bank. Every memory access is local.

numactl --cpunodebind=0 --membind=0 ./myapp

For multi-process systems, run one process per node, each pinned. Put a load balancer in front. The result effectively turns one big NUMA machine into two smaller, faster non-NUMA machines.

False sharing: NUMA's smaller cousin

The bad layout. Two counters in the same struct fit inside one 64-byte cache line.

struct Counters {
    long aCount;   // core A writes here
    long bCount;   // core B writes here
};
// both fields fit in one 64-byte cache line
// every write by A invalidates B's cached copy and vice versa
// the line ping-pongs; throughput drops 5-10x

The good layout. Pad each counter so it sits alone on its own 64-byte cache line.

struct PaddedCounters {
    long aCount;
    char pad_a[56];   // 56 bytes of nothing fills out the line
    long bCount;
    char pad_b[56];
};
// each counter sits alone on its own cache line, no ping-pong

The fix exists in every language, with slightly different syntax:

Language	How to align a hot field to its own cache line
Java	`@Contended` annotation (with `-XX:-RestrictContended` for application code)
C++	`alignas(std::hardware_destructive_interference_size)`
C	manual padding bytes
Go	a padding field (commonly `_ [56]byte`)

Java's LongAdder does this internally for its per-thread cells, which is why it scales much better than AtomicLong under contention.

The cache hierarchy in numbers

Level	Where	Size	Access latency
L1	per-core	32 to 64 KB	~1 ns
L2	per-core	256 KB to 1 MB	~3 to 10 ns
L3	shared per chip	8 to 64 MB	~10 to 30 ns
RAM (local)	per chip (NUMA)	16 GB to 1 TB	~80 to 200 ns
RAM (remote)	other chip's bank, over the inter-CPU wire	same	~150 to 400 ns

When this matters

NUMA matters for memory-heavy workloads on multi-socket servers: in-memory databases, message brokers, analytics, caches. The fix is process pinning.

False sharing matters when multiple threads write hot atomic counters in the same struct. The fix is padding (or using LongAdder-style per-thread cells).

What NUMA is

How to see it on Linux

How to fix it

False sharing: NUMA's smaller cousin

The cache hierarchy in numbers

When this matters

Implementations

Key points

Follow-up questions

Gotchas

Related reading

NUMA & Cache Effects

What NUMA is

How to see it on Linux

How to fix it

False sharing: NUMA's smaller cousin

The cache hierarchy in numbers

When this matters

Implementations

Key points

Follow-up questions

Gotchas

Related reading