Memory ManagementTopic 4 of 10

Memory ManagementIntermediate

Virtual Memory & Address Spaces

KafkaPostgreSQLRedisJVM

🧠

Mental Model

Every process lives in its own private house with numbered rooms. It sees rooms 1 through a million, but most rooms are empty behind the door -- the kernel only puts furniture in a room when someone actually walks in. Two processes can both have a room 42, but those are different rooms in different houses, even if the furniture happens to look identical. The kernel quietly manages which room number maps to which physical storage closet out back.

💡

The Problem

Fork a 10 GB process and it completes in milliseconds. Map a terabyte file and it costs zero RAM. Run 200 processes on 64 GB of physical memory and none of them know the others exist. The illusion holds beautifully -- until resident memory exceeds physical RAM. Then pages start swapping, latency spikes appear from nowhere, and processes get SIGSEGV or OOM-killed with nothing useful in the application logs to explain why.

Architecture

Every process is convinced it has 128 terabytes of memory all to itself.

It does not. That is the most important lie in all of operating systems. And everything -- from how fork() completes in milliseconds to why a production server survives running 200 processes on 64 GB of RAM -- depends on that lie holding up.

Let's pull back the curtain.

What Actually Happens

When a process accesses a memory address, here is the real sequence:

Step 1: The CPU takes the 48-bit virtual address and asks the MMU to translate it.

Step 2: The MMU checks the TLB (a small cache of recent translations). If it finds a match, translation completes in a single cycle. Done.

Step 3: On a TLB miss, the MMU walks the four-level page table (PGD -> P4D -> PUD -> PMD -> PTE) to find the physical frame number. This costs 10-30 ns.

Step 4: If the page table entry is missing -- because this is the first time the process is touching this address -- a page fault fires.

Step 5: The kernel handles the fault. For anonymous pages, it allocates a physical frame and zeros it. For file-backed pages, it reads from disk. Then it installs the page table entry.

This is why a 1 GB mmap() call completes in microseconds. No physical memory is consumed. The kernel just records "this process claims addresses X through Y." Physical frames appear only when the process actually touches those addresses.

Minor faults (page already in memory, just needs a PTE) cost about 1 microsecond. Major faults (requiring disk I/O) cost milliseconds. The difference is 1,000x.

Under the Hood

The virtual address space has a specific layout, and it is not random:

Text segment sits at the bottom -- the compiled code, marked read-only and executable. Data and BSS follow -- global variables, initialized and uninitialized. Then the heap grows upward from the program break. The mmap region sits in the middle, holding shared libraries, file mappings, and anonymous mmap allocations. The stack grows downward from near the top of user space.

The kernel maintains a mm_struct for each process containing a red-black tree of vm_area_struct (VMA) entries. Each VMA describes a contiguous region: its address range, permissions, and what backs it (file, anonymous memory, or device). When adjacent VMAs have identical attributes, the kernel merges them to save metadata.

Here's where it gets interesting. The upper half of the address space (above 0xFFFF800000000000) belongs to the kernel. It contains a direct map of all physical RAM, the vmalloc region for discontiguous kernel allocations, and the kernel text. User-space code that tries to touch kernel addresses hits SMAP/SMEP hardware protections and gets a segfault instantly.

Copy-on-Write is the trick that makes fork() cheap. After fork, parent and child share the same physical pages, all marked read-only. The first write triggers a fault, and the kernel copies just that one page. A 10 GB process that forks and immediately execs a new program consumes almost zero additional memory.

ASLR randomizes the base of the stack, mmap region, heap, and vDSO on every exec. With 28 bits of entropy on x86-64 for the mmap base, an attacker guessing addresses has a 1-in-268-million chance per attempt. This is what makes ROP attacks significantly harder.

And the vDSO (virtual dynamic shared object) is a kernel-mapped page in user space. It lets gettimeofday() and clock_gettime() execute without a real syscall. No ring transition. Pure speed.

Common Questions

What happens when malloc(16) is called?

glibc's ptmalloc2 serves it from a free-list bin. No syscall. If the bin is empty, it may call brk() to extend the heap or mmap() for large allocations. The 16-byte request actually allocates a 32-byte chunk -- 16 bytes payload plus 16 bytes metadata on 64-bit.

Why does fork() not double memory usage?

COW semantics. Parent and child share physical pages until one of them writes. A process using 4 GB RSS that forks and immediately execs consumes almost no additional memory. Only pages that are subsequently written get duplicated.

How does ASLR actually work?

By randomizing the base addresses of the stack, heap, mmap region, and executable (PIE). The randomization happens at exec time. With 28 bits of entropy for the mmap base, the probability of guessing correctly is 1 in 268 million per attempt.

What is the difference between a minor and major page fault?

A minor fault resolves without disk I/O -- COW fault, zero-fill for anonymous pages, or the page is already in the page cache. A major fault requires reading from disk -- either swap or the filesystem. Major faults are 1,000x slower and dominate process startup time for large binaries.

How Technologies Use This

Kafka

A Kafka broker managing 50,000 partitions shows 3 TB of virtual memory mapped but only 64 GB of physical RAM available. Operators panic at the virtual size numbers and start planning a cluster expansion, assuming the broker is about to run out of memory.

The key insight is that Kafka mmaps each partition index file into a virtual address range that costs zero physical pages until a consumer actually seeks into that partition. The kernel demand-pages only the hot segments, typically keeping just 2-5% of total index data resident in physical memory at any time.

Monitor RSS instead of virtual size. Brokers routinely map 3 TB of virtual space while consuming under 15 GB RSS because demand paging ensures only actively accessed index pages occupy physical RAM, letting a single node handle partition counts that would otherwise require an entire cluster.

PostgreSQL

A PostgreSQL instance with 200 backends suddenly loses all connections after a single backend corrupts a pointer. Every connection crashes simultaneously, and the DBA finds no isolation between processes sharing the 16 GB buffer pool.

Virtual memory gives each backend an independent 128 TB address space while secretly mapping the same shared_buffers segment to identical physical pages via MAP_SHARED. One backend corrupting its own address space cannot touch another backend's mappings, and COW semantics make pg_basebackup's fork-based snapshots nearly instant because no physical pages are copied until a write occurs.

Rely on virtual memory isolation as the safety net for multi-backend architectures. Forking a 50 GB process completes in under 2 ms, and 200 backends share one physical copy of the buffer pool instead of duplicating 3.2 TB of RAM.

Redis

A Redis instance shows 6 GB RSS when only 4 GB of keys are stored. Operators spend hours hunting for memory leaks using every profiling tool available, but nothing turns up because there is no leak.

The hidden cause is virtual memory fragmentation. Redis relies on jemalloc to manage virtual-to-physical mappings across 200+ size classes, but freed objects leave invisible holes in the address space that cannot be returned to the OS because adjacent pages still hold live data. The INFO memory command exposes mem_fragmentation_ratio, and anything above 1.5 signals that 50% or more of RSS is wasted on these holes.

Check mem_fragmentation_ratio before chasing phantom leaks. If it exceeds 1.5, restart Redis with BGSAVE and a fresh process to reclaim the fragmented address space entirely, eliminating the gap between actual data size and RSS.

JVM

A JVM configured with -Xmx512g appears to consume 512 GB at startup, alarming monitoring systems and triggering capacity alerts. Teams assume they need a machine with 512 GB of physical RAM just to start the process.

The JVM calls mmap with MAP_NORESERVE, creating a contiguous 512 GB virtual reservation that costs zero physical pages. No memory is touched at startup. As the application allocates objects, page faults commit physical frames at roughly 1 microsecond each. ZGC exploits this further by multi-mapping the same physical pages to three virtual addresses simultaneously, using colored pointers to track GC state without stop-the-world pauses.

Ignore VIRT/VSZ in monitoring dashboards for JVM processes. The 512 GB virtual reservation starts in under 100 ms and costs nothing until objects are actually allocated, so size capacity based on RSS growth under load, not the initial virtual reservation.

Same Concept Across Tech

Technology	How virtual memory affects it	Key behavior
JVM	Heap is virtual memory. -Xmx reserves address space, RSS grows as pages are touched	RSS can be much less than Xmx until GC touches all pages
Docker	Container sees host's virtual memory limits through cgroups	memory.limit_in_bytes caps RSS, not virtual size
Node.js	V8 heap + Buffer allocations are virtual. --max-old-space-size controls V8 heap	Buffer.allocUnsafe uses virtual pages, materialized on write
Go	Runtime reserves large virtual address space upfront for GC arenas	High VSZ is normal and not a memory leak
PostgreSQL	shared_buffers is mmap'd. OS page cache handles the physical backing	RSS includes shared pages, double-counting across processes

Stack layer mapping (process using too much memory):

Layer	What to check	Tool
Application	Is the app holding references that prevent GC? Memory leak?	Heap dump, profiler
Runtime	Is the heap configured correctly? Native memory growing?	JVM NMT, Go pprof, Node heapdump
Virtual memory	VSZ vs RSS vs swap. Page faults?	/proc/PID/status, /proc/PID/smaps
Kernel	Overcommit policy, swap pressure, page reclaim	/proc/sys/vm/overcommit_memory, /proc/pressure/memory
Hardware	Total physical RAM, NUMA distribution	free -h, numactl --hardware

Design Rationale Without virtual memory, every process would need a contiguous block of physical RAM sized for its worst case. Fragmentation would make large allocations impossible, and two processes using the same address would corrupt each other -- the DOS experience, basically. The translation layer between what a process sees and where data physically lives buys four things at once: demand paging (physical frames allocated only when touched), copy-on-write (fork without copying gigabytes), memory-mapped files (unified file and memory access), and per-process isolation. All from one indirection. The cost is the page table walk on TLB misses, which hardware mitigates with TLB caches and huge page support so well that most workloads rarely notice.

If You See This, Think This

Symptom	Likely cause	First check
VSZ is 10x larger than RSS	Normal. Virtual pages are not resident until touched	Only worry if RSS is high, not VSZ
RSS growing slowly over time	Memory leak or unbounded cache	Monitor RSS with pidstat -r, check for growing allocations
Major page faults during request handling	Data being read from swap, not RAM	perf stat -e major-faults, check swap usage
Fork is fast despite large process	Copy-on-write. Pages are shared until written	Normal behavior, not a bug
SIGSEGV with valid-looking address	Accessing unmapped page, stack overflow, or use-after-munmap	cat /proc/PID/maps, check if address is in a mapped region
Container OOM but process RSS seems low	Kernel accounting includes page cache, tmpfs, slab caches	Check memory.stat in cgroup for cache vs rss breakdown

When to Use / Avoid

Relevant when:

Understanding why VSZ (virtual size) is much larger than RSS (resident size) and that is normal
Debugging memory-related crashes (SIGSEGV, OOM kills, swap storms)
Using mmap for memory-mapped files or shared memory between processes
Understanding copy-on-write behavior after fork()

Watch out for:

Assuming VSZ indicates actual memory usage (it does not, RSS does)
Overcommitting without monitoring (default Linux behavior allows allocating more virtual memory than physical RAM)
Memory-mapped files that are never accessed still reserve virtual address space

Try It Yourself

 1  # View virtual memory layout of current shell
 2  
 3  cat /proc/self/maps
 4  
 5  # Show detailed memory map with RSS for a process
 6  
 7  pmap -x $(pidof nginx)
 8  
 9  # Check system-wide memory stats and page fault rates
10  
11  vmstat 1 5
12  
13  # View address space size limits
14  
15  ulimit -v    # max virtual memory (KB, often unlimited)
16  
17  ulimit -s    # max stack size (KB, default 8192)
18  
19  # Check ASLR setting (2 = full randomization)
20  
21  cat /proc/sys/kernel/randomize_va_space

Debug Checklist

1Check virtual vs resident memory: cat /proc/<pid>/status | grep -E 'VmSize|VmRSS|VmSwap'
2View memory map: cat /proc/<pid>/maps | head -30
3Check detailed memory usage: cat /proc/<pid>/smaps_rollup
4Monitor page faults: perf stat -e page-faults,minor-faults,major-faults -p <pid>
5Check overcommit policy: cat /proc/sys/vm/overcommit_memory
6Check swap usage: free -h and swapon --show

Key Takeaways

✓Every process gets a 128 TB private address space -- but it is all fake. The kernel splits the 64-bit space at 0xFFFF800000000000, giving user space the lower half and keeping the upper half for itself
✓Page faults are features, not bugs. When you touch a page for the first time, the kernel traps, allocates a physical frame, and wires up the translation -- this is how a 1 GB mmap completes in microseconds but costs zero RAM until accessed
✓ASLR randomizes where your stack, heap, and libraries land on every exec -- 28 bits of entropy means an attacker has a 1-in-268-million chance of guessing your mmap base
✓brk() is the old way to grow the heap; modern malloc uses mmap() for anything above 128 KB because mmap'd regions can be freed independently, while brk can only shrink from the top
✓The vDSO is a kernel page mapped into user space that lets gettimeofday() run without a syscall -- zero ring transitions, pure speed

Common Pitfalls

✗Mistaking virtual size for real usage -- a process can map terabytes via mmap and show huge VIRT/VSZ numbers while consuming almost no physical RAM; RSS is what matters for memory pressure
✗Checking mmap() returns against NULL instead of MAP_FAILED -- mmap returns (void*)-1 on failure, not NULL, and address 0 can theoretically be a valid mapping
✗Calling brk()/sbrk() directly in threaded code -- these modify a single program break pointer shared across all threads, so concurrent calls corrupt the heap
✗Assuming the stack grows forever -- it is capped by ulimit (default 8 MB), and exceeding it gives you a silent SIGSEGV, not a helpful error message

Reference

System Calls

mmapmunmapbrksbrkmprotect

Tools

/proc/<pid>/mapspmapvmstat

📌

In One Line

Virtual addresses are a lie that makes everything work -- fast forks, cheap mmap, process isolation -- and when things break (OOM kills, swap storms, SIGSEGV), the answer almost always lives in that lie.

Virtual Memory & Address Spaces

KafkaPostgreSQLRedisJVM

🧠

Mental Model

💡

The Problem

Architecture

Every process is convinced it has 128 terabytes of memory all to itself.

Let's pull back the curtain.

What Actually Happens

When a process accesses a memory address, here is the real sequence:

Step 1: The CPU takes the 48-bit virtual address and asks the MMU to translate it.

Step 2: The MMU checks the TLB (a small cache of recent translations). If it finds a match, translation completes in a single cycle. Done.

Step 3: On a TLB miss, the MMU walks the four-level page table (PGD -> P4D -> PUD -> PMD -> PTE) to find the physical frame number. This costs 10-30 ns.

Step 4: If the page table entry is missing -- because this is the first time the process is touching this address -- a page fault fires.

Step 5: The kernel handles the fault. For anonymous pages, it allocates a physical frame and zeros it. For file-backed pages, it reads from disk. Then it installs the page table entry.

Minor faults (page already in memory, just needs a PTE) cost about 1 microsecond. Major faults (requiring disk I/O) cost milliseconds. The difference is 1,000x.

Under the Hood

The virtual address space has a specific layout, and it is not random:

And the vDSO (virtual dynamic shared object) is a kernel-mapped page in user space. It lets gettimeofday() and clock_gettime() execute without a real syscall. No ring transition. Pure speed.

Common Questions

What happens when malloc(16) is called?

Why does fork() not double memory usage?

How does ASLR actually work?

What is the difference between a minor and major page fault?

How Technologies Use This

Kafka

PostgreSQL

Redis

A Redis instance shows 6 GB RSS when only 4 GB of keys are stored. Operators spend hours hunting for memory leaks using every profiling tool available, but nothing turns up because there is no leak.

JVM

Same Concept Across Tech

Technology	How virtual memory affects it	Key behavior
JVM	Heap is virtual memory. -Xmx reserves address space, RSS grows as pages are touched	RSS can be much less than Xmx until GC touches all pages
Docker	Container sees host's virtual memory limits through cgroups	memory.limit_in_bytes caps RSS, not virtual size
Node.js	V8 heap + Buffer allocations are virtual. --max-old-space-size controls V8 heap	Buffer.allocUnsafe uses virtual pages, materialized on write
Go	Runtime reserves large virtual address space upfront for GC arenas	High VSZ is normal and not a memory leak
PostgreSQL	shared_buffers is mmap'd. OS page cache handles the physical backing	RSS includes shared pages, double-counting across processes

Stack layer mapping (process using too much memory):

Layer	What to check	Tool
Application	Is the app holding references that prevent GC? Memory leak?	Heap dump, profiler
Runtime	Is the heap configured correctly? Native memory growing?	JVM NMT, Go pprof, Node heapdump
Virtual memory	VSZ vs RSS vs swap. Page faults?	/proc/PID/status, /proc/PID/smaps
Kernel	Overcommit policy, swap pressure, page reclaim	/proc/sys/vm/overcommit_memory, /proc/pressure/memory
Hardware	Total physical RAM, NUMA distribution	free -h, numactl --hardware

If You See This, Think This

Symptom	Likely cause	First check
VSZ is 10x larger than RSS	Normal. Virtual pages are not resident until touched	Only worry if RSS is high, not VSZ
RSS growing slowly over time	Memory leak or unbounded cache	Monitor RSS with pidstat -r, check for growing allocations
Major page faults during request handling	Data being read from swap, not RAM	perf stat -e major-faults, check swap usage
Fork is fast despite large process	Copy-on-write. Pages are shared until written	Normal behavior, not a bug
SIGSEGV with valid-looking address	Accessing unmapped page, stack overflow, or use-after-munmap	cat /proc/PID/maps, check if address is in a mapped region
Container OOM but process RSS seems low	Kernel accounting includes page cache, tmpfs, slab caches	Check memory.stat in cgroup for cache vs rss breakdown

When to Use / Avoid

Relevant when:

Understanding why VSZ (virtual size) is much larger than RSS (resident size) and that is normal
Debugging memory-related crashes (SIGSEGV, OOM kills, swap storms)
Using mmap for memory-mapped files or shared memory between processes
Understanding copy-on-write behavior after fork()

Watch out for:

Assuming VSZ indicates actual memory usage (it does not, RSS does)
Overcommitting without monitoring (default Linux behavior allows allocating more virtual memory than physical RAM)
Memory-mapped files that are never accessed still reserve virtual address space

Try It Yourself

 1  # View virtual memory layout of current shell
 2  
 3  cat /proc/self/maps
 4  
 5  # Show detailed memory map with RSS for a process
 6  
 7  pmap -x $(pidof nginx)
 8  
 9  # Check system-wide memory stats and page fault rates
10  
11  vmstat 1 5
12  
13  # View address space size limits
14  
15  ulimit -v    # max virtual memory (KB, often unlimited)
16  
17  ulimit -s    # max stack size (KB, default 8192)
18  
19  # Check ASLR setting (2 = full randomization)
20  
21  cat /proc/sys/kernel/randomize_va_space

Debug Checklist

1Check virtual vs resident memory: cat /proc/<pid>/status | grep -E 'VmSize|VmRSS|VmSwap'
2View memory map: cat /proc/<pid>/maps | head -30
3Check detailed memory usage: cat /proc/<pid>/smaps_rollup
4Monitor page faults: perf stat -e page-faults,minor-faults,major-faults -p <pid>
5Check overcommit policy: cat /proc/sys/vm/overcommit_memory
6Check swap usage: free -h and swapon --show

Key Takeaways

✓Every process gets a 128 TB private address space -- but it is all fake. The kernel splits the 64-bit space at 0xFFFF800000000000, giving user space the lower half and keeping the upper half for itself
✓Page faults are features, not bugs. When you touch a page for the first time, the kernel traps, allocates a physical frame, and wires up the translation -- this is how a 1 GB mmap completes in microseconds but costs zero RAM until accessed
✓ASLR randomizes where your stack, heap, and libraries land on every exec -- 28 bits of entropy means an attacker has a 1-in-268-million chance of guessing your mmap base
✓brk() is the old way to grow the heap; modern malloc uses mmap() for anything above 128 KB because mmap'd regions can be freed independently, while brk can only shrink from the top
✓The vDSO is a kernel page mapped into user space that lets gettimeofday() run without a syscall -- zero ring transitions, pure speed

Common Pitfalls

✗Mistaking virtual size for real usage -- a process can map terabytes via mmap and show huge VIRT/VSZ numbers while consuming almost no physical RAM; RSS is what matters for memory pressure
✗Checking mmap() returns against NULL instead of MAP_FAILED -- mmap returns (void*)-1 on failure, not NULL, and address 0 can theoretically be a valid mapping
✗Calling brk()/sbrk() directly in threaded code -- these modify a single program break pointer shared across all threads, so concurrent calls corrupt the heap
✗Assuming the stack grows forever -- it is capped by ulimit (default 8 MB), and exceeding it gives you a silent SIGSEGV, not a helpful error message

Reference

System Calls

mmapmunmapbrksbrkmprotect

Tools

/proc/<pid>/mapspmapvmstat

📌

Virtual Memory & Address Spaces

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Virtual Memory & Address Spaces

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics