Virtual Memory & Address Spaces
Mental Model
Every process lives in its own private house with numbered rooms. It sees rooms 1 through a million, but most rooms are empty behind the door -- the kernel only puts furniture in a room when someone actually walks in. Two processes can both have a room 42, but those are different rooms in different houses, even if the furniture happens to look identical. The kernel quietly manages which room number maps to which physical storage closet out back.
The Problem
Fork a 10 GB process and it completes in milliseconds. Map a terabyte file and it costs zero RAM. Run 200 processes on 64 GB of physical memory and none of them know the others exist. The illusion holds beautifully -- until resident memory exceeds physical RAM. Then pages start swapping, latency spikes appear from nowhere, and processes get SIGSEGV or OOM-killed with nothing useful in the application logs to explain why.
Architecture
Every process is convinced it has 128 terabytes of memory all to itself.
It does not. That is the most important lie in all of operating systems. And everything -- from how fork() completes in milliseconds to why a production server survives running 200 processes on 64 GB of RAM -- depends on that lie holding up.
Let's pull back the curtain.
What Actually Happens
When a process accesses a memory address, here is the real sequence:
Step 1: The CPU takes the 48-bit virtual address and asks the MMU to translate it.
Step 2: The MMU checks the TLB (a small cache of recent translations). If it finds a match, translation completes in a single cycle. Done.
Step 3: On a TLB miss, the MMU walks the four-level page table (PGD -> P4D -> PUD -> PMD -> PTE) to find the physical frame number. This costs 10-30 ns.
Step 4: If the page table entry is missing -- because this is the first time the process is touching this address -- a page fault fires.
Step 5: The kernel handles the fault. For anonymous pages, it allocates a physical frame and zeros it. For file-backed pages, it reads from disk. Then it installs the page table entry.
This is why a 1 GB mmap() call completes in microseconds. No physical memory is consumed. The kernel just records "this process claims addresses X through Y." Physical frames appear only when the process actually touches those addresses.
Minor faults (page already in memory, just needs a PTE) cost about 1 microsecond. Major faults (requiring disk I/O) cost milliseconds. The difference is 1,000x.
Under the Hood
The virtual address space has a specific layout, and it is not random:
Text segment sits at the bottom -- the compiled code, marked read-only and executable. Data and BSS follow -- global variables, initialized and uninitialized. Then the heap grows upward from the program break. The mmap region sits in the middle, holding shared libraries, file mappings, and anonymous mmap allocations. The stack grows downward from near the top of user space.
The kernel maintains a mm_struct for each process containing a red-black tree of vm_area_struct (VMA) entries. Each VMA describes a contiguous region: its address range, permissions, and what backs it (file, anonymous memory, or device). When adjacent VMAs have identical attributes, the kernel merges them to save metadata.
Here's where it gets interesting. The upper half of the address space (above 0xFFFF800000000000) belongs to the kernel. It contains a direct map of all physical RAM, the vmalloc region for discontiguous kernel allocations, and the kernel text. User-space code that tries to touch kernel addresses hits SMAP/SMEP hardware protections and gets a segfault instantly.
Copy-on-Write is the trick that makes fork() cheap. After fork, parent and child share the same physical pages, all marked read-only. The first write triggers a fault, and the kernel copies just that one page. A 10 GB process that forks and immediately execs a new program consumes almost zero additional memory.
ASLR randomizes the base of the stack, mmap region, heap, and vDSO on every exec. With 28 bits of entropy on x86-64 for the mmap base, an attacker guessing addresses has a 1-in-268-million chance per attempt. This is what makes ROP attacks significantly harder.
And the vDSO (virtual dynamic shared object) is a kernel-mapped page in user space. It lets gettimeofday() and clock_gettime() execute without a real syscall. No ring transition. Pure speed.
Common Questions
What happens when malloc(16) is called?
glibc's ptmalloc2 serves it from a free-list bin. No syscall. If the bin is empty, it may call brk() to extend the heap or mmap() for large allocations. The 16-byte request actually allocates a 32-byte chunk -- 16 bytes payload plus 16 bytes metadata on 64-bit.
Why does fork() not double memory usage?
COW semantics. Parent and child share physical pages until one of them writes. A process using 4 GB RSS that forks and immediately execs consumes almost no additional memory. Only pages that are subsequently written get duplicated.
How does ASLR actually work?
By randomizing the base addresses of the stack, heap, mmap region, and executable (PIE). The randomization happens at exec time. With 28 bits of entropy for the mmap base, the probability of guessing correctly is 1 in 268 million per attempt.
What is the difference between a minor and major page fault?
A minor fault resolves without disk I/O -- COW fault, zero-fill for anonymous pages, or the page is already in the page cache. A major fault requires reading from disk -- either swap or the filesystem. Major faults are 1,000x slower and dominate process startup time for large binaries.
How Technologies Use This
A Kafka broker managing 50,000 partitions shows 3 TB of virtual memory mapped but only 64 GB of physical RAM available. Operators panic at the virtual size numbers and start planning a cluster expansion, assuming the broker is about to run out of memory.
The key insight is that Kafka mmaps each partition index file into a virtual address range that costs zero physical pages until a consumer actually seeks into that partition. The kernel demand-pages only the hot segments, typically keeping just 2-5% of total index data resident in physical memory at any time.
Monitor RSS instead of virtual size. Brokers routinely map 3 TB of virtual space while consuming under 15 GB RSS because demand paging ensures only actively accessed index pages occupy physical RAM, letting a single node handle partition counts that would otherwise require an entire cluster.
A PostgreSQL instance with 200 backends suddenly loses all connections after a single backend corrupts a pointer. Every connection crashes simultaneously, and the DBA finds no isolation between processes sharing the 16 GB buffer pool.
Virtual memory gives each backend an independent 128 TB address space while secretly mapping the same shared_buffers segment to identical physical pages via MAP_SHARED. One backend corrupting its own address space cannot touch another backend's mappings, and COW semantics make pg_basebackup's fork-based snapshots nearly instant because no physical pages are copied until a write occurs.
Rely on virtual memory isolation as the safety net for multi-backend architectures. Forking a 50 GB process completes in under 2 ms, and 200 backends share one physical copy of the buffer pool instead of duplicating 3.2 TB of RAM.
A Redis instance shows 6 GB RSS when only 4 GB of keys are stored. Operators spend hours hunting for memory leaks using every profiling tool available, but nothing turns up because there is no leak.
The hidden cause is virtual memory fragmentation. Redis relies on jemalloc to manage virtual-to-physical mappings across 200+ size classes, but freed objects leave invisible holes in the address space that cannot be returned to the OS because adjacent pages still hold live data. The INFO memory command exposes mem_fragmentation_ratio, and anything above 1.5 signals that 50% or more of RSS is wasted on these holes.
Check mem_fragmentation_ratio before chasing phantom leaks. If it exceeds 1.5, restart Redis with BGSAVE and a fresh process to reclaim the fragmented address space entirely, eliminating the gap between actual data size and RSS.
A JVM configured with -Xmx512g appears to consume 512 GB at startup, alarming monitoring systems and triggering capacity alerts. Teams assume they need a machine with 512 GB of physical RAM just to start the process.
The JVM calls mmap with MAP_NORESERVE, creating a contiguous 512 GB virtual reservation that costs zero physical pages. No memory is touched at startup. As the application allocates objects, page faults commit physical frames at roughly 1 microsecond each. ZGC exploits this further by multi-mapping the same physical pages to three virtual addresses simultaneously, using colored pointers to track GC state without stop-the-world pauses.
Ignore VIRT/VSZ in monitoring dashboards for JVM processes. The 512 GB virtual reservation starts in under 100 ms and costs nothing until objects are actually allocated, so size capacity based on RSS growth under load, not the initial virtual reservation.
Same Concept Across Tech
| Technology | How virtual memory affects it | Key behavior |
|---|---|---|
| JVM | Heap is virtual memory. -Xmx reserves address space, RSS grows as pages are touched | RSS can be much less than Xmx until GC touches all pages |
| Docker | Container sees host's virtual memory limits through cgroups | memory.limit_in_bytes caps RSS, not virtual size |
| Node.js | V8 heap + Buffer allocations are virtual. --max-old-space-size controls V8 heap | Buffer.allocUnsafe uses virtual pages, materialized on write |
| Go | Runtime reserves large virtual address space upfront for GC arenas | High VSZ is normal and not a memory leak |
| PostgreSQL | shared_buffers is mmap'd. OS page cache handles the physical backing | RSS includes shared pages, double-counting across processes |
Stack layer mapping (process using too much memory):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the app holding references that prevent GC? Memory leak? | Heap dump, profiler |
| Runtime | Is the heap configured correctly? Native memory growing? | JVM NMT, Go pprof, Node heapdump |
| Virtual memory | VSZ vs RSS vs swap. Page faults? | /proc/PID/status, /proc/PID/smaps |
| Kernel | Overcommit policy, swap pressure, page reclaim | /proc/sys/vm/overcommit_memory, /proc/pressure/memory |
| Hardware | Total physical RAM, NUMA distribution | free -h, numactl --hardware |
Design Rationale Without virtual memory, every process would need a contiguous block of physical RAM sized for its worst case. Fragmentation would make large allocations impossible, and two processes using the same address would corrupt each other -- the DOS experience, basically. The translation layer between what a process sees and where data physically lives buys four things at once: demand paging (physical frames allocated only when touched), copy-on-write (fork without copying gigabytes), memory-mapped files (unified file and memory access), and per-process isolation. All from one indirection. The cost is the page table walk on TLB misses, which hardware mitigates with TLB caches and huge page support so well that most workloads rarely notice.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| VSZ is 10x larger than RSS | Normal. Virtual pages are not resident until touched | Only worry if RSS is high, not VSZ |
| RSS growing slowly over time | Memory leak or unbounded cache | Monitor RSS with pidstat -r, check for growing allocations |
| Major page faults during request handling | Data being read from swap, not RAM | perf stat -e major-faults, check swap usage |
| Fork is fast despite large process | Copy-on-write. Pages are shared until written | Normal behavior, not a bug |
| SIGSEGV with valid-looking address | Accessing unmapped page, stack overflow, or use-after-munmap | cat /proc/PID/maps, check if address is in a mapped region |
| Container OOM but process RSS seems low | Kernel accounting includes page cache, tmpfs, slab caches | Check memory.stat in cgroup for cache vs rss breakdown |
When to Use / Avoid
Relevant when:
- Understanding why VSZ (virtual size) is much larger than RSS (resident size) and that is normal
- Debugging memory-related crashes (SIGSEGV, OOM kills, swap storms)
- Using mmap for memory-mapped files or shared memory between processes
- Understanding copy-on-write behavior after fork()
Watch out for:
- Assuming VSZ indicates actual memory usage (it does not, RSS does)
- Overcommitting without monitoring (default Linux behavior allows allocating more virtual memory than physical RAM)
- Memory-mapped files that are never accessed still reserve virtual address space
Try It Yourself
1 # View virtual memory layout of current shell
2
3 cat /proc/self/maps
4
5 # Show detailed memory map with RSS for a process
6
7 pmap -x $(pidof nginx)
8
9 # Check system-wide memory stats and page fault rates
10
11 vmstat 1 5
12
13 # View address space size limits
14
15 ulimit -v # max virtual memory (KB, often unlimited)
16
17 ulimit -s # max stack size (KB, default 8192)
18
19 # Check ASLR setting (2 = full randomization)
20
21 cat /proc/sys/kernel/randomize_va_spaceDebug Checklist
- 1
Check virtual vs resident memory: cat /proc/<pid>/status | grep -E 'VmSize|VmRSS|VmSwap' - 2
View memory map: cat /proc/<pid>/maps | head -30 - 3
Check detailed memory usage: cat /proc/<pid>/smaps_rollup - 4
Monitor page faults: perf stat -e page-faults,minor-faults,major-faults -p <pid> - 5
Check overcommit policy: cat /proc/sys/vm/overcommit_memory - 6
Check swap usage: free -h and swapon --show
Key Takeaways
- ✓Every process gets a 128 TB private address space -- but it is all fake. The kernel splits the 64-bit space at 0xFFFF800000000000, giving user space the lower half and keeping the upper half for itself
- ✓Page faults are features, not bugs. When you touch a page for the first time, the kernel traps, allocates a physical frame, and wires up the translation -- this is how a 1 GB mmap completes in microseconds but costs zero RAM until accessed
- ✓ASLR randomizes where your stack, heap, and libraries land on every exec -- 28 bits of entropy means an attacker has a 1-in-268-million chance of guessing your mmap base
- ✓brk() is the old way to grow the heap; modern malloc uses mmap() for anything above 128 KB because mmap'd regions can be freed independently, while brk can only shrink from the top
- ✓The vDSO is a kernel page mapped into user space that lets gettimeofday() run without a syscall -- zero ring transitions, pure speed
Common Pitfalls
- ✗Mistaking virtual size for real usage -- a process can map terabytes via mmap and show huge VIRT/VSZ numbers while consuming almost no physical RAM; RSS is what matters for memory pressure
- ✗Checking mmap() returns against NULL instead of MAP_FAILED -- mmap returns (void*)-1 on failure, not NULL, and address 0 can theoretically be a valid mapping
- ✗Calling brk()/sbrk() directly in threaded code -- these modify a single program break pointer shared across all threads, so concurrent calls corrupt the heap
- ✗Assuming the stack grows forever -- it is capped by ulimit (default 8 MB), and exceeding it gives you a silent SIGSEGV, not a helpful error message
Reference
In One Line
Virtual addresses are a lie that makes everything work -- fast forks, cheap mmap, process isolation -- and when things break (OOM kills, swap storms, SIGSEGV), the answer almost always lives in that lie.