mmap & Memory-Mapped Files
Mental Model
No photocopying in this library. Patrons read the original book on the shelf. Ten readers looking at the same volume simultaneously, one physical copy. Someone wants to scribble notes? They get a personal copy of just that page. The original stays untouched. No duplication unless someone writes.
The Problem
Millions of random 8 KB reads per second in a database, and every read() copies data from the page cache into a user-space buffer -- 0.5 microseconds of pure copy overhead per call, 500 ms wasted every second at a million reads. A web server serving the same 10 MB file to thousands of concurrent clients allocates a separate buffer for each request, even though the bytes are identical.
Architecture
Every call to read() copies data through the kernel into a user-space buffer. One extra copy, every time.
For a database doing millions of random reads, or a web server hammering the same hot file at 50,000 requests per second, that copy is pure waste. mmap() eliminates it entirely. It maps the file directly into the process's address space. Reading it becomes a pointer dereference. No syscall. No copy.
This is how Kafka reads index files, how PostgreSQL shares buffer pools across backends, and how Git makes multi-gigabyte packfiles feel instant.
What Actually Happens
When mmap() is called, the kernel does not read the file. It creates a VMA (virtual memory area) that maps virtual addresses to file offsets. That is it. No data moves. No physical memory is consumed.
The first access triggers a page fault. The fault handler checks the page cache. If the page is already cached (from a previous read or another process's mapping), the PTE is updated to point at the cached page. No disk I/O. If it is not cached, the kernel reads it from disk into a new page frame and adds it to the page cache.
This is why mmap() works for a 1 TB file on a 64 GB machine. Only the pages actually touched consume RAM. The rest exist purely as virtual mappings.
MAP_SHARED is true sharing. The process's page table entry points directly to the page cache page. Writes modify the page cache, and every other process mapping the same file sees the changes immediately -- they are reading the same physical pages. Modified pages are marked dirty and written to disk by the kernel's writeback threads, or on demand via msync().
MAP_PRIVATE is copy-on-write. Initially, the PTE points to the page cache page (marked read-only). The first write triggers a COW fault: the kernel allocates a new anonymous page, copies the contents, and remaps the PTE. Writes go to the private copy. The original file is never modified. This is exactly how the dynamic linker loads shared libraries -- everyone shares the code pages, but each process gets its own writable data section.
Anonymous mmap is how malloc works. mmap(MAP_ANONYMOUS|MAP_PRIVATE) returns zero-filled memory backed by nothing. malloc() uses this for allocations above 128 KB. Unlike brk(), these regions can be independently returned to the kernel via munmap(), which is why large allocations do not cause the brk heap fragmentation trap.
Under the Hood
Page cache integration is the key insight. For file-backed mappings, the page cache IS the backing store. There is no separate buffer. The VMA's page table entries point directly at page cache pages. read() and mmap() access the same physical pages. A file read via read() populates the page cache for a subsequent mmap() access, and vice versa.
msync semantics matter for durability. msync(MS_SYNC) forces dirty pages to disk synchronously -- it waits for I/O completion. msync(MS_ASYNC) marks pages for writeback but returns immediately. Neither is necessary for inter-process visibility. Shared mappings are visible to other processes immediately because they share the same physical pages. msync is only needed for durability -- making sure data survives a crash.
madvise is the performance lever. MADV_SEQUENTIAL tells the kernel to aggressively readahead and free pages after access -- perfect for log processing. MADV_RANDOM disables readahead -- for hash table lookups. MADV_DONTNEED immediately releases pages (for anonymous mappings, the data is gone; for file-backed, it can be re-faulted from disk). MADV_HUGEPAGE requests transparent huge pages for the region.
The SIGBUS trap. If another process truncates the file while it is mapped, accessing pages beyond the new end of file delivers SIGBUS. Not SIGSEGV. Not an error return. A signal that kills the process. This is one of the most common crash patterns in database code that uses mmap. The fix: use flock()/fcntl() locks to prevent concurrent truncation, or handle SIGBUS in a signal handler.
Common Questions
When is mmap faster than read/write?
mmap wins for random access to large files -- each access is a page fault (or TLB hit if cached), while read() requires a syscall per read. For sequential access, read() with kernel readahead is often faster because it avoids per-page fault overhead and TLB pressure. mmap also wins when multiple processes share the same file data, because they share physical pages via the page cache.
How does mmap enable zero-copy IPC?
Two processes mmap the same file with MAP_SHARED. Writes by one process are immediately visible to the other -- same physical pages, no kernel copies. POSIX shared memory (shm_open() + mmap()) uses this mechanism with files on tmpfs (/dev/shm). The two copies in pipe/socket IPC (user to kernel, kernel to user) are completely eliminated.
What happens when memory pressure evicts mmap'd pages?
For clean file-backed pages, the kernel simply unmaps them -- they can be re-read from disk on next access. For dirty pages, it writes them back to the file first. For anonymous MAP_PRIVATE pages, they must go to swap. MAP_SHARED dirty pages are written to the original file.
Can mmap exceed physical RAM?
Absolutely. A 1 TB file can be mmapped on a 64 GB machine. Only actively accessed pages consume physical memory (RSS). The rest are just page cache metadata that can be evicted and re-read on demand. This is how Git handles multi-gigabyte packfiles -- the entire file is mapped, but only the pages actually dereferenced occupy RAM.
How Technologies Use This
A Kafka broker handling 200,000 consumer fetch requests per second struggles with offset lookup latency. Each fetch needs to resolve a byte offset in a log segment, and using read() syscalls for every lookup means 400,000 syscalls per second plus kernel-to-user copies on every request.
Kafka mmaps each partition index file (typically 10 MB) so the offset-to-position translation becomes a direct pointer dereference into the page cache. No syscall is needed. Because index files are small and accessed repeatedly, their pages stay hot in the TLB, making each lookup cost under 100 ns compared to 2-5 us for a read() path.
Use mmap for small, frequently accessed index files where random lookups dominate. At 200,000 fetches per second, eliminating the seek-plus-read path saves 400,000 syscalls per second and removes the kernel-to-user copy overhead entirely.
Nginx needs to serve the same 50 KB CSS file to 30,000 concurrent clients. Without page cache integration, each request triggers a disk read and a kernel-to-user copy, meaning 30,000 redundant disk reads and 1.5 GB of wasted memory bandwidth per second.
Nginx uses sendfile() which transfers data directly from the page cache to the socket, eliminating the user-space copy entirely. The file loads into the page cache once and stays there. Combined with open_file_cache that holds file descriptors and metadata for 1,000 hot assets, Nginx avoids repeated open() and stat() syscalls on every request.
Enable sendfile and open_file_cache for static asset serving. A single 4 GB server can saturate a 10 Gbps link serving static files because the hot working set lives entirely in page cache RAM, and zero-copy transfer from page cache to socket eliminates all user-space memory bandwidth overhead.
A PostgreSQL instance with 200 backend connections needs each process to access the same 16 GB buffer pool. Without shared memory, each backend would need its own copy, requiring 3.2 TB of RAM for 200 connections, which is obviously impossible.
MAP_SHARED maps the shared_buffers segment so every backend's page table entries point to the same physical pages. A dirty buffer written by one backend is instantly visible to all others with zero copying. The 200 processes share a single 16 GB physical copy of the buffer pool, and coherence is maintained by the kernel at the page cache level.
Combine MAP_SHARED with huge_pages=on to reduce the 16 GB region from 4 million TLB entries to just 8,192. With buffer cache hit rates above 99%, most queries never touch disk at all, and the shared mapping ensures zero memory duplication across all connections.
Loading a 10 GB RDB snapshot via sequential read() takes 3 minutes because the entire file must be copied through kernel buffers into user space before parsing can begin. During a restart after a crash, this 3-minute reload window means extended downtime for every client.
When rdbload uses mmap, the RDB file is mapped directly into the address space and parsed as an in-memory array. Demand paging fetches only the pages the parser touches, and the kernel readahead prefetches sequential pages in 128 KB batches. During BGSAVE, the forked child shares the parent's entire dataset through COW pages, allocating near-zero additional memory until the parent modifies keys.
Use mmap-based RDB loading to cut restore time from 3 minutes to 45 seconds for a 10 GB snapshot. The combination of demand paging and kernel readahead eliminates the upfront copy, and COW sharing during BGSAVE means the snapshot process runs without doubling memory consumption.
A Go program running 500,000 goroutines appears to need 4 TB of stack memory because each goroutine can grow to 1 GB. Monitoring tools report massive virtual memory usage, and operators worry the process is about to exhaust the machine.
The Go runtime calls mmap with MAP_NORESERVE to create virtual address reservations that cost zero physical pages. Each goroutine starts with a 2 KB stack, and physical frames are committed on demand at roughly 1 us per page fault as stacks grow. BoltDB uses the same technique, mmapping its entire database file and accessing B+ tree nodes as direct pointer dereferences at under 500 ns per random read.
Trust the virtual-to-physical distinction when monitoring Go services. The 4 TB of virtual reservations costs nothing until goroutines actually grow their stacks, and demand paging ensures physical RAM consumption matches actual usage rather than worst-case potential.
Elasticsearch needs to search across 500 GB of Lucene index segments on a 64 GB node. Loading them into the JVM heap would create massive GC pressure, constant stop-the-world pauses, and the heap cannot even hold that much data.
MappedByteBuffer maps each segment file directly into off-heap virtual memory, turning index lookups into pointer dereferences resolved by page faults. The kernel manages eviction through the page cache LRU, so hot segments stay resident while cold ones are reclaimed automatically. The GC never sees this data because it lives outside the managed heap entirely.
Use MappedByteBuffer for large read-heavy datasets that exceed heap capacity. On a 64 GB node indexing 500 GB, only 40-50 GB of the hottest segments remain in RAM, and query p99 latency stays under 50 ms because the kernel's page cache handles eviction more efficiently than any application-level cache could.
A user opens 40 Chrome tabs and each process loads 200 MB of shared libraries, fonts, and ICU data. Without memory sharing, this would consume 8 GB of physical RAM just for duplicated read-only resources, leaving little room for actual page content.
MAP_PRIVATE maps these read-only assets so all 40 processes share identical physical pages through copy-on-write. The 40 times 200 MB collapses to just 200 MB of physical RAM. For inter-process rendering, Chrome uses MAP_SHARED on /dev/shm so the renderer paints directly into a buffer the compositor reads at memory speed, eliminating kernel copies entirely.
Leverage MAP_PRIVATE for shared read-only resources and MAP_SHARED for zero-copy IPC between processes. At 60 fps and 8 MB per frame, the shared memory approach saves 480 MB/s of memory bandwidth that would otherwise go to pipe-based IPC between the renderer and compositor.
Running git log on a repository with a 4 GB packfile would take seconds if Git had to read() and seek through gigabytes of packed objects, issuing a syscall for each object lookup. Users would experience painful delays browsing even recent history.
Git mmaps the entire packfile into virtual memory, making every object lookup a pointer offset calculation followed by a direct memory dereference. Demand paging ensures only the pages Git actually touches consume physical RAM -- browsing recent commits might fault in just 20 MB of the 4 GB file. The pack index is also mmapped, so resolving a SHA to a byte offset is a binary search over an in-memory array.
Use mmap for large read-only data files that are accessed via random lookups. Git completes the log operation in milliseconds because demand paging loads only the 20 MB of relevant pages out of 4 GB, and each SHA-to-offset resolution costs roughly 200 ns as a direct memory dereference instead of a syscall.
Same Concept Across Tech
| Technology | How it uses mmap | Key detail |
|---|---|---|
| Kafka | Log segments are mmap'd for consumer reads. Zero-copy path: page cache to socket via sendfile | Index files also mmap'd for fast offset lookup |
| PostgreSQL | Does NOT use mmap for shared_buffers (uses shared memory + own buffer manager) | Avoids mmap because it needs fine-grained dirty page tracking |
| MongoDB (WiredTiger) | Block manager uses mmap for reading data files | Moved away from full mmap in later versions for better control |
| Git | Pack files are mmap'd for fast object lookup across large repos | Enables random access into multi-GB pack files |
| JVM | MappedByteBuffer wraps mmap. Used for memory-mapped file I/O in NIO | DirectByteBuffer is different (heap-allocated, not file-backed) |
| Node.js | No built-in mmap support. Use npm packages or fs.readFile | libuv uses thread pool for file I/O instead of mmap |
Stack layer mapping (mmap performance issue):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is access pattern random or sequential? mmap wins for random | Access pattern analysis |
| Virtual memory | How many VMAs exist? Hitting max_map_count? | wc -l /proc/PID/maps |
| Page cache | Are mapped pages in cache or being faulted from disk? | /proc/PID/smaps Rss field |
| Kernel | Major faults (disk reads) vs minor faults (already cached)? | perf stat -e major-faults |
| Storage | Is the disk I/O pattern random (HDD slow) or sequential? | iostat -x |
Design Rationale Maintaining separate caches for read/write and mmap would waste physical memory and create coherence nightmares between the two layers. Forcing applications to manage their own buffer pools and call read() on every access imposes a mandatory kernel-to-user copy -- pure overhead for random-access workloads like databases and search indexes. Making the page cache the single source of truth and letting mmap alias those pages into process address space eliminates the copy, enables transparent cross-process sharing, and makes demand paging the default: mapping a terabyte file costs zero RAM until pages are actually touched.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| SIGBUS crash when accessing mapped region | File was truncated after mmap, mapped pages no longer exist | Check if another process truncated the file |
| High major page faults | Mapped data not in page cache, being read from disk on each access | perf stat -e major-faults, check available RAM |
| max_map_count exceeded error | Too many mmap calls creating too many VMAs | wc -l /proc/PID/maps, increase vm.max_map_count |
| mmap slower than read() for sequential scan | read() benefits from kernel readahead. mmap faults page by page | Use madvise(MADV_SEQUENTIAL) or switch to read() |
| Dirty mmap pages not on disk after crash | msync() or fsync() not called. Dirty pages only flushed on writeback timer | Always msync(MS_SYNC) before crash-critical points |
| Memory usage appears high but mostly cached | mmap'd file pages count as RSS but are backed by page cache and can be reclaimed | Check Shared_Clean in smaps, these are reclaimable |
When to Use / Avoid
Use mmap when:
- Random access to large files (databases, search indexes) where read() would copy too much
- Multiple processes need read-only access to the same file (shared libraries, config)
- Implementing shared memory between processes (MAP_SHARED | MAP_ANONYMOUS)
- Working with files larger than available RAM (the kernel pages in/out as needed)
Avoid mmap when:
- Sequential reads of entire files (read() with kernel readahead is often faster)
- The file can be truncated by another process while mapped (causes SIGBUS, not EINVAL)
- Precise error handling is needed (mmap errors are signals, not return codes)
- Small files where the overhead of VMA creation and page table setup exceeds the copy savings
Try It Yourself
1 # Map a file and examine the mapping
2
3 python3 -c "import mmap,os; f=open('/etc/hosts','rb'); m=mmap.mmap(f.fileno(),0,access=mmap.ACCESS_READ); print(m[:100]); m.close()"
4
5 # Check page cache residency for a file
6
7 fincore /var/lib/postgresql/14/main/base/16384/16385
8
9 # Trace mmap syscalls of a running process
10
11 strace -e mmap,munmap,mremap -p $(pidof python3) -f
12
13 # Show shared vs private memory in mappings
14
15 cat /proc/$(pidof nginx)/smaps | grep -E '(^[0-9a-f]|Shared|Private|Rss)'
16
17 # Advise kernel about mmap access pattern
18
19 python3 -c "import ctypes; ctypes.CDLL('libc.so.6').posix_madvise(0x7f0000000000, 4096, 2)" # MADV_SEQUENTIAL=2
20
21 # Drop page cache for a specific file
22
23 dd of=/path/to/file oflag=nocache conv=notrunc,fdatasync count=0Debug Checklist
- 1
Check mapped regions: cat /proc/<pid>/maps | grep <filename> - 2
Check detailed mapping info: cat /proc/<pid>/smaps | grep -A20 <filename> - 3
Count total mappings: wc -l /proc/<pid>/maps - 4
Check if pages are resident: mincore() in code, or /proc/<pid>/smaps Rss vs Size - 5
Monitor page faults from mmap access: perf stat -e major-faults,minor-faults -p <pid> - 6
Check VMA limit: cat /proc/sys/vm/max_map_count (default 65530)
Key Takeaways
- ✓mmap does NOT read the file -- it sets up a virtual address range that points at the file's page cache; actual data loading happens on first access via page faults, which is why mmap'ing a 1 TB file costs zero RAM and microseconds of CPU
- ✓MAP_SHARED means your writes go directly to the page cache and are instantly visible to every other process mapping the same file -- but they do NOT reach disk until msync() or fsync(); a power failure without sync loses your data
- ✓MAP_PRIVATE gives you copy-on-write -- reads come from the shared page cache for free, but the first write to any page creates a private copy; the file on disk is never touched
- ✓malloc uses mmap under the hood for allocations above 128 KB -- unlike brk, these regions can be independently returned to the kernel via munmap, which is why large allocations do not cause the heap fragmentation trap
- ✓mremap() can resize an existing mapping without copying data (if virtual space permits), which is how dynamic arrays in Go and Rust can sometimes grow without memcpy
Common Pitfalls
- ✗Mapping a file MAP_SHARED then truncating it -- accessing pages beyond the new file size delivers SIGBUS (bus error), not SIGSEGV; this is a classic database crash pattern
- ✗Assuming mmap is always faster than read() -- for sequential reads, read() with kernel readahead can match or beat mmap because it avoids page fault overhead and TLB pressure; mmap wins for random access
- ✗Forgetting msync() before relying on durability -- mmap writes go to the page cache, not to disk; without msync/fsync, a power failure loses your data, exactly like write() without fsync()
- ✗Mapping very large files on 32-bit systems -- the 3 GB user-space limit means you can map at most ~2 GB contiguously; on 64-bit with 128 TB address space, this is a non-issue
Reference
In One Line
mmap for random access and shared reads; read() with readahead for sequential scans -- and never forget msync() if the data needs to survive a crash.