Shared Memory & Semaphores
Mental Model
Two chefs at the same long kitchen counter. No passing dishes through a window, no runner in between -- both reach for the same cutting board, the same bowls, the same ingredients. Raw speed. But both reaching for the same knife at the same time means someone gets cut. So there is a red kitchen timer on the counter: pick it up and the workspace is exclusively held until it goes back down. No timer in hand, no touching the counter.
The Problem
Chrome needs to push 8 MB frames from renderer to compositor 60 times a second, and piping each one means two kernel copies per frame -- 960 MB/s of bandwidth burned on bookkeeping. Three hundred PostgreSQL backends sharing a 16 GB buffer pool would require 4.8 TB of RAM if each got its own copy. Sixteen Nginx workers hitting Redis for rate-limit counters on every request add 0.5-1ms of network latency per check, and at 200,000 req/s that turns the rate limiter into the bottleneck. Then there is the crash scenario: one process dies while holding an unnamed semaphore in shared memory, and the lock stays held forever.
Architecture
Pipes copy data twice. Once from the sending process to the kernel. Once from the kernel to the receiving process.
For most IPC, that is fine. But Chrome needs to push megabytes of pixel data from its renderer to its compositor 60 times per second. PostgreSQL needs every backend process to read the same cached database pages without duplicating them. Nginx needs all its workers to share rate-limit counters at memory speed.
Shared memory eliminates both copies. Two processes, same physical RAM, memory-speed access. But it comes with a price: two processes writing to the same memory without coordination means corrupted data.
What Actually Happens
POSIX shared memory works in three steps:
Step 1: shm_open(). Creates a named file descriptor on tmpfs (mounted at /dev/shm). This is just a file, but it lives in RAM-backed storage -- no disk I/O. The name (like /my_buffer) is how multiple processes find the same shared region.
Step 2: ftruncate(). Sets the size of the shared memory object. This step is critical. The object starts at size zero. Skipping this step and trying to access the mapping results in SIGBUS -- one of the most confusing crashes in systems programming.
Step 3: mmap(MAP_SHARED). Maps the object into the process's virtual address space. The kernel creates a VMA pointing to the tmpfs-backed pages. When another process opens the same named object and maps it, both VMAs point to the same physical pages.
Writes by one process are immediately visible to others. On x86, stores become visible to other cores within 10-100 ns via the MESI/MOESI cache coherence protocol. But "visible" does not mean "consistent" -- without synchronization, the compiler can reorder accesses, the CPU can buffer writes in its store buffer, and a reader can see a partially updated structure.
Under the Hood
Memory ordering is the hard part. On weakly-ordered architectures (ARM, POWER), stores can be reordered and delayed. Even on x86 (which has a strong memory model), the compiler can reorder accesses. When process A writes data then sets a flag, process B might see the flag before the data.
The solutions, from simplest to lowest-level:
-
POSIX semaphores provide implicit acquire/release barriers.
sem_post()on the writer side acts as a release barrier.sem_wait()on the reader side acts as an acquire barrier. This is the easiest correct approach. -
C11/C++11 atomics with
memory_order_releasefor stores andmemory_order_acquirefor loads provide explicit control. -
Compiler and hardware barriers (
__asm__ __volatile__("" ::: "memory")for compiler,mfencefor x86 hardware) are the lowest-level option.
Unnamed vs named semaphores have different tradeoffs. Unnamed semaphores (sem_init with pshared=1) are placed directly in the shared memory region. They are just a futex word -- one atomic operation for the uncontended path, no file I/O. Named semaphores (sem_open) create a file in /dev/shm and are useful when processes do not already share memory. Named semaphores must be sem_unlink()'d for cleanup.
The crash recovery problem is real. If a process crashes while holding an unnamed semaphore in shared memory, the semaphore stays locked forever. Other processes deadlock on sem_wait(). There is no equivalent of PTHREAD_MUTEX_ROBUST for POSIX semaphores. Solutions: use sem_timedwait() with timeout and recovery logic, use a pthread_mutex_t with PTHREAD_MUTEX_ROBUST attribute instead (the next locker gets EOWNERDEAD and can recover), or use the kernel's robust futex list (set_robust_list syscall).
Shared memory does not cross container boundaries by default. Docker containers get their own IPC namespace (CLONE_NEWIPC), isolating System V shared memory. /dev/shm is typically a per-container tmpfs mount. To share memory between containers, use --ipc=host or --ipc=container:<name> in Docker. In Kubernetes, use emptyDir with medium: Memory.
Common Questions
How does POSIX shared memory differ from mmap'ing a regular file with MAP_SHARED?
Functionally, they are similar -- both create shared mappings backed by page cache. The difference is the backing store. shm_open() creates a tmpfs file (RAM/swap-backed, no disk I/O). mmap() on a regular file is filesystem-backed (reads and writes may hit disk). For pure IPC with no persistence need, shared memory is faster. For data that must survive reboots, use a regular file.
Can shared memory be used across Docker containers?
Not by default. Containers get isolated IPC namespaces and separate /dev/shm mounts. Use --ipc=host to share with the host, or --ipc=container:<name> to share between specific containers. In Kubernetes, use an emptyDir volume with medium: Memory.
What is the maximum shared memory size?
For POSIX shared memory, it is limited by the tmpfs mount size (typically 50% of RAM by default). For System V, it is /proc/sys/kernel/shmmax (default varies, often 50% of RAM) and shmall (total pages). The practical limit is physical RAM plus swap. PostgreSQL administrators often need to increase shmmax for large shared_buffers.
Why use POSIX shared memory instead of System V?
System V shared memory (shmget/shmat) uses numeric keys instead of names, has confusing lifecycle semantics (segments persist until IPC_RMID AND all processes detach), and requires ipcs/ipcrm for administration. POSIX shared memory uses clean file-descriptor semantics, is administered via /dev/shm/ filesystem, and has simpler cleanup with shm_unlink(). For new code, there is no reason to use System V.
How Technologies Use This
Sixteen Nginx worker processes need to enforce a global rate limit of 10,000 requests per second. Without shared state, each worker would query Redis on every request, adding 0.5-1 ms of network latency and turning the rate limiter into a throughput bottleneck.
Nginx allocates shared memory zones via mmap(MAP_SHARED) where all 16 workers read and atomically increment the same rate-limit counters. Access takes 10-50 ns using embedded spinlocks and CAS operations -- three orders of magnitude faster than a network round trip. No external coordinator is needed because the counters live in physical pages shared across all worker processes.
Use shared memory zones for any cross-worker state that must be checked on every request. The same mechanism backs session caches, proxy cache metadata, and upstream health checks, letting a single Nginx instance handle 200,000 requests per second while maintaining globally consistent state without network overhead.
Three hundred PostgreSQL backend processes need to access 8 million cached buffer pages simultaneously. Without shared memory, each backend would need its own 16 GB buffer pool copy, requiring 4.8 TB of RAM -- an impossible amount for a single server.
shared_buffers lives in a single mmap(MAP_SHARED) segment that every backend maps into its own address space. All 300 processes read and write the same physical pages through lightweight buffer locks, with zero kernel copies between them. A sequential scan on a 50 GB table that is 80% cached serves 4 million buffer reads per second because every backend accesses the same physical RAM.
Combine shared memory with huge_pages=on to reduce the 4 million TLB entries to 8,192, cutting address translation overhead by 512x during scan-heavy workloads. Shared memory turns 4.8 TB of impossible duplication into a single 16 GB physical allocation shared across all connections.
Chrome needs to push a 1920x1080 frame (8 MB at 32-bit color) from its renderer to the browser compositor 60 times per second. Piping each frame through the kernel means two copies per frame -- 960 MB/s of pure memory bandwidth waste just for pixel data transfer.
The renderer calls shm_open and mmaps a buffer that the compositor also maps into its address space. The renderer paints directly into this buffer at memory speed, and the compositor reads the same physical pages with zero kernel involvement. A double-buffering scheme using two shared memory segments and a futex-based signal ensures the compositor never reads a half-finished frame.
Use shared memory with double buffering for high-bandwidth inter-process data transfer. This approach eliminates both kernel copies entirely, achieving consistent 16.6 ms frame delivery on a 60 Hz display while saving 960 MB/s of memory bandwidth compared to pipe-based IPC.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Shared memory access | --ipc=host or --ipc=container:name; default isolates /dev/shm per container | MappedByteBuffer via FileChannel on /dev/shm for inter-process sharing | SharedArrayBuffer for threads; /dev/shm via fs module for IPC | syscall.Mmap with MAP_SHARED; golang.org/x/sys/unix for shm_open | emptyDir with medium: Memory mounts tmpfs shared across containers in a pod |
| Buffer pool sharing | PostgreSQL shared_buffers across forked backends in same container | DirectByteBuffer for off-heap pools; shared across JVM threads, not processes | Buffer.allocUnsafe for Node internal pools; no cross-process sharing | sync.Pool for goroutine-local; mmap for cross-process | StatefulSet pods cannot share memory across nodes |
| Synchronization | Futex-based semaphores in shared region; file locks as alternative | java.util.concurrent locks; ReentrantLock does not cross processes | Atomics.wait/notify for SharedArrayBuffer threads | sync.Mutex for goroutines; flock for cross-process | No built-in cross-pod synchronization; use distributed locks |
| Cleanup | Leaked /dev/shm entries visible with ls; container restart clears them | JVM does not auto-unlink shm; explicit cleanup in shutdown hook | process.on('exit') handler to shm_unlink | defer munmap + shm_unlink in Go | Pod deletion clears emptyDir volumes |
| Stack Layer | Mechanism |
|---|---|
| Application | shm_open/mmap (POSIX) or shmget/shmat (SysV) to create and map shared regions |
| Synchronization | POSIX semaphores (futex-based), pthread mutexes with PTHREAD_PROCESS_SHARED, atomics |
| Virtual memory | vm_area_struct entries in each process point to same physical page frames |
| Page cache / tmpfs | /dev/shm is a tmpfs mount; shared memory objects are page-cache-backed, swappable |
| Hardware | CPU cache coherence (MESI/MOESI) ensures stores on one core become visible to others in 10-100ns |
Design rationale: The kernel already has the physical pages sitting in RAM, so why copy them through a pipe just to hand them to another process? Shared memory skips the middleman entirely. The price is that the kernel washes its hands of ordering -- hardware cache coherence makes writes visible across cores, but "visible" and "consistent" are very different things. Semaphores or atomics are not optional; they are the contract that keeps shared memory from turning into shared corruption.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| SIGBUS when accessing shared memory region | ftruncate() not called after shm_open(); object has size 0 | ls -la /dev/shm/$NAME to check size |
| Data corruption in shared region | Missing synchronization; concurrent reads/writes without semaphore or atomics | Add sem_wait/sem_post around critical sections |
| /dev/shm filling up over time | Leaked shared memory objects not unlinked on process exit | ls -la /dev/shm/ and shm_unlink() stale entries |
| Deadlock on sem_wait | Process holding semaphore crashed; unnamed semaphore stuck at 0 | Use sem_timedwait or PTHREAD_MUTEX_ROBUST for crash recovery |
| Shared memory not visible between Docker containers | IPC namespace isolation; each container has separate /dev/shm | Use --ipc=host or --ipc=container:name; in K8s use emptyDir Memory |
| PostgreSQL fails with "could not create shared memory segment" | shmmax too low for shared_buffers setting | cat /proc/sys/kernel/shmmax; increase via sysctl |
When to Use / Avoid
- Use when multiple processes need zero-copy access to the same large data (buffer pools, frame buffers, caches)
- Use when IPC latency must be nanoseconds, not microseconds -- shared memory avoids kernel copy overhead
- Use when implementing cross-worker counters, rate limiters, or session caches (Nginx shared zones)
- Use when database backends need shared buffer pools (PostgreSQL shared_buffers)
- Avoid when data must persist across reboots -- shared memory lives in tmpfs (RAM/swap)
- Avoid when processes are across container boundaries by default -- IPC namespaces isolate /dev/shm
Try It Yourself
1 # List POSIX shared memory objects
2
3 ls -la /dev/shm/
4
5 # List System V shared memory segments
6
7 ipcs -m
8
9 # View memory mappings of a process (find shared regions)
10
11 pmap -x $$ | grep -E 'shm|shared'
12
13 # Create a shared memory object from the command line
14
15 python3 -c "import posixipc; shm = posixipc.SharedMemory('/test', posixipc.O_CREAT, size=4096); print(f'Created /dev/shm/test, size={shm.size}')"
16
17 # Check /dev/shm filesystem usage
18
19 df -h /dev/shm
20
21 # Remove leaked System V shared memory segments
22
23 ipcs -m | awk 'NR>3 && $6==0 {print $2}' | xargs -I{} ipcrm -m {} 2>/dev/null; echo 'Cleaned orphaned SysV shm'Debug Checklist
- 1
ls -la /dev/shm/ -- list POSIX shared memory objects and named semaphores - 2
ipcs -m -- list System V shared memory segments with size and attach counts - 3
pmap -x $PID | grep -E 'shm|shared|zero' -- find shared memory mappings in a process - 4
df -h /dev/shm -- check available tmpfs space for shared memory - 5
ipcs -m -p -- show creator and last-attached PIDs for SysV segments - 6
cat /proc/sys/kernel/shmmax -- check maximum SysV shared memory segment size
Key Takeaways
- ✓shm_open() creates a file descriptor backed by tmpfs, but it starts at size zero -- if you forget ftruncate() and try to access the mapping, you get SIGBUS, not a helpful error; this catches everyone at least once
- ✓Unnamed semaphores placed directly in shared memory (sem_init with pshared=1) are the fastest inter-process synchronization -- just a futex word, no file operations; named semaphores (sem_open) create files in /dev/shm and are slightly slower
- ✓Without memory barriers, shared memory is a lie -- the CPU's store buffer and compiler reordering mean one process can write a flag then data, but the other process sees the data before the flag; you MUST use atomics, semaphores, or explicit fences
- ✓System V shared memory (shmget/shmat) is a legacy API with a key-based namespace and confusing lifecycle -- segments persist until IPC_RMID is set AND all processes detach; use POSIX shm for new code
- ✓Huge pages work with shared memory via MAP_HUGETLB -- for large shared regions like database buffer pools, this reduces TLB misses by 512x and can be the difference between acceptable and terrible performance
Common Pitfalls
- ✗Forgetting ftruncate() after shm_open() -- the object starts at size zero; accessing it without setting the size gives you SIGBUS, which looks like a random crash and is miserable to debug
- ✗Using memcpy to communicate via shared memory without synchronization -- even on x86's strong memory model, the COMPILER can reorder accesses; use atomic operations, semaphores, or explicit memory barriers
- ✗Leaking shared memory objects -- they persist on /dev/shm until explicitly unlinked or the system reboots; check 'ls /dev/shm/' periodically and always call shm_unlink() in cleanup
- ✗Using SysV semaphores in new code -- they have awkward 'semaphore array' semantics, complex undo handling, and per-operation permission checks; POSIX semaphores are simpler, faster, and saner
Reference
In One Line
Shared memory gives zero-copy IPC at the cost of doing synchronization manually -- skip the semaphores or atomics, and corruption is a matter of when, not if.