Processes, Threads & SchedulingTopic 13 of 14

Processes & ThreadsAdvanced

Shared Memory & Semaphores

NginxPostgreSQLChrome

🧠

Mental Model

Two chefs at the same long kitchen counter. No passing dishes through a window, no runner in between -- both reach for the same cutting board, the same bowls, the same ingredients. Raw speed. But both reaching for the same knife at the same time means someone gets cut. So there is a red kitchen timer on the counter: pick it up and the workspace is exclusively held until it goes back down. No timer in hand, no touching the counter.

💡

The Problem

Chrome needs to push 8 MB frames from renderer to compositor 60 times a second, and piping each one means two kernel copies per frame -- 960 MB/s of bandwidth burned on bookkeeping. Three hundred PostgreSQL backends sharing a 16 GB buffer pool would require 4.8 TB of RAM if each got its own copy. Sixteen Nginx workers hitting Redis for rate-limit counters on every request add 0.5-1ms of network latency per check, and at 200,000 req/s that turns the rate limiter into the bottleneck. Then there is the crash scenario: one process dies while holding an unnamed semaphore in shared memory, and the lock stays held forever.

Architecture

Pipes copy data twice. Once from the sending process to the kernel. Once from the kernel to the receiving process.

For most IPC, that is fine. But Chrome needs to push megabytes of pixel data from its renderer to its compositor 60 times per second. PostgreSQL needs every backend process to read the same cached database pages without duplicating them. Nginx needs all its workers to share rate-limit counters at memory speed.

Shared memory eliminates both copies. Two processes, same physical RAM, memory-speed access. But it comes with a price: two processes writing to the same memory without coordination means corrupted data.

What Actually Happens

POSIX shared memory works in three steps:

Step 1: shm_open(). Creates a named file descriptor on tmpfs (mounted at /dev/shm). This is just a file, but it lives in RAM-backed storage -- no disk I/O. The name (like /my_buffer) is how multiple processes find the same shared region.

Step 2: ftruncate(). Sets the size of the shared memory object. This step is critical. The object starts at size zero. Skipping this step and trying to access the mapping results in SIGBUS -- one of the most confusing crashes in systems programming.

Step 3: mmap(MAP_SHARED). Maps the object into the process's virtual address space. The kernel creates a VMA pointing to the tmpfs-backed pages. When another process opens the same named object and maps it, both VMAs point to the same physical pages.

Writes by one process are immediately visible to others. On x86, stores become visible to other cores within 10-100 ns via the MESI/MOESI cache coherence protocol. But "visible" does not mean "consistent" -- without synchronization, the compiler can reorder accesses, the CPU can buffer writes in its store buffer, and a reader can see a partially updated structure.

Under the Hood

Memory ordering is the hard part. On weakly-ordered architectures (ARM, POWER), stores can be reordered and delayed. Even on x86 (which has a strong memory model), the compiler can reorder accesses. When process A writes data then sets a flag, process B might see the flag before the data.

The solutions, from simplest to lowest-level:

POSIX semaphores provide implicit acquire/release barriers. sem_post() on the writer side acts as a release barrier. sem_wait() on the reader side acts as an acquire barrier. This is the easiest correct approach.
C11/C++11 atomics with memory_order_release for stores and memory_order_acquire for loads provide explicit control.
Compiler and hardware barriers (__asm__ __volatile__("" ::: "memory") for compiler, mfence for x86 hardware) are the lowest-level option.

Unnamed vs named semaphores have different tradeoffs. Unnamed semaphores (sem_init with pshared=1) are placed directly in the shared memory region. They are just a futex word -- one atomic operation for the uncontended path, no file I/O. Named semaphores (sem_open) create a file in /dev/shm and are useful when processes do not already share memory. Named semaphores must be sem_unlink()'d for cleanup.

The crash recovery problem is real. If a process crashes while holding an unnamed semaphore in shared memory, the semaphore stays locked forever. Other processes deadlock on sem_wait(). There is no equivalent of PTHREAD_MUTEX_ROBUST for POSIX semaphores. Solutions: use sem_timedwait() with timeout and recovery logic, use a pthread_mutex_t with PTHREAD_MUTEX_ROBUST attribute instead (the next locker gets EOWNERDEAD and can recover), or use the kernel's robust futex list (set_robust_list syscall).

Shared memory does not cross container boundaries by default. Docker containers get their own IPC namespace (CLONE_NEWIPC), isolating System V shared memory. /dev/shm is typically a per-container tmpfs mount. To share memory between containers, use --ipc=host or --ipc=container:<name> in Docker. In Kubernetes, use emptyDir with medium: Memory.

Common Questions

How does POSIX shared memory differ from mmap'ing a regular file with MAP_SHARED?

Functionally, they are similar -- both create shared mappings backed by page cache. The difference is the backing store. shm_open() creates a tmpfs file (RAM/swap-backed, no disk I/O). mmap() on a regular file is filesystem-backed (reads and writes may hit disk). For pure IPC with no persistence need, shared memory is faster. For data that must survive reboots, use a regular file.

Can shared memory be used across Docker containers?

Not by default. Containers get isolated IPC namespaces and separate /dev/shm mounts. Use --ipc=host to share with the host, or --ipc=container:<name> to share between specific containers. In Kubernetes, use an emptyDir volume with medium: Memory.

What is the maximum shared memory size?

For POSIX shared memory, it is limited by the tmpfs mount size (typically 50% of RAM by default). For System V, it is /proc/sys/kernel/shmmax (default varies, often 50% of RAM) and shmall (total pages). The practical limit is physical RAM plus swap. PostgreSQL administrators often need to increase shmmax for large shared_buffers.

Why use POSIX shared memory instead of System V?

System V shared memory (shmget/shmat) uses numeric keys instead of names, has confusing lifecycle semantics (segments persist until IPC_RMID AND all processes detach), and requires ipcs/ipcrm for administration. POSIX shared memory uses clean file-descriptor semantics, is administered via /dev/shm/ filesystem, and has simpler cleanup with shm_unlink(). For new code, there is no reason to use System V.

How Technologies Use This

Nginx

Sixteen Nginx worker processes need to enforce a global rate limit of 10,000 requests per second. Without shared state, each worker would query Redis on every request, adding 0.5-1 ms of network latency and turning the rate limiter into a throughput bottleneck.

Nginx allocates shared memory zones via mmap(MAP_SHARED) where all 16 workers read and atomically increment the same rate-limit counters. Access takes 10-50 ns using embedded spinlocks and CAS operations -- three orders of magnitude faster than a network round trip. No external coordinator is needed because the counters live in physical pages shared across all worker processes.

Use shared memory zones for any cross-worker state that must be checked on every request. The same mechanism backs session caches, proxy cache metadata, and upstream health checks, letting a single Nginx instance handle 200,000 requests per second while maintaining globally consistent state without network overhead.

PostgreSQL

Three hundred PostgreSQL backend processes need to access 8 million cached buffer pages simultaneously. Without shared memory, each backend would need its own 16 GB buffer pool copy, requiring 4.8 TB of RAM -- an impossible amount for a single server.

shared_buffers lives in a single mmap(MAP_SHARED) segment that every backend maps into its own address space. All 300 processes read and write the same physical pages through lightweight buffer locks, with zero kernel copies between them. A sequential scan on a 50 GB table that is 80% cached serves 4 million buffer reads per second because every backend accesses the same physical RAM.

Combine shared memory with huge_pages=on to reduce the 4 million TLB entries to 8,192, cutting address translation overhead by 512x during scan-heavy workloads. Shared memory turns 4.8 TB of impossible duplication into a single 16 GB physical allocation shared across all connections.

Chrome

Chrome needs to push a 1920x1080 frame (8 MB at 32-bit color) from its renderer to the browser compositor 60 times per second. Piping each frame through the kernel means two copies per frame -- 960 MB/s of pure memory bandwidth waste just for pixel data transfer.

The renderer calls shm_open and mmaps a buffer that the compositor also maps into its address space. The renderer paints directly into this buffer at memory speed, and the compositor reads the same physical pages with zero kernel involvement. A double-buffering scheme using two shared memory segments and a futex-based signal ensures the compositor never reads a half-finished frame.

Use shared memory with double buffering for high-bandwidth inter-process data transfer. This approach eliminates both kernel copies entirely, achieving consistent 16.6 ms frame delivery on a 60 Hz display while saving 960 MB/s of memory bandwidth compared to pipe-based IPC.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Shared memory access	--ipc=host or --ipc=container:name; default isolates /dev/shm per container	MappedByteBuffer via FileChannel on /dev/shm for inter-process sharing	SharedArrayBuffer for threads; /dev/shm via fs module for IPC	syscall.Mmap with MAP_SHARED; golang.org/x/sys/unix for shm_open	emptyDir with medium: Memory mounts tmpfs shared across containers in a pod
Buffer pool sharing	PostgreSQL shared_buffers across forked backends in same container	DirectByteBuffer for off-heap pools; shared across JVM threads, not processes	Buffer.allocUnsafe for Node internal pools; no cross-process sharing	sync.Pool for goroutine-local; mmap for cross-process	StatefulSet pods cannot share memory across nodes
Synchronization	Futex-based semaphores in shared region; file locks as alternative	java.util.concurrent locks; ReentrantLock does not cross processes	Atomics.wait/notify for SharedArrayBuffer threads	sync.Mutex for goroutines; flock for cross-process	No built-in cross-pod synchronization; use distributed locks
Cleanup	Leaked /dev/shm entries visible with ls; container restart clears them	JVM does not auto-unlink shm; explicit cleanup in shutdown hook	process.on('exit') handler to shm_unlink	defer munmap + shm_unlink in Go	Pod deletion clears emptyDir volumes

Stack Layer	Mechanism
Application	shm_open/mmap (POSIX) or shmget/shmat (SysV) to create and map shared regions
Synchronization	POSIX semaphores (futex-based), pthread mutexes with PTHREAD_PROCESS_SHARED, atomics
Virtual memory	vm_area_struct entries in each process point to same physical page frames
Page cache / tmpfs	/dev/shm is a tmpfs mount; shared memory objects are page-cache-backed, swappable
Hardware	CPU cache coherence (MESI/MOESI) ensures stores on one core become visible to others in 10-100ns

Design rationale: The kernel already has the physical pages sitting in RAM, so why copy them through a pipe just to hand them to another process? Shared memory skips the middleman entirely. The price is that the kernel washes its hands of ordering -- hardware cache coherence makes writes visible across cores, but "visible" and "consistent" are very different things. Semaphores or atomics are not optional; they are the contract that keeps shared memory from turning into shared corruption.

If You See This, Think This

Symptom	Likely Cause	First Check
SIGBUS when accessing shared memory region	ftruncate() not called after shm_open(); object has size 0	ls -la /dev/shm/$NAME to check size
Data corruption in shared region	Missing synchronization; concurrent reads/writes without semaphore or atomics	Add sem_wait/sem_post around critical sections
/dev/shm filling up over time	Leaked shared memory objects not unlinked on process exit	ls -la /dev/shm/ and shm_unlink() stale entries
Deadlock on sem_wait	Process holding semaphore crashed; unnamed semaphore stuck at 0	Use sem_timedwait or PTHREAD_MUTEX_ROBUST for crash recovery
Shared memory not visible between Docker containers	IPC namespace isolation; each container has separate /dev/shm	Use --ipc=host or --ipc=container:name; in K8s use emptyDir Memory
PostgreSQL fails with "could not create shared memory segment"	shmmax too low for shared_buffers setting	cat /proc/sys/kernel/shmmax; increase via sysctl

When to Use / Avoid

Use when multiple processes need zero-copy access to the same large data (buffer pools, frame buffers, caches)
Use when IPC latency must be nanoseconds, not microseconds -- shared memory avoids kernel copy overhead
Use when implementing cross-worker counters, rate limiters, or session caches (Nginx shared zones)
Use when database backends need shared buffer pools (PostgreSQL shared_buffers)
Avoid when data must persist across reboots -- shared memory lives in tmpfs (RAM/swap)
Avoid when processes are across container boundaries by default -- IPC namespaces isolate /dev/shm

Try It Yourself

 1  # List POSIX shared memory objects
 2  
 3  ls -la /dev/shm/
 4  
 5  # List System V shared memory segments
 6  
 7  ipcs -m
 8  
 9  # View memory mappings of a process (find shared regions)
10  
11  pmap -x $$ | grep -E 'shm|shared'
12  
13  # Create a shared memory object from the command line
14  
15  python3 -c "import posixipc; shm = posixipc.SharedMemory('/test', posixipc.O_CREAT, size=4096); print(f'Created /dev/shm/test, size={shm.size}')"
16  
17  # Check /dev/shm filesystem usage
18  
19  df -h /dev/shm
20  
21  # Remove leaked System V shared memory segments
22  
23  ipcs -m | awk 'NR>3 && $6==0 {print $2}' | xargs -I{} ipcrm -m {} 2>/dev/null; echo 'Cleaned orphaned SysV shm'

Debug Checklist

1ls -la /dev/shm/ -- list POSIX shared memory objects and named semaphores
2ipcs -m -- list System V shared memory segments with size and attach counts
3pmap -x $PID | grep -E 'shm|shared|zero' -- find shared memory mappings in a process
4df -h /dev/shm -- check available tmpfs space for shared memory
5ipcs -m -p -- show creator and last-attached PIDs for SysV segments
6cat /proc/sys/kernel/shmmax -- check maximum SysV shared memory segment size

Key Takeaways

✓shm_open() creates a file descriptor backed by tmpfs, but it starts at size zero -- if you forget ftruncate() and try to access the mapping, you get SIGBUS, not a helpful error; this catches everyone at least once
✓Unnamed semaphores placed directly in shared memory (sem_init with pshared=1) are the fastest inter-process synchronization -- just a futex word, no file operations; named semaphores (sem_open) create files in /dev/shm and are slightly slower
✓Without memory barriers, shared memory is a lie -- the CPU's store buffer and compiler reordering mean one process can write a flag then data, but the other process sees the data before the flag; you MUST use atomics, semaphores, or explicit fences
✓System V shared memory (shmget/shmat) is a legacy API with a key-based namespace and confusing lifecycle -- segments persist until IPC_RMID is set AND all processes detach; use POSIX shm for new code
✓Huge pages work with shared memory via MAP_HUGETLB -- for large shared regions like database buffer pools, this reduces TLB misses by 512x and can be the difference between acceptable and terrible performance

Common Pitfalls

✗Forgetting ftruncate() after shm_open() -- the object starts at size zero; accessing it without setting the size gives you SIGBUS, which looks like a random crash and is miserable to debug
✗Using memcpy to communicate via shared memory without synchronization -- even on x86's strong memory model, the COMPILER can reorder accesses; use atomic operations, semaphores, or explicit memory barriers
✗Leaking shared memory objects -- they persist on /dev/shm until explicitly unlinked or the system reboots; check 'ls /dev/shm/' periodically and always call shm_unlink() in cleanup
✗Using SysV semaphores in new code -- they have awkward 'semaphore array' semantics, complex undo handling, and per-operation permission checks; POSIX semaphores are simpler, faster, and saner

Reference

System Calls

shm_openmmapsem_opensem_waitsem_postftruncate

Tools

ls -la /dev/shm/ipcs -mpmap -x <pid>

📌

In One Line

Shared memory gives zero-copy IPC at the cost of doing synchronization manually -- skip the semaphores or atomics, and corruption is a matter of when, not if.

Shared Memory & Semaphores

NginxPostgreSQLChrome

🧠

Mental Model

💡

The Problem

Architecture

Pipes copy data twice. Once from the sending process to the kernel. Once from the kernel to the receiving process.

What Actually Happens

POSIX shared memory works in three steps:

Under the Hood

The solutions, from simplest to lowest-level:

POSIX semaphores provide implicit acquire/release barriers. sem_post() on the writer side acts as a release barrier. sem_wait() on the reader side acts as an acquire barrier. This is the easiest correct approach.
C11/C++11 atomics with memory_order_release for stores and memory_order_acquire for loads provide explicit control.
Compiler and hardware barriers (__asm__ __volatile__("" ::: "memory") for compiler, mfence for x86 hardware) are the lowest-level option.

Common Questions

How does POSIX shared memory differ from mmap'ing a regular file with MAP_SHARED?

Can shared memory be used across Docker containers?

What is the maximum shared memory size?

Why use POSIX shared memory instead of System V?

How Technologies Use This

Nginx

PostgreSQL

Chrome

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Shared memory access	--ipc=host or --ipc=container:name; default isolates /dev/shm per container	MappedByteBuffer via FileChannel on /dev/shm for inter-process sharing	SharedArrayBuffer for threads; /dev/shm via fs module for IPC	syscall.Mmap with MAP_SHARED; golang.org/x/sys/unix for shm_open	emptyDir with medium: Memory mounts tmpfs shared across containers in a pod
Buffer pool sharing	PostgreSQL shared_buffers across forked backends in same container	DirectByteBuffer for off-heap pools; shared across JVM threads, not processes	Buffer.allocUnsafe for Node internal pools; no cross-process sharing	sync.Pool for goroutine-local; mmap for cross-process	StatefulSet pods cannot share memory across nodes
Synchronization	Futex-based semaphores in shared region; file locks as alternative	java.util.concurrent locks; ReentrantLock does not cross processes	Atomics.wait/notify for SharedArrayBuffer threads	sync.Mutex for goroutines; flock for cross-process	No built-in cross-pod synchronization; use distributed locks
Cleanup	Leaked /dev/shm entries visible with ls; container restart clears them	JVM does not auto-unlink shm; explicit cleanup in shutdown hook	process.on('exit') handler to shm_unlink	defer munmap + shm_unlink in Go	Pod deletion clears emptyDir volumes

Stack Layer	Mechanism
Application	shm_open/mmap (POSIX) or shmget/shmat (SysV) to create and map shared regions
Synchronization	POSIX semaphores (futex-based), pthread mutexes with PTHREAD_PROCESS_SHARED, atomics
Virtual memory	vm_area_struct entries in each process point to same physical page frames
Page cache / tmpfs	/dev/shm is a tmpfs mount; shared memory objects are page-cache-backed, swappable
Hardware	CPU cache coherence (MESI/MOESI) ensures stores on one core become visible to others in 10-100ns

If You See This, Think This

Symptom	Likely Cause	First Check
SIGBUS when accessing shared memory region	ftruncate() not called after shm_open(); object has size 0	ls -la /dev/shm/$NAME to check size
Data corruption in shared region	Missing synchronization; concurrent reads/writes without semaphore or atomics	Add sem_wait/sem_post around critical sections
/dev/shm filling up over time	Leaked shared memory objects not unlinked on process exit	ls -la /dev/shm/ and shm_unlink() stale entries
Deadlock on sem_wait	Process holding semaphore crashed; unnamed semaphore stuck at 0	Use sem_timedwait or PTHREAD_MUTEX_ROBUST for crash recovery
Shared memory not visible between Docker containers	IPC namespace isolation; each container has separate /dev/shm	Use --ipc=host or --ipc=container:name; in K8s use emptyDir Memory
PostgreSQL fails with "could not create shared memory segment"	shmmax too low for shared_buffers setting	cat /proc/sys/kernel/shmmax; increase via sysctl

When to Use / Avoid

Use when multiple processes need zero-copy access to the same large data (buffer pools, frame buffers, caches)
Use when IPC latency must be nanoseconds, not microseconds -- shared memory avoids kernel copy overhead
Use when implementing cross-worker counters, rate limiters, or session caches (Nginx shared zones)
Use when database backends need shared buffer pools (PostgreSQL shared_buffers)
Avoid when data must persist across reboots -- shared memory lives in tmpfs (RAM/swap)
Avoid when processes are across container boundaries by default -- IPC namespaces isolate /dev/shm

Try It Yourself

 1  # List POSIX shared memory objects
 2  
 3  ls -la /dev/shm/
 4  
 5  # List System V shared memory segments
 6  
 7  ipcs -m
 8  
 9  # View memory mappings of a process (find shared regions)
10  
11  pmap -x $$ | grep -E 'shm|shared'
12  
13  # Create a shared memory object from the command line
14  
15  python3 -c "import posixipc; shm = posixipc.SharedMemory('/test', posixipc.O_CREAT, size=4096); print(f'Created /dev/shm/test, size={shm.size}')"
16  
17  # Check /dev/shm filesystem usage
18  
19  df -h /dev/shm
20  
21  # Remove leaked System V shared memory segments
22  
23  ipcs -m | awk 'NR>3 && $6==0 {print $2}' | xargs -I{} ipcrm -m {} 2>/dev/null; echo 'Cleaned orphaned SysV shm'

Debug Checklist

1ls -la /dev/shm/ -- list POSIX shared memory objects and named semaphores
2ipcs -m -- list System V shared memory segments with size and attach counts
3pmap -x $PID | grep -E 'shm|shared|zero' -- find shared memory mappings in a process
4df -h /dev/shm -- check available tmpfs space for shared memory
5ipcs -m -p -- show creator and last-attached PIDs for SysV segments
6cat /proc/sys/kernel/shmmax -- check maximum SysV shared memory segment size

Key Takeaways

✓shm_open() creates a file descriptor backed by tmpfs, but it starts at size zero -- if you forget ftruncate() and try to access the mapping, you get SIGBUS, not a helpful error; this catches everyone at least once
✓Unnamed semaphores placed directly in shared memory (sem_init with pshared=1) are the fastest inter-process synchronization -- just a futex word, no file operations; named semaphores (sem_open) create files in /dev/shm and are slightly slower
✓Without memory barriers, shared memory is a lie -- the CPU's store buffer and compiler reordering mean one process can write a flag then data, but the other process sees the data before the flag; you MUST use atomics, semaphores, or explicit fences
✓System V shared memory (shmget/shmat) is a legacy API with a key-based namespace and confusing lifecycle -- segments persist until IPC_RMID is set AND all processes detach; use POSIX shm for new code
✓Huge pages work with shared memory via MAP_HUGETLB -- for large shared regions like database buffer pools, this reduces TLB misses by 512x and can be the difference between acceptable and terrible performance

Common Pitfalls

✗Forgetting ftruncate() after shm_open() -- the object starts at size zero; accessing it without setting the size gives you SIGBUS, which looks like a random crash and is miserable to debug
✗Using memcpy to communicate via shared memory without synchronization -- even on x86's strong memory model, the COMPILER can reorder accesses; use atomic operations, semaphores, or explicit memory barriers
✗Leaking shared memory objects -- they persist on /dev/shm until explicitly unlinked or the system reboots; check 'ls /dev/shm/' periodically and always call shm_unlink() in cleanup
✗Using SysV semaphores in new code -- they have awkward 'semaphore array' semantics, complex undo handling, and per-operation permission checks; POSIX semaphores are simpler, faster, and saner

Reference

System Calls

shm_openmmapsem_opensem_waitsem_postftruncate

Tools

ls -la /dev/shm/ipcs -mpmap -x <pid>

📌

In One Line

Shared memory gives zero-copy IPC at the cost of doing synchronization manually -- skip the semaphores or atomics, and corruption is a matter of when, not if.

Shared Memory & Semaphores

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Shared Memory & Semaphores

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics