POSIX Threads
Mental Model
Several craftspeople at one large table. Same tools, same raw materials bin, same blueprint on the wall. Each person has their own stool and a personal notepad (thread-local storage). Sharing the table is efficient -- nobody duplicates tools or materials. But when two people grab the same drill at once, one waits. A sign-out sheet next to each tool (the mutex) ensures one person at a time. Forget to sign a tool back in, and the entire workshop grinds to a halt.
The Problem
Fork-per-request for 10,000 concurrent connections means duplicating a 500 MB address space 10,000 times -- 5 TB of virtual memory overhead. Threads share memory and avoid the copies, but sharing introduces data races: two threads incrementing the same counter without a lock lose 5-15% of updates under load. A forgotten mutex unlock creates deadlocks that surface only under specific scheduling patterns, taking months to reproduce. And a thread stack overflow (8 MB default) quietly corrupts the heap -- no segfault, just wrong values in unrelated structures.
Architecture
Here is something that surprises most people: Linux does not have threads. Not really. What it has are processes that share things.
When a pthread is created, the kernel creates a new task_struct -- the exact same data structure it uses for a process. The only difference is the flags passed to clone(). A process shares nothing. A thread shares everything -- address space, file descriptors, signal handlers. The kernel does not distinguish between the two. It is the same mechanism with different knobs turned.
This design choice has profound consequences. It means threads are cheap (they are just processes under the hood). It means the scheduler treats threads and processes identically. And it means understanding how threads really work on Linux requires understanding clone(), futex(), and the FS segment register.
What Actually Happens
When pthread_create() is called, here is what happens under the hood:
- Glibc allocates a new stack via
mmap()(default 8MB + guard page). - At the top of the stack, it places the Thread Control Block (TCB) -- a
struct pthreadholding TLS pointers, cleanup handlers, cancellation state, and a futex forpthread_join(). - The FS segment register is set to point to the TCB. This is how thread-local storage works -- each
__threadvariable is accessed as an offset from FS, compiling down to a singlemovinstruction. - Glibc calls
clone()with flags:CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM.
CLONE_VM shares the address space. CLONE_FILES shares file descriptors. CLONE_THREAD groups threads under the same tgid (thread group ID), which is what getpid() returns. The new thread gets its own task_struct, its own stack, its own register state, and its own signal mask.
The futex trick. This is where things get clever. Synchronization primitives -- mutexes, condvars, rwlocks, barriers -- are all built on the Linux futex() syscall. The key design is a two-level system:
- Fast path (userspace): An uncontended
pthread_mutex_lock()is an atomic compare-and-swap on a 32-bit integer in userspace. No syscall. About 25 nanoseconds. - Slow path (kernel): When the CAS fails (another thread holds the lock), glibc calls
futex(FUTEX_WAIT)to put the thread to sleep in the kernel.pthread_mutex_unlock()does an atomic store and, if there are waiters, callsfutex(FUTEX_WAKE).
Most mutex operations in a well-designed program are uncontended. That means most of them never enter the kernel. This is why pthreads are fast.
Under the Hood
Mutex types matter more than most people think. PTHREAD_MUTEX_NORMAL (default) silently deadlocks on recursive locking and has undefined behavior if unlocked by the wrong thread. PTHREAD_MUTEX_ERRORCHECK returns EDEADLK on recursive lock and EPERM on wrong-owner unlock -- use this during development. PTHREAD_MUTEX_RECURSIVE allows the same thread to lock multiple times (maintains a count). PTHREAD_MUTEX_ROBUST detects owner death via the kernel's robust_list mechanism -- if a thread dies while holding the lock, the next locker gets EOWNERDEAD and can recover.
Condition variable semantics. pthread_cond_wait() atomically unlocks the mutex and blocks on the condvar's futex. The word "atomically" is doing heavy lifting here. Without atomic unlock-and-sleep, a signal could arrive in the gap between unlock and sleep -- the waiter would miss it and sleep forever. This is the "lost wakeup" problem, and it is why condvars exist (instead of just using a flag + sleep).
Spurious wakeups are allowed by POSIX and do occur. Always wait in a while loop that re-checks the predicate. Using if instead of while is a classic bug.
Thread cancellation. Deferred cancellation (default) only fires at specific "cancellation points" -- functions like sleep(), read(), write(), pthread_cond_wait(). The cancelled thread runs cleanup handlers in LIFO order. Asynchronous cancellation can fire at any instruction and is almost never safe -- it can leave mutexes locked and heap corrupted. In practice, use a shared "shutdown" flag that threads check periodically.
NPTL vs LinuxThreads. The old LinuxThreads implementation (pre-glibc 2.3.2) used a manager thread for thread creation and did not properly share PIDs. NPTL (Native POSIX Threads Library) provides true 1:1 threading with correct PID/TID semantics, faster creation, and proper POSIX signal delivery. Every modern Linux system uses NPTL.
Common Questions
What is the performance difference between a mutex and a spinlock?
An uncontended mutex is about 25ns (one CAS, no syscall). A contended mutex puts the thread to sleep (context switch cost ~1-5 microseconds). A spinlock burns CPU but avoids the context switch -- it is faster for very short critical sections (under 1 microsecond) where the lock holder is running on another core. For anything longer, the mutex wins because it does not waste CPU. PTHREAD_MUTEX_ADAPTIVE_NP offers the best of both: it spins briefly, then sleeps.
How does priority inversion happen with pthreads?
A high-priority thread blocks on a mutex held by a low-priority thread. A medium-priority thread preempts the low-priority thread. Now the high-priority thread is effectively stuck behind the medium-priority thread. The fix: PTHREAD_PRIO_INHERIT (priority inheritance) temporarily boosts the lock holder to the highest waiter's priority. Linux implements this via futex(FUTEX_LOCK_PI) and rt_mutex internally. This is what saved the Mars Pathfinder mission.
Why is fork() so dangerous in a multithreaded process?
fork() clones only the calling thread. Every other thread vanishes in the child, but their mutexes stay locked. If the child calls malloc() -- which uses an internal arena mutex -- and one of the vanished threads was holding that mutex, the child deadlocks. The only safe things to do after fork in a threaded program are calling exec() or async-signal-safe functions. pthread_atfork() can register cleanup handlers, but it cannot cover every third-party library's internal locks.
How Technologies Use This
A sequential scan on a 10 GB table takes 40 seconds on one core. Eight cores are available. But PostgreSQL uses one process per connection, and forking mid-query is impossible without duplicating the entire shared buffer pool and losing the parent's query state.
PostgreSQL solves parallel queries with background workers that are pthreads coordinated through LWLocks. These LWLocks are built directly on futex: the uncontended path is a single CAS in userspace (about 25 ns), and only under contention does a worker enter the kernel via futex(FUTEX_WAIT). Threads share the buffer pool instead of copying it.
With 4 parallel workers, that 40-second sequential scan drops to about 10 seconds, with synchronization overhead under 2% of total query time. The futex-based design means the coordination cost is nearly invisible compared to the I/O savings.
A Java application has 500 threads and synchronized blocks are dominating CPU profiles. Every lock acquisition seems to involve a kernel call. Thread scheduling feels sluggish on a 32-core machine, and the overhead of thread infrastructure alone is eating into throughput.
Every Thread.start() call maps to a real pthread via pthread_create, giving each Java thread its own kernel task_struct and scheduling entity. The synchronized keyword uses a tiered locking scheme built on futex: uncontended locks are a single CAS in userspace (about 25 ns, no syscall). Only when another thread holds the lock does futex(FUTEX_WAIT) enter the kernel. Thread-local storage via the FS segment register gives each thread O(1) access to its JNI environment.
With well-designed locks, a JVM running 500 threads spends less than 1% of CPU on thread infrastructure. If profiles show heavy futex contention, the problem is lock design, not the threading model -- reduce critical section size or switch to concurrent data structures.
A Go server needs a million concurrent goroutines. Creating a pthread per goroutine would require 8 TB of stack space (8 MB each) and the kernel scheduler would collapse under a million runnable threads. The OS simply cannot handle this.
Go's runtime multiplexes goroutines M:N onto a small pool of pthreads, typically one per CPU core. The key trick: Go builds its own mutex and semaphore primitives directly on the futex syscall, bypassing pthread_mutex entirely. This lets the Go scheduler hand off a lock to a specific goroutine and immediately schedule it on the current thread, avoiding a context switch through the OS scheduler.
The result: a Go server sustains 100,000+ concurrent goroutines on 8 OS threads, using roughly 200 MB of memory instead of the 800 GB that one-thread-per-goroutine would demand. The M:N model plus direct futex usage is what makes Go's concurrency story actually work at scale.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Thread model | Host pthreads, visible in container /proc | OS threads mapped 1:1 (JVM threads = pthreads) | Single-threaded event loop + worker_threads | Goroutines multiplexed on M OS threads (M:N) | Pod processes use host pthreads |
| Synchronization | Futex-based (kernel) | synchronized/ReentrantLock (JVM monitors) | SharedArrayBuffer + Atomics | Channels + sync.Mutex | N/A (pod-level, use distributed locks) |
| Thread-local storage | __thread / pthread_key | ThreadLocal<T> | N/A (single-threaded per context) | goroutine-local via context.Context | N/A |
| Thread pool | Application-managed | ForkJoinPool / ThreadPoolExecutor | worker_threads pool | goroutine pool (ants, tunny) | N/A |
| Stack size | 8 MB default (mmap) | -Xss (default 512 KB - 1 MB) | --stack-size (default 4 MB) | Goroutine stacks start at 8 KB, grow dynamically | N/A |
Stack Layer Mapping
| Layer | Threading Mechanism |
|---|---|
| CPU | Hardware threads (SMT/hyperthreading), context switch via task_struct |
| Kernel | clone() with CLONE_VM creates thread; scheduler treats as normal task |
| glibc / NPTL | pthread_create() wraps clone(), manages TCB and TLS via FS register |
| Synchronization | Futex: atomic CAS for fast path, kernel wait queue for contention |
| Application | Mutexes, condition variables, barriers, read-write locks |
| Debugging | helgrind (race detection), pstack (stack dumps), /proc/<pid>/task/ |
Design Rationale
Linux did not create a separate thread abstraction because the scheduler already handles task_struct efficiently -- a second scheduling entity would just duplicate logic. Threads are clone() with sharing flags, nothing more. The 8 MB default stack is virtual-only via mmap; physical pages fault in on demand, so the real cost scales with actual stack depth. Futexes keep the uncontended path in userspace because over 90% of mutex acquisitions succeed without contention, and even a fast syscall costs 200-500 ns -- too much to pay every time.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Sporadic wrong values in shared data | Data race (missing mutex or atomic) | valgrind --tool=helgrind or gcc -fsanitize=thread |
| Process hangs with all threads blocked | Deadlock (lock ordering violation) | pstack <pid> to see where each thread is waiting |
| Thread creation fails with EAGAIN | Thread count exceeds /proc/sys/kernel/threads-max | `cat /proc/<pid>/status |
| High CPU with low throughput | Lock contention (threads spinning on mutex) | perf record -e futex:* -p <pid> or strace -e futex |
| Stack overflow corruption | Thread stack exceeded 8 MB default | Check guard page with cat /proc/<pid>/maps |
| pthread_join hangs forever | Joined thread never called pthread_exit or returned | Verify thread function returns or calls pthread_exit() |
When to Use / Avoid
Use when:
- Multiple tasks need shared access to the same address space (heap, file descriptors)
- Parallelizing CPU-bound work across cores (compression, image processing, matrix ops)
- Implementing thread pools for request handling (web servers, database backends)
- Low-latency inter-task communication where IPC overhead is unacceptable
Avoid when:
- Tasks are I/O-bound and async/event-driven models (epoll, io_uring) suffice
- Fault isolation is critical (a crash in one thread kills all threads in the process)
- The language runtime provides goroutines/green threads (Go, Erlang) with better scheduling
- Shared state is minimal and fork + IPC is simpler
Try It Yourself
1 # Show threads of a process
2
3 ps -T -p $(pidof mysqld) | head -20
4
5 # Count threads per process
6
7 ls /proc/$(pidof nginx)/task/ | wc -l
8
9 # View thread-level stats
10
11 cat /proc/$(pidof nginx)/task/*/status | grep -E '^(Name|Pid|Tgid)' | head -20
12
13 # Trace futex syscalls (mutex/condvar contention)
14
15 strace -f -e trace=futex -p $(pidof myapp) 2>&1 | head -50
16
17 # Find threads blocked in futex (potential deadlock)
18
19 cat /proc/$(pidof myapp)/task/*/wchan 2>/dev/null | sort | uniq -c | sort -rn
20
21 # Check thread limits
22
23 ulimit -u && cat /proc/sys/kernel/threads-maxDebug Checklist
- 1
ps -eLf | grep <process> -- list all threads with LWP (tid) column - 2
cat /proc/<pid>/status | grep Threads -- thread count for a process - 3
cat /proc/<pid>/task/<tid>/status -- per-thread state and stack info - 4
pstack <pid> -- dump stack traces of all threads (requires gdb) - 5
strace -f -e clone,futex -p <pid> -- trace thread creation and mutex operations - 6
valgrind --tool=helgrind ./program -- detect data races and lock ordering issues
Key Takeaways
- ✓Linux makes no distinction between threads and processes at the kernel level. Thread creation is clone() with CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD. The CLONE_THREAD flag groups them under the same tgid (what getpid() returns). That is it.
- ✓An uncontended mutex lock is a single atomic cmpxchg in userspace. No syscall. Only when another thread holds the lock does futex(FUTEX_WAIT) enter the kernel. Uncontended mutexes cost about 25 nanoseconds. This is why pthreads are fast.
- ✓pthread_cond_wait() does two things atomically: releases the mutex and sleeps on the condvar's futex. This atomicity is critical. Without it, a signal could slip between unlock and sleep -- the classic 'lost wakeup' problem.
- ✓Thread-local storage (__thread / thread_local) uses the FS segment register on x86-64. Each thread's FS base points to its TCB, and TLS variables are offsets from FS. Accessing a thread-local variable is a single mov instruction, not a function call.
- ✓Thread cancellation is almost never what you want. Deferred cancellation only fires at specific 'cancellation points' (sleep, read, write, etc.). Asynchronous cancellation can fire at any instruction and will leave mutexes locked and heap corrupted. Prefer a shared 'shutdown' flag instead.
Common Pitfalls
- ✗Not checking pthread_mutex_lock() return value. With PTHREAD_MUTEX_ERRORCHECK or robust mutexes, it can return EDEADLK or EOWNERDEAD. Default mutexes silently deadlock on recursive locking -- no error, just a hang.
- ✗Signaling a condition variable without holding the associated mutex. Technically allowed by POSIX, but it creates a race: the waiter may miss the signal if it has not entered pthread_cond_wait() yet. Always signal while holding the mutex, or immediately after unlocking.
- ✗Using a stack-allocated mutex or condvar after the declaring function returns. The mutex memory gets reused, and other threads end up locking garbage. Always use heap-allocated or global synchronization primitives.
- ✗Calling fork() in a multithreaded program without immediately calling exec(). Only the calling thread survives in the child. Mutexes held by vanished threads remain locked forever. Use pthread_atfork() as a band-aid, or avoid this pattern entirely.
Reference
In One Line
Threads share everything, which is the point -- but every piece of shared data needs a mutex or an atomic, no exceptions.