Read-Copy-Update (RCU) & Lock-Free Read Access
Mental Model
A museum guide leading a group through a gallery. Visitors (readers) walk through and look at paintings freely, no tickets or turnstiles required. When the curator (writer) wants to replace a painting, they hang the new one next to the old one and swap the sign. Visitors already looking at the old painting finish at their own pace. The curator waits until every visitor who saw the old painting has moved on to the next room. Only then is the old painting removed from the wall. No visitor is ever interrupted. No visitor ever sees a half-hung painting. The cost of admission is zero for visitors; the curator bears all the waiting.
The Problem
A routing update on a busy router must modify the forwarding table while 10 million packets per second are flowing through it. With a read-write lock, every packet lookup acquires a read lock. On a 48-core system, the cache line holding the rwlock counter bounces between all 48 L1 caches on every packet. Throughput collapses from 10 Mpps to under 2 Mpps. The lock is never contended in the traditional sense -- readers never block -- but the cache-line bouncing alone destroys performance. The routing update itself takes microseconds, but the read-side overhead is paid on every single packet.
Architecture
A 48-core router forwards 10 million packets per second. Each packet needs a routing table lookup. A BGP peer sends an update that changes one route. How does the kernel modify the routing table without making any of those 10 million lookups wait?
The answer is not a read-write lock. A read-write lock forces every reader to atomically increment a shared counter, and on 48 cores that counter's cache line bounces across 48 L1 caches on every single packet. Uncontended in the traditional sense, but devastatingly slow at scale.
RCU -- Read-Copy-Update -- solves this by eliminating reader-side synchronization entirely. Readers do not lock anything. Readers do not increment anything. Readers do not even execute a memory barrier. They just read. The cost of an RCU read-side critical section on x86 is literally zero instructions beyond the normal read.
What Actually Happens
Here is the sequence for an RCU-protected data structure update:
-
Read side. Any CPU that needs to access the shared data calls
rcu_read_lock(), reads the pointer withrcu_dereference(), uses the data, and callsrcu_read_unlock(). On non-preemptible kernels,rcu_read_lock()compiles down topreempt_disable(). No atomic operation. No shared cache line touched. -
Copy. The writer allocates a new version of the data structure and initializes it completely. The old version is not modified.
-
Update. The writer calls
rcu_assign_pointer()to atomically swing the global pointer from the old version to the new version. This includes a write barrier so that readers on other CPUs see the fully initialized new structure, not a half-written one. -
Wait. The writer calls
synchronize_rcu()(blocking) orcall_rcu()(non-blocking callback). This waits until every CPU that might have been reading the old version has finished its read-side critical section. The kernel tracks this by watching for quiescent states -- context switches, idle periods, user-mode execution. -
Reclaim. After the grace period, no reader can possibly hold a reference to the old version. The old memory is freed.
The critical insight: the writer never touches the old data. Readers of the old version keep reading it without interruption. Readers that start after step 3 see the new version. There is no moment where any reader sees inconsistent data, and no moment where any reader waits.
For the routing table, this means: fib_lookup() runs entirely inside rcu_read_lock(). Ten million lookups per second across 48 cores, each touching only its own CPU-local data. When a route changes, the new entry is published via rcu_assign_pointer(), and the old entry is freed via call_rcu() after the grace period. Packet forwarding is never interrupted.
Under the Hood
Grace period detection. The kernel must determine when every CPU has passed through a quiescent state. The production implementation (Tree RCU, in kernel/rcu/tree.c) organizes CPUs into a hierarchical tree. Leaf nodes group 16-64 CPUs. Each CPU reports its quiescent state to its leaf node via a bitmask. When all CPUs in a leaf have reported, the leaf propagates up to its parent. When the root node sees all children reported, the grace period ends. This hierarchical approach scales to thousands of CPUs without a single global lock.
What counts as a quiescent state. For RCU-sched (the variant used by most kernel code): a context switch, entering the idle loop, or executing in user mode. The scheduler calls rcu_note_context_switch() on every context switch. The idle loop calls rcu_idle_enter(). These are events that already happen constantly -- RCU piggybacks on them rather than adding new synchronization.
rcu_read_lock() internals. On CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY kernels:
static inline void rcu_read_lock(void) {
preempt_disable(); /* One instruction: increment per-CPU preempt_count */
}
That is it. No atomic. No barrier. No shared memory access. On CONFIG_PREEMPT_RT (real-time) kernels, RCU read-side critical sections are preemptible, and rcu_read_lock() becomes a per-CPU reference count -- still cheap, but not free.
call_rcu() vs synchronize_rcu(). synchronize_rcu() puts the caller to sleep until the grace period ends. Simple to reason about, but it blocks. call_rcu() registers a callback function and a pointer; the callback runs after the grace period in softirq context. This is how the networking code works -- call_rcu(&old_route->rcu_head, route_free_rcu) queues the free and returns immediately. The writer continues without waiting.
rcu_barrier() drains callbacks. When a kernel module is unloaded, any pending call_rcu() callbacks referencing module code must complete before the module's code pages are freed. rcu_barrier() blocks until all previously queued callbacks on all CPUs have executed. Without it, unloading a module with pending callbacks causes a kernel oops when the callback tries to execute freed code.
Memory ordering. On x86, rcu_dereference() is a plain pointer load -- x86 hardware does not reorder dependent loads, so if the pointer load returns the new pointer, all subsequent dependent loads see the data the writer initialized before publishing. On ARM and POWER, rcu_dereference() includes a data dependency barrier (smp_read_barrier_depends() or the C11 memory_order_consume equivalent) because these architectures can speculatively satisfy dependent loads from stale cache lines.
Common Questions
How does RCU handle data structures larger than a single pointer?
RCU protects pointer-mediated access, not individual fields. The pattern is: allocate a complete new structure, populate it fully, then swing one pointer. For linked lists, list_add_rcu() and list_del_rcu() manipulate next/prev pointers with the necessary barriers so that readers traversing the list under rcu_read_lock() always see a consistent chain. A reader might traverse the old path or the new path depending on timing, but it never sees a corrupted pointer or a dangling reference.
What happens if a reader is preempted in an RCU critical section on a CONFIG_PREEMPT kernel?
On preemptible kernels, rcu_read_lock() does not disable preemption. Instead, the kernel tracks which tasks are inside RCU read-side critical sections using a per-task counter. The grace period cannot end until all such tasks have exited their critical sections. This means a preempted reader extends the grace period -- writers wait longer, but correctness is maintained. This is why preemptible RCU has slightly higher overhead and longer grace periods than non-preemptible RCU.
Can RCU completely replace read-write locks?
Not always. RCU works when reads vastly outnumber writes and when the data can be updated by replacing an entire structure via pointer swap. It does not work for in-place field updates (use per-field atomics or seqlocks), does not work when reads must block (use SRCU), and adds complexity that is not justified when a simple mutex has acceptable performance. The rule of thumb: if read-side scalability across many CPUs is the bottleneck, RCU is the answer. If the workload is write-heavy or runs on fewer than 8 cores, a rwlock is simpler and sufficient.
How much memory overhead does deferred freeing add?
Grace periods typically complete in 1-20ms. In that window, old versions of updated structures remain allocated. For a routing table that changes 100 times per second, that is at most 1-2 old route entries alive at any time -- negligible. For a workload that deletes 100,000 objects per second via call_rcu(), the backlog can reach thousands of pending callbacks per CPU. The kernel monitors this and can invoke callbacks more aggressively (via rcu_nocbs CPUs or by forcing a grace period) to prevent memory pressure. In extreme cases, rcu_barrier() provides a hard synchronization point.
How Technologies Use This
A container host tears down 150 containers in rapid succession during a rolling deployment. Each container has its own network namespace containing routes, iptables rules, active sockets, and netfilter conntrack entries. Removing a network namespace must guarantee that no packet currently in flight on another CPU still references the namespace's data structures before those structures are freed.
The kernel uses call_rcu() to defer freeing network namespace structures until a grace period elapses. When a namespace is dismantled, its routing tables and socket structures are unlinked from the global lists immediately, but the actual kfree calls are queued via call_rcu(). Any packet being processed on another CPU that still holds an rcu_read_lock() reference to the old namespace's routing entries completes safely. Only after every CPU has passed through a quiescent state (context switch, idle, or user-mode execution) does the RCU callback fire and free the memory.
Without call_rcu(), tearing down 150 namespaces would require either a blocking synchronize_rcu() call for each one (adding 1-20ms of latency per namespace, totaling seconds of delay) or risk use-after-free crashes from packets referencing freed structures. call_rcu() lets the teardown proceed at full speed while the RCU subsystem batches deferred frees in the background.
An Nginx server handling 50,000 requests per second serves static files from paths like /var/www/assets/css/main.css, where each open() call requires five directory entry (dentry) lookups: var, www, assets, css, and main.css. At 50K req/sec, that is 250,000 dentry cache lookups per second across all worker processes.
The kernel's VFS uses RCU-walk mode to traverse the dentry hash chain entirely under rcu_read_lock(), which on non-preemptible kernels compiles to a single preempt_disable() instruction. Each path component is resolved by following pointers via rcu_dereference() with zero lock acquisition and zero atomic counter increments. If no concurrent rename or deletion is detected (checked via a per-dentry sequence counter), the entire five-component lookup completes in pure RCU mode without touching a single shared lock.
On a 16-core host running Nginx workers pinned to each core, this means all 16 cores resolve file paths in parallel with no cache-line bouncing between them. Without RCU-walk, the old dcache global spinlock would serialize all 250K lookups per second through a single lock, capping throughput regardless of core count. In practice, 99%+ of lookups complete in RCU-walk mode, and the fallback ref-walk path (which takes proper reference counts) fires only during active directory renames.
A Go application performing 10,000 HTTP requests per second relies on the kernel's routing table (FIB) for every outbound DNS and TCP connection. Each call to connect() triggers a fib_lookup() in the kernel to determine the next-hop gateway and output interface. On a host with 500 routes from multiple network interfaces and VPN tunnels, these lookups happen entirely under rcu_read_lock().
The kernel's FIB uses RCU to allow route lookups and route updates to proceed concurrently. When a new route is added or an existing one changes (for example, a VPN tunnel renegotiates and the default route shifts), the routing subsystem allocates a new fib_info structure, populates it, and swaps the pointer using rcu_assign_pointer(). Any in-flight fib_lookup() on another CPU that started before the swap continues reading the old route safely. The old fib_info is freed via call_rcu() after all CPUs have passed through a quiescent state.
For the Go runtime, this is invisible but critical. The goroutine scheduler multiplexes thousands of goroutines across a handful of OS threads, each making connect() calls that hit fib_lookup(). If route lookups required a read-write lock, the cache-line bouncing from atomic counter increments on every lookup would serialize all outbound connections. RCU ensures that routing table reads scale linearly with core count, keeping DNS resolution and connection setup fast even during route table churn.
Same Concept Across Tech
| Technology | How it uses RCU | Key gotcha |
|---|---|---|
| Linux routing (FIB) | All route lookups run under rcu_read_lock(). Route updates swap pointers and defer frees via call_rcu() | Route table memory is not freed instantly. Under heavy churn, deferred frees can accumulate |
| VFS dcache | RCU-walk mode traverses dentry hash chains without locks. Falls back to ref-walk only on concurrent rename | Rename-heavy workloads force fallback to the slower ref-walk path |
| Network namespaces | Namespace teardown defers structure freeing via call_rcu() to avoid blocking on in-flight packets | Tearing down hundreds of namespaces rapidly can build a large callback backlog |
| SELinux/AppArmor | Policy lookups on every syscall use RCU to avoid lock overhead on the hot path | Policy reloads must wait for a grace period before freeing old policy structures |
| epoll | epoll's ready list traversal uses RCU in parts of the wakeup path | Complex interaction with the epoll spinlock for modifications |
Stack layer mapping (RCU stall or memory growth):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is a kernel thread stuck in a long RCU read-side critical section? | dmesg for "rcu_sched detected stalls" |
| Scheduler | Is a CPU not scheduling (stuck in a tight loop with preemption disabled)? | /proc/sched_debug, check runqueue |
| RCU core | Are grace periods completing? Is the gap between current and completed growing? | /sys/kernel/debug/rcu/rcu_preempt/rcugp |
| Memory | Are deferred frees accumulating faster than grace periods complete? | /proc/softirqs RCU line, slabtop for growing caches |
| Hardware | Is an interrupt storm preventing a CPU from reaching a quiescent state? | /proc/interrupts, check for runaway interrupt counts |
Design Rationale Traditional read-write locks have a hidden scalability wall. Even though readers do not block each other, every reader must atomically increment a shared counter to acquire the lock. On a 64-core machine, that counter's cache line bounces between 64 L1 caches on every read operation, turning an "uncontended" lock into a serialization point. RCU eliminates reader-side shared state entirely. Readers touch only CPU-local data (preemption count), so read-side throughput scales linearly. The tradeoff is that writers must wait for a grace period before reclaiming memory, and the programming model is more restrictive -- data must be accessed through pointer indirection, and updates must follow the publish-subscribe pattern. For the kernel's dominant workload pattern (frequent reads, rare writes), this tradeoff is overwhelmingly favorable.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| "rcu_sched detected stalls on CPUs" in dmesg | A CPU is stuck in a non-preemptible section, preventing quiescent state | dmesg -T, check which CPU and backtrace |
| Memory usage grows steadily under RCU-heavy workload | call_rcu() callbacks accumulating faster than grace periods complete | cat /sys/kernel/debug/rcu/rcu_preempt/rcudata, check cblist len |
| Soft lockup warnings alongside RCU stall warnings | A CPU spinning with interrupts disabled or preemption disabled | Check NMI watchdog output, look for the stuck backtrace |
| Network performance degrades after routing table churn | Excessive synchronize_rcu() calls serializing route updates | Convert synchronize_rcu() to call_rcu() for batch updates |
| High softirq CPU time on RCU_SOFTIRQ | Large number of pending RCU callbacks being processed | /proc/softirqs, check if RCU callbacks are batching properly |
| System hangs after loading a kernel module that holds rcu_read_lock too long | Grace period cannot complete, blocking all RCU writers system-wide | dmesg for stall backtrace, fix the module to shorten read-side sections |
When to Use / Avoid
Relevant when:
- Read-heavy workloads where read-side performance must scale linearly with CPU count
- Data structures updated rarely but read on every packet, syscall, or lookup (routing tables, dcache, process list)
- Lock contention analysis shows cache-line bouncing on rwlock reader counts as the bottleneck
- Existing rwlock-protected code needs to scale beyond 16-32 cores
Watch out for:
- Write-heavy workloads where grace period delays cause memory pressure from deferred frees
- Code paths that need to sleep during reads (use SRCU instead of classic RCU)
- Structures modified in place rather than replaced via pointer swap (RCU protects pointer traversal, not field mutation)
Try It Yourself
1 # Check for RCU stall warnings in kernel log
2
3 dmesg -T | grep -i 'rcu.*stall'
4
5 # View RCU grace period state
6
7 cat /sys/kernel/debug/rcu/rcu_preempt/rcugp 2>/dev/null || cat /sys/kernel/debug/rcu/rcu_sched/rcugp
8
9 # Count softirq invocations including RCU callbacks
10
11 grep -E 'RCU|TASKLET' /proc/softirqs
12
13 # Trace RCU grace period events in real time
14
15 trace-cmd record -e rcu:rcu_grace_period -e rcu:rcu_utilization -- sleep 5 && trace-cmd report | head -30
16
17 # Monitor RCU callback processing rate
18
19 watch -n1 'grep RCU /proc/softirqs'
20
21 # Check kernel config for RCU flavor
22
23 zcat /proc/config.gz 2>/dev/null | grep CONFIG_RCU || grep CONFIG_RCU /boot/config-$(uname -r)Debug Checklist
- 1
Check for stalled grace periods: dmesg | grep 'rcu.*stall' - 2
View per-CPU RCU state: cat /sys/kernel/debug/rcu/rcu_preempt/rcudata - 3
Check grace period progress: cat /sys/kernel/debug/rcu/rcu_preempt/rcugp - 4
Count pending RCU callbacks: grep rcu /proc/softirqs - 5
Trace RCU events: trace-cmd record -e rcu -e rcu:rcu_grace_period - 6
Check for long read-side critical sections: echo 1 > /sys/kernel/debug/rcu/rcu_preempt/rcu_expedited_stall
Key Takeaways
- ✓rcu_read_lock() on a non-preemptible kernel compiles to preempt_disable(). No atomic instruction. No memory barrier. No cache-line bounce. This is why RCU readers scale perfectly -- each CPU touches only its own local data. Compare this to a rwlock where every reader atomically increments a shared counter, bouncing the cache line across every CPU.
- ✓RCU does not protect data. It protects pointers to data. The pattern is always: publish a new pointer, wait for readers of the old pointer to finish, then free the old data. If code needs to read the contents of a structure that might be concurrently modified in place, RCU alone is not sufficient -- combine it with per-field atomic operations or seqlocks.
- ✓synchronize_rcu() vs call_rcu() is a latency-vs-memory tradeoff. synchronize_rcu() blocks until the grace period ends (typically 1-10ms), keeping the call stack simple. call_rcu() returns immediately but the old data stays allocated until the callback fires. In a tight loop freeing thousands of objects, call_rcu() can build up a large backlog. Use rcu_barrier() to drain all pending callbacks when that matters.
- ✓SRCU (Sleepable RCU) exists for cases where readers need to sleep. Classic RCU read-side critical sections cannot sleep because the grace period detection relies on context switches as quiescent states. SRCU uses per-CPU counters instead, allowing readers to block. The tradeoff is higher read-side overhead (atomic increment/decrement on the counter).
- ✓RCU grace periods in a real kernel typically complete in 1-20ms. The kernel's RCU implementation (Tree RCU) uses a hierarchical structure to scale grace period detection across thousands of CPUs. Each CPU reports quiescent states to its leaf node, and the information propagates up the tree. This avoids a single global counter that all CPUs would contend on.
Common Pitfalls
- ✗Accessing RCU-protected data outside rcu_read_lock/rcu_read_unlock. The data can be freed at any moment after the read-side critical section ends. Saving a pointer obtained via rcu_dereference() and using it after rcu_read_unlock() is a use-after-free bug. The pointer is valid only within the critical section.
- ✗Blocking or sleeping inside an RCU read-side critical section (classic RCU, not SRCU). Sleeping prevents the CPU from reaching a quiescent state, which stalls the grace period for all writers system-wide. In the worst case, this causes memory exhaustion as call_rcu() callbacks pile up waiting for the stalled grace period.
- ✗Using synchronize_rcu() in a loop that frees many objects. Each call blocks for a full grace period (milliseconds). Freeing 10,000 objects one at a time with synchronize_rcu() takes seconds. Use call_rcu() to batch the frees, or collect objects into a list and call synchronize_rcu() once for the entire batch.
- ✗Forgetting rcu_assign_pointer() when publishing a new pointer. A plain store can be reordered by the compiler or CPU (on weakly-ordered architectures), letting readers see a pointer to a partially initialized structure. rcu_assign_pointer() includes the necessary store barrier. On x86 this matters for compiler reordering; on ARM/POWER it matters for both compiler and hardware reordering.
Reference
In One Line
Readers pay nothing -- not even a single atomic instruction -- while writers do all the work by waiting for every existing reader to finish before freeing old data.