Process vs Thread vs Goroutine
Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often
A process owns memory and resources; threads share the process's address space and are scheduled by the OS; goroutines are user-space tasks multiplexed onto a small pool of OS threads by the Go runtime.
Key Points
- Process: isolated address space, expensive context switch (~tens of μs), IPC required to share data
- OS thread: ~1 MB stack, context switch ~1–10 μs, kernel-scheduled
- Goroutine: ~2 KB initial stack (grows), context switch ~hundreds of ns, runtime-scheduled
- Java virtual threads (21+): JVM-managed, M:N model, millions per JVM
- Python threads share the GIL, only one executes Python bytecode at a time
- Prefer processes for CPU-bound parallelism in Python; threads for I/O
Interview Follow-up Questions
How many goroutines can a Go program run?
Practically millions on modern hardware. Each starts with ~2 KB stack that grows as needed. Limited by available memory, not OS thread count.
What's the GIL and why does it matter?
Global Interpreter Lock, CPython serializes execution of Python bytecode across threads. Threads help for I/O (GIL released around system calls) but not CPU. Use multiprocessing for CPU parallelism.
Difference between virtual threads and goroutines?
Both use M:N scheduling onto OS threads. Goroutines are runtime-native and integrate with channels. Virtual threads are JVM-managed and work with existing blocking JDK APIs without rewrites.
Why does Go limit OS threads but not goroutines?
GOMAXPROCS sets the number of OS threads (P's) executing Go code simultaneously. Goroutines are multiplexed onto these. Default = number of CPU cores.
Can a goroutine outlive the function that started it?
Yes, a goroutine runs until its function returns. If main() exits, all goroutines are killed. Otherwise it can run forever, which is how leaks happen.
Gotchas
- Goroutines started in a loop without context cancellation are the #1 source of leaks in production Go services
- Virtual threads pin to their carrier inside synchronized blocks, prefer ReentrantLock for I/O-heavy code
- Python threads do NOT give CPU parallelism, measure and switch to multiprocessing if compute-bound
- Mixing asyncio and threading is a footgun, use asyncio.to_thread() for blocking calls
- runtime.NumGoroutine() growing unbounded over time = leak; alert on it
Concurrency vs Parallelism
Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often
Concurrency is structuring a program to deal with many things at once; parallelism is actually doing many things at the same time. One can exist without the other.
Key Points
- Concurrency = composition of independently executing tasks (a structure)
- Parallelism = simultaneous execution of multiple tasks (an outcome)
- Single-core CPU + threads = concurrency without parallelism (interleaved)
- GPU compute or multi-core SIMD = parallelism without classical concurrency
- Web server handling 10K connections = concurrency; matrix multiplication on 8 cores = parallelism
- Async/event loops give concurrency without threads, single-threaded but interleaved
Interview Follow-up Questions
If Python's GIL prevents parallel threads, why use threading at all?
Threading provides concurrency for I/O-bound work, the GIL is released around blocking system calls. The result is overlapping waits, not parallel compute.
Is async/await concurrent, parallel, or both?
Concurrent only. asyncio runs on one thread; tasks interleave at await points. To add parallelism, run multiple asyncio event loops (one per process) or use a thread pool for blocking work.
Can a single-threaded program be concurrent?
Yes, Node.js and Python asyncio are single-threaded but concurrent. They interleave I/O-bound tasks via cooperative scheduling on an event loop.
Why does Rob Pike say 'concurrency is not parallelism'?
His point: concurrency is a way to structure problems with independent pieces; parallelism is one possible execution. A well-designed concurrent program runs efficiently on 1 or 100 cores, the structure is independent of the hardware.
Gotchas
- More threads != more parallelism, once thread count exceeds core count, the cost is context-switching without throughput gain
- asyncio is single-threaded, a CPU-bound task in an async function blocks the entire event loop
- Thread pools sized for I/O are wrong for CPU-bound work and vice versa
Thread & Goroutine Lifecycle
Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often
A Java thread moves through NEW → RUNNABLE → BLOCKED/WAITING/TIMED_WAITING → TERMINATED. A goroutine has no public state, it runs until its function returns; the runtime parks/unparks it on channel and lock operations. Python threads expose start/join/is_alive but no fine-grained state.
Key Points
- Java thread states: NEW, RUNNABLE, BLOCKED (waiting for monitor), WAITING, TIMED_WAITING, TERMINATED
- RUNNABLE in Java covers both 'running' and 'ready to run', the JVM doesn't expose CPU-on/off
- BLOCKED = waiting for a synchronized monitor; WAITING = parked via wait/join/park
- Goroutines have no public state, they're either running, runnable, or parked
- Cancellation in Go is via context.Context, propagated explicitly through call chains
- Cancellation in Java is via Thread.interrupt(), checked cooperatively
- Cancellation in Python is via threading.Event polling or asyncio.Task.cancel()
Interview Follow-up Questions
What's the difference between BLOCKED and WAITING in Java?
BLOCKED = waiting to acquire a synchronized monitor (no timeout, no signal). WAITING = parked via wait/join/LockSupport.park, woken by notify/unpark/interrupt.
Why is Thread.stop() deprecated?
It releases all locks abruptly, leaving objects in inconsistent states. Use interrupt + cooperative shutdown.
How do you cancel a goroutine?
Not from outside, signal it via context.Done() or by closing a channel it selects on. The goroutine must check the signal and return.
What happens to goroutines when main() exits?
They're killed immediately, no cleanup runs. Use sync.WaitGroup or context to wait for them before main returns.
Can a Thread be restarted after it terminates?
No. Thread.start() throws IllegalThreadStateException if called twice. Create a new Thread or use an ExecutorService.
How does Python cancel a thread?
Not directly. The convention is a shared threading.Event the thread polls. Or for asyncio, task.cancel() raises CancelledError at the next await.
Gotchas
- Thread.sleep() and Object.wait() CLEAR the interrupt flag, must re-set with Thread.currentThread().interrupt()
- Forgetting wg.Add() before 'go' creates a race with wg.Wait(), Add must be called from the parent goroutine
- context.WithCancel without calling cancel() leaks the context's goroutine
- Daemon threads (Java setDaemon, Python daemon=True) are killed at shutdown, avoid for cleanup work
- Calling notify() vs notifyAll(), notify wakes ONE waiter; spurious wakeups require waiting in a loop
- Swallowing asyncio.CancelledError prevents shutdown, re-raise after cleanup
Critical Sections & Shared State
Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often
A critical section is the part of code that touches shared mutable state and must run without interleaving from other threads. Identify it first, then pick a primitive (mutex, atomic, channel) to protect it. Most concurrency bugs come from missing or oversized critical sections.
Key Points
- Shared mutable state is the source of every race. Remove sharing or remove mutation, and the problem disappears.
- A critical section is any code path that reads-then-writes (or writes-then-reads) shared state and depends on the data not changing in between.
- Two failures to avoid: forgetting to protect a critical section, and protecting too much (locking around I/O).
- The right size of a critical section is the smallest set of statements that preserve the invariant in question.
Interview Follow-up Questions
How does one find critical sections in unfamiliar code?
Look for every read or write to a field that is also written by another thread. Then trace what invariant the code relies on. If the invariant spans more than one statement, those statements are the critical section.
What is wrong with making the lock cover the whole method?
Two things. Throughput drops because no other thread can do anything while one is in the method. And if the method calls into other code (a callback, a network call), it can deadlock or block on something that has no business being inside a lock.
Is a single statement always atomic?
No. `counter++` reads the value, increments, and writes back. Three operations. Other threads can interleave between them. A single statement is not the same as a single operation at the hardware level.
Race Conditions, Why Concurrent Bugs Happen
Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often
A race condition is when the program's correctness depends on the unpredictable interleaving of two or more threads' operations on shared state. Fix it by removing the sharing, or by making the operations atomic with a lock or atomic primitive.
Key Points
- A race condition is a CORRECTNESS bug, not a 'maybe slow' bug, the wrong answer can be observed
- Three ingredients: shared mutable state, multiple threads, unsynchronized access
- Remove ANY one of the three and the race goes away
- Read-modify-write operations like counter++ are NEVER atomic, they're three instructions
- Compilers and CPUs reorder reads/writes, what looks sequential in source isn't
- Race conditions hide on x86 (strong memory model) and explode on ARM
Interview Follow-up Questions
What's the difference between a data race and a race condition?
Subtle but important. A data race is unsynchronized access where at least one is a write, a memory model violation, undefined behavior. A race condition is broader, any logic bug whose outcome depends on timing. All data races are race conditions; not all race conditions are data races (e.g., check-then-act with atomics).
Why does counter++ fail even with volatile?
volatile gives visibility (other threads see the latest value) and ordering, but ++ is still three operations: read, add, write. Two threads can both read the same value, both add 1, both write, losing one update. volatile makes each operation atomic; only the COMPOUND operation needs to be atomic.
If code 'works' on x86, is there still a race?
Yes. x86's strong memory model masks many races at runtime, but the compiler is free to reorder. Move to ARM (Apple Silicon, AWS Graviton, mobile) and the same code may break. Use the race detector on every platform.
What's the simplest way to remove a race?
Remove the sharing. Confine the state to one thread (Go's idiom: 'share by communicating'), or make it immutable. Locks and atomics are the next-best fixes, they manage the race; removing the sharing eliminates it.
Are reads-only races okay?
Yes, concurrent reads of immutable state are always safe. A race requires at least one write.
Gotchas
- Tests almost always pass, races need real load and timing to manifest
- x86 hides races that ARM exposes, test on both
- Even atomic.Add doesn't help with check-then-act: if (atomic.Load() == 0) atomic.Store(1) is racy
- The Go race detector slows code 5-10x, run it in CI, not production
- Java's volatile is for visibility, not atomicity of compound ops
Thread Safety Levels
Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often
Code is not thread-safe or unsafe in one bucket. There are levels: immutable, thread-confined, conditionally thread-safe, fully thread-safe, and lock-free. Knowing the level a class promises is the difference between safe code and a postmortem.
Key Points
- Immutable: no writes after construction. Always safe to share.
- Thread-confined: only ever touched by one thread. Safe by isolation, not by locking.
- Conditionally thread-safe: individual methods are safe, but compound operations need external synchronisation.
- Fully thread-safe: every method and every reasonable sequence of calls is safe under concurrent use.
- Lock-free: thread-safe AND no thread can be permanently blocked by another. Stronger than fully thread-safe.
Interview Follow-up Questions
What is the difference between thread-safe and lock-free?
Thread-safe means concurrent use does not corrupt state or produce wrong results. Lock-free is stronger: at least one thread always makes progress. A locked queue is thread-safe but not lock-free, because the lock holder can be preempted and block all callers.
Why is conditionally thread-safe so easy to misuse?
Because each method call looks safe, so engineers assume sequences of calls are safe too. They are not. The lesson is to always think about the compound operation, not the individual calls.
Is final / readonly enough for immutability?
Almost. The fields must be final, the references must point to immutable objects too (or defensive copies), and the constructor must not leak `this` before completion. Java's record types and Kotlin's data classes get this right by default.
How should the thread-safety level of a class be documented?
In the Javadoc / docstring, in one line: 'This class is immutable', 'fully thread-safe', 'not thread-safe', 'thread-confined to the EDT'. Library authors should be specific. Callers should not have to guess.
Mutex, Semaphore, Condition Variable
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often
A mutex enforces 'one at a time'. A semaphore counts permits, N threads at a time. A condition variable lets threads wait for a state change. They're the three primitives every higher-level concurrency tool is built on.
Key Points
- Mutex = binary lock, mutual exclusion, ONE owner
- Semaphore = counted permits, allows N concurrent holders
- Condition variable = wait until a predicate becomes true, used WITH a mutex
- Mutex 'protects' invariants; semaphore 'limits' concurrency; condition variable 'signals' state changes
- Always wait on a condition variable in a while-loop, never if (spurious wakeups + stolen wakeups)
- Counting semaphore (N>1) for resource pools; binary semaphore (N=1) is 'almost a mutex' but lacks ownership
Interview Follow-up Questions
Mutex vs binary semaphore, what's the difference?
A mutex has ownership: only the thread that acquired can release. A binary semaphore is a counter capped at 1, any thread can release, even one that didn't acquire. Use mutex for mutual exclusion (most cases). Use semaphore for signaling (rare) or for resource pools (counting > 1).
Why pair a mutex AND a condition variable?
The condition variable's wait operation atomically: (a) releases the mutex and (b) parks the thread, so it can't miss a signal sent between the predicate check and the wait. Without the mutex, a check-then-wait race appears. They're a pair.
What's a spurious wakeup?
An OS or JVM may wake a parked thread without anyone calling signal/notify. The wakeup spec is 'at least once when notified', not 'exactly when notified.' That's why every wait MUST be in a while-loop that re-checks the predicate.
Difference between signal() and signalAll()?
signal/notify wakes ONE waiting thread (impl-defined which). signalAll/notifyAll wakes ALL of them. Use signal when only one thread can make progress (e.g., one slot freed in a buffer); use signalAll when state changed in a way that all waiters might benefit from.
Why does Go discourage condition variables?
Go's design philosophy is 'communicate, don't share.' Channels carry both data and synchronization in one operation, eliminating the mutex+condition pair for most cases. sync.Cond exists for the rare case (broadcast to many) but should be a last resort.
Gotchas
- Forgetting to release a lock on the error path → permanent deadlock. Always use defer/finally/with.
- Using if instead of while around await/wait → spurious + stolen wakeups corrupt invariants
- Locking and waiting on different lock instances, the code looks locked but isn't
- Calling signal() without holding the lock, the wait/check race is back
- Counting semaphore release without prior acquire → permit leak (raises in BoundedSemaphore)
- Mutex copy in Go (passing a struct by value), go vet catches it
Synchronisation vs Coordination
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Sometimes
Synchronisation protects shared state. Coordination orchestrates the order or timing of work. Mutexes synchronise. Channels, latches, barriers, futures coordinate. Mixing them up leads to overcomplicated code that locks where it should signal, and signals where it should lock.
Key Points
- Synchronisation: 'no two threads in this section at once'. Mutex, RWLock, atomic, synchronized.
- Coordination: 'do this after that'. Channel, latch, barrier, future, semaphore, condition variable.
- Misapplication: using a mutex to wait for an event (busy-loop with mutex). Or using a channel to protect a variable (overkill).
- Channels are coordination AND synchronisation in one move because the send happens-before the receive.
- When in doubt, ask what's actually needed. Mutual exclusion of access? Or sequencing of work?
Interview Follow-up Questions
Are these two things really different, or just different words for the same?
Different. Synchronisation prevents simultaneous access to data. Coordination orders events in time. A mutex synchronises but does not coordinate (it has no 'wait for X to finish' semantics). A latch coordinates but does not synchronise (it does not protect any data). Some primitives do both.
Is a semaphore synchronisation or coordination?
Coordination, mostly. With permits = 1 it can act like a mutex (binary semaphore), but the typical usage is rate-limiting or signal-style: 'release a permit when X happens, acquire one to wait for it'. The N-permit case is purely coordination of how many things can run.
Why are channels a popular choice in Go?
Because they bundle synchronisation (the send happens-before the receive) with coordination (the receiver waits until something is sent). One primitive, two jobs. Most Go programs end up needing both at the same point in the code, so the bundle is a clean fit.
When should one reach for a condition variable?
When shared state guarded by a lock requires threads to wait until that state matches a predicate. Producer-consumer with a bounded buffer is the canonical example. For 'wait for one event', a latch or Event is simpler. For 'protect this state' alone, a lock by itself is enough.
Optimistic vs Pessimistic Concurrency Control
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often
Pessimistic: lock the data while working on it; nobody else can touch it. Cheap when contention is high, costs a lock acquire on every operation. Optimistic: read freely, do the work, commit only if nothing else changed (CAS, version number, MVCC). Nearly free when contention is low, requires retry under contention. Pick pessimistic when conflicts are common, optimistic when they're rare.
Key Points
- Pessimistic = 'assume conflict, prevent it' (locks). Optimistic = 'assume no conflict, detect it' (CAS, version checks).
- Pessimistic cost is constant per operation regardless of contention; optimistic cost is near-zero under low contention but grows (retries) under high contention.
- Optimistic concurrency uses one of three mechanisms: CAS on a memory word, a version column / ETag in a row, or MVCC snapshots in a database.
- The break-even point is the conflict rate. Roughly: under 10% conflict, optimistic wins. Above 50% conflict, pessimistic wins. In between, measure.
- Optimistic doesn't avoid contention, it relocates it: instead of waiting for the lock, the caller waits for a successful retry. The work done before the failed commit is wasted.
- Hybrid systems pick per-operation: PostgreSQL's MVCC is optimistic for reads (snapshots) and pessimistic for writes (row locks). HTTP If-Match uses optimistic versioning at the protocol layer.
Interview Follow-up Questions
How does one decide which to use?
Estimate the conflict rate. (1) Multiple users editing the same row: usually low (most rows are touched by one user at a time). Use optimistic. (2) Hot counter incremented by every request: very high. Use sharding (LongAdder, per-CPU counters), not either lock. (3) A connection pool's free-list: low when pool is sized right, high when pool is undersized. Use a small pessimistic lock around the list, but mostly fix the pool sizing. (4) Two cashiers on the same drawer: high. Use pessimistic. The break-even is roughly 10-20% conflict; below that optimistic wins on throughput, above that the retries hurt.
What about livelock with optimistic?
Real risk. Two threads can keep beating each other to the commit, each retrying, neither making progress. Defenses: backoff with jitter between retries, bounded retry count with a fallback to pessimistic, or use an atomic helper (incrementAndGet, addAndGet) that's been tuned by the JDK to use exponential backoff under contention. For SQL: detect the failed UPDATE, surface a 409 Conflict to the user, let them resolve.
Why does MVCC fit here?
MVCC (PostgreSQL, Oracle, InnoDB) is the database scaling story for optimistic reads: every transaction reads a consistent snapshot at its start time, no locks needed, no contention with writers. Writes still take row locks (pessimistic), but reads scale linearly with cores. It's the optimistic answer to 'readers shouldn't block writers, writers shouldn't block readers.' The same pattern shows up in user-space: copy-on-write trees, RCU in the kernel, persistent data structures in functional languages.
How does this map to distributed systems?
etcd's compare-and-swap on a key (compare against expected version, set if matches) is optimistic. Distributed locks (Redlock, Zookeeper ephemeral nodes) are pessimistic. Sagas with compensating transactions are optimistic for the cross-service case (assume each step succeeds, compensate if not). Two-phase commit is pessimistic (lock everything, agree, commit). The cost calculus is the same: optimistic cheaper when conflicts are rare, pessimistic safer when they're common.
Are these always exclusive?
No. Most real systems combine both. PostgreSQL: MVCC reads + row locks for writes. Java StampedLock: optimistic read with fallback to read lock. JPA: @Version optimistic locking inside transactions that may also use row locks for explicit pessimistic operations. The right design often has both, applied per operation based on the expected conflict rate of that operation.
Gotchas
- Optimistic without retry: a single CAS failure becomes a user-visible error. Always handle the false return, either by retrying or surfacing a conflict.
- Pessimistic + user think time: holding a row lock across a UI session is a deadlock factory. Move locks to commit time (optimistic) or shrink the lock window.
- Mixing optimistic and pessimistic on the same data without coordination: an optimistic writer can win the race against a pessimistic reader's expected ordering.
- Versioning a row but never bumping the version on every write path: a code path that forgets to increment version silently breaks optimistic checks.
- Counting CAS failure as a 'lock' in profiles: it isn't; millions of failed CAS in an atomic counter means contention to fix, not lock-elimination work to do.
Deadlock, Livelock, Starvation
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often
Deadlock, threads block each other forever. Livelock, threads are running but making no progress. Starvation, one thread never gets the resource it needs because others keep grabbing it first. All three look similar from the outside; the fix is different for each.
Key Points
- Deadlock = circular wait, Thread A holds X waits for Y, Thread B holds Y waits for X
- Coffman conditions: mutual exclusion + hold-and-wait + no preemption + circular wait
- Break ANY ONE of the four → no deadlock
- Lock ordering (always acquire locks in a fixed global order) is the most common deadlock prevention
- Livelock = threads keep retrying and stepping on each other, no blocking, no progress
- Starvation = unfair scheduling, a thread is permanently outrun by others
Interview Follow-up Questions
What are the four Coffman conditions for deadlock?
1) Mutual exclusion (resource can't be shared); 2) Hold-and-wait (thread holds one resource and waits for another); 3) No preemption (resources can't be forcibly taken); 4) Circular wait (a cycle in the wait-for graph). All four must hold simultaneously, break ANY one and deadlock is impossible.
What's the difference between deadlock and livelock?
In deadlock, threads are blocked, doing nothing, waiting forever. In livelock, threads are running and consuming CPU, but their actions cancel out and no progress is made. Livelock is harder to detect because the program looks 'busy.'
How is deadlock prevented in practice?
Most common: enforce a global lock acquisition order. Other tactics: tryLock with timeout (break no-preemption), open call (release the lock before calling out), structured concurrency (locks scoped to a function). The cheapest fix is usually 'don't hold two locks at once.'
Why does Java's ReentrantLock have tryLock and Go's sync.Mutex doesn't (until recently)?
Go's design philosophy is that lock contention should be rare; reaching for tryLock often signals a design problem. Go 1.18 added TryLock anyway for cases that need a fallback code path. Use it sparingly, most uses indicate the lock is in the wrong place.
What's starvation?
When a thread perpetually fails to acquire the resource it needs because other threads keep getting there first. Common with priority schedulers (high-priority threads keep preempting low-priority ones) or unfair locks (newcomers grab the lock before queued waiters).
Gotchas
- Reentrant locks (Java ReentrantLock, Python RLock) prevent self-deadlock but NOT mutual deadlock between threads
- Holding a lock while calling user-supplied code (a callback, a remote service) is a deadlock waiting to happen, that code might try to take a held lock
- Go's runtime detects only 'all goroutines asleep' deadlocks, partial deadlocks (some goroutines stuck, others running) are silent
- Java's jstack and -XX:+PrintConcurrentLocks show deadlocked threads, use them in production hangs
- Database transactions can deadlock too, ORDER BY-then-FOR UPDATE on different tables is the SQL version
Spurious and Stolen Wakeups
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often
A waiting thread can wake up without anyone calling notify (spurious), or be beaten to the lock by another thread that consumed the condition (stolen). The defence is to always check the condition in a while loop after wait returns, never if.
Key Points
- Spurious wakeup: the OS or runtime wakes a waiting thread for no reason. Allowed by every POSIX-shaped wait API.
- Stolen wakeup: someone signalled, the thread woke up, but another thread grabbed the resource before the first could re-acquire the lock.
- Both are why every pthread / Java / Python tutorial says 'wait inside while, never if.'
- wait/await releases the lock atomically, parks the thread, re-acquires the lock on wake. The re-check happens after re-acquire.
- Channels and semaphores avoid the issue because they have an integer count, not a free-form predicate.
Interview Follow-up Questions
Why are spurious wakeups even allowed?
POSIX condition variables document them as legal because real implementations sometimes find it cheaper or simpler to wake everyone than to track exactly who needs to wake. The rule of 'check in a loop' is part of the contract, so the runtime is free to wake conservatively.
Is signal vs signalAll about spurious wakeups?
No, that is a separate question. signal wakes one waiter, signalAll wakes all. Both can also produce stolen wakeups (another waiter beats the original to the lock). The while loop handles both.
Do semaphores have the same issue?
No. A semaphore's acquire blocks until count > 0 and decrements atomically. There is no predicate to re-check, no race window between wake and acquire. Same for channel sends and receives.
What about asyncio.Condition.wait?
Same rule: wrap in `while not predicate: await cond.wait()`. asyncio's coroutines do not have the OS spurious-wakeup issue, but the stolen-wakeup case is still real. The while is correct in both.
Priority Inversion
Section: Core Mental Models | Difficulty: Advanced | Frequency: Sometimes
A high-priority thread waits on a lock held by a low-priority thread. A medium-priority thread runs, preempting the low-priority holder, which means the high-priority thread also waits indirectly. The fix is priority inheritance: bump the holder's priority to match the highest waiter.
Key Points
- Three threads, one lock. High waits on Low for the lock. Medium runs, preempting Low. High is blocked indefinitely.
- Famous case: Mars Pathfinder, 1997. The rover kept rebooting due to a priority inversion in the VxWorks scheduler.
- Priority inheritance: while Low holds the lock, raise its priority to High's so Medium cannot preempt.
- Common in real-time systems, audio drivers, embedded code, kernel synchronisation. Less common in Java/Go/Python.
- JVM and most modern OS schedulers do not implement priority inheritance by default; problem solved by avoiding strict thread priorities in the first place.
Interview Follow-up Questions
What is priority inheritance?
A scheduling rule: while a thread holds a lock that a higher-priority thread is waiting for, the holder is temporarily promoted to that priority. The medium-priority thread can no longer preempt it. Used in real-time OS kernels (VxWorks, RT Linux, FreeRTOS).
Does Java do priority inheritance?
No. The JVM advisory priorities map onto OS scheduler priorities, which usually do not implement inheritance. The practical advice for Java code is to avoid relying on thread priority for correctness.
How was the Mars Pathfinder bug fixed?
The team enabled priority inheritance on the VxWorks mutex used by the bus management task. NASA pushed a configuration change to the rover remotely. After that, the high-priority task could no longer be indirectly blocked by medium-priority work.
When does this matter for normal application code?
Rarely. It comes up when thread priorities are set, the OS respects them strictly (real-time configurations), and locks are shared across priority levels. Most servers ignore priorities and the problem does not occur. The lesson generalises though: any time a fast operation can wait behind a slow one, the shape is inversion-like.
Memory Visibility & CPU Reordering
Section: Core Mental Models | Difficulty: Advanced | Frequency: Asked Often
Modern CPUs and compilers reorder reads and writes for performance. Without explicit synchronization, one thread's writes may NEVER become visible to another thread, or may appear in a different order. This is the deepest reason concurrent code breaks, and skipping it means writing code that "works on a laptop, fails in production."
Key Points
- Skipping this lesson leads to broken concurrent code
- CPUs have per-core caches, a write on Core 1 may sit in Core 1's cache forever, invisible to Core 2
- Compilers reorder instructions for optimization; CPUs reorder them for pipelining
- The 'obvious sequential order' of source code is not what runs on the hardware
- x86 has a strong memory model and hides most reordering bugs; ARM and POWER expose them
- ALL modern languages have a memory model, Java JMM, Go, C++11, even Python (informally)
- Synchronization primitives (volatile, atomic, mutex, channel) insert MEMORY BARRIERS that flush caches and prevent reordering
Interview Follow-up Questions
What's a memory barrier?
A CPU instruction (or compiler directive) that prevents reordering across it and flushes caches. Different barriers do different things: a 'release' barrier ensures prior writes are visible; an 'acquire' barrier ensures subsequent reads see those writes; a 'full' barrier does both. Java's volatile, Go's sync/atomic, and C++11 atomics all emit appropriate barriers.
Why does x86 hide so many reordering bugs?
x86 has a TSO (total store order) memory model, most reads and writes can't be reordered with each other (only store-load can). ARM and POWER have weaker models that allow more reordering. Code that 'works' on Intel laptops can fail on Apple Silicon, AWS Graviton, mobile devices, or any non-x86 production target.
What's the difference between visibility and ordering?
Visibility: when does a write become observable to another thread? Without synchronization, possibly never. Ordering: when multiple writes happen, in what order do other threads see them? Without synchronization, any order. Both are required for correct concurrent code; both are provided together by proper synchronization primitives.
Are 64-bit reads/writes atomic in Java?
Not always. The JLS guarantees atomicity of 32-bit ops; 64-bit long/double reads/writes can be torn on 32-bit JVMs unless declared volatile. On 64-bit JVMs and HotSpot in practice they're atomic, but the spec doesn't require it. Use volatile or AtomicLong to be safe.
If a Mutex is used everywhere, is reordering still a concern?
Inside the critical section, no, Mutex.acquire/release are full memory barriers. Between critical sections, yes, the lock only orders accesses that happen WHILE it is held. Common bug: reading shared state outside the lock 'because it's just a check.'
Gotchas
- Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
- Compiler optimizations can hoist a volatile-less read out of a loop entirely → 'why is my thread spinning forever?'
- long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
- Java: writing to a field after publishing a reference to its containing object can be invisible to readers who don't synchronize
- Go: copying a sync.Mutex breaks it (go vet catches); copying a struct with atomic fields breaks them
- Python's GIL gives bytecode-level atomicity but NOT visibility for the publication pattern
Linearizability and Sequential Consistency
Section: Core Mental Models | Difficulty: Advanced | Frequency: Sometimes
Linearizability says every operation appears to take effect at one instant between its call and its return, and that instant respects real time across threads. Sequential consistency is weaker: each thread's operations stay in program order and there is some single global order, but it doesn't have to match wall-clock time. Linearizability composes across objects; sequential consistency does not.
Key Points
- Linearizability: each operation has a single 'effect point' that lies between its invocation and its response, and the order of effect points is consistent with real time.
- Sequential consistency: there exists some interleaving of all operations that respects each thread's program order, but the interleaving does not have to match real time.
- Linearizability is composable: if every object is linearizable, the system is linearizable. Sequential consistency is not.
- Java volatile reads and writes on a single variable are linearizable; combining multiple volatiles or doing read-modify-write is not. Use java.util.concurrent.atomic for linearizable RMW.
- Redis is not linearizable by default. etcd, ZooKeeper, and DynamoDB conditional writes are.
- Linearizability has a real cost: in an async system with failures, you can't have linearizable reads with sub-RTT latency (Attiya and Welch, 1994).
Interview Follow-up Questions
If linearizability is just sequential consistency plus a real-time constraint, why does the real-time part matter so much?
Two reasons. First, it is what client-visible 'happens after' means: if write A returned before read B started, B should see A. Without that, a client cannot trust its own ordering. Second, linearizability composes. If every object in a system is linearizable, the whole system is. Sequential consistency does not compose: two SC objects together can produce histories that no SC system could.
Is volatile in Java linearizable?
A single volatile read or write on a single variable is linearizable on its own. The write has one effect point (the store), the read has one (the load), and the JMM guarantees the global total order matches real time for that variable. The catch: combining multiple volatile fields or doing read-modify-write loses linearizability on the combination. That is what AtomicLong, AtomicReference, and CAS are for.
Is Redis linearizable?
Out of the box, no. Standalone Redis with a single primary is linearizable for reads from the primary, but the moment replication is added, replicas can return stale values. Redis Sentinel and Cluster do not give linearizable reads. Use WAIT for stronger guarantees, or pick a system designed for linearizability (etcd, Zookeeper, DynamoDB strong reads) when correctness depends on it.
Why do people say linearizability is expensive?
Two costs. Latency: linearizable reads in a replicated system require talking to a quorum (or the leader) to confirm freshness, so they cannot be served from a local replica. Availability: the CAP triangle says under partition the choice is consistency or availability. A linearizable system gives up availability during a partition. For systems where stale reads are OK, that price isn't worth paying.
Gotchas
- Composing two sequentially consistent objects can give a non-SC system. Composing two linearizable objects always gives a linearizable system.
- A linearizable register is what most papers mean by 'atomic register.' The terms are interchangeable in the literature.
- Don't conflate linearizability with serializability. Linearizability is about single-object real-time order. Serializability is about multi-operation transactions.
- A snapshot read across two unrelated linearizable objects is not itself linearizable. A transaction or a globally-ordered log is required for that.
Atomic Operations & CAS
Section: Core Mental Models | Difficulty: Advanced | Frequency: Asked Often
Atomic operations are indivisible CPU instructions that read-modify-write a single memory location safely, without locks. Compare-and-swap (CAS) is the foundation, "if this value is still X, replace it with Y", and it's how every lock-free data structure is built.
Key Points
- An atomic operation is a single CPU instruction that can't be interrupted partway through
- CAS = compare-and-swap: 'if memory[addr] == expected, write new and return success; else fail and return current value'
- Lock-free programming uses CAS in retry loops, never blocks, but may spin under contention
- Atomics give you single-variable safety; for multi-variable invariants you still need a lock
- ABA problem: a value goes A → B → A while you're not looking; CAS thinks nothing changed
- LongAdder/striped counters trade memory for throughput under high contention
Interview Follow-up Questions
What's the ABA problem?
A thread reads value A, plans to CAS A → C. Before the CAS, another thread changes A → B → A. Your CAS succeeds because the value is back to A, but the world changed in between. Common in lock-free linked lists where a node is removed and re-added. Solutions: tagged pointers (include a version counter), hazard pointers, or use a higher-level primitive.
When does a CAS retry loop become a busy-wait?
Under heavy contention, every CAS may fail and retry. CPU goes to 100%, no useful work happens. Mitigations: exponential backoff (sleep between retries), give up after N tries and use a lock fallback, or shard the contention (LongAdder). If you see a CAS loop dominating CPU profile, that's the symptom.
Are AtomicReference assignments atomic for compound operations?
AtomicReference.set() is atomic. compareAndSet is atomic. But '`if (ref.get() == X) { ref.set(Y) }`' is NOT atomic, between the get and set, another thread can race in. Use compareAndSet, or wrap in a lock.
Why is volatile not enough for counter++?
volatile makes the read and write each atomic and visible. But ++ is read-modify-write, three operations. Two threads can both read 5, both compute 6, both write 6, losing one update. AtomicInteger.incrementAndGet uses CAS to make the whole thing atomic.
How does CAS work at the CPU level?
x86 has the LOCK CMPXCHG instruction; ARM has LDREX/STREX or CAS (since ARMv8.1). The CPU asserts exclusive cache line ownership for the duration, performs the compare and conditional write, releases. Cost is roughly an L2 cache miss, fast but not free.
Gotchas
- ABA problem on linked-list CAS, value comes back to original after intermediate change
- Reading an atomic field non-atomically (`int x = atomic.Load(); other code; use x`), value can change between read and use
- AtomicInteger.incrementAndGet() is atomic; AtomicInteger.get() + 1 followed by set() is NOT
- Hot CAS loop on a single AtomicLong from many threads = CPU bottleneck, switch to LongAdder
- Java: AtomicInteger.lazySet vs set, lazySet doesn't establish happens-before, can lose visibility
- Go: atomic operations require *value pointer (& prefix); using value-by-value is silently wrong
False Sharing
Section: Core Mental Models | Difficulty: Advanced | Frequency: Sometimes
Two unrelated variables can sit on the same CPU cache line. When two threads write to them concurrently, the cache line ping-pongs between cores even though the variables are not actually shared. The code is correct but slow. Fix by padding to one variable per cache line.
Key Points
- CPU cache lines are 64 bytes on x86 and ARM. The smallest unit of cache coherence is the line, not the variable.
- If two threads write variables in the same line, every write invalidates the other core's cache copy.
- Symptom: linear-looking code that does not scale with cores. Adding threads makes it slower.
- Fix: pad each hot variable to a full cache line, or use frameworks (LongAdder, padded atomics) that already do it.
- Most common in counters, queues, lock state, per-thread statistics.
Interview Follow-up Questions
Why does the cache work in 64-byte chunks even when only 8 bytes get written?
DRAM is slow. Caches amortise the cost by fetching whole lines when any byte in them is touched. The line is also the unit of the coherence protocol (MESI on x86). When one core writes, the protocol has to invalidate the line on every other core, regardless of which byte changed.
How is a false-sharing-affected program identified?
Two clues. First, throughput drops or even reverses as cores are added. Second, perf c2c on Linux shows cache-line modify traffic. To test a suspected hot field, run a quick benchmark with padding and see if it speeds up.
Does false sharing happen with reads only?
No. Reads can be served from each core's local cache copy, so concurrent reads are fine. The problem is writes: a write invalidates copies on every other core, and they have to refetch.
Why does LongAdder beat AtomicLong under contention?
Two reasons. It splits the counter into per-thread cells, so threads usually update different cells. And each cell is padded to its own cache line, so the cells do not false-share with each other. Together: less contention AND less coherence traffic.
Backpressure: Strategies and Signals
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often
When the producer is faster than the consumer, something has to give. Backpressure is how the system tells the producer to slow down (or what happens when it can't). Four strategies: block the producer, drop new work, shed load with errors, or propagate the slowdown back through the call chain. Picking the wrong one is how you turn a slow consumer into a memory blowup or a thundering retry storm.
Key Points
- Without backpressure, a slow consumer forces unbounded memory growth or work loss
- Block the producer (bounded queue blocks on full): natural backpressure, default for most pipelines
- Drop incoming (drop newest or drop oldest): for telemetry where stale data is worse than missing data
- Shed load (return 503 / fail fast): for request paths where the client can retry or degrade
- Propagate (the slow downstream signals the upstream caller): for end-to-end flow control across services
Interview Follow-up Questions
What happens without backpressure?
Two failure modes. Memory grows unbounded as the queue fills with work the consumer can't process; eventually OOM kills the process. Or work piles up and gets dropped silently when the queue overflows, which looks like a service that just loses messages with no error. Both are the kind of bug that's invisible until production.
Block, drop, or shed: how to choose?
Block when the producer has nothing better to do (batch jobs, ETL pipelines). Drop when stale data is worse than missing data (telemetry, sampled metrics, real-time logs). Shed when the client can retry or degrade (HTTP requests, RPC calls). Propagate when the upstream caller has a budget that can be respected (request handlers, gRPC chains).
Why is propagation special?
It's the only strategy that handles cross-service overload gracefully. When a service is slow because its downstream is slow, propagating the deadline tells callers to give up too. They redirect, cache, or fail fast. The alternative (everyone blocks waiting on slow downstream) cascades the failure upward.
How does backpressure relate to the bulkhead pattern?
Bulkhead is one mechanism that produces backpressure. A bulkhead caps how many concurrent calls can be in flight to a downstream; when it's full, new callers fail fast (load shed). The bulkhead is the implementation, backpressure is the broader concept of 'how the system signals overload'.
Gotchas
- Unbounded queue = no backpressure = memory blowup under sustained overload
- Drop without metrics = silent data loss (count and alert on drops)
- Shed without Retry-After = client retries immediately, makes things worse
- Block in a request handler = thread starvation; combine with timeouts
- Propagate without timeouts = nothing actually propagates; ctx.Done is the channel
The Colored Function Problem
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Sometimes
In some languages (Python, JavaScript, Rust), async functions and sync functions are different colours. Sync code can't call async without an event loop; async code can't call sync blocking calls without freezing the loop. The colour propagates upward through every caller. Go and Java 21+ virtual threads avoid the problem by making the runtime handle async transparently.
Key Points
- Python / JavaScript / Rust: async is a function colour. Async can call sync; sync cannot directly call async (needs a loop).
- Once async is added at any layer, every caller above must also be async, or pay an awkward bridge cost.
- Go avoids it: every function looks blocking; the runtime parks goroutines on I/O. No async/await syntax.
- Java 21+ virtual threads do the same: write blocking-style code; the JVM provides async-like efficiency.
- Cross-colour bridging exists (asyncio.to_thread, asyncio.run, asgiref.sync) but it's friction, not a clean solution.
Interview Follow-up Questions
Why did Python and JavaScript end up with colored functions?
Both have single-threaded runtimes (well, JS does; Python's GIL makes it effectively single-threaded for CPU work). Async/await was added later as a way to multiplex I/O on one thread without callbacks. The syntax made the colour explicit. In contrast, Go and Java were designed (or redesigned) with native multithreading; they could provide blocking-style code with multiplexing under the hood.
Is the colored function problem actually a problem?
It's friction, not a blocker. Working systems exist in all of these languages. The issue is that adopting async at one layer forces it on all callers above. Half-async codebases (some sync, some async) spend energy on bridges that don't add user value. Go and Java 21+ skip this entire category of work.
What about Kotlin coroutines?
Kotlin coroutines are colored too: a `suspend fun` can only be called from another suspend function or a coroutine builder. The colour is `suspend` instead of `async` but the propagation rule is the same. Kotlin's win is structured concurrency primitives (CoroutineScope, supervisorJob) that make the propagation manageable.
When would the explicit-colour approach be preferred?
When explicit visibility into what blocks vs what doesn't is the priority. The `async` keyword in Python/JS/Rust forces a call-site decision. Go and Java 21+ trade this visibility for ergonomics; the call site doesn't reveal whether the call will park the virtual thread. Both choices are defensible; pick based on what the team optimises for.
Gotchas
- Calling an async function in sync code without a loop returns a coroutine/Promise, not a value
- Calling sync blocking I/O in async code freezes the entire event loop
- Mixing thread-based code (threading.Lock) with asyncio is the worst of both worlds
- Adopting async at one layer forces async on every caller above (the 'colouring' propagates)
- Even Go and Java virtual threads can starve when blocking C extensions don't release the runtime
When NOT to Use Concurrency
Section: Core Mental Models | Difficulty: Intermediate | Frequency: Sometimes
The best concurrent code is usually no concurrent code. Every added thread or task costs memory, debugging difficulty, and a chance for race conditions. Reach for concurrency only after measuring a real bottleneck where the workload's nature (I/O wait, CPU parallelism, latency hiding) actually benefits.
Key Points
- Concurrency is a cost, pay it only when the benefit is measured, not assumed
- Premature concurrency is worse than premature optimization, it can't be taken back without rewriting
- If the bottleneck is the database, adding threads in the app server doesn't help
- Under 1000 simultaneous users, async/await is rarely needed
- Single-threaded event-loop services (Redis, Node.js) prove single-threaded scales further than people think
- The ONE-CORE rule: if it can be solved on one core, do it on one core
Interview Follow-up Questions
When should code become concurrent?
When measurement shows it's I/O-bound with lots of concurrent operations (use threads or async), OR it's CPU-bound and the work parallelizes (use processes/goroutines). Don't add concurrency to make a single user request faster, that almost never works.
If concurrency is so risky, why does any production code use it?
Because at scale the single-threaded ceiling is real, Redis hits ~100K ops/sec on one core, then needs sharding. Most apps run a process per CPU core (gunicorn, Puma) and let the OS schedule them. The concurrency boundary is process-level, kept simple. Within each process: as little concurrency as possible.
What's the cheapest 'concurrency' that's actually free?
Running multiple processes (one per CPU). No shared state, no locks, OS handles scheduling. This is what most modern web stacks do, it's why Python web servers scale despite the GIL. The 'concurrency' is between requests/processes, not within them.
Is async worth using even for a small service?
Probably not, unless the framework is async-first (FastAPI, Sanic). The migration cost from sync to async is high (every library, every test, mental model change) and the benefit only kicks in at high concurrent connection counts.
When is concurrency NOT a choice, i.e., truly required?
GUI threads (must not block the UI), real-time systems (must meet deadlines), high-frequency trading (latency-bound), and serving > 10K concurrent connections (memory-bound on threads). For everything else, it's a choice, and often the wrong one.
Gotchas
- Threading.Thread() in a script that runs <1 minute → complexity cost paid for nothing
- asyncio.gather on 10 items → not enough concurrency to matter; just loop
- Adding sharding before measuring single-shard limits → premature, costs cohesion
- Concurrent code reads as 'sophisticated' in code reviews, bias toward simplicity
- 'It might scale' is not a reason to add concurrency, measure first
Producer-Consumer Pattern
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
One or more producers add work items to a bounded buffer; one or more consumers remove and process them. The buffer decouples their rates; the synchronization makes the buffer safe and prevents busy-waiting.
Key Points
- Bounded buffer: producers block when full, consumers block when empty
- Two condition variables (notFull, notEmpty), never one, or you risk lost wakeups
- Always wait in a while-loop, never if, to handle spurious and stolen wakeups
- In Go, a buffered channel IS the entire pattern, no extra synchronization needed
- In Python, queue.Queue is the canonical impl, don't roll your own
- Graceful shutdown: producer closes channel / sends sentinel; consumer detects and exits
Interview Follow-up Questions
Why use TWO conditions (notFull, notEmpty) instead of one?
With a single condition, signal() might wake the wrong type of waiter, e.g. a producer when the buffer is empty. notifyAll() would be required for correctness, which is wasteful.
Why must wait() be in a while loop, not if?
Spurious wakeups exist (JVM/OS can wake without signal) and stolen wakeups (another thread races in and changes state). Re-checking the condition is the only safe pattern.
In Go, who should close the channel?
The producer side. Never close from the consumer (sender doesn't know it's safe to send) and never close twice. Multi-producer: use sync.WaitGroup + a single closer goroutine.
How is 'no more items' signalled in Java BlockingQueue?
Use a sentinel value (poison pill) or a separate completion flag; BlockingQueue has no built-in 'closed' state like Go channels.
How does Python's queue.Queue handle shutdown?
Same sentinel pattern as Java; put None on the queue, consumers check for it. With multiple consumers, re-put the sentinel after seeing it so the next consumer also sees it.
Gotchas
- Closing a Go channel twice → panic. Sending on a closed channel → panic.
- wg.Add() must run BEFORE 'go ...' or it races with wg.Wait()
- Spurious wakeups: never use 'if' instead of 'while' around await()
- BlockingQueue.put vs offer vs add: put blocks, offer returns false, add throws, pick the right one
- Forgetting to handle InterruptedException = thread silently lives on
- Python: forgetting to put the sentinel = consumer hangs forever
Worker Pool Pattern
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
A fixed pool of N worker threads/goroutines pulls jobs off a shared queue and processes them in parallel. It bounds resource usage (you can't run out of memory by spawning a thread per request) and keeps overhead constant under load.
Key Points
- Fixed N workers > thread-per-task: bounded memory, predictable overhead
- Workers loop: pull from queue, process, repeat, until queue closes
- Result delivery: separate result channel, Future objects, or callback
- Sizing: ~core count for CPU work, much higher for I/O, use Goetz's formula
- Always have a shutdown story: close queue → workers drain → join all
- Error propagation matters: one worker failing should signal the rest
Interview Follow-up Questions
How is a worker pool sized?
CPU-bound: pool ≈ core count. I/O-bound: Goetz's formula → cores × (1 + wait/compute). For 90% I/O work, that's ~10× cores. Empirically tune: ramp up workers, watch throughput plateau, back off slightly.
What's the right rejection policy when the queue is full?
Default in Java's ThreadPoolExecutor is to throw (RejectedExecutionException). Production-favorite: CallerRunsPolicy, the submitter does the work, providing natural backpressure. DiscardPolicy silently drops tasks (use only if dropping is acceptable, e.g. metrics). Never use unbounded queues in production.
Why is newCachedThreadPool() dangerous?
It uses a SynchronousQueue with no upper bound on threads. Under sustained load, every submission spawns a new thread, until the JVM hits the OS thread limit (typically a few thousand) and crashes. Stick to fixed-size pools.
Should each request get its own thread, or share a pool?
Share a pool. Thread-per-request is fine at small scale (a few hundred req/sec) and is what Java pre-21 servlet servers do, but it doesn't bound resource usage. Modern advice: virtual threads (Java 21+) for the request-per-thread model with effectively no upper bound; fixed pool for CPU-heavy work.
How are errors propagated from a worker pool?
Java: Future.get() throws ExecutionException wrapping the worker's exception. Python: future.exception() or .result() raises. Go: errgroup.Group.Wait() returns the first non-nil error. Custom pools need an error channel + a 'first failure cancels all' protocol.
Gotchas
- Java's newFixedThreadPool uses an unbounded LinkedBlockingQueue by default, bursts → OOM. Use ThreadPoolExecutor explicitly with ArrayBlockingQueue.
- Forgetting to call shutdown() on an ExecutorService → JVM doesn't exit
- Forgetting to close the work channel in Go → workers block forever waiting for more
- Calling .get() on a Future inside a worker → can deadlock if the pool is full and the worker is waiting for another task to complete
- Task throws an exception, future swallows it, always log future.exception() or wrap in try/catch
- ProcessPoolExecutor with closures or lambdas → pickle errors; use top-level functions
Pipeline Pattern
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
A pipeline chains stages where each stage transforms data and passes it to the next. Each stage runs concurrently; bounded queues between stages give backpressure for free. Dominant pattern for stream processing, ETL, and any 'producer → transform → transform → consumer' shape.
Key Points
- Each stage is a concurrent worker (or pool) reading from one queue, writing to the next.
- Bounded queues between stages give automatic backpressure: a slow stage throttles upstream.
- The slowest stage is the bottleneck. Throughput equals throughput of the slowest stage.
- Shutdown propagates downstream: close the input, the stage drains and closes the output.
- Errors are tricky: cancel the whole pipeline, or skip-and-log, or dead-letter the bad item. Pick a policy.
Interview Follow-up Questions
What if one stage is much slower than the others?
Two options. Scale that stage horizontally: run multiple workers reading from the same input queue. Or rebalance: do less work in that stage, more in faster ones. The bounded queue makes the slowness visible (it fills up, upstream blocks); tune from there.
How are errors propagated through a pipeline?
Three patterns. Cancel-on-error: every stage shares a context; first error cancels everything (errgroup-style). Skip-on-error: log the error, continue with the next item. Dead-letter: send the bad item to a separate error channel for later inspection. The right choice depends on whether each item is independent.
When to use a pipeline vs fan-out?
Pipeline: each item flows through stages in sequence (A → B → C → D). Fan-out: multiple workers process items independently. They compose: a pipeline stage can fan out internally (10 workers all reading from the input queue, all writing to the output). Use pipeline when there is a sequence; fan-out when there is parallelism within a stage.
Gotchas
- Forgetting to close the output channel leaves downstream goroutines blocked forever
- Unbounded queues between stages defeat backpressure; bound everything
- Errors in the middle of the pipeline can leave stages running with no input/output
- Slow consumer at the end backs up everything; monitor queue depths
Fan-Out / Fan-In
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
Fan-out: distribute work to multiple workers concurrently. Fan-in: merge their results back into a single stream. The two together parallelise the middle of a pipeline. Standard shape: one input, N workers each doing the same transformation, one output combining results.
Key Points
- Fan-out: one input channel/queue feeds N workers. Each worker reads independently.
- Fan-in: N output channels merge into one. With channels, this is N goroutines copying to a shared output channel.
- Order is not preserved by default; preserving it requires a sequence number and a re-sort downstream.
- Worker count = degree of parallelism. Cap it (errgroup.SetLimit, semaphore, ThreadPoolExecutor maxWorkers).
- Combine with retry, circuit breaker, timeout for production-grade fan-out.
Interview Follow-up Questions
How many workers should run?
For CPU-bound work, equal to or slightly less than core count. For I/O-bound work, much higher (50, 100, 500), bounded by the downstream's concurrency limit. The wrong default is 'as many as possible'; that exhausts file descriptors, overwhelms the downstream, and yields nothing. Measure throughput at different worker counts and find the knee.
How is order preserved?
Tag each input with a sequence number. Fan out. In the consumer, re-sort by sequence. Or use a single-threaded consumer with a min-heap to stream in order. Or use a partitioned approach where each partition is single-threaded but partitions are parallel.
What if some workers are much slower than others?
With a shared input channel, work-stealing happens automatically: fast workers grab more items. With per-worker input queues, balancing has to be manual. Default to shared input unless there is a specific reason for per-worker queues (cache locality, stateful workers).
How are errors propagated?
Two patterns. Cancel-on-first-error: shared context, first error cancels the rest (errgroup, asyncio.TaskGroup). Or collect errors and continue: each worker logs/returns its error, the consumer decides. Cancel-on-first is the default for transactional fan-out; collect-and-continue for batch processing where partial success is fine.
Gotchas
- Forgetting to close the output channel leaves the consumer blocked forever
- Unbounded worker count exhausts FDs and overwhelms downstream
- Order is lost unless explicitly preserved with sequence numbers
- One slow worker doesn't slow others if input is shared, but slows the overall completion
- Errors in one worker can leak resources if not propagated through cancellation
Futures and Promises
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
A future is a placeholder for a value that will be available later. A promise is the producer side that fulfils the future. The pair allows starting work asynchronously and consuming the result with backpressure (the consumer waits if not ready). Foundational primitive in async programming.
Key Points
- Future = consumer-side handle (read result, wait, attach callbacks). Promise = producer-side handle (set value or error).
- Many languages have only one type that does both jobs: CompletableFuture (Java), Future (Python), Promise (JS), Future (Rust).
- Composition: chain futures (then/map/flatMap), wait for all (allOf/gather), wait for any (anyOf).
- Cancellation is the hard part. Many implementations support cancelling the future, but stopping the actual work depends on the producer cooperating.
- Don't block in a future-returning function. The whole point is async; calling .get() inside an async function defeats the purpose.
Interview Follow-up Questions
What is the difference between Future and Promise?
Future is the read side: the consumer reads the value when ready. Promise is the write side: the producer fulfils the future. JavaScript and Scala keep them as one Promise type that does both. Java keeps them separate (Future read-only, CompletableFuture both). C++ has std::future and std::promise as a clear pair. Conceptually they are two ends of the same thing.
How is a future cancelled?
Depends on the implementation. CompletableFuture.cancel(true) marks it cancelled and any caller of .get() throws CancellationException, but the underlying work may keep running unless it cooperates (checks isCancelled, responds to interrupt). asyncio.Task.cancel raises CancelledError at the next await. Channels in Go don't have built-in cancellation; pass a context.
Why can chaining futures with then/map be cleaner than awaiting each?
For a sequence of dependent operations (A → B → C), the chained version reads as a flat pipeline and can run continuations on a different executor than the source. The await/blocking version is also fine; it depends on whether dataflow style or imperative style is preferred. Both end up with the same semantics.
What is a 'completed' future?
A future that already has a value or an error. Useful for testing, for known-static results, or as the start of a chain. CompletableFuture.completedFuture(x), asyncio.Future().set_result(x), and similar constructors return one immediately.
Gotchas
- Calling .get() / await inside async code defeats the purpose; chain instead
- Cancellation marks the future cancelled but may not stop the underlying work
- Java CompletableFuture's default executor (ForkJoinPool.commonPool) is shared; specify an executor for I/O-bound chains
- Composing futures with shared mutable state still needs synchronisation
- exceptionally / catch resets the error chain; subsequent stages run on the recovered value, not the error
Reactor Pattern and the Event Loop
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
One thread, one loop. The loop asks the kernel which file descriptors are ready, fires the registered handlers for each, and goes back to the kernel. No threads block. This single design powers Node.js, Python asyncio, Netty, Tokio, libuv, nginx, and the Go netpoller. It is the answer to 'how do I serve 100K connections without 100K threads?'
Key Points
- Three pieces: a demultiplexer (epoll, kqueue, IOCP, io_uring), a queue of handlers keyed by FD, and a loop that drives them.
- Loop body: block in epoll_wait, get a list of ready FDs, run each FD's handler to completion, then loop again.
- Handlers must be non-blocking. Any blocking call (sync I/O, sleep, CPU-heavy work) freezes the entire loop and stalls every other connection.
- CPU-heavy work goes to a worker pool (Node.js worker_threads, Python's run_in_executor); the result comes back as a callback or future on the loop.
- Single-threaded by design. Removes data races inside the handler. Costs multi-core scaling unless multiple loops run (one per core, share-nothing).
- The 'colored functions' problem comes from this pattern: handlers and the things they call must agree on the same async ABI (callback, promise, async/await).
Interview Follow-up Questions
Why is the reactor single-threaded?
Two reasons. Simplicity: handlers run sequentially in one thread, so locks are never needed for state inside the loop. Cache: one thread on one core keeps L1/L2 hot for the loop and its scheduler state. Cost: a single loop cannot use multiple cores. The standard solution is to run one loop per core (share-nothing); each loop owns a slice of the work, and they coordinate over message queues if they have to.
What's a proactor and how is it different?
Reactor (epoll, kqueue): the kernel signals 'this FD is ready, the caller does the I/O.' Proactor (IOCP on Windows, io_uring on Linux): the caller submits the I/O to the kernel and the kernel notifies when it's done with the data already in the buffer. Proactor saves a syscall per operation and scales better at very high rates, which is why io_uring is widely seen as the future on Linux. asyncio's ProactorEventLoop on Windows uses IOCP; libuv added io_uring as an opt-in backend (epoll is still the default).
Why don't goroutines have this problem?
They do have the same machinery, just hidden. The Go netpoller is a reactor: every blocking network call parks the goroutine on the netpoller, the kernel's epoll readiness wakes it. The difference is that Go schedules goroutines preemptively and runs them on multiple OS threads (M's), so the 'CPU-heavy handler stalls everything' bug doesn't happen at the language level; one slow goroutine just keeps its M busy while others run elsewhere. Conceptually it's a multi-loop reactor with the loop hidden in the runtime.
How is a reactor scaled across cores?
Three patterns. (1) One loop per core, share-nothing: nginx, Tokio in 'multi-thread' mode, Vert.x verticles. Each loop owns its own connections; cross-loop communication is via channels. (2) Single loop + worker pool: Node.js. The loop dispatches CPU work to worker_threads and gets results as callbacks. (3) Acceptor + worker loops: Netty's EventLoopGroup. One loop accepts, multiple loops handle. Pick (1) for max throughput, (2) for ergonomic single-threaded code, (3) for fine-grained control.
Where does this break down?
When most of the work is CPU-bound. A reactor is optimal for many connections that are mostly idle (chat, dashboards, gateways). It's overkill or actively bad with few connections doing heavy compute (machine learning serving, video transcoding); a thread-per-request model is simpler and uses cores naturally. Match the model to the workload.
Gotchas
- A blocking call inside a handler freezes the loop. This includes sleep(), input(), requests.get(), psycopg2 queries, and most non-async libraries.
- Forgetting to await an async function: it returns a coroutine object that does nothing. Linters catch this; reviews don't always.
- The loop runs callbacks in the order they're scheduled, not the order their I/O completed. Never assume FIFO across handlers if work is queued via call_soon.
- Exceptions in handlers can crash the loop in some libraries; in asyncio they get logged and lost unless the task is awaited. Always check for unawaited tasks in production.
Async vs Blocking I/O
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
Blocking I/O: one thread per concurrent operation, thread sleeps during I/O wait. Async I/O: one thread handles many operations via an event loop, syscalls are non-blocking with callbacks/coroutines. Async wins for many concurrent connections; blocking wins for few but heavy ones, or when the language doesn't have great async support.
Key Points
- Blocking model: thread per request. Each thread has memory cost (pthreads default ~8 MB virtual / a few hundred KB committed; JVM default 512KB-1MB), thread-create cost, context switch cost.
- Async model: event loop reads ready I/O events, dispatches to coroutines. One thread can multiplex thousands of connections.
- Async needs the entire stack to be async: blocking call inside an async function freezes the loop.
- Choose based on workload: many connections + small per-connection work = async. Few connections + heavy work = threads.
- Virtual threads (Java 21+, Go goroutines) blur the line: write blocking-style code, runtime provides async-like efficiency.
Interview Follow-up Questions
Why is async usually faster for I/O?
Each blocking thread has fixed memory cost (pthreads default 8 MB virtual on Linux; JVM defaults to 512KB-1MB; only the touched stack pages are physically committed) and OS scheduling cost. 10,000 concurrent connections need 10,000 threads, even at JVM defaults that's ~5-10 GB of address space. Async uses one thread (or a small pool) and an event loop; the event loop costs are amortised. For pure I/O workloads, async services routinely handle 10x to 100x more concurrent connections per server.
When is blocking actually fine?
When concurrency is low (a CLI tool, a CRON job, a low-traffic admin endpoint). When the per-request work is heavy and CPU-bound (the I/O is small relative to compute). When the language ecosystem doesn't have good async support (synchronous languages without coroutines). For most internal tooling, blocking is simpler and the cost doesn't matter.
Are virtual threads (Java) and goroutines (Go) async or blocking?
Both, in a sense. The programmer writes blocking-style code (call, wait for result, return). The runtime provides async-like efficiency by parking the virtual thread/goroutine during I/O and scheduling another on the same OS thread. The result is async throughput with blocking syntax. Mostly a clear win; the trade-off is some loss of explicit control.
What is the 'colored function' problem in async languages?
In Python and JavaScript, async functions can only be called from async contexts (with await). Sync functions cannot directly call async ones (they get a coroutine object back). This 'colors' the codebase: once async is adopted at one layer, it propagates upward through every caller. Go and Java's virtual threads avoid this by making the runtime handle the async-ness without language-level coloring.
Gotchas
- Calling blocking code from async freezes the event loop
- Mixing thread-based libraries (threading.Lock) with asyncio races horribly
- Per-thread memory cost adds up: 10K threads = ~10 GB just for stack address space (mostly virtual)
- Virtual threads inherit thread-locals, but stack traces look different from platform threads
- Async stack traces are harder to read (the event loop is in the middle)
Retry with Exponential Backoff and Jitter
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
Retry transient failures with delays that double each attempt (exponential backoff). Add jitter (randomness) so a thousand clients don't synchronise their retries and DDoS the recovering server. Cap the total attempts and the total elapsed time. Only retry idempotent operations.
Key Points
- Exponential backoff: delay = base * 2^attempt. Avoids hammering the failing service.
- Jitter is required, not optional. Without it, retries synchronise into a thundering herd.
- Bound retries: max attempts AND max total elapsed time. Both, not either.
- Only retry transient errors (5xx, timeouts, connection refused). Never retry 4xx (client errors).
- Only retry idempotent operations. POST is dangerous; GET, PUT, DELETE are usually safe.
Interview Follow-up Questions
Why is jitter important?
Without jitter, all clients that started failing at roughly the same time retry at exactly the same intervals. When the downstream recovers, every retry hits at once and overwhelms it again. Jitter (random delay) spreads the retries over the interval, letting the downstream absorb them gradually. Standard practice in distributed systems.
Full jitter vs decorrelated jitter?
Full jitter: delay = random(0, base * 2^attempt). Simple, works well. Decorrelated jitter: delay = random(base, prev_delay * 3), where prev_delay is the previous attempt's delay. Better at avoiding correlation across clients but more complex. For most workloads, full jitter is enough.
When NOT to retry?
Three cases. Permanent errors (4xx HTTP, validation failures, auth failures): retrying never helps. Non-idempotent operations (POST without an idempotency key): retry can charge the customer twice. Operations approaching a hard deadline: a retry that takes another 5s when the request budget is 1s is wasted work. Always check the operation type and the remaining budget.
How is a transient error identified?
5xx HTTP responses, network timeouts, connection refused, DNS failures are usually transient. 4xx HTTP, 'invalid input', 'permission denied' are permanent. For databases, deadlocks are transient (retry); constraint violations are not. The library should support per-error classification; avoid blanket-retrying all exceptions.
Gotchas
- No jitter = thundering herd when downstream recovers
- Retrying non-idempotent operations (POST without idempotency key) double-charges users
- No total-time bound = retries last longer than the original request budget
- Retrying 4xx errors wastes everyone's time
- Retry inside a retry (caller retry plus the HTTP client's) creates exponential blowup
Circuit Breaker Pattern
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
Stop calling a downstream that is failing. After N failures in a window, the breaker opens and immediately fails subsequent calls without hitting the downstream. After a cooldown, it allows a probe; if successful, close again. Prevents one slow downstream from cascading into the calling service.
Key Points
- Three states: closed (normal), open (fail fast), half-open (probe single call).
- Trip on failure rate or absolute count over a sliding window. 'X failures per minute' is more useful than 'X consecutive failures'.
- Open state has a cooldown (typically 30s-5min). After cooldown, breaker is half-open: allows ONE probe call.
- Without a circuit breaker, a slow downstream ties up threads/goroutines waiting on timeouts; the caller slows down trying to call something that won't respond.
- Per-downstream breakers, not global. One slow service shouldn't block calls to a healthy one.
Interview Follow-up Questions
Circuit breaker vs retry: are they competing?
Complementary. Retry handles transient blips: 'this one call might work next time'. Circuit breaker handles sustained failure: 'this downstream is down, stop calling it'. Combine: retry within a single call, circuit breaker across calls. The retry hits the breaker first; if the breaker is open, the call fails fast without the retry attempts running.
How is the failure threshold picked?
Two parameters: failure rate (e.g., 50%) and minimum calls (e.g., 10). The minimum-calls floor prevents tripping on a tiny sample (one failure out of one is 100% failure rate but means nothing). Sliding-window size controls how reactive: smaller window trips faster, more false positives; larger window is slower to trip and slower to recover.
Why per-downstream and not global?
Different downstreams have different reliability. Service A being slow shouldn't stop calls to service B. Per-downstream breakers also allow tuning thresholds per downstream (a payment gateway gets a stricter breaker than a logging service).
What does 'fail fast' actually save?
Two things. Threads/goroutines/connections that would have been tied up waiting for the downstream's timeout. And the downstream's load: not piling on more requests while it is trying to recover. The first is critical: with 100 threads and a downstream that takes 30s to time out, 100 calls per minute is enough to exhaust all threads.
Gotchas
- Single global breaker for all downstreams: one bad service blocks everything
- No probe in half-open state: breaker stays open forever
- Counting client errors (4xx) as breaker failures: trips on user mistakes, not service health
- Threshold too low: trips on benign blips. Too high: trips too late, after damage is done
- Forgetting fallback when breaker opens: callers see CircuitOpenError instead of degraded response
Bulkhead Pattern
Section: Core Patterns | Difficulty: Intermediate | Frequency: Sometimes
Isolate resources so a failure in one part of the system doesn't drain everything. Separate thread pool, connection pool, or rate limit per downstream so one slow downstream can't tie up resources needed by healthy ones. Named after ship hulls divided into compartments to prevent total flooding.
Key Points
- Each downstream gets its own resource budget (threads, connections, semaphore permits).
- When one downstream is slow, only its budget exhausts. Calls to other downstreams keep flowing.
- Without bulkheads, a single thread pool serving all downstreams gets fully tied up by the slow one. Latency cascades.
- Pair with circuit breaker (fail-fast) and timeout (don't wait forever) for full resilience.
- Cost: more threads/connections to manage. The win is bounded blast radius.
Interview Follow-up Questions
Bulkhead vs circuit breaker?
Different problems. Circuit breaker: stop calling a downstream that is failing. Bulkhead: limit concurrent calls so a slow downstream can't drain shared resources. They pair: bulkhead caps in-flight calls; if those calls are slow and the bulkhead fills up, callers fail fast (similar to circuit-breaker open). For full protection, use both.
How big should each bulkhead be?
Sized to the downstream's actual capacity, with headroom. If the downstream can handle 100 req/sec at 200ms latency, in-flight is 100 * 0.2 = 20 calls. Set the bulkhead a bit higher (30) to absorb bursts, but not so high that the calling service ties up resources when the downstream slows.
Why not just use timeouts?
Timeouts release the resource (thread, connection) only after the timeout fires. With a 30-second timeout and a slow downstream, threads stay busy for 30 seconds each. With a bulkhead, in-flight is capped from the start; new callers either wait briefly or fail fast. Timeouts AND bulkhead together: bulkhead caps in-flight, timeout caps per-call duration.
What about thread pool isolation specifically?
ThreadPoolBulkhead provides a dedicated pool per downstream. Stronger isolation than semaphore (failure can't even reach the shared pool's threads), but more memory per downstream. Use semaphore bulkheads as default; reach for thread-pool bulkheads when stronger isolation is required (the downstream call has potentially leaky resource usage).
Gotchas
- One shared pool with no per-downstream cap = slow downstream blocks all calls
- Bulkhead too small = false rejections under normal traffic
- Bulkhead too big = downstream still gets overloaded; bulkhead provided no protection
- Forgetting to release the semaphore on exception leaks permits permanently
- Bulkhead without timeout = each in-flight call can still hang forever; combine them
Idempotency
Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often
An idempotent operation produces the same result whether called once or many times. Critical for retries: if a charge call is idempotent, a network blip that triggers a retry doesn't double-charge the customer. Achieved via idempotency keys (client-supplied unique IDs) or natural idempotence (PUT, DELETE, set operations).
Key Points
- Idempotent operations: GET, PUT (replace), DELETE, set-x-to-y. Calling twice == calling once.
- NON-idempotent: POST that creates, INSERT, charge. Calling twice produces two records or two charges.
- Idempotency key: client generates a unique ID per logical operation, server deduplicates on it.
- Server stores 'this key was seen, here is the response' for some retention window.
- Without idempotency, retries can double-charge, double-create, double-send. With it, retries are safe.
Interview Follow-up Questions
Why is idempotency a concurrency topic?
Because retries are concurrency. In any network or distributed system, requests can be received twice (client retried, network duplicated, load balancer retried, the response was lost on the way back). Without idempotency, those duplicate receives have duplicate effects. Idempotency makes the system tolerant of these everyday concurrent events.
How long should idempotency keys be stored?
At least as long as the maximum retry window. For client-driven retries, hours to a day. For batch/queue-driven workflows that may retry over days, keep them longer. Common choice: 24 hours for sync APIs, 7 days for async/queue workflows. Storage cost is small (one key + cached response per request).
What happens if two clients send the same idempotency key for different operations?
Tricky. The server has to decide: is the second request actually the same as the first (fingerprint matches), or is it a key collision (different request with the same key)? Stripe-style policy: if the request body differs from the cached one, return 409 Conflict. If they match, return the cached response. Always include enough of the request in the dedup check to detect collisions.
Are GET requests idempotent?
By the HTTP spec, yes. GET should not modify server state; calling twice is the same as calling once. Servers that implement GET correctly are idempotent for free. The idempotency-key pattern is for the operations that aren't naturally idempotent (POST that creates, charge, send).
Gotchas
- Generating a new key per retry defeats deduplication; one logical operation = one key
- Storing the key without storing the response means retries fail to replay the original outcome
- Not handling concurrent requests with the same key (need a lock) leads to double-execution
- Treating different requests with the same key as duplicates leads to wrong responses
- POST/INSERT without idempotency keys is a bug waiting to happen under network failures
Strict Alternating Output
Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often
Two threads must alternate strictly, A then B then A then B, n times. The canonical solution uses a Semaphore pair, with thread A holding the 'A's turn' permit and signaling B's after each step. Same pattern with wait/notify, asyncio.Event, or Go channels.
Key Points
- Two semaphores, started with permits 1 and 0 respectively → enforces order
- Each thread acquires its semaphore, prints, then releases the OTHER's semaphore
- Equivalent patterns: wait/notify pair, two condition variables, two channels
- Generalizes to N-thread round-robin (next lesson)
Interview Follow-up Questions
Why two semaphores, not one?
One semaphore can't enforce *which* thread runs next, it only enforces 'one at a time.' Two semaphores form a token-passing protocol: each thread waits for its specific turn-token before printing and hands the other token forward.
Will signal() vs signalAll() matter here?
Either works for two threads, since there's only ever one waiter. With more threads, signalAll is safer (avoids spurious-wakeup edge cases) but wastes one wakeup. The semaphore pattern sidesteps the question entirely, semaphores are 1:1 by construction.
How does this generalize to N threads round-robin?
An array of N semaphores: thread[i] acquires sem[i], runs, releases sem[(i+1) % N]. The next lesson covers Print Zero Even Odd, which is N=3 with conditional turns.
Three-Way Round-Robin Output
Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often
Three threads must produce 010203040..., Zero, then alternating Odd/Even. Three semaphores form a token-passing relay: Zero hands to Odd-or-Even based on parity; the parity thread hands back to Zero.
Key Points
- Three semaphores: zeroSem (starts 1), oddSem (0), evenSem (0)
- Zero always runs after every digit; parity thread alternates 'who's next' based on counter
- Generalizes the strict-alternation pattern to conditional next-turn
- wait/notify version: shared int + condition variables, harder to get right
Interview Follow-up Questions
Why does Zero always run between every Odd and Even?
Because Zero releases the next parity thread, which then releases Zero. So the order is forced: Zero, parity, Zero, parity. No way for two parity threads to run consecutively without Zero in between.
What if the parity choice were random instead of based on i?
Same pattern, Zero just picks based on whatever the rule is. The semaphore relay enforces 'one runs at a time' regardless of the choice. The choice logic is independent of the synchronization.
How to solve this with a single mutex?
Single mutex + shared counter + condition variables, with each thread waiting on a predicate ('counter % 2 == 1' for odd, '== 0 and counter > 0' for even). Works but verbose; semaphores express the intent more cleanly.
Coordinated Conditional Output
Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often
Four threads, Number, Fizz, Buzz, FizzBuzz, coordinate to print FizzBuzz for 1..n. The right thread is whichever matches the current number's divisibility. Pattern: shared counter + per-thread Semaphore, dispatcher wakes only the right thread.
Key Points
- Four threads, one shared counter, four semaphores
- A 'dispatcher' increments the counter and releases exactly the matching thread's semaphore
- Or: each thread waits on a condition that matches its rule (i % 3, i % 5, i % 15)
- Same token-passing pattern as FooBar but routed by predicate
Interview Follow-up Questions
Predicate-on-condition vs semaphore dispatcher, which is cleaner?
Both are valid. Semaphore dispatcher is faster (no spurious wakeups, no broadcast) but requires a coordinator thread. Predicate version is shorter to write but uses notifyAll/broadcast, less efficient at scale. For interview, either is acceptable; the senior signal is being able to discuss the tradeoff.
Why does the predicate version need a while loop, not if?
Spurious wakeups + the broadcast wakes ALL waiters every time. Each thread re-checks its predicate on wake; only the one that matches the new state proceeds, the others go back to sleep. `if` would mistakenly print on a wake even when the predicate has become false.
How to handle 100 different output rules, not just 4?
Single condition variable + a predicate function per thread, all subscribed to broadcasts. Or a dispatcher that maps rules → channels in a routing table. The semaphore-per-thread approach is cleanest for small N (2-4); broadcast/predicate scales better for large N.
Ordered Three-Thread Execution (Print in Order)
Section: Core Patterns | Difficulty: Easy | Frequency: Asked Often
Three threads start in arbitrary order; ensure first() runs before second() runs before third(). The minimal solution: two binary semaphores starting at 0, where each method releases the next's semaphore on completion. Same pattern with CountDownLatch.
Key Points
- Two semaphores both start at 0 → first method must signal before second can proceed
- first() prints, then releases s1; second() acquires s1, prints, releases s2; third() acquires s2
- CountDownLatch (Java) is a clean alternative, semantically more obvious
- Generalizes to N-stage pipelines via N-1 semaphores or N-1 latches
Interview Follow-up Questions
Semaphore vs CountDownLatch, which to pick?
CountDownLatch when the semantics are 'wait for one event' (one-shot). Semaphore when the synchronization point may be reused. For Print in Order, latch is conceptually clearer; semaphore works fine.
What if first() throws after printing but before releasing?
second() and third() will block forever. Production-safe version uses try/finally to guarantee release on error path. Or: rethink whether 'first failed' means second should also fail (probably yes, propagate the error).
How does this generalize to a 5-stage pipeline?
4 latches/semaphores, each chained: stage1 releases s1; stage2 awaits s1, releases s2; etc. Or, single CountDownLatch with count=N for join-all semantics. Choose based on whether stages are sequential or fan-in.
N-of-M Barrier Synchronization
Section: Core Patterns | Difficulty: Hard | Frequency: Asked Often
Hydrogen and Oxygen threads must pair into water, exactly 2 H + 1 O before any can release. The pattern: two semaphores controlling H/O entry + a CyclicBarrier of 3 ensuring all three threads of a molecule are present before any prints.
Key Points
- Pattern: rate-limiting semaphore + barrier
- Hydrogen semaphore = 2 permits; Oxygen semaphore = 1 permit; both refill after barrier
- CyclicBarrier(3) ensures the 2 H + 1 O appear together before any prints
- After barrier, semaphores reset, next molecule's H and O begin
Interview Follow-up Questions
What's a CyclicBarrier exactly?
A barrier where N threads must all call `await()` before any can proceed. Once all arrive, they're all released simultaneously. *Cyclic* because the barrier resets to wait for the next batch of N. Java has it natively; Python's `threading.Barrier` is equivalent; Go has no builtin (build with sync.Cond).
Why do we need both the semaphores AND the barrier?
Semaphores cap *how many* of each type can enter. Barrier ensures the *correct mix* (2 H + 1 O) all reach the print step together. Without semaphores, four hydrogens could pass the barrier and starve oxygens. Without the barrier, individual threads print whenever they enter, no coordination.
How does this generalize to N-of-M?
Same pattern: M semaphores rate-limiting each kind, barrier of size N. Want 3 sodium + 1 chloride for NaCl3? Set hSem=3 (rename), oSem=1, barrier(4). Pattern scales to any 'fixed mix' synchronization.
Dining Philosophers
Section: Core Patterns | Difficulty: Hard | Frequency: Asked Often
Five philosophers, five forks, alternating thinking and eating. Each needs both adjacent forks to eat, so naively-acquired forks deadlock. Classic solutions: enforce odd/even acquisition order, limit to 4 eaters at once, or use a waiter (arbitrator).
Key Points
- Naive solution (each grabs left then right) deadlocks, circular wait
- Solution 1: odd philosophers grab left first, even grab right first → asymmetry breaks cycle
- Solution 2: a Semaphore(N-1) limits concurrent eaters → no full cycle possible
- Solution 3: arbitrator (waiter) serializes fork pickups
- Modern Go solution: channel-based fork ownership eliminates locks entirely
Interview Follow-up Questions
Why does naive (everyone grabs left, then right) deadlock?
Five philosophers each grab their left fork simultaneously. Now every philosopher holds one fork and waits for the right one, held by their neighbor. Circular wait, all four Coffman conditions hold, deadlock.
Asymmetry vs Semaphore(4), which is preferred?
Semaphore(4) is simpler to explain and provably deadlock-free. Asymmetry is slightly faster (no semaphore overhead) but harder to verify. Most senior interviewers accept either; the win is articulating *why* each works.
Can a philosopher starve under these solutions?
Yes, none of these solutions prevent starvation, only deadlock. A philosopher could be perpetually beaten to forks by faster neighbors. Fixing requires fairness, e.g., a queue or arbitrator (Solution 3, the 'waiter').
What's the arbitrator solution?
A central 'waiter' grants permission to pick up forks. Each philosopher requests both forks atomically through the waiter; only granted if both are free. Eliminates deadlock and provides fairness if the waiter uses a queue.
Readers-Writers Problem
Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often
Many readers can access shared data concurrently, but writers need exclusive access. The classic problem: which side gets priority? Reader-pref starves writers; writer-pref starves readers; fair locks alternate. Java/Go/Python all ship reader-writer locks.
Key Points
- Readers can run concurrently with each other; writers need exclusive access
- Reader-preference: writer can starve under sustained reader load
- Writer-preference: reader can starve when writers always queue
- Fair / FIFO: alternates by arrival order, slightly slower, no starvation
- Most stdlib RW locks are writer-preferring (Java, Go, Python's read-write semaphore)
Interview Follow-up Questions
When does an RWMutex actually win over a plain Mutex?
When reads dominate writes by 5x or more AND each read does meaningful work (so the lock-acquire cost is amortized). For tiny critical sections, a Mutex is faster, RWMutex bookkeeping overhead exceeds the parallelism benefit.
Why is the default writer-preferring?
Without writer-pref, a steady stream of readers can starve writers indefinitely. Writers usually represent state changes that need to happen; reads can wait an extra millisecond. Most production lock libraries default to writer-pref for this reason.
What's the danger of writer-pref?
The re-entry deadlock, a goroutine/thread holding RLock that calls a method which also tries to RLock can deadlock if a writer arrives between. Covered in Bug Hunt: RWMutex Reentrant. Solution: don't re-enter (use 'Locked' helpers) or use a reentrant RW lock.
How does Linux's RCU compare?
Read-Copy-Update is a different beast, readers don't lock at all (they read a stale snapshot); writers create new copies and atomically install them. Used heavily in the Linux kernel. Best read-throughput in existence; complex update semantics; no general-purpose user-space equivalent.
Sleeping Barber
Section: Core Patterns | Difficulty: Medium | Frequency: Sometimes
A barber sleeps when no customers; a customer wakes the barber if asleep, otherwise sits in a finite waiting room (or leaves if full). Classic producer-consumer with a bounded buffer + signal-on-empty pattern. Two semaphores + a counted slot map onto it cleanly.
Key Points
- Three roles: barber, customer, finite waiting room
- Two semaphores: customers (producer signals barber) + barber (signals next haircut starts)
- Mutex protects the seat counter; bounded chairs reject if full
- Producer-consumer with a bounded buffer in different clothes
Interview Follow-up Questions
How is this different from producer-consumer?
It's a producer-consumer with: (1) a finite buffer that *rejects* on full instead of blocking the producer, and (2) the consumer (barber) blocks on the buffer being empty. Same primitives, different overflow policy.
What if the shop has multiple barbers?
N barbers reading from the same channel/queue. Each barber is independent; customers don't care which barber serves them. The channel buffered at CHAIRS still bounds the waiting room. This is a worker pool.
Why does the customer use semaphore.release() to wake the barber?
It's the signal that work is available. The barber blocks on customerSem.acquire(), semantically 'wait for a customer.' The release/acquire pair is the synchronization. The shared waiting count is just bookkeeping for the rejection check.
Cigarette Smokers Problem
Section: Core Patterns | Difficulty: Hard | Frequency: Rare
Three smokers each have one of three ingredients (tobacco, paper, matches). An agent randomly drops two of the three on the table; the smoker with the missing ingredient must roll and smoke. Tests whether the candidate can avoid 'helper threads' (Patil's restriction) and use only counting semaphores cleanly.
Key Points
- Three smokers, each with infinite supply of one ingredient
- Agent puts two ingredients on the table; smoker with the third must take and smoke
- Patil's restriction: NO conditional statements in the agent code (just put + signal)
- Trick: use 'pusher' threads as helpers, they observe what's available and signal the matching smoker
- Modern view: this 'difficulty' is contrived and the helpers pattern is fine
Interview Follow-up Questions
What's Patil's restriction and why does it matter?
Patil (1971) added the rule: the agent thread cannot have conditional logic, it just puts ingredients down and signals. This forces semaphores-only coordination without checking inter-thread state, requiring 'pusher' helper threads. Modern view: the restriction is contrived; in production, the agent would just be smart.
Without pushers, can the problem be solved?
Yes, let the agent inspect what's on the table and directly signal the right smoker. But this 'cheats' Patil's restriction. The pushers exist to honor the academic constraint.
What real problems does this map to?
Pattern matching on event streams, multiple subscribers each waiting for a specific combination of recent events. CSP literature uses this as a canonical example of selective receive with multiple conditions.
Is this still asked in interviews?
Rarely as the original. More often as a variant: 'three event sources, three matching subscribers, dispatch by combination.' Once the pusher pattern clicks, the variants are straightforward.
Santa Claus Problem
Section: Core Patterns | Difficulty: Hard | Frequency: Rare
Santa sleeps until either 9 reindeer return (deliver toys) or 3 elves arrive (consult). Reindeer take priority over elves. Tests selective wakeup + group barrier + priority, three classic primitives composed into one problem.
Key Points
- Santa sleeps until: 9 reindeer ready OR 3 elves waiting
- Reindeer have priority, Santa serves them first if both are queued
- Each group needs barrier-style synchronization (all 9 / all 3 must be present)
- Composition: priority + group-of-N barrier + producer-consumer signaling
Interview Follow-up Questions
What if a 4th elf arrives while 3 are waiting?
Per the canonical rules, they wait, the next group of 3 starts only after the current 3 finish consulting. Modeled by the barrier: it accepts only N waiters at a time. The 4th elf blocks at the barrier until the previous batch is released.
How is reindeer priority enforced?
Santa checks reindeer count first when waking. If both groups signal him simultaneously, he handles reindeer; the elves' signal stays pending in the semaphore for next loop iteration. This is the standard 'priority by check order' pattern.
Why is this problem interesting?
It composes three primitives that are usually tested separately: barrier (group of N), priority (which kind of wake-up to handle first), and producer-consumer signaling. Combining them tests whether the candidate can decompose a multi-faceted problem into the right primitive set.
Categorical Mutual Exclusion
Section: Core Patterns | Difficulty: Medium | Frequency: Sometimes
A bathroom can hold N people of the same category but never mix categories. People of category A enter while bathroom is empty or only A's are inside; same for B's. Generalizes Readers-Writers, readers and writers are categories, with N=1 for writers.
Key Points
- Two states: 'A only', 'B only', or empty
- Same-category entry is unrestricted (up to capacity if any)
- Cross-category entry must wait until current category empties
- Generalization of Readers-Writers, same problem, more symmetric
- Risk: starvation, if one category keeps arriving, the other never enters
Interview Follow-up Questions
How does this differ from Readers-Writers?
It's the symmetric version. RW lock has 'readers' (many concurrent) vs 'writers' (one exclusive), asymmetric. Categorical mutex has two symmetric categories, both allowing many concurrent. Same primitive structure (lock + condition), generalized.
Can starvation happen?
Yes. If category A keeps arriving while B's are waiting, B may never enter, A's never reach count 0. Fixes: (1) once any B is waiting, block new A's from entering; (2) round-robin between categories with a max consecutive entries. Adds fairness, costs throughput.
What if there are 3 categories, not 2?
Same pattern. The predicate becomes 'current is null or current == my category.' Generalizes to N categories with the same lock + condition + state-string approach.
What about a capacity limit per category?
Add a counter check in the wait predicate: 'current is null or (current == my category and count < capacity).' Same lock, additional condition. Combines with categorical exclusion cleanly.
Goroutines, Spawning, Scheduling, Lifecycle
Section: Go Concurrency | Difficulty: Basic | Frequency: Asked Often
Goroutines are user-space tasks multiplexed onto a small pool of OS threads by the Go runtime's M:P:G scheduler. Spawn with `go func() {...}()`, ~2KB initial stack, cheap to create, millions are practical. No public state machine; lifetime is 'until the function returns.'
Key Points
- Goroutines are NOT OS threads, multiplexed onto N OS threads where N = GOMAXPROCS
- Stack starts at ~2KB, grows on demand up to ~1GB
- M:P:G model: M = OS thread, P = logical processor, G = goroutine
- Work stealing: idle P's steal goroutines from busy P's run queues
- No stop/kill, goroutines exit only when their function returns
Interview Follow-up Questions
What's M:P:G scheduling?
M = machine (OS thread). P = processor (logical, count = GOMAXPROCS). G = goroutine. Each P has a run queue of G's. M's pick a G from their P's queue and execute it. Idle M's steal from other P's. The runtime parks G's on blocking calls (channel, lock, syscall) without blocking the underlying M.
How are goroutines cheaper than OS threads?
(1) Stack: 2KB initial vs 1MB; (2) Context switch: user-space, no kernel transition, ~hundreds of ns vs ~5μs; (3) Scheduling: Go scheduler optimized for many goroutines; OS scheduler optimized for few threads. Net result: 100K goroutines is fine; 100K threads would crash.
Can a goroutine outlive main?
Yes, until main returns. Once main returns, the runtime exits and kills all goroutines (no cleanup, no defers). Use sync.WaitGroup or context.Context to wait for them before main returns.
How is a goroutine stopped?
Not from outside. The goroutine itself must return. Pass a `context.Context` and have the goroutine select on `ctx.Done()`, when the context is cancelled, the goroutine exits voluntarily. There's no `goroutine.Stop()`.
Gotchas
- Closure capture in loops (pre-Go 1.22): always pass loop vars as args
- wg.Add() must be BEFORE 'go ...' or it races with wg.Wait()
- Spawning unbounded goroutines = potential OOM under load
- tight pure-CPU loops without yields could starve the scheduler before Go 1.14
GMP Scheduler: How Goroutines Actually Run
Section: Go Concurrency | Difficulty: Advanced | Frequency: Asked Often
G = goroutine, M = OS thread, P = logical processor. The runtime keeps GOMAXPROCS P's; each P holds a local run queue of runnable G's. M's must hold a P to run G's. Blocking syscalls hand the P off to another M so other goroutines keep running. Idle P's steal work from busy ones. The network poller and sysmon thread keep things moving. The whole machine runs in user space and is what makes 100K goroutines cheap.
Key Points
- G = goroutine (small struct + stack), M = OS thread (machine), P = logical processor (GOMAXPROCS of these). M needs a P to run G's.
- Each P has a local run queue of 256 runnable G's plus a 'runnext' LIFO slot for newly created goroutines. Overflow goes to the global queue.
- Work stealing: when a P's local queue is empty, it steals half of another P's queue. Keeps load balanced without explicit coordination.
- Blocking syscalls: the M holding a P releases the P (handoff), another M (or a fresh one) picks the P up and continues running other G's. The blocked G stays attached to its M.
- Channel and lock waits don't block an M. The G is parked in user space (a wait list inside the channel/mutex), the M moves on to other work.
- Network poller (epoll/kqueue/IOCP) is one shared component. When network I/O completes, it puts the waiting G back on a P's run queue.
- sysmon is a dedicated OS thread (no P) that preempts long-running G's, retakes P's stuck in syscalls too long, and triggers GC.
- Async preemption via signals (since Go 1.14): even a tight loop with no function calls can be preempted. Before 1.14, such a loop could starve the scheduler.
Interview Follow-up Questions
Why have P at all? Why not just M and G?
P is the unit of scheduling capacity. GOMAXPROCS P's means at most GOMAXPROCS goroutines run in parallel, regardless of how many M's exist. This decouples 'how many cores are allowed to be used' from 'how many OS threads exist.' Stuck syscalls can spawn extra M's without stealing CPU from useful work, because those M's don't have a P. The local run queue per P is also a cache-locality win: G's recently created on a P tend to share data with other G's on the same P.
What is 'spinning' and why does the runtime do it?
After releasing a G, an M with no work to do can spin briefly (a few microseconds) checking whether new work appeared, instead of parking immediately. Parking and waking M's costs an OS round trip; if work is about to appear, spinning is cheaper. The runtime keeps the count of spinning M's low (one or two) to avoid burning CPU. It shows up in pprof as 'runtime.findrunnable' time.
How does GOMAXPROCS interact with cgroups and CPU limits?
Before Go 1.25, GOMAXPROCS defaulted to the number of host CPUs, ignoring cgroup CPU limits. On a 64-core box with a 2-core cgroup, GOMAXPROCS=64 caused massive scheduling overhead. The standard workaround was uber-go/automaxprocs. Go 1.25 made the runtime cgroup-aware by default, so most modern containers no longer need the workaround. Older Go versions still do.
Can the scheduler be starved?
Hard, but yes. (1) Calls to C code through cgo block the M without releasing the P; if all M's go into cgo at once, no goroutines run. (2) Tight loops in Go assembly that don't include preemption points still aren't preemptible. (3) Holding a runtime mutex (very rare in user code) blocks scheduling. For normal Go code, sysmon's signal-based preemption catches almost everything.
What's the cost of a goroutine creation?
About 200ns and 2KB of stack on first creation. Reuse from a pool helps amortize the stack allocation. Compared to an OS thread (~5us creation, ~1MB stack), goroutines are nearly free, which is why patterns like 'spawn one goroutine per request' are idiomatic in Go.
Gotchas
- Runaway G creation under load: spawning a goroutine per inbound message without bounds will OOM the process when the queue can't drain fast enough. Always have a worker pool or semaphore.
- cgo calls block an M for the entire C call. Millions of cgo calls per second will exhaust the M pool and require raising the soft cap (default 10000).
- GOMAXPROCS too low (default 1 in Go pre-1.5, or in containerized environments without automaxprocs) silently serializes everything.
- GOMAXPROCS too high on a small cgroup cuts performance because the OS scheduler thrashes between threads that compete for limited CPU.
- Long syscalls (slow disks) can spike the M count; suddenly seeing hundreds of M's is a sign of blocking syscalls under load.
Channels, Buffered, Unbuffered, Patterns
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Channels are typed conduits between goroutines. Unbuffered = synchronous handoff (sender blocks until receiver ready). Buffered = bounded queue. Closing signals 'no more values' to receivers. The cornerstone of Go's 'share by communicating' philosophy.
Key Points
- Unbuffered: synchronous, send blocks until receive, vice versa
- Buffered: send blocks only when buffer full; receive blocks only when empty
- close(ch) = 'no more values'; receivers see ok=false; sending on closed = panic
- Nil channels block forever, useful in select to disable a case
- Channels are not free, for tight CPU loops, mutex is faster
Interview Follow-up Questions
Buffered or unbuffered, which to use?
Unbuffered for synchronization (signal that something happened). Buffered for decoupling (producer can run ahead of consumer up to buffer size). Buffered also tolerates burstiness. As a rule of thumb: start with unbuffered; add buffer only when measurement shows it helps.
What does closing a nil channel do?
Panics. Sending to a nil channel blocks forever. Reading from nil blocks forever. Nil channels are mostly useful in `select` to disable a case dynamically.
When are channels slower than mutex?
Tight critical sections, millions of ops/sec on shared state. Channel send/receive has overhead (~50-100ns); a mutex Lock/Unlock is ~20-30ns. For a hot counter, atomic.Int64 wins; for coordination, channels win.
What's the right channel buffer size?
Depends on the burst tolerance required. If producer and consumer rates are matched, 0 (unbuffered) is fine. If a producer bursts at 10x consumer rate for 100ms, buffer for that burst. Don't make it 'large' as a default; large buffers hide backpressure problems.
Gotchas
- Closing a channel twice → panic
- Sending on a closed channel → panic (so receiver should never close)
- Range loop on a never-closed channel → blocks forever (goroutine leak)
- Make sure only ONE goroutine closes the channel; coordinate with WaitGroup if multiple senders
- Buffered channels with capacity = number of sends → potential deadlock if a receive is missed
select Deep Dive
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
select waits on multiple channel operations and proceeds with whichever is ready first. Add a default for non-blocking operations; combine with time.After for timeouts; with ctx.Done() for cancellation. The cornerstone of channel composition.
Key Points
- select picks ONE ready case; if multiple ready, randomly chosen
- default makes select non-blocking, if no case ready, default runs
- time.After(d) returns a channel that fires once after d → easy timeouts
- ctx.Done() returns a channel, select on it for cancellation
- select on nil channel = case never fires → useful for dynamic channel disabling
Interview Follow-up Questions
What if multiple cases are ready simultaneously?
Go picks one uniformly at random. This prevents priority inversion / starvation, no case can be perpetually starved by others. Critical for fairness.
How is priority between cases implemented?
Two-step select. Outer select picks higher-priority channel only; if no high-priority ready, inner select picks any. Or use a guard with a buffered channel: drain high-priority first, then check normal.
Why does time.After leak under heavy use?
Each call creates a Timer + goroutine. Calling it millions of times in a tight loop (e.g., per-iteration timeout) lets the timers accumulate. Use `time.NewTimer` + explicit `Stop()` for hot paths.
When should select have a default?
When non-blocking semantics are required, 'try send, drop if full' or 'try receive, no value? skip.' Default has subtle implications: it makes the select consume CPU when called in a tight loop. Use sparingly outside try-X patterns.
Gotchas
- select with no cases blocks forever (used to park the main goroutine intentionally)
- select with only default is just an immediate non-blocking check
- time.After in tight loops leaks timers, use NewTimer + Stop
- default in select on multiple channels can cause starvation if always-ready cases dominate
- Nil channel cases never fire, both feature and footgun
context.Context, Cancellation & Deadlines
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
context.Context propagates cancellation, deadlines, and request-scoped values through call chains. Every long-running operation should accept a Context and select on ctx.Done(). The universal cancellation primitive in modern Go.
Key Points
- Context flows through the call graph, pass as first parameter to every function that may block
- WithCancel: parent function cancels manually
- WithTimeout / WithDeadline: auto-cancellation after duration
- ctx.Done() returns a channel that closes on cancellation
- Always defer cancel() to release resources, even when not strictly needed
Interview Follow-up Questions
Why is ctx the first parameter, not last?
Convention; aids readability and discoverability. Go style guides (Effective Go, gopls) flag it. Functions that don't take ctx are usually CPU-only / synchronous; those that do are network/DB-bound, easy to spot at a glance.
When should context NOT be used?
Pure-CPU synchronous functions (`Sum`, `Reverse`). Tight loops where the overhead of `select` matters. Constructor functions (use a Done method instead). The ctx is for orchestration of cancellable operations; don't sprinkle it on everything.
What's context.Background vs context.TODO?
Functionally identical (empty context, never cancels). Convention: Background for top-level (main, init); TODO when the right context is unknown (placeholder during refactoring). The TODO signals 'come back and fix this.'
Memory leak, context.WithCancel without calling cancel?
Yes. The internal timer/goroutine is never released. Always `defer cancel()`. Even after parent context is cancelled, the explicit cancel call cleans up the child's resources. linters (vet, staticcheck) catch this.
Gotchas
- Forgetting `defer cancel()` → goroutine leak (the cleanup goroutine sticks around)
- Storing context as a struct field; convention says pass it explicitly
- Putting too much in context.Value, easy to abuse for global state
- Mixing cancellation: ctx is cancelled but inner code didn't check ctx.Done() → operation continues
- context.Background() in a function that should take a ctx, silently disables cancellation
Goroutine Leak Prevention
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Every spawned goroutine must have a known exit story, context cancellation, channel close, natural completion, or timeout. Goroutines that block forever are the #1 production scaling bug in Go services. Detect via runtime.NumGoroutine and pprof/goroutine.
Key Points
- Symptom: NumGoroutine grows unbounded under load
- Diagnosis: pprof/goroutine shows what each leaked goroutine is blocked on
- Pattern: every send/receive needs a select-with-ctx.Done() escape
- Bounded buffered channels absorb bursts; default-on-full handles overflow
- Test for leaks: run handler 10K times in test, check NumGoroutine before/after
Interview Follow-up Questions
What's the canonical leak in production Go services?
Fire-and-forget goroutine doing logging or metrics, sending to an unbuffered/under-capacity channel whose consumer is slow or down. Every request leaks one goroutine; unbounded growth under load.
How is a leak reproduced locally?
Stress test the handler with go-test running N times. Check `runtime.NumGoroutine()` before and after. Growth indicates a leak. Use `goleak.VerifyNone(t)` to make this automatic.
Why don't unbuffered channels usually leak in tests?
Tests run linearly. Production has burst load: 1000 requests/sec with a slow consumer means 1000 leaked goroutines/sec. Tests usually don't reproduce because the consumer keeps up at low load.
When is unbuffered fine?
When sender and receiver are tightly paired with bounded lifetimes, e.g., 'wait for one response.' Both sides are explicit; the goroutine ends after a single exchange. The pattern that bites is *unbounded fire-and-forget*.
Gotchas
- context.WithCancel without `defer cancel()` leaks the timer goroutine
- for {} loops without select-on-ctx.Done leak forever
- Channel send with no receiver blocks forever, wrap in select + default OR + ctx.Done
- time.After in tight loops creates one timer per call; use NewTimer + Stop
- Goroutines waiting on http.Request body to fully drain, close the body explicitly
Go Memory Model & Sync Guarantees
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Go's memory model defines when one goroutine is guaranteed to observe writes made by another. The rule is short, synchronize using channels, sync primitives, or sync/atomic, or you have a data race.
Key Points
- Go's mantra: 'Don't communicate by sharing memory; share memory by communicating'
- A send on a channel happens-before the corresponding receive completes
- A receive from an unbuffered channel happens-before the send completes
- Mutex.Unlock() happens-before the next Mutex.Lock() returns
- sync/atomic loads/stores establish happens-before (since Go 1.19)
- go run -race detects data races at runtime; run it in CI
Interview Follow-up Questions
Why does Go say 'don't communicate by sharing memory'?
Channels make ownership and synchronization explicit in the type system. Shared-memory bugs hide; channel patterns surface coordination at the call site.
Is reading a bool atomic in Go?
No. Without atomic.Bool or a mutex, the race detector flags concurrent read+write. Word-size loads may be atomic on x86 but the Go memory model doesn't guarantee it.
What does the race detector actually do?
Instruments memory accesses at compile time and tracks happens-before relationships. Reports any pair of accesses where at least one is a write and no happens-before edge exists.
Difference between unbuffered and buffered channel send semantics?
Unbuffered: send blocks until receiver is ready, a sync point. Buffered: send blocks only when buffer is full, like a queue.
Can sync.Mutex be reentrant?
No. Re-locking the same mutex from the same goroutine deadlocks. This is intentional; Go discourages reentrant locks because they hide design issues.
Gotchas
- Copying a struct that contains a sync.Mutex breaks it; go vet will warn
- Closing a channel twice panics; sending to a closed channel panics
- select with no default + no ready case blocks forever, common cause of deadlock
- Goroutine leaked on channel send: writer blocks forever if no one reads and no buffer
- atomic operations require the value's address, atomic.AddInt64(&x, 1), not (x, 1)
sync.Mutex and sync.RWMutex
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
sync.Mutex is Go's basic mutex (Lock/Unlock). sync.RWMutex allows many readers OR one writer. Always use defer Unlock right after Lock. Mutexes are not reentrant in Go: same goroutine locking twice deadlocks. Pass mutexes by pointer; copying a mutex copies its state (including held-ness).
Key Points
- Mutexes are NOT reentrant in Go. Same goroutine locking twice deadlocks.
- Always Lock then defer Unlock. Skipping defer is the source of most lock leaks.
- Mutex zero value is unlocked and ready to use. No initialisation needed.
- Copying a Mutex copies its state. Embed by value carefully; pass structs containing mutexes by pointer.
- RWMutex is slower than Mutex for write-heavy or short-critical-section workloads. Profile.
Interview Follow-up Questions
Why doesn't Go support reentrant mutexes?
Design choice. The Go authors argue that needing reentrancy is usually a signal of a design problem (the function does not know whether it already holds the lock, which means it does not know its own state). The alternative is to refactor: split the function into a public version that locks and a private version that assumes the lock is held. The public version locks and calls the private; nested calls go through the private.
Mutex or RWMutex by default?
Mutex. RWMutex looks like 'free parallelism for reads', but the bookkeeping (atomic counter for readers, separate write path) costs cycles even for the read path. For critical sections shorter than ~1µs, Mutex wins. Switch to RWMutex when reads vastly dominate (50:1 or more) AND each read holds the lock long enough that the read-path overhead is negligible.
Can a goroutine that holds a Mutex pass it to another goroutine?
Yes. The mutex tracks held-or-not, not which goroutine holds it. Goroutine A can Lock, send the protected struct over a channel, and goroutine B can Unlock. This is unusual; the conventional pattern is acquire-and-release in the same function. But the runtime does not enforce that.
What goes wrong when defer Unlock is forgotten?
Two failure modes. First, an early return between Lock and Unlock leaks the lock; the next caller deadlocks. Second, a panic between them leaks the lock; recovery in a parent goroutine sees the lock held forever. defer Unlock right after Lock makes both paths correct.
Gotchas
- Mutex is not reentrant; same goroutine locking twice deadlocks
- Copying a struct that contains a Mutex copies the lock state; pass by pointer
- RWMutex is slower than Mutex for short critical sections; profile before assuming
- Forgetting defer Unlock leaks the lock on panic or early return
- Holding a mutex across a channel send/receive can deadlock with the channel's other end
sync Package: Once, WaitGroup, Cond, Pool, Map
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Beyond mutexes, sync provides Once (one-time init), WaitGroup (wait for N goroutines), Cond (wait/signal on a predicate), Pool (reusable object cache), and Map (lock-free read-mostly map). Each solves a specific shape; reaching for the wrong one leads to ugly code.
Key Points
- sync.Once.Do guarantees the function runs exactly once across all goroutines.
- WaitGroup.Add(n) MUST happen before launching goroutines; Done in defer; Wait blocks until counter is 0.
- sync.Cond is rarely the right tool in Go; channels usually express the same intent more clearly.
- sync.Pool reduces GC pressure for short-lived objects but is NOT a connection pool: items can be reaped any time.
- sync.Map wins for read-mostly disjoint-key workloads. For write-heavy or read-write-mixed, plain map + RWMutex is faster.
Interview Follow-up Questions
When to use sync.Map vs map + RWMutex?
sync.Map is faster ONLY in two scenarios: write-once-read-many (entries added once, then mostly read) and disjoint-key access (different goroutines touching different keys). For mixed reads and writes on the same keys, or for write-heavy workloads, plain map with RWMutex is faster. Default to map+RWMutex; switch to sync.Map only after measuring the win.
Why is sync.Pool 'best-effort'?
The pool can drop items at any garbage collection. The Go authors made this trade-off so the pool does not pin memory. It cannot be used for objects with finite-resource semantics (file handles, DB connections, network sockets). Use it only for objects that are 'free to recreate but expensive to allocate' (buffers, slices, parser state).
What is the difference between Cond.Signal and Cond.Broadcast?
Signal wakes ONE waiting goroutine. Broadcast wakes ALL waiting goroutines. Use Signal when only one waiter can usefully proceed (e.g., one item enqueued, only one consumer can take it). Use Broadcast when state changed in a way many waiters care about (cache invalidation, shutdown signal). Wrong choice = lost wakeup or wasted wakeups.
WaitGroup.Add inside the goroutine: why is that wrong?
It races with Wait. The parent calls Wait; the counter is still 0 (because Add has not run yet); Wait returns immediately; goroutines then run after the parent has moved on. The rule: every Add must happen-before every Wait. The simplest way is to Add before launching.
Gotchas
- WaitGroup.Add inside a goroutine races with Wait; Add before launching
- sync.Pool items can be dropped at any GC; never use for finite resources
- sync.Cond.Wait must be inside a for-loop on the predicate (spurious wakeups)
- sync.Map cannot be ranged AND mutated safely; use Range carefully
- Calling Signal/Broadcast without holding the Cond's Locker can cause missed wakeups
sync/atomic: Lock-Free Single-Variable Updates
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
sync/atomic provides lock-free atomic operations on integers and pointers. Add, Load, Store, Swap, CompareAndSwap. Go 1.19+ added typed atomic.Int64, atomic.Pointer[T], atomic.Bool which are cleaner than the function-style API. Use for counters, hot flags, lock-free data structures.
Key Points
- Atomic ops are lock-free: single hardware instruction (LOCK XADD on x86, LDXR/STXR on ARM).
- Go 1.19+ typed API (atomic.Int64) is preferred over the function-style API.
- CompareAndSwap is the building block for lock-free algorithms; the retry loop pattern is identical to Java.
- atomic.Value supports any type but is replace-only (no compound update). Use atomic.Pointer[T] when possible.
- Plain reads/writes to int64 are NOT atomic on 32-bit platforms; use atomic.Int64 if portability matters.
Interview Follow-up Questions
atomic.Int64 vs sync.Mutex around an int64: which is faster?
Atomic by a wide margin. atomic.Add is one CPU instruction; mutex Lock is at least dozens of cycles even uncontended, hundreds under contention. Use atomic for counters, single-variable invariants. Mutex when the invariant spans more than one variable.
When to use atomic.Value vs atomic.Pointer[T]?
atomic.Pointer[T] for new code on Go 1.19+. It is type-safe, no interface boxing. atomic.Value exists for backward compatibility and for cases where T is unknown at compile time. The two have one important difference: atomic.Value requires the same concrete type on every Store after the first.
How does Go's atomic interact with the memory model?
Atomic operations have sequentially consistent semantics in Go's memory model (since Go 1.19 made this explicit). A Load synchronises with the Store that wrote the value: writes that happened before the Store are visible to code after the Load. This matches what most engineers intuitively expect from atomics.
Can a lock-free data structure be built with sync/atomic?
Yes, and people have. Lock-free queues, stacks, hash maps. But: the standard library covers the common cases (channels for queues, sync.Map for some maps), and rolling a lock-free structure is where subtle bugs live (ABA problem, memory ordering, leaks of removed nodes). Reach for off-the-shelf or academic-tested designs first.
Gotchas
- Function-style API requires 8-byte alignment for int64; typed API handles this
- atomic.Value requires same concrete type on every Store after the first
- Plain int64 reads/writes are NOT atomic on 32-bit platforms
- Atomic operations are NOT enough for compound state (multiple fields); use a mutex or restructure to one immutable struct
- CAS retry loops can spin under high contention; benchmark before assuming lock-free is faster
sync.Map vs map+RWMutex
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Sometimes
sync.Map is the standard library's lock-free concurrent map. It wins for two specific shapes: write-once-read-many entries, and disjoint-key access from many goroutines. For mixed read/write on the same keys, plain map + sync.RWMutex is faster. The Go authors put a warning in the docs to this effect.
Key Points
- sync.Map is untyped (any/any). No generics in standard library; type assertions everywhere.
- Optimised for write-once-read-many or disjoint-key access. Other patterns: use map+RWMutex.
- Range is weakly consistent: it iterates a snapshot, but mutations during Range may or may not be reflected.
- LoadOrStore is the atomic 'get or insert' that map+RWMutex needs separate Lock/Unlock for.
- If you need iteration order, range, or generics, use plain map with RWMutex. sync.Map is a sharp tool for specific shapes.
Interview Follow-up Questions
When is sync.Map actually faster than map+RWMutex?
Two shapes per the docs: (1) write-once-read-many entries (write at insert, then read forever) and (2) many goroutines mostly accessing disjoint keys (each goroutine touches a different subset). For mixed writes/reads on contended keys, map+RWMutex wins. Benchmark the real workload before assuming.
Why is sync.Map untyped?
It predates generics. The Go team has not added a generic version to the standard library; the existing sync.Map remains untyped for backward compatibility. For typed concurrent maps in modern Go, either build a wrapper with generics + map + RWMutex, or use a third-party library.
Can sync.Map iteration miss entries?
Yes, if entries are added during the iteration. The Range docs say: 'Range may reflect any mapping for that key from any point during the Range call. Range does not block other methods on the receiver; even f itself may call any method on m.' In practice, when a consistent snapshot is needed, copy the entries first.
What about ordered iteration?
sync.Map does not preserve insertion order or any other order. For ordered iteration, use map+RWMutex (which also doesn't preserve order, so sort the keys), or maintain a separate slice of keys with the same lock.
Gotchas
- sync.Map is untyped; type assertions on every Load are easy to get wrong
- Range is weakly consistent; not safe for code that needs an exact snapshot
- Storing nil values is allowed but easily confused with 'not present'; check the bool
- For most workloads, map+RWMutex outperforms sync.Map; default to it unless you have measured otherwise
- Range can block forever if f never returns, including on writes that block on the map
errgroup and golang.org/x/sync/semaphore
Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often
errgroup.Group is the structured fan-out for Go: launch goroutines via Go(func), Wait for all, first error cancels siblings via Context. semaphore.Weighted bounds concurrency with weighted permits. Together they cover most real fan-out cases cleaner than raw WaitGroup + manual cancel plumbing.
Key Points
- errgroup.WithContext returns a Group AND a derived Context that is cancelled when any goroutine returns an error.
- First error wins; subsequent errors from siblings are dropped (only logged).
- g.SetLimit(n) (added in x/sync errgroup v0.0.0-2022) caps in-flight goroutines without a separate semaphore.
- semaphore.Weighted permits >1 weight per acquisition; useful when tasks have different cost (memory, CPU).
- Both are in golang.org/x/sync (not stdlib) but are de facto standard for Go fan-out.
Interview Follow-up Questions
errgroup vs sync.WaitGroup: when to use which?
errgroup when error propagation or context cancellation on first failure is needed. WaitGroup when there are no errors to propagate (background fan-out where each goroutine handles its own errors), or when finer control is needed. For most fan-out-and-wait code, errgroup is the right default.
What happens to slow goroutines after the first error?
The shared context is cancelled. Slow goroutines that respect ctx (check ctx.Done(), pass ctx to downstream calls) will see the cancellation and exit early. Goroutines that ignore ctx will run to completion. errgroup waits for all of them in either case.
How does SetLimit differ from a Semaphore?
SetLimit caps in-flight Go calls; the limit is per-Group. Semaphore caps a global resource (memory, connection slots) that may span multiple Groups. SetLimit is simpler for the common case of 'don't fan out more than N at once'. Semaphore is more flexible (weighted permits, shared across components).
Why are these in x/sync rather than stdlib?
Historical: x/sync was the staging area for sync extensions. Both are stable, widely used, and effectively standard. The Go team has stated no plans to move them, partly because evolving them in x/sync is easier than the stdlib's strict compatibility promise.
Gotchas
- Forgetting to capture the loop variable (item := item) before passing to g.Go creates a race in pre-Go-1.22 code
- errgroup only returns the FIRST error; subsequent errors are silently dropped
- A goroutine that ignores ctx runs to completion even after error cancellation
- semaphore.Weighted.Acquire blocks until ctx done or weight available; check ctx.Err() to distinguish
- errgroup zero value works but loses cancellation; always use WithContext for fan-out
Go Runtime Tuning: GOMAXPROCS, GC, Scheduler
Section: Go Concurrency | Difficulty: Advanced | Frequency: Sometimes
GOMAXPROCS sets how many OS threads execute Go code concurrently (default = cpu count). GOGC controls when GC runs (default 100 = collect when heap doubles). GOMEMLIMIT (Go 1.19+) caps RSS to keep within container limits. The Go scheduler is M:N; you almost never need to think about it, but knowing the levers helps when you do.
Key Points
- GOMAXPROCS defaults to cpu count. In containers without CFS-quota awareness, that may be the host's cores, not your limit. Use uber-go/automaxprocs.
- GOGC=100 means GC when heap doubles since last GC. Higher values trade memory for fewer collections.
- GOMEMLIMIT is a soft cap on total memory. Without it, Go can OOM in containers even with GOGC tuned low.
- The scheduler is M:N: many goroutines on a few OS threads. Goroutines park and resume cheaply (microseconds).
- GODEBUG=schedtrace=1000 prints scheduler stats every 1s. Useful for diagnosing scheduling stalls.
Interview Follow-up Questions
Should GOMAXPROCS be tuned manually?
Almost never on bare metal or VMs. The default (cpu count) is right. In containers with CFS quota, use uber-go/automaxprocs to read the limit correctly. The only time to override manually is when cores are deliberately reserved for other processes (e.g., a sidecar).
What is the right GOGC?
Default 100 is fine for most services. Higher values (200, 500) reduce GC cost at the cost of higher memory; useful for batch jobs that prioritise throughput. Lower values (50, 25) reduce memory at the cost of more CPU on GC; useful when memory is tight. With GOMEMLIMIT set, GOGC can usually stay at default and the limit drives GC pressure.
How does the M:N scheduler differ from OS threads?
Goroutines (G) are scheduled onto P's (logical processors); P's are bound to M's (OS threads). Goroutines that block on syscalls or channels park; the M can run another goroutine without OS context switch. Park/unpark is microseconds vs the OS context switch's microseconds-to-milliseconds. Result: Go programs can have hundreds of thousands of goroutines without scheduling overhead becoming dominant.
What is GOTRACEBACK and when is it needed?
Controls how much detail is in panic stack traces. Default 'single' shows the panicking goroutine; 'all' shows every goroutine. Set to 'all' in production for richer crash dumps. Doesn't affect performance, only crash output.
Gotchas
- GOMAXPROCS default is host CPU count, NOT container CPU limit; use automaxprocs
- Without GOMEMLIMIT, Go can OOM in tight containers even with GOGC tuned
- Running pprof with too many concurrent profiles slows the program; profile sparingly in prod
- GODEBUG=schedtrace adds overhead; not for prod, only for diagnostics
- runtime.NumGoroutine() shows total; doesn't tell you which are blocked vs runnable
Threads, Runnable, Virtual Threads
Section: Java Concurrency | Difficulty: Basic | Frequency: Asked Often
Java's concurrency unit is the thread. Platform threads wrap OS threads (1MB stack, ~10K max). Virtual threads (Java 21+) are JVM-managed M:N tasks (~few KB), millions per JVM. Use Runnable/Callable to define work; Thread/Executors to schedule.
Key Points
- Platform thread = OS thread; Virtual thread = JVM-managed M:N (Java 21+)
- Use Runnable for fire-and-forget; Callable for results; Thread.start to launch
- Virtual threads work with all blocking JDK APIs, no library rewrite needed
- Thread.join() waits for completion; thread.getState() gives runtime state
- Avoid creating threads directly in production code; use ExecutorService
Interview Follow-up Questions
Platform thread vs virtual thread, when to choose?
Platform: CPU-bound work, small thread counts (<1000). Virtual: I/O-bound, high concurrency (10K+ requests). Virtual threads work with existing blocking APIs, no rewrite to async required to scale.
Why is Thread.stop() deprecated?
It releases all locks abruptly, leaving objects in inconsistent states. Worse, it can corrupt internal data structures (a HashMap mid-resize, for example). Use interrupt + cooperative shutdown, let the thread finish what it's doing safely.
How are virtual threads scheduled?
M:N. JVM has a small pool of carrier (platform) threads. Virtual threads run on carriers; on blocking JDK calls, the virtual thread unmounts from its carrier so the carrier can run another. Re-mount when the I/O completes.
When does Runnable beat Callable?
When no return value or exception propagation is required. Fire-and-forget background tasks (logging, metrics flush). Callable's return type and checked exceptions add unnecessary ceremony for those cases.
Gotchas
- Thread.stop() is deprecated and unsafe, never use
- Thread.sleep() and Object.wait() CLEAR the interrupt flag, re-set with Thread.currentThread().interrupt()
- Don't reuse Thread instances, start() throws on a started/terminated thread
- Virtual threads pin to their carrier inside synchronized blocks, use ReentrantLock for I/O-heavy code on virtual threads
- Daemon threads are killed at JVM exit, don't use for cleanup work
synchronized vs ReentrantLock
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
synchronized is the language-level mutex, concise, JVM-optimized, automatic release on exception. ReentrantLock is the API-level mutex, adds tryLock with timeout, fairness, multiple Conditions, lock interruptibility. Pick based on which features the call site needs.
Key Points
- synchronized: cheap, JVM-optimized, auto-release. Default choice.
- ReentrantLock: tryLock(timeout), fairness, multiple Conditions, interruptible
- Both reentrant, same thread can re-acquire
- synchronized.notify() / notifyAll() vs ReentrantLock.newCondition().signal()
- Java 21+ virtual threads + synchronized = pinning (use ReentrantLock instead for I/O-heavy code)
Interview Follow-up Questions
When is synchronized faster than ReentrantLock?
For uncontended fast path. JVM lock coarsening and escape analysis reduce uncontended sync to a few CPU instructions (biased locking did this aggressively pre-JDK 15, then was disabled by JEP 374). ReentrantLock has more bookkeeping. Under contention they're similar; in benchmarks, synchronized usually wins for short critical sections.
Why does virtual thread + synchronized 'pin' the carrier?
Virtual threads run on carrier threads. When a virtual thread enters a synchronized block, it can't unmount from its carrier (JVM internals). I/O inside the block holds the carrier. Workaround: use ReentrantLock for I/O-heavy critical sections on virtual threads.
When to choose StampedLock?
Read-mostly workloads with very rare writes. The optimistic path skips ALL synchronization on read, fastest possible. But: not reentrant, manual stamp management, fall-back complexity. Generally niche; ReentrantReadWriteLock is more common.
synchronized on this, bad practice?
Sometimes. Any caller with a reference to the object can lock it externally, `synchronized(myObj) { ... }` from outside. For libraries, use a private final lock object instead. For internal classes, `synchronized(this)` is fine.
Gotchas
- ReentrantLock without try/finally → exception leaks the lock
- synchronized on String literals or autoboxed Integers → shared monitors, deadlocks across unrelated code
- Fairness in locks adds overhead, use only when measurement justifies it
- Reading a long/double without synchronization on 32-bit JVMs is NOT atomic
- Holding a lock during an external call (network, callback) → deadlock if callee tries to lock something held by caller
AbstractQueuedSynchronizer (AQS)
Section: Java Concurrency | Difficulty: Advanced | Frequency: Asked Often
AQS is the framework underneath every lock in java.util.concurrent. It provides a single int state variable, a CLH-style wait queue, and the park/unpark plumbing. Subclasses define what the state means and when acquire/release succeeds. ReentrantLock, Semaphore, CountDownLatch, ReadWriteLock, FutureTask, ThreadPoolExecutor's worker locks, and StampedLock are all AQS subclasses or close cousins.
Key Points
- One volatile int state field. Subclasses interpret it: 0/1 for ReentrantLock, hold count for reentrancy, available permits for Semaphore, count for CountDownLatch.
- FIFO wait queue (a CLH variant): each waiter is a Node with a Thread and a status. Spin briefly, then park.
- Subclasses override tryAcquire / tryRelease (exclusive) or tryAcquireShared / tryReleaseShared (shared). Everything else (queueing, parking, signalling) is provided.
- Supports interruptible, timed, and fair acquisition out of the box.
- ConditionObject (the Condition impl) lives on top of AQS too; await/signal use a separate condition queue that migrates nodes back to the main wait queue on signal.
- Java 21 added VirtualThread support: AQS parks via LockSupport, which knows how to unmount a virtual thread from its carrier.
Interview Follow-up Questions
Why one int instead of an object?
Atomic CAS on a single 32-bit word is hardware-cheap and fits in one cache line with the rest of the AQS header. A heavier representation (a HashMap, an Object reference) would cost an allocation per state change and a cache miss per access. The whole framework is designed around a single hot CAS path. Most synchronizers fit fine: 0/1 for a mutex, hold count for reentrancy, available permits for a semaphore, count for a latch. When 32 bits isn't enough, AbstractQueuedLongSynchronizer extends it to 64.
What is the CLH queue and why does AQS use a variant of it?
Craig, Landin, Hagersten 1993. A linked list of Nodes where each node spins on a flag in its predecessor. AQS modifies the original CLH: instead of pure spinning, nodes park after a brief spin attempt (the standard approach for non-real-time systems). The benefits: O(1) enqueue/dequeue, no allocation on the unlock path (the head moves, doesn't get freed), and it's easy to reason about FIFO fairness.
How does Condition.await fit into all this?
AQS exposes ConditionObject which has its own singly-linked condition queue. await() releases the AQS lock fully, adds the thread to the condition queue, and parks. signal() pulls a node from the condition queue and migrates it to the main AQS wait queue. When that node eventually reaches the head, it tries to re-acquire the lock with the original hold count. Spurious wakeups are possible because LockSupport.park can return early; the await loop checks the condition again.
Why don't ReentrantLock and synchronized share an implementation?
synchronized is built into the JVM; it's part of bytecode (monitorenter/monitorexit) and the runtime can apply lock coarsening and lock elision because it sees the bytecode. (Biased locking was disabled by default in JDK 15 and removed in JDK 18.) ReentrantLock is a library type built on AQS; the JVM has no special knowledge of it. Tradeoff: synchronized gets JVM optimizations but no tryLock with timeout, no lockInterruptibly, no fair mode, no Condition. ReentrantLock gets all of those at the cost of being just a normal class. Java 21-23 ReentrantLock works with virtual threads without pinning the carrier (synchronized used to pin); JEP 491 in Java 24 fixed the synchronized-pinning case too.
Has AQS changed for virtual threads?
Yes. The parking primitive is LockSupport.park(blocker), which the JVM intercepts for virtual threads: instead of parking the carrier, it unmounts the virtual thread and frees the carrier to run another. This means ReentrantLock and friends do not pin a carrier under virtual threads, even though synchronized still does (which is why HotSpot is rewriting synchronized for Loom).
Gotchas
- Forgetting to make state writes happen via setState/compareAndSetState. They are volatile-safe; direct writes to other fields are not.
- In tryRelease, only return true when the lock is fully released (state == 0 for a reentrant mutex). Returning true early wakes a waiter that will then fail tryAcquire and re-park, wasting a context switch.
- Mixing exclusive and shared modes in the same synchronizer is possible but tricky (see ReentrantReadWriteLock). Don't try without first reading the source.
- Signal-on-Condition without holding the lock throws IllegalMonitorStateException. ConditionObject checks the AQS owner.
volatile in Java
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
volatile gives a single field two guarantees: every read sees the latest write from any thread (visibility), and reads and writes cannot be reordered across the access (ordering). It does not make compound operations like value++ atomic. Use it for status flags, write-once references, and the published reference in double-checked locking.
Key Points
- Visibility: a write to a volatile field is immediately visible to any thread that reads it.
- Ordering: writes that happened before the volatile write are visible to a reader that observes it. Reads after the volatile read see those writes too.
- NOT atomic for compound operations. value++ is read-modify-write, not one operation.
- Long and double on 32-bit JVMs are atomic only when declared volatile.
- For atomicity, use AtomicInteger / AtomicLong / AtomicReference. When both are needed, those classes already provide volatile semantics.
Interview Follow-up Questions
Is volatile a substitute for synchronized?
Only for single-field invariants. Volatile provides visibility and ordering on one field. Synchronized provides visibility, ordering, mutual exclusion, AND allows reasoning about multiple fields together. When the invariant spans more than one variable, use a lock or atomic.
What is the cost of volatile?
Per-access cost: a memory barrier on each write and a load barrier on each read. On modern CPUs that is on the order of tens of nanoseconds. Cheap for status flags. Measurable at millions of reads per second on the same field. For hot counters, use LongAdder, which is internally sharded.
Does volatile help with long and double on 32-bit JVMs?
Yes. The JLS only guarantees atomicity of 32-bit reads and writes. 64-bit long and double can be torn (a read picks up half the high word and half the low word). Declaring them volatile forces atomicity. On 64-bit JVMs in practice, HotSpot makes them atomic regardless, but the spec only requires it under volatile.
When prefer VarHandle over volatile?
When the same visibility/ordering is needed but with finer control over which barrier (acquire-only, release-only, opaque). Used in performance-critical lock-free code in JDK internals. Application code almost never needs it.
Gotchas
- volatile does not lock; multiple writers can still race on the same field
- volatile array reference is volatile, but reading array[i] is NOT (the element access is plain). Use AtomicReferenceArray for per-element ordering
- Long and double need volatile to be atomic on 32-bit JVMs
- Compound updates (value++, value = compute(value)) are not atomic under volatile alone
- synchronized provides everything volatile does, plus mutual exclusion. Do not stack them on the same field
Java Memory Model & Happens-Before
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
The JMM defines what reads can observe what writes across threads. The happens-before relation is the rule set, if action A happens-before action B, then A's effects are visible to B.
Key Points
- Without happens-before, the JVM/CPU can reorder reads/writes for performance
- volatile guarantees: visibility (latest value) + ordering (no reordering across the access)
- synchronized guarantees: mutual exclusion + happens-before on lock release → next acquire
- final fields: safely visible after constructor completes, no synchronization needed
- Thread.start() happens-before any action in the started thread
- Action in a thread happens-before another thread's successful join() return
Interview Follow-up Questions
What does volatile guarantee that a regular field doesn't?
Two things: (1) visibility, every read sees the latest write from any thread; (2) ordering, reads/writes are not reordered across the volatile access (acquire/release semantics).
Why doesn't volatile make ++ atomic?
++ is read-modify-write. volatile makes the read and write each atomic, but two threads can interleave and lose updates. Use AtomicInteger.incrementAndGet().
What is happens-before?
A partial ordering on memory operations. If A happens-before B, everything visible to A is visible to B. Established by volatile, synchronized, Thread.start/join, final field freeze, etc.
Are final fields thread-safe without synchronization?
Yes, the JMM guarantees any thread that observes the constructed object sees correctly initialized final fields, even without synchronization.
Does Python have a memory model like Java?
No formal one. CPython piggybacks on the GIL for atomicity of basic ops, but this is implementation-specific. Always synchronize explicitly, Lock, Event, Queue.
Gotchas
- Reading a volatile array reference is volatile, but reading array[i] is NOT, use AtomicReferenceArray
- synchronized on a String literal or autoboxed Boolean can deadlock unrelated code (shared instances)
- Constructors that publish 'this' before completion break the final-field guarantee
- long and double are NOT atomic on 32-bit JVMs without volatile
- Python CPython's atomic-bytecode behavior is not portable to PyPy or free-threaded CPython
Java Atomics: AtomicInteger, AtomicReference, LongAdder
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
java.util.concurrent.atomic gives lock-free single-variable updates: AtomicInteger / AtomicLong / AtomicReference for counters and references, LongAdder for high-contention counters, AtomicReferenceArray for per-element ordering. All built on hardware compare-and-swap.
Key Points
- All atomic classes provide volatile-style visibility plus atomic compound operations like incrementAndGet, compareAndSet, accumulateAndGet.
- Use AtomicLong for counters with low contention. Use LongAdder when many threads write the same counter and reads are rare.
- compareAndSet is the building block. Most other methods are CAS retry loops.
- AtomicReference.compareAndSet protects against the wrong WRITE, but not against the wrong READ. Reading a reference and dereferencing it does not interlock with concurrent writes.
- AtomicStampedReference adds a version counter to defeat the ABA problem.
Interview Follow-up Questions
AtomicLong vs LongAdder, how to choose?
LongAdder wins for write-heavy counters that are read infrequently, like metrics scraped every 10 seconds. AtomicLong wins for counters read on every request (header injection, rate limits) because get() is O(1). Benchmark the workload, but the rule of thumb is: many writers + rare reads = LongAdder.
Is compareAndSet really lock-free?
It compiles to a single hardware CAS instruction (lock cmpxchg on x86, CAS or LDREX/STREX on ARM). One CPU instruction, no thread parking, no context switch. The retry loop on failure is busy-wait, but each iteration is fast and bounded by other threads' progress.
What is VarHandle and when is it needed?
VarHandle (Java 9+) gives explicit acquire / release / opaque / volatile semantics on a field, instead of the fixed volatile semantics of Atomic classes. Used by JDK internals and high-performance lock-free code. Application code almost never needs it: AtomicInteger / AtomicReference cover 99% of cases.
Why is AtomicReference not enough for safe lazy init?
AtomicReference makes the reference write atomic and visible. But the constructor that builds the object can still be reordered with the publication if the field is just an AtomicReference and not synchronised properly. Use AtomicReference + a CAS-based double-check, or just use the static-holder idiom which is simpler.
Gotchas
- AtomicLong.get() then .set() is two operations, not one. Use accumulateAndGet or compareAndSet.
- AtomicReferenceArray protects per-element ordering; a volatile array reference does NOT.
- Under heavy contention, CAS retry loops can spin a lot. Move to LongAdder or sharded counters.
- Compound operations across multiple atomics need a lock; atomics are single-variable only.
- lazySet on Atomic* skips the release barrier. Faster, but no happens-before. Use only when the implications are well understood.
ThreadLocal & Inheritable State
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Sometimes
ThreadLocal stores per-thread values, every thread sees its own copy. Used for: request context, SimpleDateFormat (not thread-safe), DB connections per thread. Pitfalls: leaks in thread pools, doesn't inherit by default, virtual threads explode memory.
Key Points
- Each thread has its own value, isolated, no synchronization needed
- Common uses: request-id for logging, per-thread DB connections, SimpleDateFormat
- InheritableThreadLocal copies parent's value to child threads
- Pitfall: thread pool reuses threads, ThreadLocal values persist across requests → leaks
- ScopedValue is the modern replacement (preview in 21, finalised in Java 25): bounded scope, no leaks
Interview Follow-up Questions
Why is ThreadLocal a memory leak risk?
In a thread pool, the thread isn't destroyed between tasks. ThreadLocal values stay attached. Without `remove()`, the value persists until the thread eventually dies (which may be never for long-lived pools). Worse: the value can hold onto large objects (entire user contexts), causing slow OOM.
ThreadLocal vs InheritableThreadLocal vs ScopedValue?
ThreadLocal: per-thread, manual cleanup, no inheritance. InheritableThreadLocal: per-thread, copies on creation (but only at thread creation, not on async dispatch). ScopedValue (preview in Java 21, finalised in Java 25): lexically scoped, auto-cleared, works with virtual threads, no leak risk. Use ScopedValue when on Java 25+; ThreadLocal for legacy code.
Why are ThreadLocals problematic with virtual threads?
Each virtual thread gets its own ThreadLocal map. With 1M virtual threads + 50 ThreadLocals + each holds 100 bytes → 5GB of memory just for ThreadLocal storage. Java's ScopedValue is designed for the virtual-thread era, bounded, lexically scoped, much less memory.
Where do logging frameworks use ThreadLocal?
MDC (Mapped Diagnostic Context) in SLF4J/Logback/Log4j is a ThreadLocal map. Set request ID in middleware → log statements automatically include it. Same pattern; same leak risk if middleware doesn't clear.
Gotchas
- ThreadLocal in a pool without remove() in finally → leak
- InheritableThreadLocal copies AT CREATION, async dispatch doesn't propagate (need explicit handoff)
- Virtual threads + many ThreadLocals → memory explosion (use ScopedValue)
- ThreadLocal.set(null) is NOT the same as remove(), null entry stays in the map
- MDC in async chains often loses the value, middleware must propagate or wrap
ExecutorService & Thread Pools
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
ExecutorService manages a pool of threads + a work queue. Use ThreadPoolExecutor with bounded queue + named threads + sensible rejection policy in production. newFixedThreadPool/newCachedThreadPool have unbounded queues, production hazards.
Key Points
- ThreadPoolExecutor (the actual class), full control
- Executors.newXxx, convenience but with footguns (unbounded queues)
- Always: bounded queue, named threads, explicit rejection policy
- shutdown() vs shutdownNow(), graceful vs forceful
- Java 21+ virtual thread executor, request-per-thread reborn
Interview Follow-up Questions
Why is newFixedThreadPool dangerous in production?
It uses an UNBOUNDED LinkedBlockingQueue. Burst load → queue grows without bound → OOM. Always use ThreadPoolExecutor explicitly with ArrayBlockingQueue(N) or another bounded queue.
What's the right rejection policy?
Default (AbortPolicy) throws RejectedExecutionException. CallerRunsPolicy makes the submitter run the task, natural backpressure. DiscardPolicy silently drops. DiscardOldestPolicy drops the oldest queued. CallerRuns is the most-used in production.
How to size a thread pool?
CPU-bound: pool = NumCPU. I/O-bound: Goetz's formula, cores × (1 + waitTime/computeTime). For Java 21 virtual threads: don't size at all; use newVirtualThreadPerTaskExecutor with semaphores for downstream-protection.
Is it necessary to shutdown the executor?
Yes. Without shutdown, the JVM doesn't exit because non-daemon worker threads stay alive. Always call shutdown() (or use try-with-resources in Java 19+). For tests, shutdown in @AfterEach to avoid thread leaks.
Gotchas
- Executors.newFixedThreadPool: unbounded queue → potential OOM
- Executors.newCachedThreadPool: unbounded thread count → potential OS thread exhaustion
- Forgetting to call shutdown() → JVM doesn't exit
- Future.get() without timeout → can wait forever if task hangs
- Worker thread throwing uncaught exception → killed silently; pool spawns replacement; bug invisible without UncaughtExceptionHandler
CompletableFuture & Async Composition
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
CompletableFuture composes async operations into pipelines: thenApply (transform), thenCompose (chain), allOf/anyOf (combine), exceptionally (recover), orTimeout (deadline). The Java equivalent of JS Promise chains, with explicit executor binding.
Key Points
- supplyAsync: launch async computation; returns CompletableFuture<T>
- thenApply: transform result (sync). thenCompose: chain another async operation
- allOf: wait for all. anyOf: wait for first.
- Always specify the Executor, default ForkJoinPool.commonPool() is shared
- exceptionally / handle for error recovery; orTimeout for deadlines
Interview Follow-up Questions
thenApply vs thenCompose, when?
thenApply for SYNC transforms. thenCompose when the next step itself returns a CompletableFuture (async). thenCompose is JavaScript's Promise.then semantics; thenApply is map. If the function `T → U` then thenApply; if `T → CF<U>` then thenCompose.
Does CompletableFuture have a stack trace?
By default, no, async work runs on a worker thread, original call stack is lost. Use whenComplete with logging, or capture stack at submit time. Or use Project Reactor / RxJava which preserve more context. Java's structured concurrency (Java 21+) helps too.
Why specify the Executor in every async method?
Without it, async tasks run on ForkJoinPool.commonPool(), shared across the whole JVM. Long-running tasks starve other commonPool users (parallel streams, etc.). Always pass a dedicated executor.
Can a CompletableFuture be cancelled?
Sort of. cancel(true) marks the future as cancelled with CancellationException. But it does NOT interrupt the running thread, that's a persistent gotcha. Use the CompletableFuture chain to model cancellation explicitly (a separate cancelled future + check).
Gotchas
- Default executor is ForkJoinPool.commonPool(), shared, easily exhausted
- thenApply where thenCompose was needed → CF<CF<T>> (nested future)
- cancel(true) doesn't interrupt the running thread
- exceptionally hides the exception unless logged; whenComplete is better for visibility
- join() can hang if no timeout, always pair with orTimeout for production
ConcurrentHashMap
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Java's general-purpose concurrent map. Lock-free reads, CAS-based and per-bin locking on writes. Use computeIfAbsent and merge for atomic compound operations. Far better than synchronized HashMap or Collections.synchronizedMap for everything except trivial workloads.
Key Points
- Reads are completely lock-free. Writes lock per-bin (the linked list at one hash bucket).
- Java 8+ replaced segment locks with per-bin synchronisation + CAS. Resize is incremental and lock-free.
- containsKey + put is racy. Use putIfAbsent or computeIfAbsent.
- size() is approximate under concurrent mutation. mappingCount() returns long for huge maps.
- Iterators are weakly consistent: they reflect the state at some point during iteration, not a snapshot.
Interview Follow-up Questions
How is CHM different from Collections.synchronizedMap?
synchronizedMap wraps every method in a single object lock. Two threads calling get serialise. CHM has lock-free reads and per-bin write locking, so concurrent reads and writes to different keys do not interfere. CHM is dramatically faster under contention and the only choice for serious workloads.
Why is size() approximate?
Under concurrent mutation, the count is maintained across multiple per-bin counters (so writes do not all contend on one). On a size() call, CHM sums them in a snapshot that may not reflect concurrent updates that happened during the sum. For most uses (capacity hints, metrics) the approximation is fine. For exact count, freeze the map first.
Can CHM be used as a concurrent set?
Yes. Either use ConcurrentHashMap.newKeySet() (returns Set view), or use the map with values that get ignored (Boolean.TRUE). The newKeySet variant is cleaner and uses ConcurrentHashMap.KeySetView under the hood.
What is wrong with iterating a CHM?
Iterators are weakly consistent: they will not throw ConcurrentModificationException, but they may or may not reflect mutations made during iteration. They reflect the state at some point during the iteration, not a snapshot. Acceptable for monitoring, not for transactional reads.
Gotchas
- computeIfAbsent function should be fast and side-effect-free. While it runs, the bin is locked.
- Recursive update inside computeIfAbsent (calling map.put on the same map) can deadlock or throw.
- size() is O(N) and approximate. Cache it when called often.
- Default initial capacity is 16, default concurrency level was historically 16. In Java 8+ concurrencyLevel is just a hint.
- Null keys and null values are forbidden. Throws NullPointerException.
BlockingQueue Family
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
BlockingQueue is the Java interface for thread-safe producer-consumer queues. Pick by overflow behaviour and ordering: ArrayBlockingQueue (bounded), LinkedBlockingQueue (default unbounded, dangerous), PriorityBlockingQueue (ordered), SynchronousQueue (no buffer, hand-off), DelayQueue (scheduled), TransferQueue (sender knows when received).
Key Points
- put blocks if full, take blocks if empty. offer/poll have non-blocking and timed variants.
- ArrayBlockingQueue is the default safe choice for production: bounded, simple, predictable.
- LinkedBlockingQueue defaults to UNBOUNDED. Always pass a capacity for production.
- SynchronousQueue has zero buffer: producer waits for consumer to take directly. Used by Executors.newCachedThreadPool.
- All implementations are fully thread-safe: put/take/poll/offer can be called from any thread.
Interview Follow-up Questions
Why is LinkedBlockingQueue dangerous by default?
Its no-arg constructor uses Integer.MAX_VALUE as capacity, which is effectively unbounded. Under burst load, the queue grows without bound and the JVM OOMs. Always pass an explicit capacity. When in doubt about which bounded queue to pick, use ArrayBlockingQueue.
ArrayBlockingQueue vs LinkedBlockingQueue: which is faster?
Depends on the workload. ArrayBlockingQueue uses one lock for put and take; LinkedBlockingQueue uses two (head and tail), so it can have one producer and one consumer not blocking each other. For low contention, ArrayBlockingQueue's simpler design wins. For very high contention with separate producer and consumer threads, LinkedBlockingQueue can win.
When to use SynchronousQueue?
When zero buffering is required: every produced item must be handed off to a waiting consumer. Used as the work queue in newCachedThreadPool to force thread creation per task. Also used in some hand-off algorithms where buffering would mask a problem.
What is TransferQueue and why does it matter?
TransferQueue extends BlockingQueue with a transfer method: the producer not only puts the value, it also waits for a consumer to take it. Useful for request-response patterns where the producer needs confirmation that the consumer received the work.
Gotchas
- LinkedBlockingQueue with no arg = unbounded = OOM risk under burst
- Mixing add/remove (throw) with put/take (block) leads to inconsistent error handling
- PriorityBlockingQueue is unbounded; use a separate semaphore for bounded priority work
- Iterator on a BlockingQueue is weakly consistent, not a snapshot
- Forgetting to handle InterruptedException in put/take leaves the thread alive but the cancellation lost
Sync Utilities: Latch, Barrier, Semaphore, Phaser, Exchanger
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Five coordination primitives in java.util.concurrent. CountDownLatch: one-shot start signal. CyclicBarrier: rendezvous N threads, reusable. Semaphore: limit concurrent access to N permits. Phaser: dynamic-party barrier. Exchanger: hand off a value between two threads.
Key Points
- CountDownLatch is one-shot. Once it hits zero, it stays zero. Cannot reset.
- CyclicBarrier is reusable. After N threads arrive, it triggers an optional action and resets.
- Semaphore.acquire() blocks for a permit. Use it for connection pools, rate limiters, bounded concurrent execution.
- Phaser is the flexible cousin of CyclicBarrier: parties can register and deregister mid-flight.
- Exchanger pairs exactly two threads. Each calls exchange(value) and gets the other's value.
Interview Follow-up Questions
When to use CountDownLatch vs CyclicBarrier?
CountDownLatch when the count and the waiters are different sets of threads. One thread (or N) counts down; M threads wait. Used for one-shot start gates and 'wait for all workers to finish'. CyclicBarrier when the same N threads need to rendezvous repeatedly. Used for iterative parallel algorithms.
What does 'fair' mean for Semaphore?
Fair = FIFO order. Threads acquire in the order they called acquire(). Unfair = any waiting thread can win. Fair is slower (more scheduling, more context switches) but prevents starvation. For low-contention pools the cost is invisible; for high-contention rate limiters fair is the right default.
Phaser vs CyclicBarrier?
Phaser is a superset. CyclicBarrier requires a fixed party count at construction. Phaser lets parties register and deregister mid-flight. Phaser also tracks phase number, so threads can ask 'which phase are we on'. Use CyclicBarrier when count is fixed; reach for Phaser when it is not.
Are these primitives ever the wrong tool?
Yes, often. To 'wait for N futures to complete', CompletableFuture.allOf is cleaner than a CountDownLatch. For a producer-consumer queue, BlockingQueue is cleaner than rolling one with Semaphore. These primitives are foundations; reach for higher-level utilities first when they fit.
Gotchas
- CountDownLatch cannot be reset. For a reusable variant, use CyclicBarrier or Phaser.
- Forgetting to release a Semaphore permit (no try/finally) leaks permits permanently
- CyclicBarrier with a barrier action: that action throws? All waiting threads get BrokenBarrierException.
- Unfair Semaphore can starve a thread under sustained contention.
- Phaser deregistration is one-shot; deregistering twice from one thread will desync the count.
ReadWriteLock & StampedLock
Section: Java Concurrency | Difficulty: Advanced | Frequency: Sometimes
ReentrantReadWriteLock allows many concurrent readers OR one writer. StampedLock adds an optimistic-read mode that does not block writers. Use ReadWriteLock when reads dominate writes by 10x or more. Use StampedLock when reads vastly dominate and a retry is tolerable. Both are sharper than they look.
Key Points
- ReentrantReadWriteLock: many readers OR one writer at a time. Readers do not block readers.
- Read locks are reentrant. Write locks are reentrant. Holding write lock and acquiring read = OK. Holding read lock and acquiring write = DEADLOCK on the same thread.
- StampedLock has optimistic read: no lock acquisition, validate after the read. Falls back to read lock if validation fails.
- StampedLock is NOT reentrant. Calling readLock twice on the same thread deadlocks.
- Both are heavier than synchronized for low-contention or write-dominant code. Measure.
Interview Follow-up Questions
When is ReadWriteLock actually faster than synchronized?
When reads dominate by 10x or more AND the read holds the lock long enough that contention matters. For very short reads, the cache-line bouncing from incrementing the reader count can make ReadWriteLock slower than just synchronized. Profile both.
When to use StampedLock?
Read-heavy code where reads are very cheap (a few field reads) and writes are rare. Geometry caches, configuration that updates seconds apart, lookup tables. Avoid for code that needs reentrancy or for reads with side effects (the optimistic read can run twice on validation failure).
Why is StampedLock not reentrant?
Reentrancy requires per-thread bookkeeping that adds cost. StampedLock's design pushed that cost out to keep the fast path tight. The trade is that calling readLock or writeLock on a lock the same thread already holds will self-deadlock.
Can readers starve writers?
With unfair ReadWriteLock, yes. New readers can keep arriving while a writer waits, and the writer never gets in. Use the fair constructor (true) to prevent that. StampedLock does not have a fairness option; in practice it is biased toward writer fairness.
Gotchas
- Read lock cannot be upgraded to write lock on ReentrantReadWriteLock. Drop and re-acquire.
- StampedLock readers must be side-effect-free and cheap. Optimistic read may execute and then re-execute.
- StampedLock is NOT reentrant. Holding read or write and re-acquiring the same hangs.
- Unfair ReadWriteLock can starve writers under sustained read load.
- For very short critical sections, plain synchronized often beats both.
ForkJoinPool & Parallel Streams
Section: Java Concurrency | Difficulty: Intermediate | Frequency: Sometimes
ForkJoinPool is Java's work-stealing thread pool optimised for divide-and-conquer recursion. RecursiveTask / RecursiveAction split work, fork() schedules, join() waits. Parallel streams run on the common pool. Sharp tool: useful for CPU-bound recursive work, terrible for blocking I/O.
Key Points
- Work stealing: each worker has its own deque; when idle, it steals from another worker's deque tail.
- Designed for divide-and-conquer with no blocking. Tasks split until small, compute, combine.
- Common pool is shared across the JVM. Parallel streams use it. Blocking inside it starves everyone.
- Default parallelism = available cores - 1. Override with -Djava.util.concurrent.ForkJoinPool.common.parallelism.
- For CPU-bound recursive work, fork-join can outperform a fixed thread pool. For I/O, use a dedicated executor.
Interview Follow-up Questions
Why is the common pool shared across the JVM?
Performance and simplicity. Allocating a new ForkJoinPool per parallel stream call would create dozens of pools, dozens of thread pools, and contention for cores. The common pool centralises that. The downside is that any user of parallel streams shares it, so one badly-behaved task affects everyone.
When to avoid parallel streams?
For pipelines where the per-element cost is small (the parallelisation overhead dominates), for I/O-bound work (the pool starves), and for ordered pipelines where preserving order matters (parallel streams have order-preservation modes but they cost). Default to sequential streams; reach for parallel only when the elements are expensive and CPU-bound.
What does work stealing buy?
Load balancing for free. With a fixed-size pool and a global queue, all workers contend on the queue head. With per-worker deques, contention only happens when a worker steals (rare on balanced workloads). Work stealing turns out to be the right scheduler for divide-and-conquer where subtask sizes are uneven and unpredictable.
How big should THRESHOLD be?
Big enough that the per-task overhead (fork, queue, join) is small relative to the leaf work, but small enough to leave many leaves to balance across cores. Rule of thumb: aim for 10x to 100x the number of cores in total tasks. Profile to tune.
Gotchas
- Blocking inside the common pool starves every other parallel stream caller
- Recursive tasks that share mutable state defeat the model. Compose pure functions.
- Parallel stream parallelism = commonPool size, NOT number of stream sources
- fork() then immediately join() on the same task is just compute(); fork-join shines when one side runs inline and the other forks
- Setting common.parallelism too high (>cores) wastes context switches; too low (=1) defeats the point
Structured Concurrency (Java 21+)
Section: Java Concurrency | Difficulty: Advanced | Frequency: Sometimes
StructuredTaskScope (Java 21 preview, finalised in 25) ties subtasks to a parent scope. The scope owns lifetimes: when it exits, all subtasks are cancelled or joined. Replaces the unstructured spawn-and-forget pattern of plain ExecutorService for fan-out work. Pairs naturally with virtual threads.
Key Points
- Subtasks are scoped to a try-with-resources block. Scope close = all subtasks cancelled or joined.
- Fork-join is structured: parent waits for children, children inherit deadlines and cancellation.
- anySuccessful joiner: first success cancels siblings. Used for hedged calls, fastest-replica reads.
- allSuccessfulOrThrow: all must succeed or all are cancelled. Used for fan-out where every result matters.
- Pairs naturally with virtual threads: cheap to spawn 1000 subtasks, scope cleans up reliably.
Interview Follow-up Questions
Why is this called 'structured' concurrency?
Concurrent control flow follows the same lexical structure as sequential control flow. Subtasks live within the lexical block of their scope; when the block exits, they are guaranteed done or cancelled. Like try-with-resources for concurrency. The opposite of unstructured ExecutorService.submit, where the future outlives the calling code and orphans are easy.
How does this interact with virtual threads?
Beautifully. Virtual threads make spawning a subtask cheap (microseconds, no OS thread). StructuredTaskScope by default spawns each subtask on its own virtual thread. The pair turns 'fan out 100 calls and wait for all' from a careful executor configuration into a 5-line block.
What about exceptions?
With allSuccessfulOrThrow, the first failure cancels remaining subtasks and join() throws the failure. With anySuccessfulOrThrow, all-failed throws an exception aggregating all of them. The scope handles both the propagation and the cancellation; no per-fork try/catch is needed.
Is this stable yet?
JEP 505 finalised it as a stable preview-graduated feature in Java 25 (2025). Earlier Java versions had it as a preview API with slightly different shape. For new code on Java 25+, prefer StructuredTaskScope over plain ExecutorService for any fan-out that waits for the results before returning.
Gotchas
- scope.join() must be called before reading any subtask result. Otherwise IllegalStateException.
- Subtask.get() before join() also throws. Order matters.
- Forgetting try-with-resources defeats the structuring. Always use the resource block.
- Joiners are typed by result type. Mixing forks with different return types complicates the joiner.
- Subtasks inherit the parent's interrupt status. Cancel propagates downward through nested scopes.
JMH: Benchmarking Concurrent Java
Section: Java Concurrency | Difficulty: Advanced | Frequency: Sometimes
JMH (Java Microbenchmark Harness) is the only Java microbenchmarking tool worth trusting. Hand-rolled timing loops give meaningless numbers because the JIT optimises away the work, dead-code-eliminates results, and warm-up effects dominate. JMH manages all of that and produces statistically sound measurements.
Key Points
- JIT warmup matters: cold code runs interpreted, warm code is fully JIT-compiled. Measure only warm.
- Dead code elimination silently deletes unconsumed results. Use Blackhole or return values.
- Loop unrolling and constant folding can vaporise simple benchmarks. Inputs must come from @State.
- @Threads(N) measures contention. Per-thread state vs shared state matters.
- Reported numbers are op/sec or ns/op with confidence intervals. Trust the CIs.
Interview Follow-up Questions
Why not just use System.nanoTime() in a loop?
Three reasons. JIT warmup: the first thousand iterations run interpreted or partially compiled, giving wildly different timings than the steady state. Dead code elimination: if the result is unused, the JIT may remove the work entirely. Loop unrolling and constant folding: simple inputs let the JIT precompute, so the measurement reflects a literal-load instead of compute. JMH addresses all three.
How many forks to use?
Start with 2-3. JVM-to-JVM variance comes from GC layout, code cache layout, JIT decisions. Multiple forks let the harness see that variance and report it in the confidence intervals. For publish-quality numbers, 5-10 forks. For day-to-day work, 2 is fine.
Does JMH measure contention correctly?
Yes, with @Threads(N). The N benchmark threads run the same @Benchmark method concurrently. Pair with @State(Scope.Benchmark) for shared state to measure contention. JMH also has @Group for asymmetric workloads (some threads do A, others do B).
What is the relationship between JMH and async-profiler?
JMH reports how fast something is. Async-profiler reports where the time goes. Both are essential. JMH-only gives a number with no path to improvement. Profiler-only gives a flame graph but no measurement of impact. Use them together: JMH to compare, profiler to explain the comparison.
Gotchas
- Hand-rolled timing loops are almost always wrong. Use JMH.
- Returning the result is the easy way to defeat DCE; use Blackhole when a return is not possible
- @State scope decides shared vs per-thread; getting it wrong measures the wrong thing
- Single-fork results are noisy; multiple forks expose JVM-to-JVM variance
- Runtime in dev IDE is misleading; build the JAR and run from CLI
The Python GIL, What It Is and Why It Matters
Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often
The Global Interpreter Lock is a mutex that serialises execution of Python bytecode in CPython. It makes single-threaded code fast and simple, but means threads cannot run Python code in parallel, even on a 64-core machine. Reach for multiprocessing for CPU-bound work, threading or asyncio for I/O.
Key Points
- The GIL is a CPython implementation detail, not part of the language spec
- Threads release the GIL around blocking I/O calls, so threading IS useful for I/O work
- Threads do NOT help CPU-bound code in CPython, they serialise via the GIL
- GIL switches between threads roughly every 5ms (sys.setswitchinterval)
- PyPy, Jython, and IronPython have no GIL; CPython 3.13+ has experimental free-threaded mode (PEP 703)
- Even with the GIL, common bytecode operations like += are NOT atomic, bytecode boundaries can interleave
Interview Follow-up Questions
Why does the GIL exist?
Two reasons: (1) it makes the CPython interpreter implementation simple, reference counting and other internals don't need fine-grained locks; (2) it makes single-threaded code fast and most Python code is single-threaded. Removing the GIL has historically slowed single-threaded performance, that's the tradeoff free-threaded CPython is now navigating.
If the GIL serialises Python bytecode, why is += not atomic?
The GIL is released between bytecode operations to let other threads run. += is multiple bytecodes (LOAD, ADD, STORE). Two threads can read the same value, both add 1, both write, losing one update. Use threading.Lock or queue.Queue to make compound operations atomic.
Does the GIL apply to NumPy / Pandas / TensorFlow operations?
No, these libraries release the GIL inside their C/Fortran code. NumPy operations on large arrays run in parallel across threads. The GIL only affects pure Python bytecode. This is why scientific Python is fast despite the GIL.
Will the GIL be removed?
PEP 703 (accepted 2023) introduces free-threaded CPython, opt-in starting 3.13. The plan is gradual, make it experimental, then default, then default-only. Timeline is years. Until then, the GIL is here.
Why does asyncio not have GIL problems?
asyncio is single-threaded, only one piece of code runs at a time. The GIL becomes irrelevant because there's no parallelism to serialise. asyncio achieves concurrency by interleaving tasks at await points, not by running them simultaneously.
Gotchas
- Adding threading to CPU-bound code makes it slower, measure before adding
- multiprocessing on Windows/macOS uses 'spawn' by default, globals from main aren't copied
- Pickle errors when passing un-picklable objects (lambdas, locks) to ProcessPoolExecutor
- asyncio + a CPU-heavy synchronous call → entire event loop stalls until it finishes
- Mixing threads and asyncio is a footgun, prefer asyncio.to_thread() for blocking calls
- The GIL is released around blocking I/O, but NOT around CPU loops in pure Python
threading vs multiprocessing, Picking the Right Tool
Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often
threading shares memory and is great for I/O-bound work because the GIL releases around blocking calls. multiprocessing spawns separate processes that bypass the GIL, the only way to get real CPU parallelism in stock CPython.
Key Points
- threading: shared memory, light, GIL-bound, pick for I/O concurrency
- multiprocessing: separate memory, ~50ms startup each, no GIL, pick for CPU parallelism
- concurrent.futures gives a unified API over both, Executor, submit(), map(), Future
- Process startup method differs by OS: 'fork' on Linux (fast, copy-on-write), 'spawn' on macOS/Windows (slower, isolated)
- multiprocessing requires picklable arguments, lambdas, locks, sockets won't cross the boundary
- Process pools are best for medium-large CPU tasks; small tasks lose to IPC overhead
Interview Follow-up Questions
When does multiprocessing pool overhead outweigh the parallelism win?
When tasks are small. Each task pickle + IPC + result unpickle costs ~100μs. If your task takes 50μs, processes will be slower than serial. Rule of thumb: process pool wins when each task is at least a few ms and there are enough tasks to amortize startup.
Why are sockets and locks not picklable?
Their state isn't meaningful in another process, a socket file descriptor or a lock identity in process A makes no sense in process B. multiprocessing rejects them rather than silently producing broken state. Use Manager.Lock() to cross processes.
How can a numpy array be shared across processes without pickling it?
Use multiprocessing.shared_memory.SharedMemory (3.8+), the array lives in OS shared memory, processes attach by name. Or use a memory-mapped file. Or use a multiprocessing.Array. Manager-backed arrays work but are slow because every access proxies through IPC.
concurrent.futures or raw threading/multiprocessing?
concurrent.futures, almost always. It's the modern API: Executor abstraction, Future objects, clean shutdown via context manager, exception propagation, as_completed iteration. Reach for raw threading.Thread only for long-lived workers with custom lifecycles.
Is a thread pool inside each worker process viable?
Yes, common pattern for mixed CPU+I/O workloads. ProcessPoolExecutor at the top, ThreadPoolExecutor inside each worker. The processes get CPU parallelism; the threads inside each handle blocking I/O without pinning a whole core to wait.
Gotchas
- Forgetting `if __name__ == '__main__':` guard on Windows, re-imports module recursively, infinite spawn
- Module-level state set by the parent isn't visible to spawn-mode children, must pass explicitly
- Pickle errors on lambdas, use def or functools.partial
- ProcessPoolExecutor swallows worker crashes, wrap in try/except and check fut.exception()
- fork on macOS is officially discouraged (3.8+) due to subtle bugs with system libraries
- Closing a Pool inside the with block is automatic; outside, pool.close() + pool.join() are required or workers leak
asyncio Fundamentals
Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often
asyncio runs many tasks on a single thread by interleaving them at await points. It scales to 100K+ concurrent I/O operations because each task is just a few KB instead of a thread's megabytes, but a single CPU-heavy synchronous call freezes the entire event loop.
Key Points
- asyncio is single-threaded, concurrency without parallelism
- await is a yield point: control returns to the event loop, other tasks run
- Tasks are cheap (~few KB); spawning 100K is realistic
- ANY blocking call (requests.get, time.sleep, DB driver) stalls the entire event loop
- asyncio.to_thread() bridges blocking code into the async world
- The library being called must be async-aware, sync libraries don't 'become' async
Interview Follow-up Questions
Why is asyncio single-threaded?
By design, cooperative scheduling on a single thread eliminates the need for locks and avoids GIL contention. Each await point is an explicit yield, so race conditions are dramatically rarer than with threads. The tradeoff is no CPU parallelism within one event loop.
What's the difference between gather and wait?
gather() waits for ALL tasks, returns results in input order, propagates exceptions. wait() returns (done, pending) sets and accepts return_when=FIRST_COMPLETED, FIRST_EXCEPTION, or ALL_COMPLETED. gather is the common case; wait is for first-finished semantics.
How is an asyncio task cancelled?
task.cancel() schedules a CancelledError to be raised inside the task at the next await. The task should catch it, do brief cleanup, and re-raise. asyncio.wait_for and asyncio.timeout() are higher-level wrappers that cancel + raise TimeoutError.
Can asyncio run inside a thread?
Yes, each thread can run its own event loop with asyncio.run(). Useful when integrating async code into a thread-pool-based legacy service. Don't share async objects (Locks, Queues) across threads, they're tied to one event loop.
Why does requests not work with asyncio?
requests is synchronous, its socket I/O blocks the calling thread. In an event loop, that blocks every concurrent task. Use aiohttp or httpx (async mode) instead, or wrap with asyncio.to_thread() when stuck with requests.
Gotchas
- Forgetting `await` calls a coroutine but doesn't run it; emits a 'coroutine was never awaited' warning
- Calling sync DB drivers inside async, silent event loop stall, tail latency spikes
- Swallowing CancelledError, breaks asyncio.wait_for and graceful shutdown
- Creating tasks without awaiting/storing them, task may be garbage collected mid-run
- Mixing asyncio.run() inside another running event loop, RuntimeError; use await or get_event_loop()
- Using time.sleep in async code; asyncio.sleep is the right tool (the sync one blocks the loop)
Python threading: Lock, RLock, Condition, Event, Barrier
Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often
threading module ships the usual suspects: Lock (mutex), RLock (reentrant), Condition (wait/notify), Event (one-shot signal), Semaphore (counted permits), Barrier (rendezvous N threads). The GIL makes Python thread interleaving look atomic for single bytecodes but not for multi-step operations, so these primitives are still needed.
Key Points
- Lock acquired twice on the same thread = deadlock. RLock allows recursive entry.
- Even with the GIL, operations like dict[k] += 1 are NOT atomic (load, add, store across multiple bytecodes).
- Condition.wait() releases the lock and re-acquires on wakeup. Always pair with a predicate loop to handle spurious wakeups.
- Event is the right tool for one-time signals (shutdown flag, ready signal). Don't roll a custom Lock+flag.
- queue.Queue is built on these and provides a thread-safe producer-consumer for free.
Interview Follow-up Questions
Doesn't the GIL make all operations atomic?
No. The GIL makes individual bytecodes atomic (a single LOAD or STORE happens entirely or not at all). But Python operations like x += 1 compile to multiple bytecodes (LOAD, ADD, STORE), and the GIL can be released between them. Two threads doing x += 1 can lose an update. dict[k] = v on the other hand IS atomic on CPython (single STORE_SUBSCR), though that is implementation detail and not safe to rely on.
Lock vs RLock: which is the default?
Lock. It is faster, surfaces the 'same thread re-acquires' bug as a deadlock immediately rather than hiding it. Reach for RLock only when nested method calls genuinely all need the lock and refactoring to acquire once at the top is not feasible. RLock often masks a design issue.
When is queue.Queue preferable over Lock + Condition?
Almost always for producer-consumer. queue.Queue handles the locking, the bounded capacity, the wait on full/empty, the shutdown protocol (with sentinels). Hand-rolling with Lock+Condition is for cases where the predicate is more complex than queue size, or where a custom data structure beyond FIFO is needed.
Why is Event better than a Lock + boolean for shutdown?
Event is one-shot: once set, every wait() returns immediately, including future waits. With Lock + boolean, every loop iteration must recheck and cannot block efficiently. Event allows wait with a timeout, which is exactly what worker loops want ('do work for at most 1 second, then check shutdown').
Gotchas
- Holding a Lock across a yield from / await blocks the whole asyncio loop
- Forgetting `with` and using acquire/release manually leaks locks on exception
- Condition.wait() must be called inside a loop on the predicate; single 'if' is wrong
- Barrier with action: the action runs ONCE on whichever thread arrives last; exceptions break the barrier
- BoundedSemaphore raises if release() exceeds initial count; Semaphore silently allows it
queue Module: Queue, LifoQueue, PriorityQueue
Section: Python Concurrency | Difficulty: Basic | Frequency: Asked Often
queue.Queue is the standard thread-safe FIFO. LifoQueue is a stack. PriorityQueue is a heap. SimpleQueue is FIFO without size bound, optimised for short-lived. All handle the locking internally; pair with a sentinel (None) to signal shutdown to consumers.
Key Points
- queue.Queue is thread-safe by design. No external lock needed.
- maxsize > 0 makes the queue bounded; put() blocks when full. Backpressure for free.
- Standard shutdown: producer puts a sentinel (None); consumer breaks when it sees one.
- task_done() + join() let the producer wait for all submitted work to be processed.
- queue.Queue is for THREADS. asyncio has its own asyncio.Queue. multiprocessing has multiprocessing.Queue.
Interview Follow-up Questions
queue.Queue or asyncio.Queue or multiprocessing.Queue?
Different worlds, do not mix. queue.Queue is for threading, asyncio.Queue is for coroutines (async put/get), multiprocessing.Queue is for processes (pickled across the boundary). Pick by the producer/consumer concurrency model. Mixing them (calling queue.Queue.get from an async coroutine) blocks the event loop.
Why is queue.SimpleQueue faster than queue.Queue?
Queue has a configurable maxsize, supports task_done/join, and uses a lock + condition. SimpleQueue is FIFO-only, unbounded, and uses a lock-free fast path for the common case. For high-throughput producer-consumer that doesn't need bounding or task_done, SimpleQueue can be measurably faster.
What goes in PriorityQueue tuples?
(priority, tiebreaker, payload). The tiebreaker (an itertools.count() value or a timestamp) prevents the heap from comparing payloads when priorities are equal. Without it, two equal-priority items try to compare payload, which fails for non-orderable types like dicts or custom objects without __lt__.
How are workers shut down cleanly?
Two patterns. Sentinel: producer puts N copies of None (one per worker), each worker breaks when it sees one. Event: set a threading.Event, workers check it on each iteration with a short timeout on get(). Sentinel is cleaner when N workers is known; Event is cleaner for dynamic worker pools.
Gotchas
- queue.Queue is for threads only; asyncio coroutines must use asyncio.Queue
- PriorityQueue tuples need a tiebreaker; otherwise equal priorities try to compare payloads
- task_done() called more times than put() raises ValueError
- maxsize=0 is the default and means UNBOUNDED. Always set explicitly for production.
- join() blocks the calling thread; if no consumers are running, it blocks forever
concurrent.futures: ThreadPoolExecutor & ProcessPoolExecutor
Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often
concurrent.futures is the high-level API for running callables on a thread or process pool. ThreadPoolExecutor for I/O-bound work (the GIL is released during blocking I/O). ProcessPoolExecutor for CPU-bound work (one Python interpreter per process, true parallelism). Submit returns a Future; map streams results in input order.
Key Points
- ThreadPoolExecutor: I/O-bound (HTTP, DB, file). The GIL releases during blocking syscalls so multiple threads make progress.
- ProcessPoolExecutor: CPU-bound (parsing, image processing, math). True parallelism, but pickling cost on every call.
- executor.map preserves input order. as_completed yields in completion order. Pick by what the downstream consumer needs.
- Always use a context manager (with). Forgetting to shutdown leaks worker threads or processes.
- Default max_workers: min(32, cpu_count+4) for ThreadPool, cpu_count for ProcessPool. Tune to the workload.
Interview Follow-up Questions
ThreadPoolExecutor or ProcessPoolExecutor: how to choose?
When work is dominated by waiting (HTTP, DB queries, disk I/O), use threads. The GIL releases during blocking I/O syscalls, so threads make real concurrent progress. When work is dominated by Python-level CPU (parsing, math, image processing without numpy releasing the GIL), use processes. For CPU work in a C extension that releases the GIL (numpy, opencv), threads work too.
What is the cost of ProcessPoolExecutor?
Process startup (fork or spawn), and pickling every argument and every return value. For tiny workloads, the overhead exceeds the speedup. Rule of thumb: each task should do at least 10ms of work to justify process overhead, more on Windows where spawn is slower than fork.
Why not just use threading.Thread directly?
concurrent.futures provides the pool (bounded concurrency), the Future abstraction (result, exception, callback), error handling, and timeouts in one API. Plain threading.Thread is one thread per call, no result mechanism, no error propagation back to the caller. concurrent.futures is what's wanted 95% of the time.
Can a Future be cancelled mid-execution?
No. cancel() only succeeds if the task has not started. Once running, the worker thread or process executes to completion. Python has no thread cancellation primitive. The standard workaround is cooperative cancellation: pass a threading.Event to the worker, check it periodically, return early if set.
Gotchas
- Forgetting `with` context manager leaks workers (especially processes, which keep file handles open)
- ProcessPoolExecutor on Windows uses spawn, not fork; everything in __main__ guard
- executor.map raises only on iteration; exceptions in early items are deferred
- Default max_workers is often wrong for the workload; size based on bottleneck (cores for CPU, latency * QPS for I/O)
- Cancelled futures still complete pending work in the queue if other threads pick them up
asyncio Patterns: gather, TaskGroup, Semaphore
Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often
asyncio.gather runs awaitables concurrently and returns results in input order. TaskGroup (Python 3.11+) is the structured-concurrency replacement: better cancellation, better error propagation. asyncio.Semaphore caps in-flight concurrency. asyncio.timeout adds a deadline. These four cover most async fan-out patterns.
Key Points
- gather runs N awaitables concurrently; result is a list in input order.
- TaskGroup replaces gather for new code: any task failure cancels siblings, the group raises ExceptionGroup.
- asyncio.Semaphore is the concurrency limit. Without it, gather over 10,000 URLs spawns 10,000 connections.
- asyncio.timeout(s) is a context manager. The whole block must finish in s seconds or it cancels.
- Bare `asyncio.create_task(coro)` without keeping a reference may be garbage-collected mid-flight.
Interview Follow-up Questions
TaskGroup vs gather: which to use?
TaskGroup for new code on Python 3.11+. It handles cancellation correctly (gather's cancel behaviour was historically buggy), provides ExceptionGroup for multiple failures, and aligns with structured concurrency. Use gather only for simple cases on older Python or when return_exceptions=True semantics are specifically needed.
What is the difference between Task and Future?
Task is a Future that wraps a coroutine. asyncio.create_task(coro) schedules the coroutine and returns a Task. A Task can be awaited, cancelled, queried for state. Future is the lower-level primitive (also returned by run_in_executor for thread/process integration). In application code, almost always work with Tasks.
Why does 'coroutine was never awaited' sometimes appear?
An async function was called without being awaited. async def foo(); foo() returns a coroutine object; it does not run the coroutine. Await foo(), schedule with create_task, or pass to gather/TaskGroup. Forgetting one of these means the coroutine is built and discarded.
How does asyncio interact with threads and processes?
asyncio.to_thread runs a sync function on the default executor (a ThreadPoolExecutor) and returns an awaitable. Useful for blocking calls (file I/O, subprocess, sync libraries) that cannot be awaited directly. For CPU-bound work, asyncio.get_running_loop().run_in_executor(ProcessPoolExecutor(), ...) bridges to processes.
Gotchas
- create_task without keeping a reference may be GC'd before completion
- gather with return_exceptions=False raises immediately; siblings keep running
- Forgetting `await` on an async function returns a coroutine, not a result
- Calling sync blocking code (time.sleep, requests.get) from async blocks the event loop
- asyncio.Lock is for asyncio coroutines; threading.Lock will block the event loop
ContextVar: Async-Safe Request Context
Section: Python Concurrency | Difficulty: Intermediate | Frequency: Sometimes
ContextVar (Python 3.7+) is the async-safe replacement for threading.local. Each task gets its own isolated copy of the variable; values set in a parent task propagate to children but mutations don't leak back. The right tool for request IDs, trace IDs, user context, and any per-request state in async code.
Key Points
- Each asyncio Task gets a copy of the parent's context. Sets in the task don't affect siblings or parent.
- Replaces threading.local for async: thread-locals don't isolate across asyncio tasks on the same thread.
- Used for request-scoped data: trace ID, request ID, authenticated user, tenant ID.
- var.set() returns a Token; pass it to var.reset(token) to restore the previous value (rare in async code).
- Frameworks (FastAPI, Starlette, structlog) use ContextVar internally for request-scoped state.
Interview Follow-up Questions
ContextVar vs threading.local: when each?
threading.local for sync code that needs per-thread state. ContextVar for async code (per-task) AND for sync code that needs per-call isolation (Context.run scopes a value to a specific callable). For new code in an async codebase, always ContextVar.
Does ContextVar work with thread pools?
Yes, but with care. Each thread starts with an empty context unless the context is copied explicitly. asyncio.to_thread copies the current context; concurrent.futures.ThreadPoolExecutor.submit does NOT, so contextvars.copy_context().run(...) is required. Frameworks usually handle this automatically.
Can a ContextVar have no default?
Yes. var.get() then raises LookupError if unset. Or var.get(default) returns default. The no-default form is useful for catching bugs ('this var must be set by middleware'); the default form is forgiving.
Why does var.set() return a Token?
For undo. var.reset(token) restores the previous value. Useful in synchronous code that wants to set a value temporarily. In async code, the per-task isolation usually means reset is unneeded; the value disappears when the task ends.
Gotchas
- threading.local in async code: tasks on the same thread share state, race city
- ThreadPoolExecutor.submit does NOT copy context; trace IDs disappear in pool work
- var.set() inside a finally block of a task does NOT affect the parent (per-task isolation)
- Forgetting default + var.get() raises LookupError; catch or pre-set in middleware
- Context is captured at create_task time, not at await time; mutations between create and await don't propagate
Multiprocessing Deep Dive
Section: Python Concurrency | Difficulty: Advanced | Frequency: Sometimes
multiprocessing spawns separate Python processes to escape the GIL. Pool, Process, Queue, Pipe, Manager, shared memory. The catch: every argument and return value is pickled across the boundary, start methods (fork/spawn/forkserver) behave very differently, and shared state requires explicit machinery.
Key Points
- Each process has its own Python interpreter and its own GIL. True CPU parallelism.
- Arguments and return values must be picklable. Lambdas, closures, file handles, locks all fail.
- Start method matters: fork (Unix default, fast, copy-on-write) vs spawn (Windows, slower, fresh interpreter) vs forkserver.
- Shared state needs Manager (proxied, slow) or shared_memory (fast, raw bytes).
- multiprocessing.Pool has the same map/imap/apply API as concurrent.futures' ProcessPoolExecutor; the latter is usually preferred.
Interview Follow-up Questions
When is multiprocessing preferable over ProcessPoolExecutor?
ProcessPoolExecutor is built on multiprocessing and provides a cleaner API for the common case (map, submit, futures). Reach for raw multiprocessing when needed: a Manager for shared state, shared_memory, custom IPC via Pipe, or fine control over the start method. For 'run this function on N cores', use ProcessPoolExecutor.
Why does code work on Linux but fail on macOS / Windows?
Default start method changed. macOS in Python 3.8+ defaults to spawn (was fork before). Windows has always been spawn. Spawn re-imports the module in each worker, which breaks lambdas, top-level code without an `if __name__ == '__main__':` guard, and code that depends on shared parent state. Add the guard, refactor lambdas to module-level functions, and pass shared state explicitly.
Manager.dict vs shared_memory: when to pick which?
Manager.dict for low-frequency access to small objects. Each access is an IPC round trip (pickle, send, unpickle), so it does not scale. shared_memory for large fixed-size buffers (numpy arrays, byte buffers) needing lots of reads and writes without copying. The trade-off is convenience vs speed.
How does PEP 703 (no-GIL Python) change this?
Free-threaded CPython (3.13+ as a build option, supported but non-default in 3.14, default-on planned for later) removes the GIL. Threads achieve true parallelism, so multiprocessing becomes less necessary for CPU-bound work. multiprocessing remains useful for fault isolation (a crash in one process does not take down the parent) and for memory isolation (no shared mutable state by default).
Gotchas
- Forgetting `if __name__ == '__main__':` on Windows/macOS spawn causes infinite recursion
- Lambdas, local functions, and closures cannot be pickled; use module-level callables
- Manager proxies are SLOW; every read or write is an IPC round trip
- fork after threading is undefined behaviour in many libraries; prefer spawn or forkserver
- Forgetting shm.unlink() leaks shared memory until reboot on some platforms
Bridging Sync and Async
Section: Python Concurrency | Difficulty: Advanced | Frequency: Sometimes
asyncio.to_thread runs sync code on a thread without blocking the event loop. asyncio.run_coroutine_threadsafe runs async code from a thread. asgiref.sync provides sync_to_async / async_to_sync helpers. Mixing the two worlds is messy; isolate the boundary and pick one direction per layer.
Key Points
- Calling sync blocking code from async (without to_thread) freezes the event loop.
- Calling async code from sync (without proper bridging) raises RuntimeError or silently never runs.
- asyncio.to_thread offloads the sync call to the default thread pool. Returns awaitable.
- From a non-event-loop thread, asyncio.run_coroutine_threadsafe schedules a coroutine on the loop and returns a concurrent.futures.Future.
- When in doubt, isolate the boundary: a layer is fully sync OR fully async, never mixed.
Interview Follow-up Questions
Why not just call sync code from async?
It works, but if the sync code blocks (HTTP request, DB query, file I/O, time.sleep), it freezes the entire event loop. Every other coroutine, every other request, stops making progress. asyncio.to_thread offloads the blocking call to a thread, freeing the loop to handle other work.
What thread runs asyncio.to_thread tasks?
By default, the loop's default executor (a ThreadPoolExecutor with min(32, cpu_count + 4) workers). It can be replaced with loop.set_default_executor(custom_executor) for a larger pool or one with custom thread settings. For CPU-bound sync code, use ProcessPoolExecutor instead via run_in_executor.
When is anyio preferable over asyncio directly?
anyio is a higher-level wrapper that works on both asyncio and trio with the same API. Reach for it for runtime-agnostic libraries, for trio's structured concurrency, or for easier sync/async bridging via anyio.from_thread.run / anyio.to_thread.run_sync. For pure asyncio applications, the asyncio API is fine.
What is the boundary to aim for?
Pick a direction per layer. The web framework layer is async. The data access layer is async (using async drivers). The CPU-bound utility layer is sync, called from async via to_thread. Don't ping-pong: a sync function that calls an async function that calls a sync function is hard to reason about and slow.
Gotchas
- Calling time.sleep / requests.get / sync DB drivers from async blocks the event loop
- asyncio.run inside a running loop raises RuntimeError; only call from sync code
- asyncio.run_coroutine_threadsafe needs a loop already running on another thread
- to_thread workers come from a fixed pool; thousands of concurrent to_thread calls queue up
- asgiref helpers create per-call loops if needed; expensive in tight loops
PEP 703: No-GIL Python
Section: Python Concurrency | Difficulty: Advanced | Frequency: Sometimes
PEP 703 makes the GIL optional. Free-threaded CPython (3.13+ build option, supported but not default in 3.14) lets multiple Python threads run bytecode in parallel. Threads finally help CPU-bound work. Cost: per-object reference counting needs atomics, single-threaded performance regresses ~10-15%, C extensions need rebuilding.
Key Points
- Free-threaded Python (3.13+) builds without the GIL. Threads achieve true parallelism.
- Only available in a separate build (python3.13t). Default CPython still has the GIL.
- C extensions must opt in via Py_mod_gil. Old extensions trigger a re-enabling of the GIL.
- Single-threaded perf regresses (~10-15% in 3.13) due to atomic reference counting and biased locking on objects.
- Multi-threaded scaling on CPU-bound code finally works without multiprocessing.
Interview Follow-up Questions
Should one switch to free-threaded Python today?
Probably not for production yet. The free-threaded build is experimental in 3.13 and supported (but not default) in 3.14. Becoming default is on the roadmap but not yet officially scheduled. Most third-party C extensions (numpy, pandas, torch, etc.) are still adding no-GIL support. Test in a dev environment, not in production. Single-threaded perf also regresses ~10-15%, which matters for some workloads.
Does this make multiprocessing obsolete?
No. Multiprocessing still gives fault isolation (a crash in one worker does not take the parent), memory isolation (workers cannot corrupt each other's state), and the ability to scale beyond one machine. Free-threaded Python makes threads viable for CPU-bound work, but processes remain useful when isolation matters.
Will existing threaded code break?
Code that is correctly synchronised (uses Lock around shared mutable state) works the same. Code that relies on the GIL to make multi-bytecode operations atomic (e.g., assumes counter += 1 is safe) was always broken; on free-threaded Python the bugs are easier to trigger. Audit shared mutable state for missing locks.
What changes for library authors?
Pure-Python libraries should work unchanged. C extensions need to declare no-GIL safety via Py_mod_gil slot, audit for thread-unsafe global state, and add explicit synchronisation for any shared mutable C-level state (caches, free lists). The cpython docs have a porting guide.
Gotchas
- Single-threaded perf regresses ~10-15% on free-threaded build (atomic refcounts, biased locking overhead)
- Loading any GIL-required C extension re-enables the GIL process-wide
- Race conditions hidden by GIL serialisation become visible; correctness audits matter
- Free-threaded build is python3.13t (separate binary); pip install wheels are limited
- Some libraries gate behaviour on sys.version; check for free-threaded build separately
Connection Pool
Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often
A pool of pre-established connections (database, HTTP, gRPC) reused across requests. Building a connection per request is too slow (TLS handshake, TCP setup, auth). Pool provides bounded resource usage, faster requests, and back-pressure when the limit is reached.
Key Points
- Connection establishment is expensive: TCP handshake, TLS handshake, auth. Pool amortises this.
- Pool size = max concurrent in-flight requests. Sized by downstream limit, not by app traffic.
- Acquire blocks (or fails) when pool is empty; this is the back-pressure mechanism.
- Health check: validate connection on borrow; broken connections must be discarded, not handed out.
- Idle timeout: connections that have been idle too long are closed (server may have closed them already).
Interview Follow-up Questions
How big should the connection pool be?
Use Little's Law: throughput × average query latency. At 100 queries/sec and 50ms average, 5 in-flight is the baseline. Add 2x headroom for bursts. So pool of 10. Don't size by number of app threads; that gives wildly oversized pools that overwhelm the database.
What is the relationship between pool size and database max_connections?
Pool size per app instance × number of app instances ≤ database max_connections × 0.8 (leave headroom). Postgres default max_connections is 100; with 10 app instances that's 8 connections each. Sounds low. Two responses: tune Postgres up to 200-500, or use PgBouncer in front to multiplex.
Should there be separate pools for read replicas and writers?
Yes. Read traffic is bursty; write traffic is steady. Separate pools allow each to be sized appropriately and route queries to the right backend. Most ORMs and DB libraries support multiple data sources.
Why recycle connections (maxLifetime)?
Three reasons. Server-side connection limits or memory leaks accumulate over time. DNS or load balancer changes need fresh connections to take effect. Server restarts may close idle connections without us noticing. Recycling every 30 minutes keeps the pool healthy without thinking about it.
Gotchas
- Pool size larger than database can handle: every request fails when DB max_connections hit
- Forgetting to release connection (no try-with-resources): pool leak, eventual exhaustion
- No idle timeout: server-closed connections stay in the pool, fail on next borrow
- Long-running query holds a connection: pool exhausted by 1 slow query
- Same pool for read and write: write spike starves reads, or vice versa
Background Job Queue
Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often
Persist jobs to durable storage (Postgres, Redis, SQS), workers pull and process them. Need: idempotent handlers (jobs can be delivered twice), visibility timeout (a worker that crashes mid-job must release the job for retry), exponential retry with dead-letter queue, scheduled/delayed jobs, priority queuing.
Key Points
- Durable storage so jobs survive restart. Memory-only queues lose work on crash.
- Visibility timeout: when a worker pulls a job, hide it for N seconds. If the worker doesn't ack, it reappears.
- Idempotency on the handler side: jobs can be delivered more than once (visibility timeout expiry, worker crashes after work but before ack).
- Dead-letter queue (DLQ) for jobs that fail repeatedly. Don't retry forever; surface to humans.
- Sidekiq/Resque (Redis), RQ (Redis), Celery (RabbitMQ or Redis broker), BullMQ (Redis), AWS SQS, and Postgres SKIP LOCKED are popular implementations. Only Sidekiq/RQ/BullMQ are Redis-native; Celery's primary broker is RabbitMQ (Redis is supported but not the default).
Interview Follow-up Questions
Why not just use a list in Redis (LPUSH/RPOP)?
Works for the simplest case but lacks visibility timeout, retries, persistence guarantees on crash, scheduled jobs. All of that ends up being rebuilt. For anything beyond toy workloads, use a real job queue (Sidekiq, RQ, Celery, SQS) or Postgres SKIP LOCKED. Redis-only is fine with Redis Streams (which has consumer groups and acks) but not raw LPUSH/RPOP.
Why is Postgres SKIP LOCKED so popular?
Most teams already run Postgres. No new system to operate. Transactional with the rest of the data (a job can be enqueued and a row updated atomically). SKIP LOCKED prevents lock contention between workers. Throughput is real (10K+ jobs/sec on a single Postgres). The downside: not as fast as Redis or SQS at very high throughput, and database CPU goes to queue work.
What goes in the dead-letter queue?
Jobs that exceeded max retries. Store the original payload, the attempt count, and the last error. Periodically (or via dashboard), humans inspect, decide whether the job is recoverable (fix and replay) or permanently broken (delete). DLQ size is a great alert: a sudden spike means something downstream broke.
How do I prioritise jobs?
Add a priority column (0-9, 0=highest). Worker selects ORDER BY priority, run_at. For more sophisticated needs (per-tenant fairness, SLA-aware scheduling), use multiple queues with different worker pools, or a dedicated scheduler. Most teams over-engineer this; a single priority field handles 95% of needs.
Gotchas
- No visibility timeout: worker crash mid-job means the job is lost OR re-runs immediately and floods retries
- Non-idempotent handler with at-least-once delivery: side effects double on retry
- Unbounded retries: a permanently failing job retries forever, drowning out healthy work
- No DLQ inspection: dead jobs accumulate, no one notices the silent failure
- Treating queue as a database: long-running jobs hold locks, block compaction, slow queries
In-Memory Cache with TTL Invalidation
Section: Real-World Scenarios | Difficulty: Medium | Frequency: Asked Often
Thread-safe cache where each entry expires after T seconds. Core decisions: lazy expiry vs background sweep; ConcurrentHashMap vs RWLock; cache-stampede prevention via single-flight. Tests understanding of expiration semantics + concurrency together.
Key Points
- Two expiry strategies: lazy (check on read) vs active sweep (background timer)
- ConcurrentHashMap (Java) / sync.Map (Go) / dict + RLock (Python) for storage
- Cache stampede: when key expires, many concurrent reads all try to recompute → use single-flight
- Hot-key contention: shard the cache by key hash for high throughput
Interview Follow-up Questions
Lazy vs active sweep, pick one?
Use both. Lazy expiry on reads gives correctness. Active sweep bounds memory for keys rarely read. Without active sweep, keys never read again sit forever, memory leak. Without lazy, stale data is served between sweep cycles.
What's cache stampede and how is it prevented?
When a hot key expires and 100 concurrent reads all see 'expired' and all try to recompute simultaneously, hammering the upstream. Single-flight: only one recomputation per key; others wait for it. Java's `computeIfAbsent`, Go's `singleflight`, hand-rolled per-key locks in Python.
How would this scale to 1M ops/sec?
Sharding. Split the cache into N partitions by key hash; each has its own lock and map. ConcurrentHashMap (Java) does this internally with ~16 bins; for Go/Python, do it explicitly. Separate single-flight group per shard.
What about negative caching?
Cache 'not found' with a shorter TTL. Otherwise every miss for a non-existent key hammers the loader. Production caches usually have separate hit/miss TTLs.
Token Bucket & Sliding Window Rate Limiters
Section: Real-World Scenarios | Difficulty: Medium | Frequency: Asked Often
A rate limiter caps request rate per key. Token bucket: refill N tokens/sec, requests consume one. Sliding window: count requests in the last second. Both need atomic counter updates and either lazy refill or scheduled refill, choice depends on traffic patterns and burst tolerance.
Key Points
- Token bucket: refill rate, max bucket capacity (allows bursts)
- Sliding window: log of timestamps, count those in [now-1s, now]
- Per-key state requires concurrent map; per-bucket update requires atomic
- Distributed rate limiting: Redis SETNX + INCR + EXPIRE, or atomic Lua script
- Real production: Stripe's algorithm uses sliding window + leaky bucket per dimension
Interview Follow-up Questions
Token bucket vs sliding window, pick one?
Token bucket allows bursts (use up tokens fast, refill slowly). Sliding window strictly caps requests-per-window. Stripe's API uses sliding-window because billing endpoints can't tolerate bursts. Most CDNs use token bucket for friendliness. Pick based on the contract being enforced.
How is this made distributed across N servers?
Per-server limiters drift; a shared store is required. Redis SETNX + INCR + EXPIRE is the standard (atomic Lua script for the check-and-update). Sketches like CountMinSketch reduce memory for huge key spaces. Trade exact accuracy for performance.
What about rate limit by IP vs by user vs by API key?
All three usually. Multi-dimensional limits, most strict wins. Stripe documents this: per-account, per-endpoint, per-action all have separate limits. Each is a TokenBucket (or sliding window) keyed by the identifier.
How are limit responses handled gracefully?
Return 429 with `Retry-After` header. Add `X-RateLimit-Remaining` and `X-RateLimit-Reset` headers so clients can self-throttle. For internal services, exponential backoff with jitter on receiving 429.
Retry Library: Production Implementation
Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often
Build a retry library: classify errors (transient vs permanent), exponential backoff with jitter, max-attempts AND max-duration, per-call deadline budget, optional circuit breaker integration. Two ways to get it wrong: retry permanent errors, or retry without jitter.
Key Points
- Classify errors: only retry transient (5xx, timeout, connection refused). Never retry 4xx.
- Exponential backoff with FULL jitter: delay = random(0, base * 2^attempt).
- Bound both attempts AND total elapsed time. Either alone is not enough.
- Pass remaining deadline to each attempt; don't run a 5s call when 1s of budget remains.
- Compose with circuit breaker: retries hit the breaker first, fail fast if open.
Interview Follow-up Questions
What goes in the retryable-error classifier?
Network errors (timeout, connection refused, DNS fail), 5xx HTTP responses, gRPC UNAVAILABLE/DEADLINE_EXCEEDED/RESOURCE_EXHAUSTED, database deadlock errors, cloud throttling responses (429 with Retry-After). NOT in: 4xx (client errors), validation failures, auth errors, business-logic errors. The classifier is per-protocol; HTTP, gRPC, DB each have their own.
Why should I prefer a deadline over a fixed max-attempts?
Because attempts can grow latency unboundedly. With max-attempts=5 and exponential backoff, the worst case is base * (1+2+4+8+16) = 31 * base. If base is 1s, that's 31s of latency for one user request. A deadline (e.g., 5s total) caps the experience. Both together give the best behaviour: stop on attempts AND on time.
Should retry be in the client SDK or in the application code?
Client SDK is the cleanest place; the application calls one function and gets retry for free. The risk: stacked retries when the application also wraps the SDK in its own retry. Solution: SDK exposes a way to disable its retry, or document clearly. When the SDK isn't under the team's control, application-level retry is the right place.
Gotchas
- Retrying without a per-attempt timeout means slow downstream + slow retries = unbounded latency
- Retrying a non-idempotent POST without an idempotency key duplicates the side effect
- Stacked retries (application retry + the HTTP client's retry) multiply attempts exponentially
- Using elapsed time but not respecting context cancellation: retry continues after caller gave up
- Treating all exceptions as retryable: validation errors and OOM get retried too
Circuit Breaker: Production Implementation
Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often
Build a circuit breaker that wraps downstream calls. State machine: closed → (failures) → open → (cooldown) → half-open → (probe) → closed/open. Per-downstream instance. Trip on failure rate over a sliding window with a minimum-call floor. Half-open allows N probes; success rate decides if we close or reopen.
Key Points
- Three states: closed (normal), open (fail-fast), half-open (probing). Atomic state transitions.
- Trip policy: failure rate over a sliding window (last N calls or last T seconds), with a minimum-call floor.
- Cooldown in open state: 30s-5min typical. After cooldown, transition to half-open.
- Half-open allows N probes (typically 3-10). Track their outcomes; success rate decides next state.
- Per-downstream breaker. Never one global breaker. Different downstreams have different reliability.
Interview Follow-up Questions
How do I size the sliding window?
Tradeoff between reactivity and false positives. A 10-call window: trips fast but flaky on bursty traffic. A 100-call or 60-second window: more stable but slower to trip. For high-traffic services, 60-second time-based windows work well; for low-traffic, count-based with a minimum-call floor.
What if I get a flood of requests right when half-open opens?
Don't admit them all. Half-open should cap concurrent probes (e.g., 3 at a time, others fail-fast as if open). Without this cap, the still-recovering downstream gets a surge of probe traffic at the moment it can least handle it.
Should I share circuit breaker state across instances?
Usually not. Each service instance maintains its own breaker. The downstream's actual health is the same; each instance discovers it independently. Sharing state across instances (via Redis, say) sounds smart but adds latency and a coordination point. Per-instance breakers converge to the same decision quickly enough.
How does the breaker interact with metrics?
Emit gauge for state (0/1/2 for closed/open/half-open), counters for trips and resets, histograms for time-in-state. Alerts on 'breaker open for >5 minutes' or 'breaker tripping >10 times per hour'. The breaker is one of the most useful sources of downstream-health signal.
Gotchas
- One global breaker for all downstreams: one bad downstream stops everything
- Counting client errors (4xx) as breaker failures: trips on user mistakes, not service health
- Half-open with no probe cap: surge of probe traffic on a recovering downstream
- No fallback when breaker opens: caller sees CircuitOpenException with no graceful path
- Breaker without timeout: slow calls can pile up before the breaker even sees them as failures
Idempotency Key: Stripe-Style Implementation
Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often
Server-side dedup of retries on non-idempotent operations. Client sends Idempotency-Key header (unique per logical operation). Server: lock on key → check cache → if cached, replay response → else process and cache → release. TTL on cache (24h typical). Detect key collisions via request fingerprint.
Key Points
- Client generates ONE key per logical operation, reuses on retry. NOT one per HTTP attempt.
- Server-side: lock on key (so concurrent retries don't both run), check cache, process, store response, release.
- Cache stores: response body + status code + relevant headers. Replay must be byte-identical.
- TTL: typically 24h. Long enough for client retries, short enough that storage doesn't grow forever.
- Detect key collisions: store a fingerprint (hash of request body) and compare. If different, return 422 instead of cached response.
Interview Follow-up Questions
What happens with concurrent requests using the same idempotency key?
The lock serialises them. First request acquires the lock, processes, stores the cached response, releases. Concurrent requests either wait briefly (with blocking lock acquisition) or fail with 409 'in flight' (with fail-fast). For a server-side idempotency layer, the 409 response is usually preferred: the client retries with backoff and the second time finds the cached response.
Why detect key collisions?
Two clients might accidentally generate the same key (UUID collisions are rare but not zero in long-tail). Or a single client reuses a key for two different operations by mistake (using a session ID instead of an operation ID, say). Without collision detection, the server replays the first response for the second request, which is the wrong response. The fingerprint check catches this and returns 422.
Should the cache be local or shared?
Shared. With multiple instances behind a load balancer, the same key might land on different instances. A local cache means each instance dedupes only against its own history; the lock would also need to be shared. Use Redis or similar.
What about idempotency for fire-and-forget background jobs?
Use an idempotency key in the job payload. The handler checks 'have I processed key X?' before doing the side effect. Same idea, different transport. Critical for at-least-once queues (SQS, Kafka with at-least-once consumer) where a single logical operation can be delivered multiple times.
Gotchas
- Generating new key per HTTP attempt: server sees each retry as a fresh request
- Caching only response body, not status code: replay returns wrong status
- No collision detection: same key for different operations replays wrong response
- No TTL on cached response: storage grows without bound
- Lock without TTL: crashed processor blocks subsequent retries forever
API Aggregator (Backend-for-Frontend)
Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often
Service that calls N downstream services concurrently and merges results into a single response. Common in mobile/web BFFs. Patterns: fan out with errgroup/TaskGroup/gather, per-call timeouts, partial-result tolerance (return what we have if some fail), per-downstream circuit breakers.
Key Points
- Fan out concurrently: total latency = max of downstreams, not sum.
- Per-call timeout: one slow downstream shouldn't drag the whole response.
- Partial-result tolerance: decide which downstreams are critical, which are optional.
- Per-downstream circuit breaker: when one is unhealthy, fail it fast and degrade gracefully.
- Cache aggressively: page-level cache and per-downstream cache reduce fan-out cost.
Interview Follow-up Questions
What's the total-latency calculation for an aggregator?
max of the slowest downstream call. The whole point of the aggregator is parallelism. Accidental serial calls (await user; then await cart; then await recs) make latency the sum, which defeats the purpose. Always fan out concurrently.
How to decide which downstreams are critical vs optional?
Product/UX decision. The page that says 'username and avatar' needs the user-service; without it, the page can't render. The page section that shows 'people you might know' is optional; if it fails, hide that section but still show the rest. Talk to designers; over-tolerating failures hides real problems, but treating everything as critical makes the page brittle.
Should the aggregator be its own service?
Yes for non-trivial apps. Pattern: BFF (Backend for Frontend). One BFF per client (mobile, web, partner API), each shaped to that client's needs. The BFF does the fan-out, caching, and shaping. Backends stay focused on their domain. Mobile app talks to one BFF endpoint instead of orchestrating 5 calls itself.
How are aggregator responses cached?
Two layers. Per-downstream cache (the user-service result is the same for the same userId for some seconds). Page-level cache (the full assembled page when it's safe to cache). Per-downstream cache wins more because invalidation is per-domain; page-level cache wins on repeated identical requests.
Gotchas
- Sequential awaits instead of concurrent fan-out: latency becomes sum, not max
- No per-call timeout: one slow downstream blocks the whole page
- Treating all downstreams as critical: any single failure breaks the page
- No circuit breaker per downstream: calls keep timing out for an unhealthy service
- No caching: every request fans out to all backends; database hammered
Polite Concurrent Web Crawler
Section: Real-World Scenarios | Difficulty: Hard | Frequency: Asked Often
Crawl 1M URLs concurrently while: (1) bounding total in-flight requests, (2) deduplicating visited URLs, (3) respecting per-domain politeness (max N concurrent per host). Bounded worker pool + concurrent set + per-domain semaphore.
Key Points
- Worker pool with N workers fetches from a shared work queue
- Concurrent set (visited) prevents fetching the same URL twice
- Per-domain semaphore caps requests-per-host (politeness)
- BFS-style traversal: each fetched page enqueues newly-discovered links
- Termination: when queue empty AND no in-flight workers
Interview Follow-up Questions
How is pool size decided?
Web crawling is I/O-bound, pool size much larger than core count. Goetz's formula: cores × (1 + wait/compute). For ~500ms HTTP fetches and millisecond parsing: 100-500× cores. Empirically: ramp up until upstream complains or the file-descriptor table maxes out.
Why per-host limits?
Politeness, don't DDOS individual hosts. Most sites tolerate 4-8 concurrent connections from one client. Public crawler etiquette and robots.txt expect this.
How does termination work when the queue drains?
Track in-flight worker count. Done = queue empty AND inFlight == 0. Tricky because between 'queue empty' check and worker incrementing inFlight, the queue could re-fill. Real solution: a CountDownLatch / WaitGroup that fires when both conditions hold atomically.
How is robots.txt handled?
Per-host robots.txt cache. Fetch and parse once per host; cache the rules. Apply rules before queuing URLs for that host. Add to the per-host limit logic.
Lock Convoys
Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes
A lock convoy happens when threads contending for a lock end up serialized in the kernel wait queue. The first thread runs and releases. The next thread is woken from a kernel sleep, which costs a context switch and a cache miss. The lock is fair, but throughput drops because every acquire pays a kernel round trip even when the critical section is microseconds. Common in fair mutexes under steady contention.
Key Points
- All N threads end up in the kernel wait queue. Each acquire pays a wakeup, a context switch, and cold-cache restart.
- Critical section is short, but wall-clock time per acquire is dominated by the wakeup cost (microseconds), not the section itself (nanoseconds).
- Triggered by fair locks (FIFO wakeup), high contention, and short critical sections. The trio is the warning sign.
- Symptom: throughput plateaus or drops as you add threads, but no thread is CPU bound; perf shows time inside futex/wait, not inside the critical section.
- Fixes: reduce contention (sharding, RCU, lock-free), prefer adaptive/unfair locks (default in glibc, Java ReentrantLock without fair=true), or restructure so the section is so short that nobody ever blocks.
Interview Follow-up Questions
Why does fairness make convoys worse?
A fair lock parks the releasing thread instead of letting it grab the lock back. So even if the releasing thread is the one with the hottest cache and zero work to do before the next acquire, it has to go to the back of the queue. Every acquire pays a wakeup. An unfair lock lets a recently-running thread re-acquire (`barge`), which usually costs nothing because it never blocked in the first place. Throughput goes up, fairness goes down.
Glibc moved away from fair futexes years ago. Why is this still a problem?
Modern futex-based mutexes are unfair by default, which avoids the worst case. Convoys still appear when (1) fair locks are explicitly opted into (Java's ReentrantLock(true), Go's mutex in starvation mode after 1ms wait, pthread mutexes with PTHREAD_PRIO_INHERIT and FIFO scheduler), or (2) the critical section grows because of GC pauses, page faults, or scheduler preemption. Once parking starts, it tends to keep happening.
How is a convoy different from regular lock contention?
Regular contention: threads spin briefly, occasionally one parks, throughput scales sub-linearly with cores. Convoy: every acquire pays a wakeup, threads spend most of their wall-clock time in `park`, throughput goes flat or drops. The tell is the breakdown of time per acquire: critical section in nanoseconds, total time per op in microseconds. The gap is the convoy.
Can a convoy happen without a lock?
Yes. Channel-based serialization (Go), single-threaded actor (Erlang, Node.js), or a single-writer queue can all convoy under enough load. The mechanism is the same: producers stack up faster than the consumer can drain, and each producer pays a wakeup when finally serviced. The fix is also the same: shard the consumer, or batch.
Gotchas
- Adding a 'just to be safe' fair lock to fix a starvation complaint can tank throughput by 100x.
- Convoys are time-of-day sensitive: light load looks fine, peak load suddenly cliffs. Load-test under realistic concurrency.
- GC pauses and STW phases extend critical sections enough to push a healthy unfair mutex into convoy mode for a few seconds.
- Profilers that sample the running thread won't show the convoy: time is inside the kernel scheduler, not application code. Use mutex/contention profilers.
Scheduled Callback Executor (HashedWheelTimer)
Section: Real-World Scenarios | Difficulty: Hard | Frequency: Sometimes
Schedule callbacks to run at future times. Naive heap-based scheduler scales to thousands; for millions, use a HashedWheelTimer, bucketize callbacks by absolute deadline, sweep one bucket per tick. Used by Netty, Linux kernel timer wheel, Kafka.
Key Points
- Heap-based: O(log N) insert, O(log N) extract, fine for thousands
- Hashed wheel: O(1) insert, O(N/buckets) per tick, scales to millions
- Trade precision for throughput: wheel has tick granularity (~10ms)
- Single tick thread + worker pool dispatches expired callbacks
Interview Follow-up Questions
When does a heap-based scheduler bottleneck?
With many concurrent timers being added/cancelled. Each operation is O(log N), and the heap mutex serializes them. Above ~50K timers, the lock becomes a hotspot. Wheels make insert O(1) at the cost of slightly imprecise expiration.
How does Linux's timer wheel work?
Cascading wheels: a 'second' wheel for tasks within ~5s, then 'tens-of-seconds' wheel, etc. Tasks far in the future demote down through wheels as time passes. Same idea as the kernel's timer subsystem and rate-limited netfilter rules.
What about cancellation?
Wheel-based: O(1), just mark the task as cancelled. The tick thread checks before dispatching. Heap-based: lazy delete (mark + skip) is O(1); eager delete is O(N) without cross-references; with parent pointers in the heap, O(log N).
Single tick thread vs multiple?
Almost always single. The tick thread does minimal work, it just dispatches to a worker pool. Multiple tick threads compete on the wheel locks; the contention is worse than the single-thread baseline.
In-Process Pub-Sub Bus
Section: Real-World Scenarios | Difficulty: Hard | Frequency: Asked Often
An in-memory pub-sub: publishers send to topics, subscribers receive matching messages. Decisions: per-subscriber bounded queue (slow consumer doesn't block fast); ordered or fan-out; sync or async delivery. Trades latency for backpressure.
Key Points
- Topics → subscriber list; publish iterates and delivers (or queues per subscriber)
- Each subscriber has bounded queue → slow consumer doesn't backpressure publisher
- Drop-on-full vs block-on-full, slow-consumer policy is THE design decision
- Concurrent map for topic→subscribers (CHM, sync.Map, dict + RLock)
- Used by Spring Events, Akka Streams, in-process EventBus libraries
Interview Follow-up Questions
Why per-subscriber queues instead of one global queue?
Per-subscriber queues isolate slow consumers, one slow handler doesn't block others. Single global queue: every subscriber processes serially → throughput limited by slowest. Per-subscriber: each runs at its own pace, can be drop-on-full independently.
Drop-on-full vs block-on-full?
Drop-on-full preserves publisher throughput (publisher never blocks). Block-on-full applies backpressure, publisher slows when consumer can't keep up. Drop is right for telemetry/metrics; block is right for guaranteed delivery (orders, billing).
How are subscriber failures handled?
Each worker wraps the handler in try/catch, exceptions log and continue. Otherwise a thrown exception kills the worker thread, leaving the subscriber silently dead. Always isolate handler bugs from the worker loop.
How does this differ from Kafka?
Kafka has persistent log; messages survive restart; consumers track offsets. In-process bus is ephemeral, restart loses messages. Kafka is for cross-service durable pub-sub; in-process bus is for in-app event-driven decoupling.
DAG Task Runner with Backpressure
Section: Real-World Scenarios | Difficulty: Hard | Frequency: Sometimes
Run tasks with dependencies, each task waits for its prerequisites. Topological-order scheduling + worker pool + bounded concurrency. Pattern: per-task CountDownLatch / Future / WaitGroup, with dependency graph driving submission.
Key Points
- Topological order: tasks run as soon as their dependencies complete
- Per-task synchronization: Future, CompletableFuture, CountDownLatch, errgroup
- Worker pool bounds concurrency; max-in-flight is the backpressure knob
- Failure propagation: if a task fails, downstream tasks are cancelled (or retried)
- Used by Apache Airflow, Argo Workflows, Make, build systems
Interview Follow-up Questions
How are cycles handled?
Detect them at submit time via DFS or topological-sort validation. If a task is added with a dependency that doesn't resolve, reject. Most DAG runners (Airflow, Argo) refuse cyclic graphs at definition time.
Failure semantics, what if task B depends on A and A fails?
Three strategies: (1) cancel all dependents (Airflow default, 'fail fast'), (2) retry A with backoff and continue, (3) skip B and proceed with whatever can still run. Airflow has a 'trigger_rule' that lets each task pick its policy.
How is resource usage bounded?
Worker pool with max parallelism (`g.SetLimit(N)` in Go's errgroup, ThreadPoolExecutor in Java). For finer control, per-resource semaphores, DB-bound tasks share one limit, network-bound tasks share another.
How does this scale to distributed execution?
Same DAG, but tasks run on different machines. Dependency tracking moves to a coordinator (Airflow scheduler, Kubernetes Argo controller). Per-task results stored externally (DB, S3). Workers pull tasks whose deps are done.
Inventory Reservation with Concurrent Holds
Section: Real-World Scenarios | Difficulty: Hard | Frequency: Asked Often
Booking system: 100 concurrent users grab the last seat. Only one wins. The losers see 'sold out' before completing checkout. Pattern: optimistic locking with version-stamped CAS, OR pessimistic per-item lock, OR atomic decrement on a Redis counter.
Key Points
- Optimistic: version + CAS, read inventory, attempt update, retry if version changed
- Pessimistic: lock per item, serializes high-contention items but cleaner semantics
- Soft holds: reserve for N minutes, expire if checkout doesn't complete
- Distributed: Redis DECR, or Postgres SELECT FOR UPDATE, or two-phase commit
Interview Follow-up Questions
Optimistic vs pessimistic, when?
Optimistic when contention is low (rare conflicts → CAS rarely retries). Pessimistic when contention is high (hot key, lots of users for last seat → CAS retry loop becomes a busy wait). For 'last seat sold' interview, pessimistic actually wins, serializes naturally.
How is this made distributed?
Redis: `DECR inventory:item-id` is atomic; if result < 0, increment back and reject. Or Postgres: `SELECT ... FOR UPDATE` row lock + decrement in transaction. Both are pessimistic; both serialize the hot row.
What about concert ticket sales, million users, tiny inventory?
Standard pattern: virtual queue with cap (e.g., 50K users in queue, 10K seats). Behind the queue, soft holds with short TTL (~3 min). Avoids stampeding the inventory layer; queue throttles attempts. Ticketmaster, BookMyShow do this.
Race condition: user clicks 'reserve' twice, same hold or two holds?
Idempotency: pass a client-provided idempotency key. Server stores the first request's hold ID under that key; subsequent requests get the same ID. Stripe pioneered the API pattern.
Distributed Lock: Redis or ZooKeeper
Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often
Mutual exclusion across processes/machines. Three implementations: SETNX with TTL on Redis (cheap, mostly correct), Redlock (Redis cluster, controversial safety), ZooKeeper ephemeral nodes (strongly correct, more ops). All require a TTL/lease so a crashed holder doesn't block forever. Always design for: 'what if the holder thinks it has the lock but actually doesn't?'
Key Points
- Redis SET key value NX EX ttl is the simplest distributed lock. Atomic, TTL-based, fast.
- Lock holder must store a unique token (UUID) and check it on release; otherwise release-by-someone-else is possible.
- TTL is required: if the holder crashes, the lock auto-expires. Pick TTL > expected critical-section duration.
- Fencing tokens (monotonic ID) protect against zombies: 'I had the lock, then GC paused me, lock expired, someone else took it, then I came back and acted on stale belief'.
- ZooKeeper / etcd / Consul give stronger guarantees (linearisable, sequential ephemeral nodes) at higher operational cost.
Interview Follow-up Questions
Is Redis SETNX really safe?
For most use cases where the cost of double-execution is acceptable, yes. For workloads where double-execution corrupts data (writes to a database that doesn't enforce idempotency), no. The classic Redlock-vs-Martin-Kleppmann debate is exactly this. The pragmatic answer: use Redis SETNX for non-critical mutual exclusion (rate limiters, scheduling); use ZooKeeper/etcd or fencing tokens for safety-critical paths.
What is a fencing token?
A monotonically increasing number associated with each lock acquisition. The lock holder includes it on every protected operation; the resource (database, file, etc.) checks it against the largest token it has seen. If the token is smaller, the operation is rejected. Prevents 'zombie holder' (a holder whose lock expired but who still thinks it has it) from corrupting state.
Why does TTL matter so much?
Because the holder might crash. Without TTL, the lock is held until manually released. A crashed process never releases. The lock blocks forever. With TTL, a crashed holder's lock auto-expires, and someone else can take over. Trade-off: TTL too short means the work might not finish before the lease expires. Either keep work short or extend the lease via heartbeat.
When should I reach for ZooKeeper instead of Redis?
When correctness matters more than simplicity. ZK is built for coordination: linearisable, ephemeral nodes auto-clean on disconnect, sequential nodes give natural fencing tokens. Operationally heavier than Redis (a ZK ensemble must be run). For high-stakes locks (leader election, primary-replica failover), ZK or etcd is the right choice. For lower-stakes (rate limit buckets, scheduled job dedup), Redis is fine.
Gotchas
- No TTL: crashed holder blocks the lock forever
- No token check on release: someone else's release deletes the lock
- Long-running work without lease extension: TTL expires mid-work, two holders concurrent
- Treating SETNX as linearisable: it is not under all failure modes
- No fencing token in critical paths: zombies can corrupt data after lock expiry
Heartbeat & Leader Election
Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often
Heartbeat: each instance periodically writes 'I'm alive' to shared storage. Leader election: instances compete to become the single coordinator; loser instances watch the leader's heartbeat. When the leader's heartbeat stops, a new election runs. Backed by ZooKeeper, etcd, Consul, or a simpler Redis lease.
Key Points
- Heartbeat: leader writes a key with TTL every N seconds; if it stops writing, the key expires.
- Election: candidates try to atomically claim the leader key. The winner is the leader. Losers watch.
- Lease > heartbeat interval × 3: tolerates transient network blips without flapping.
- Always design for split-brain: two instances that both think they are leader. Fencing tokens prevent corruption.
- ZooKeeper, etcd, Consul, Kubernetes leases give correct primitives. Redis with TTL is simpler but has known edge cases.
Interview Follow-up Questions
What is split-brain and how does it happen?
Two instances both believe they are the leader. Happens during network partitions: leader A is fine but the network between A and the coordination service is broken; A's lease expires; B claims leadership; network heals; A still thinks it's leader. Both write. Fencing tokens (monotonic counter on each leadership claim) let downstream resources reject the stale leader's writes.
ZooKeeper / etcd / Consul vs Redis for election?
ZK/etcd/Consul are designed for coordination: linearisable, ephemeral nodes that auto-clean on disconnect, sequence numbers as natural fencing tokens. Operationally heavier (the coordination cluster must be run). Redis is simpler but well-known to have edge cases (the Redlock debate). For high-stakes leadership (primary database failover), use ZK/etcd. For low-stakes (cron deduplication), Redis is fine.
How do I size the lease TTL?
Trade-off between failover speed and stability. Short TTL (e.g., 5s): fast failover when the leader dies, but risk of flapping during transient network blips. Long TTL (e.g., 60s): stable, but slow failover. Most production systems use 15-30s lease with renewal every 5-10s. The 'renew interval × 3 ≤ lease' rule of thumb keeps the lease healthy under one missed renewal.
Does the leader need to know it's still the leader?
Yes. Pattern: leader holds a 'leadership context' that is cancelled if the renewal fails. All work performed by the leader checks this context. If the context is cancelled (lease lost), the leader stops doing work immediately, even if it hasn't yet learned about the new leader. Without this, a leader running a slow operation can keep writing after losing leadership.
Gotchas
- TTL too short: harmless network blip causes leadership flap
- Renewing on a separate connection: connection drops independently, lease lost while leader is fine
- Doing work after losing leadership: the new leader is also writing; conflict
- No fencing token: stale leader corrupts data after lease expiry
- Single-region election with no fencing: cross-region partition can have two regional leaders
Saga Pattern: Distributed Transactions Without 2PC
Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often
A saga executes a multi-step business transaction across services as a sequence of local transactions, each with a compensating transaction that undoes it. If any step fails, run the compensations for the steps already done. Replaces two-phase commit (2PC) for the microservices era. Two flavours: orchestrated (a coordinator drives) or choreographed (each service reacts to events).
Key Points
- Local transactions per step. Each has a compensating transaction (cancel order, refund payment, restore inventory).
- On failure at step N, run compensations for steps N-1, N-2, ..., 1 in reverse order.
- Compensations must be idempotent: they may be replayed. Use idempotency keys.
- Orchestrated: one coordinator decides next step, calls each service, handles failures. Easier to reason about.
- Choreographed: each service publishes events, others react. Decouples services but harder to debug.
Interview Follow-up Questions
Saga vs 2PC?
2PC (two-phase commit) gives strong consistency: all participants either commit or all abort. Requires a transaction coordinator and locks across services for the full duration. Doesn't scale beyond a few services and breaks under network partitions. Saga gives eventual consistency: each step commits its local transaction; on failure, compensations undo. Scales to many services, tolerant of partitions, but readers may see partial state during execution.
Orchestrated vs choreographed?
Orchestrated: coordinator service knows the workflow, calls each service in order, handles failures centrally. Pros: visible flow, easy to add steps, easy to monitor. Cons: coordinator becomes a single point of complexity. Choreographed: each service reacts to events; no coordinator. Pros: services are independent. Cons: workflow is implicit (no single place shows it), debugging spans logs across services. Default to orchestrated unless there's a strong reason otherwise.
Why must compensations be idempotent?
Because the saga can crash and restart. The coordinator may run a compensation, crash before recording success, restart, and run it again. Idempotent compensations mean the second run is a no-op. Use idempotency keys per compensation, the same way they're used for forward operations.
What if a compensation itself fails?
Now there's a 'pivot' situation: an unrecoverable inconsistency that needs human intervention. Persist the saga state, alert on-call, manually reconcile. Don't pretend it can be automated away; design for the case where compensation fails to converge and a human has to look.
Gotchas
- Forgetting idempotency keys: replays double-charge, double-refund
- Not persisting state between steps: coordinator crash loses progress
- Choreographed saga across many services: workflow is invisible, debugging is awful
- Treating compensations as 'just like forward, in reverse': they're not, semantics differ (refund != negative charge)
- No alert on compensation failure: silent inconsistency
Outbox Pattern: Reliable Event Publishing
Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often
Write the event to an 'outbox' table in the SAME transaction as the business write. A separate process polls the table and publishes events to Kafka/queue/webhook. Solves the dual-write problem: a database AND a message broker can't be written to atomically, so events would otherwise be lost or duplicated when one of the writes fails.
Key Points
- Dual write problem: 'INSERT order' + 'publish OrderCreated event' is two writes, no atomic guarantee. Either can fail.
- Outbox: write the event row in the SAME DB transaction as the business write. Either both happen or neither.
- A relay process polls the outbox, publishes each event to the broker, marks it published (or deletes it).
- At-least-once: the relay can publish, crash, restart, publish again. Subscribers must be idempotent.
- CDC alternative: tools like Debezium read the database WAL and publish change events without an outbox table.
Interview Follow-up Questions
Why not just use a distributed transaction (XA)?
XA between a database and a message broker exists in theory but is rarely used in practice. It locks both the DB and the broker until commit; performance is poor; not all brokers support it (Kafka doesn't). The outbox sidesteps the problem: only the DB is involved in the transaction.
What about Debezium / CDC?
Debezium reads the database's write-ahead log (WAL) and publishes changes as events to Kafka. With CDC, no outbox table is needed — every INSERT/UPDATE/DELETE becomes an event. Trade-off: every change gets published including ones not meant as events; the event payload can't be shaped (it's the row); CDC requires platform-specific setup. Outbox gives cleaner event design at the cost of one extra table and the relay process.
Does the outbox pattern guarantee exactly-once delivery?
No. It gives at-least-once: the event will eventually be published, possibly more than once if the relay crashes between publishing and marking-sent. Consumers must be idempotent. Combine with idempotency keys at the consumer side to get exactly-once-effect.
How big should the outbox table grow?
After events are published, the row can either be deleted or kept as an audit log. For high-volume systems, delete to keep the table small (and fast for the relay query). For audit / replay needs, keep with a TTL (delete rows older than 30 days). Either way, partition the table or run regular VACUUM to keep it healthy.
Gotchas
- Forgetting to put the outbox INSERT in the same transaction defeats the pattern
- Relay marks events 'sent' before actually publishing: events lost on crash
- Single relay instance: bottleneck and SPOF; run multiple with SKIP LOCKED
- Letting the outbox table grow unbounded: relay queries slow down, table bloats
- Consumers not idempotent: at-least-once delivery causes duplicate side effects
Cancellation Propagation Across Services
Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often
When a request is cancelled (client disconnect, deadline exceeded, parent abort), the cancellation must propagate through every downstream call so wasted work stops everywhere. gRPC ships deadlines in headers; HTTP relies on connection close detection; Go does it with context; Java with CompletionStage / virtual thread interrupts. Done wrong, zombie work ties up resources after the user has gone.
Key Points
- Cancellation is end-to-end: a single layer that ignores it leaves zombie work running
- Client disconnect: the server must detect socket close and stop processing (HTTP request.context() in Go, request.is_disconnected() in Starlette)
- gRPC propagates a deadline header (grpc-timeout); intermediate services subtract elapsed time and pass the remainder downstream
- HTTP has no standard deadline header; pass via X-Request-Timeout or rely on connection lifetime
- Forgotten cancellation = thread/goroutine pool exhaustion as zombies pile up
Interview Follow-up Questions
How is a client disconnect detected?
Depends on the framework. Go's net/http: r.Context().Done() fires. Node.js: req.on('close') fires. Starlette/FastAPI: await request.is_disconnected(). Java Servlet: AsyncListener.onTimeout. The detection isn't magic; the underlying TCP socket close is what triggers it. If the runtime doesn't expose this, the disconnect isn't detectable, and the only fallback is to enforce a deadline.
How does gRPC propagate deadlines?
On the wire, gRPC sends a grpc-timeout header (e.g., 'grpc-timeout: 500m' for 500ms). Server-side libraries set the request context deadline from this. When the server makes further gRPC calls and passes the ctx, the client library reads the remaining deadline from ctx and sends a smaller grpc-timeout. Each hop subtracts the elapsed time. This is deadline propagation across services, and it's the main reason gRPC is preferred over REST for inter-service calls.
What about plain HTTP? Is there a deadline header?
Not officially. Some teams use X-Request-Timeout or X-Deadline as a custom convention. Others rely on connection-level timeouts and the client disconnect signal. When both sides are under the team's control, gRPC or a custom header works. Otherwise, the only options are TCP and per-hop timeouts.
What's the difference between cancellation and timeout?
Timeout is a deadline; cancellation is the act of aborting work in progress. A timeout PRODUCES a cancellation when it fires. Cancellation can also come from other sources: parent task cancelled, user pressed cancel, system shutdown. The handler logic is the same: check ctx.Done() / is_cancelled(), unwind cleanly, release resources.
Gotchas
- Forgetting to pass ctx to a downstream call: that layer ignores the deadline, zombies pile up
- Long CPU loops without ctx checks: cancellation can't interrupt CPU work, only I/O
- Deadline too tight at the edge: real requests time out under normal load
- Deadline too loose at the edge: zombies pile up before deadline hits
- Treating cancellation as exceptional: it's a normal part of every request lifecycle
CPU-bound vs I/O-bound
Section: Performance & Tradeoffs | Difficulty: Basic | Frequency: Asked Often
A CPU-bound task spends most of its time doing computation; an I/O-bound task spends most of its time waiting. Different workloads need really different concurrency strategies, and picking the wrong one makes the code slower, not faster.
Key Points
- CPU-bound: bottleneck is compute. Speedup requires more cores OR faster code.
- I/O-bound: bottleneck is waiting (network, disk, DB). Speedup requires more concurrency to overlap waits.
- Threads help I/O-bound work even with the GIL, Python releases the GIL around blocking calls
- Threads do NOT help CPU-bound work in Python (GIL serializes execution), use multiprocessing
- Async/await scales I/O-bound to 100K+ tasks but does nothing for CPU-bound
- Right pool size: CPU-bound ≈ core count; I/O-bound ≈ much higher (often 100s)
Interview Follow-up Questions
How does one tell if code is CPU or I/O bound?
First step: `time` it. If user+sys ≈ real, CPU-bound; if user+sys << real, I/O-bound. Then profile (py-spy / pprof / async-profiler) to confirm where the time goes. Most production code is I/O-bound, DBs, RPCs, disk.
Why doesn't async/await help CPU-bound code?
Async/await is cooperative single-threaded concurrency. There's only one thread, so CPU work runs on it serially. A CPU-heavy task in an async function blocks the event loop and stalls every other task. Use a thread/process pool for CPU work, async for I/O.
What's the right thread pool size?
CPU-bound: cores (sometimes cores+1 to handle minor I/O). I/O-bound: cores × (1 + wait/compute). For 90% I/O work, that's 10× cores. With Java 21+ virtual threads or Go goroutines, the answer collapses to 'one per task.'
When does CPU-bound matter in practice?
Image/video encoding, ML inference (without GPU), regex on huge inputs, JSON parsing of massive payloads, cryptography, compression. Most web services are I/O-bound; data pipelines are often CPU-bound.
Why does Python's GIL release on I/O?
Blocking system calls (sockets, file I/O, sleep) release the GIL because the C implementation knows the thread will idle. Other Python threads can run during the wait. CPU work in pure Python keeps the GIL held until a bytecode boundary.
Gotchas
- A function can be CPU-bound in one workload and I/O-bound in another, measure on actual data
- Hashing/JSON-parsing/compression in 'I/O' code paths is often the hidden CPU bottleneck
- GPU-accelerated compute is technically I/O from the CPU's perspective, different concurrency model again
- Network I/O can become CPU-bound at very high throughput (parsing dominates)
- asyncio + a CPU-heavy task = stalled event loop and timeouts on every concurrent task
Throughput vs Latency Tuning
Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Asked Often
Throughput = ops/sec the system can sustain. Latency = time per op. They trade off via batching, queuing, and concurrency. Higher concurrency → higher throughput but worse tail latency. Little's Law: average concurrency = throughput × average latency.
Key Points
- Throughput: system-level capacity (req/sec)
- Latency: per-op time, especially p50/p99/p99.9 percentiles
- Little's Law: L = λW (concurrent in-flight = throughput × avg latency)
- Batching trades latency for throughput (waits for batch, then processes many)
- Queue length grows when arrival rate exceeds service rate, exponential queueing
Interview Follow-up Questions
What's Little's Law and why does it matter?
L = λW. Average number of items in system = arrival rate × average time in system. Useful for prediction: at 1000 req/sec and average latency 100ms, there are 100 in-flight requests at any moment. Size connection pool, thread pool, and queue depth around this.
Why does p99 matter more than average?
Page-load times are bottlenecked by the slowest of N parallel requests. If 99% of requests are fast and 1% are slow, the 1% dominates user experience. Average masks tail. Always track p50 (median), p99 (tail), p99.9 (worst-case).
How does batching affect latency?
Batching adds wait-for-batch time to each operation. If batch size = 100 and ops arrive at 1ms intervals, last-in-batch waits 99ms before processing. But total processing for 100 ops drops dramatically. Net: higher throughput, individual latency up by batch wait.
What's USL (Universal Scalability Law)?
Neil Gunther's law: as concurrency N increases, throughput grows linearly (in idealized world), then sublinearly (contention), then DROPS (coordination cost). Real systems have an optimal N, beyond that, more threads = lower throughput. Find N empirically.
Gotchas
- Average latency hides tail, always look at p99/p99.9
- Adding threads beyond optimal N HURTS throughput (USL)
- Unbounded queues turn brief overload into permanent latency degradation
- Cache hit rate drops when concurrency increases (cache-line contention, eviction)
Context Switch Cost
Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes
OS thread context switch: ~5μs (kernel-mediated, save/restore registers, TLB flush). Goroutine / virtual thread context switch: ~hundreds of ns (user-space, no kernel transition). Async task switch: ~tens of ns (function call). Choosing the right unit affects throughput at scale.
Key Points
- OS thread switch: ~1-5μs typical, includes kernel mode switch + TLB pollution
- Goroutine / virtual thread switch: ~200-500ns, user-space only
- asyncio task switch: ~50-100ns, just a coroutine resumption
- At 1M req/sec, a 1μs context switch costs 1 second of CPU per second, 100% of one core
- Mass goroutine spawning is fine; mass thread spawning hits scheduler limits
Interview Follow-up Questions
Why do OS thread switches cost so much more?
Three reasons: (1) kernel mode transition (~hundreds of ns by itself); (2) full register save/restore including FP and SIMD state; (3) TLB pollution, virtual memory mappings get partially invalidated, causing subsequent memory accesses to refill from page tables. Goroutine switches do none of this, same address space, just swap stack pointers.
What's TLB pollution?
TLB = Translation Lookaside Buffer, the CPU's cache of virtual→physical address translations. Switching processes flushes it; switching threads in the same process partially flushes (TLB tagged by ASID on modern CPUs). After a thread switch, the next N memory accesses are slower because they refill the TLB.
How are context switches measured in production?
Linux: `vmstat 1` shows cs (context switches per second) globally. Per-process: `pidstat -w 1`. For perf-critical services, monitor cs/sec; sudden spikes correlate with throughput drops. Goroutine 'switches' don't show in OS metrics, use Go's pprof/sched.
When does context-switch cost become the bottleneck?
When tasks are very short. If each task does 1μs of work and incurs a 5μs context switch, 5x more time goes to switching than working. Either batch tasks (do many per switch) or use cheaper switching (goroutines, async).
Gotchas
- Mass-spawning OS threads (e.g., one per request) hits OS scheduler limits at ~10K threads
- vmstat cs only counts OS-level switches; goroutine and async switches invisible
- TLB pollution makes the FIRST few accesses after a switch much slower than steady-state
- Hyper-threading helps hide some thread-switch latency by running another logical core
- Synchronized blocks in Java pin virtual threads to carriers, losing the cheap-switch benefit
Memory Overhead Comparison
Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes
OS thread: ~1MB stack reserved. Goroutine: ~2KB initial, grows on demand. Virtual thread (Java 21): ~few KB. asyncio task: ~1KB. The difference dictates whether you can run 10K vs 1M concurrent units on the same hardware.
Key Points
- OS thread: 1MB default stack reserved (most unused) → 10K threads = 10GB RSS
- Goroutine: 2KB initial, grows up to 1GB → 10K goroutines = 20MB
- Virtual thread (Java 21): ~few KB stack → 1M virtual threads ~ few GB
- asyncio task: ~1KB per task (Python) → 1M tasks ~ 1GB
- Per-thread/task overhead matters when you serve millions of concurrent connections
Interview Follow-up Questions
Why do OS threads reserve 1MB even if unused?
Stack is virtually mapped at thread creation; physical pages allocated only when touched (most aren't). RSS may be smaller than 1MB × thread count, but the virtual address space cost is real; eventually 32-bit address space exhausts (less of an issue on 64-bit, but still). Some kernels actually allocate the stack eagerly.
How does goroutine stack 'grow on demand'?
Each goroutine starts with a tiny stack (2KB in Go 1.4+). When a function call would overflow, the runtime allocates a larger stack, copies the contents, updates pointers, and re-runs. Cost is per-growth (rare in practice). Stacks also shrink on GC if usage drops.
What's the memory cost of 1M asyncio tasks?
Each task is a coroutine state object, ~1KB in Python (the suspended frame, locals, await chain). 1M tasks ~ 1GB. Add aiohttp connections (~few KB each) and the total reaches several GB. Bound concurrency or risk OOM.
Where does ThreadLocal cost matter?
Each thread has a ThreadLocal map. Java apps with many ThreadLocals (Spring, MDC, request context, ORM) can use ~50KB per thread JUST for ThreadLocals. With 1M virtual threads, that's 50GB. Java 21 ScopedValue solves this for new code.
Gotchas
- 1M threads on Linux = ~1MB×1M = 1TB RSS, won't fit on most servers
- Async stacks live across awaits, long-lived tasks accumulate stack memory
- ThreadLocal × virtual threads = memory explosion (use ScopedValue in Java 21+)
- Goroutine stack growth costs CPU during the copy, usually invisible but profiles can show it
Amdahl's Law
Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes
Maximum speedup from parallelisation = 1 / (s + p/N), where s is serial fraction, p is parallel fraction, N is processor count. Even small serial fractions cap maximum speedup. 5% serial = 20x maximum, no matter how many cores. The implication: optimising the serial part is often the highest-leverage work.
Key Points
- Speedup(N) = 1 / (s + (1-s)/N). As N → ∞, speedup → 1/s.
- 5% serial = 20x maximum speedup. 1% serial = 100x. 10% = 10x.
- Real systems have multiple serial parts: I/O, lock contention, sequential bottlenecks.
- Highest-leverage optimisation: reduce the serial fraction, not just add cores.
- Universal Scalability Law (USL) extends Amdahl with a contention term, often more realistic.
Interview Follow-up Questions
What's the practical lesson?
Throwing cores at a problem has diminishing returns. To break the limit, reduce the serial fraction. Profile to find what's serial: a single-threaded log writer, a global lock, a sequential I/O to a shared backend. Eliminate or parallelise it; only then does adding cores keep helping.
How does Amdahl differ from USL?
Amdahl assumes parallel portions scale perfectly. USL (Universal Scalability Law) adds a contention term and a coherency term, modeling cases where adding cores actually slows you down (cache-line bouncing, lock contention overhead). USL fits real systems better; Amdahl is the simpler upper bound.
Does Amdahl apply to distributed systems?
The same shape. The 'serial' part might be a coordinator, a leader, a global lock, a single-shard query. Most scaling problems in distributed systems are Amdahl-shaped: 'this part of the workflow can't be parallelised, and it caps the throughput'.
Why don't speedups in benchmarks match Amdahl predictions?
Several reasons. Benchmark workload may not match production (different serial fraction). Cache effects help on some sizes (more cores = more cache total) and hurt on others (false sharing). System overhead (scheduler, GC) is non-trivial. Amdahl is the upper bound; real speedup is at or below it.
Gotchas
- Adding cores past the Amdahl ceiling is wasted money
- 'Embarrassingly parallel' workloads still have a serial setup/teardown that limits speedup at scale
- Lock contention is a hidden serial fraction; profile to find it
- Per-core wins (better algorithm, less contention) sometimes beat throwing more cores
- Real systems are often USL-shaped (degradation past peak), not Amdahl-shaped (asymptote)
Universal Scalability Law (USL)
Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes
USL extends Amdahl's Law with a coherency term. Speedup = N / (1 + α(N-1) + βN(N-1)). The β term captures the case where adding cores actually slows you down (cache coherency, lock contention). Real systems peak at some N and degrade beyond it. USL fits this; Amdahl doesn't.
Key Points
- α (contention): the part that serialises, like Amdahl's serial fraction.
- β (coherency): the cost of coordinating between cores; quadratic in N.
- Real systems often peak at finite N, then degrade. USL predicts the peak; Amdahl doesn't.
- Fitting USL to your benchmark data tells you where to invest: reduce contention OR coordination.
- Practical use: capacity planning. Know the peak before you scale past it and waste money.
Interview Follow-up Questions
How does USL differ from Amdahl?
Amdahl predicts an asymptote; USL predicts a peak followed by decline. Real systems usually have the peak shape: adding cores helps up to some N, then cache coherency and contention overhead exceed the parallelism gain, and you go backward. Amdahl is a useful upper bound; USL is closer to what you'll actually measure.
What does the β term physically represent?
The cost of cores coordinating with each other. Cache coherency traffic (cache lines bouncing between L1 caches), atomic operations on shared memory, lock acquisition cost, NUMA cross-socket access. These costs grow as cores multiply because every core might need to talk to every other core.
How is USL used in practice?
Two ways. (1) Capacity planning: fit USL to benchmark data, predict the peak, don't scale past it. (2) Optimisation guidance: a high-α fit means 'reduce serial work'; a high-β fit means 'reduce inter-core coordination'. The shape of the curve points at the right kind of fix.
Where can USL not be applied?
USL assumes a single bottleneck shape (contention + coherency). For systems with multiple distinct bottlenecks at different N (memory at N=4, network at N=16, disk at N=64), USL gives a single curve that doesn't fit any one bottleneck well. For those, profile each bottleneck separately.
Gotchas
- Fitting USL to too few data points (3-4) gives unreliable α, β estimates
- Treating the asymptote as the goal: real systems peak and decline; the peak is the goal
- Ignoring the curve shape: high α and high β need different fixes
- Adding cores past the USL peak makes you slower AND more expensive
- USL is per-workload; same hardware can have very different curves for different workloads
Tail Latency: Why p99 Matters More Than Average
Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Asked Often
p99 (the 99th percentile latency) is the user experience; the average is a lie. Tail latency amplifies in fan-out: a request that fans out to 10 services has a p99 dominated by the slowest of 10 backends, not by their average. Mitigations: hedged requests, tied requests, micro-partitioning, smaller fan-out, better tail metrics.
Key Points
- Average hides outliers. The p99 IS the user experience for 1% of requests, which is millions when you serve millions.
- Tail amplification: a request fanning out to N backends has p99 ≈ max of N independent draws. With N=10, p99 ≈ p99.9 of one backend.
- Mitigations: hedged requests (send to two replicas, take first), tied requests (cancel the loser), micro-batching, fewer fan-outs.
- Coordinated omission: load testing tools that wait for slow responses miss tail latency. Use HDR Histogram, k6, or wrk2.
- Read 'The Tail at Scale' by Dean & Barroso (2013); foundational paper, still relevant.
Interview Follow-up Questions
Why does p99 amplify with fan-out?
Each backend independently has some chance of being slow. Fanning out to N backends and waiting for all means waiting for the slowest. The probability that ALL N are fast is much lower than the probability that ONE is fast. With N=10 and per-backend p99=100ms, ~10% of fan-out requests will hit at least one slow backend. The fan-out p99 lands near the original p99.9 of one backend.
What's the simplest way to reduce tail latency?
Hedged requests. Send to one replica; if not done after the p95 latency, send a duplicate to another replica; take the first response. ~5% extra load, dramatic p99 improvement (often 5-10x). Only safe with idempotent operations.
What is coordinated omission?
A flaw in many load testers: they send the next request only after the previous one returns. If the server slows down, the tester slows down with it, sending fewer requests and missing the latency that users at constant rate would actually experience. Use open-loop testers (wrk2, k6, vegeta) that send at constant rate regardless of response time.
Optimise for p50 or p99?
Depends on the workload. For batch/throughput, p50 (or average). For user-facing requests, p99. For 'tail-sensitive' systems (search, ad serving), even p99.9. The question is: how many users have a bad experience? At millions of requests, even p99.99 is many users.
Gotchas
- Reporting average latency in dashboards: hides the tail entirely
- Coordinated omission in load tests: reported p99 understates real p99 by 10-100x
- Fan-out without thinking about amplification: p99 of the user request is much worse than p99 of any backend
- Hedging non-idempotent operations: double-execution under network failures
- Prometheus summary metrics with default config: percentiles drift over the reporting window
Benchmarking Methodology
Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes
Microbenchmarks lie unless they're done carefully. Warm up the JIT/runtime. Defeat dead-code elimination. Use realistic input. Measure with confidence intervals (multiple runs). Benchmark the right thing (the operation, not the noise around it). End-to-end load tests for system-level decisions, microbenchmarks for code-level decisions.
Key Points
- Warmup is critical: JIT, branch predictor, caches all need to stabilise.
- Dead-code elimination silently deletes your benchmark; consume the result.
- Multiple runs in fresh JVMs (forks) catch JVM-to-JVM variance.
- Report confidence intervals; if they overlap, you can't claim one is faster.
- Microbenchmark vs load test: microbenchmark for 'is op A faster than op B'; load test for 'does the system handle 1000 RPS'.
Interview Follow-up Questions
Why doesn't a simple time.time() loop work?
Three reasons. (1) Warmup: cold code runs differently from warm code. (2) Dead code elimination: if the result is unused, the compiler deletes the work. (3) Variance: GC pauses, scheduler quanta, cache effects produce huge run-to-run variance. JMH/pytest-benchmark/testing.B handle all three. Hand-rolled timing is almost always wrong unless the target is whole applications, not microbenchmarks.
How many runs are needed?
Enough that confidence intervals are tight. JMH defaults: 5 warmup + 5 measurement iterations × 2 forks = 10 measurements. For tighter CIs, increase forks (more JVM-to-JVM samples). For tighter relative comparisons, increase iterations within each fork.
When to prefer load testing over microbenchmarks?
When the question is system-level: 'can we handle Black Friday', 'what's our p99 at 1000 RPS', 'does the cache help end-to-end'. Microbenchmarks answer 'is op A faster than op B'. Use load tests (k6, wrk, gatling) for the system question, microbenchmarks to drill into specific hot spots.
What's the relationship between benchmarking and profiling?
Benchmarking measures: 'this takes X ns'. Profiling explains: 'X ns is spent here'. Both essential. Benchmark-only gives a number with no path to improvement. Profile-only gives a flame graph but no impact estimate. Workflow: profile to find the hot spot → benchmark the hot spot → make a change → re-benchmark to confirm.
Gotchas
- Hand-rolled timing loops give numbers that are off by 10x or more
- Single run = no idea of variance; could be off by a factor of 2
- Discarding the benchmark result lets DCE delete your loop
- Reporting averages without intervals hides 'no significant difference'
- Benchmarking in dev/IDE = unreliable; always run benchmarks from CLI
Measuring Lock Contention
Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes
Lock contention is when threads spend significant time waiting for locks instead of running. Measure with: profiler block-time view, JFR/async-profiler lock events, perf-events for futex contention. The fix is usually one of: shrink the critical section, shard the lock, replace with lock-free, or remove shared state entirely.
Key Points
- Symptom: high CPU time spent in lock acquisition / blocked / parked states.
- Java: JFR events 'jdk.JavaMonitorEnter', async-profiler 'lock' mode, jstack thread dumps.
- Linux: perf trace -e 'sched:*' or perf record with off-cpu profiling.
- Fix priority: shrink critical section > shard lock > lock-free structure > redesign to avoid sharing.
- False sharing looks like contention; measure cache-miss rate to distinguish.
Interview Follow-up Questions
How is contention identified as the bottleneck?
Check CPU utilisation under load. If CPU is well below 100% but throughput won't go higher, threads are blocked on something. The 'something' is often a lock. Profile with off-CPU sampling (async-profiler lock mode, perf trace, Go mutex profiler) to find the specific lock.
Shrink critical section vs shard lock: which first?
Shrink first. Often the critical section contains work that doesn't need the lock (compute outside, only update inside). Shrinking is cheap and doesn't change the API. Sharding is more invasive (queries that span partitions have to deal with N partitions). Try shrinking; if still contended, shard.
What's the difference between contention and false sharing?
Contention: threads explicitly waiting on the same lock. False sharing: threads writing to different fields that happen to be on the same cache line, causing cache-line ping-pong even though there's no logical contention. Contention shows up as block time; false sharing shows up as high cache-miss rate with high CPU. Different fixes (lock partitioning vs padding/separation).
Is GIL contention measurable in CPython?
Yes. py-spy --gil shows per-thread GIL hold time. ceval.c instrumentation (CPython internals) can be enabled. For most Python services, the GIL is the bottleneck on multi-threaded CPU work, which is why 'use multiprocessing' or 'use C extensions that release the GIL' are the standard advice.
Gotchas
- CPU at 100%: contention is NOT the bottleneck; the system is CPU-bound, optimise the code
- CPU low but throughput limited: lock or I/O is the bottleneck; profile to find which
- Confusing high CPU with high contention: profile both on-CPU AND off-CPU
- Profiler overhead: high sampling rates can themselves cause perceived contention
- Optimising a non-bottleneck: makes the system more complex without helping
NUMA & Cache Effects
Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes
Multi-socket servers have NUMA: each CPU socket has its own memory; accessing the other socket's memory is slower (1.5-3x). Pinning threads to NUMA nodes and allocating memory locally can dramatically improve performance for memory-bound workloads. False sharing is the related cache-line problem within a socket.
Key Points
- NUMA: each socket has its own memory bank. Cross-socket access is slower than local.
- Linux: numactl --hardware shows NUMA topology. numactl --cpunodebind=0 --membind=0 pins to node 0.
- JVM, Go runtime, Postgres can be NUMA-aware; check docs and tune.
- False sharing: two cores writing different fields on the same cache line cause cache-line bouncing.
- Fix false sharing with padding (@Contended in Java, _Alignas(64) in C) so independent fields land on different cache lines.
Interview Follow-up Questions
How big is the NUMA penalty?
Cross-socket memory access is typically 1.5x to 3x slower than local. For latency-sensitive workloads (databases, message brokers, in-memory caches), this is significant. For CPU-bound workloads with small working sets that fit in cache, NUMA effects are smaller. Measure with `perf stat -e LLC-loads,LLC-load-misses` to see how much cross-socket traffic is happening.
How is false sharing identified?
Symptoms: high CPU, low instructions-per-cycle (IPC), cache-coherency events spike (perf stat -e mem_load_uops_retired.l3_hit). Profile shows time spent in trivial atomic operations. Fix: pad the contended fields. The improvement is often dramatic (2-10x for the affected operation).
Should atomics always be padded?
No. Padding wastes memory (64 bytes per field instead of 8). Only pad after measuring false sharing. The Java LongAdder uses padding because the per-thread cells are explicitly designed to be hot; a one-off counter doesn't need it.
How does NUMA interact with virtualisation/cloud?
Cloud VMs may or may not expose NUMA. Smaller instance types typically don't (one socket worth of memory). Larger instances often do. Check `numactl --hardware` inside the VM. For NUMA-aware workloads on multi-socket VMs, pinning still helps.
Gotchas
- Cross-socket memory access is 1.5-3x slower; bind threads and memory to the same node
- False sharing on hot atomics can dominate even simple workloads
- JVM and Go runtimes are mostly NUMA-blind by default; manual binding helps
- Large heaps that span sockets thrash; either pin or use multiple smaller processes
- Padding wastes memory; only apply where you've measured false sharing
Parallel Merge Sort & ForkJoin
Section: Performance & Tradeoffs | Difficulty: Medium | Frequency: Sometimes
Divide-and-conquer naturally parallelizes, recursive halves run on different cores. Java's ForkJoinPool is purpose-built for this; Go uses goroutines + WaitGroups; Python uses concurrent.futures. Threshold check stops parallelism for small subarrays, fork overhead exceeds the parallelism win.
Key Points
- Divide-and-conquer = natural parallelism (each half is independent)
- Threshold: stop forking when subarray < N (~~ 1000 elements)
- Java ForkJoinPool: work-stealing scheduler optimized for D&C
- Sequential merge step is unavoidable bottleneck, Amdahl's law
- Speedup is logarithmic vs linear: 8 cores ≠ 8x speedup
Interview Follow-up Questions
Why a threshold? Why not parallelize all the way down?
Fork overhead (creating a task, scheduling, joining) is a few μs. For a subarray of 10 elements, the fork cost dwarfs the work. Threshold ~= 1000 elements is empirical, fork cost amortized over enough work. Below threshold, sequential is faster.
What's the theoretical speedup of parallel merge sort?
O(n log n) sequential; O(log² n) on infinite cores. With P cores: O(n log n / P + log n). Practically: 4 cores ~= 3.2x speedup, 8 cores ~= 5.5x, due to merge step bottleneck and synchronization overhead. Amdahl's Law: sequential fraction caps speedup.
Why ForkJoinPool over regular ExecutorService?
FJP uses work-stealing, idle threads steal tasks from busy threads' queues. Optimized for recursive divide-and-conquer where subtasks are short and bursty. Regular ThreadPoolExecutor's central queue serializes scheduling, bottleneck for fine-grained tasks.
Can the merge step itself be parallelized?
Yes, with O(log n) parallelism using rank-based parallel merge (find each element's rank in the other array via binary search, write to position in parallel). Used in libraries like Intel TBB. Doubles complexity for marginal gains; usually not worth it.
Gotchas
- ForkJoinPool.commonPool() is shared, heavy use starves parallel streams
- ProcessPoolExecutor pickles arguments, large arrays have IPC overhead
- Threshold too low → fork overhead dominates
- Threshold too high → underutilization on big arrays
- Parallel merge sort has O(n) extra memory for the merge buffer
Parallel MapReduce: Word Frequency Counter
Section: Performance & Tradeoffs | Difficulty: Medium | Frequency: Sometimes
Map: each worker processes a chunk of input → partial counts. Reduce: merge partial counts into final result. Map step is embarrassingly parallel; reduce step is the synchronization point. Java's parallelStream, Python's multiprocessing.Pool.map, Go's errgroup, all express this.
Key Points
- Map: chunk input, each worker computes partial result independently
- Reduce: merge partial results, must be associative (and ideally commutative)
- Embarrassingly parallel: no inter-worker communication during map
- Reduce strategies: tree-reduce (parallel), sequential merge, ConcurrentHashMap
- Foundation of distributed computing, Hadoop, Spark, BigQuery all extend this
Interview Follow-up Questions
Why per-worker map instead of shared ConcurrentHashMap?
Avoid lock contention. Each worker writes to its OWN map (no synchronization needed). Final reduce merges. Concurrent updates to one shared map serialize at the lock, defeats parallelism. The shard-then-merge pattern is universal in MapReduce.
How does this connect to Hadoop / Spark?
Distributed MapReduce. Same pattern at machine scale: each machine processes a chunk of input from HDFS/S3, emits partial results; reduce stage shuffles by key and merges. Hadoop's per-mapper / per-reducer abstraction is exactly this.
What's a tree-reduce?
Pair up partial results and merge in pairs in parallel; repeat until one result. log(N) merge depth instead of linear. Useful when reduce is expensive and many partials exist. Java's parallelStream uses this internally.
Why must reduce be associative?
Parallel execution can merge in any order: ((A+B)+(C+D)) or (A+(B+(C+D))). Associative ⇒ same result regardless of grouping. For commutative reducers (like sum), order also doesn't matter. Common bug: reducers that look fine but aren't strictly associative under floating-point arithmetic.
Gotchas
- Mutable accumulators (like StringBuilder) need synchronization or per-worker copies
- parallelStream uses commonPool, heavy use starves other parallelStream callers
- Reduce step can be the bottleneck if maps are huge, consider tree-reduce
- Float addition is not strictly associative, parallel reduce gives slightly different results than sequential
Parallel Graph Traversal (BFS/DFS)
Section: Performance & Tradeoffs | Difficulty: Hard | Frequency: Sometimes
Parallel BFS: process each level in parallel; serial frontier-to-frontier transitions. Parallel DFS: harder due to depth-recursion; usually use task-pool of subgraphs. Concurrent visited-set required either way; ConcurrentHashMap or Bloom filter.
Key Points
- BFS parallelizes naturally per-level, all nodes at level k processed concurrently
- DFS harder, recursive nature serializes; use bounded task pool of subgraphs
- Visited set must be concurrent, CHM, sync.Map, atomic bitset
- Frontier transitions are barriers, synchronize all workers between levels
- Used in PageRank, social-graph queries, dependency resolution at scale
Interview Follow-up Questions
Why is BFS easier to parallelize than DFS?
BFS has natural levels, all nodes at distance k are processed before moving to k+1. Each level is embarrassingly parallel. DFS has recursive call structure, depth-first ordering must be preserved (somewhat), and recursion serializes naturally. Workarounds: parallel DFS uses task pool of subgraphs.
Why does visited set need to be concurrent?
Many workers may discover the same neighbor simultaneously. Without atomic add, two workers might both think they discovered it first → both enqueue → duplicate work. ConcurrentHashMap.newKeySet().add() returns true exactly once per key, natural deduplication.
When does parallel BFS NOT win?
Tree-shaped graphs (each node has one parent), frontier is small, parallelism is wasted on coordination. Sparse graphs in general. Win is biggest on dense, wide graphs (many neighbors per node, many nodes per level).
What about Bloom filter for visited set?
Useful at extreme scale (billions of nodes) where exact set is too big. Trade memory for false positives. Used by web-scale crawlers. Risk: Bloom false positive → drop a real new node. Usually pair with exact set as fallback for confirmation.
Gotchas
- If workers don't synchronize between levels, processing crosses levels and BFS becomes wrong
- Concurrent map's `add` semantics matter, use `newKeySet().add()` (returns true if newly added) for dedup
- Level barrier requires ALL workers from level k to finish before level k+1 starts, wg.Wait()
- DFS isn't level-based, naive parallelization breaks ordering
- Memory-bound when frontier is huge, bound concurrency to avoid OOM
Lock-Free Programming, CAS Loops
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes
Lock-free algorithms use atomic compare-and-swap (CAS) instead of mutexes. The pattern: read current value, compute new value, CAS-install, retry on failure. At least one thread always makes progress (no deadlocks, no waiting on a slow holder), but individual threads can starve under contention. Pays off when contention is high and locks would serialise.
Key Points
- Lock-free = at least one thread is always making progress (system-wide). Individual threads may retry many times.
- The CAS retry loop is the building block: load → compute → CAS → retry on failure.
- No deadlocks possible (no thread blocks on another). No priority inversion. No waiting on a slow holder.
- Costs: ABA problem in pointer-based structures, memory reclamation in non-GC languages, cache-line contention under hot CAS.
- Use when locks are measurably the bottleneck. For most code, mutexes are simpler AND faster.
Interview Follow-up Questions
Lock-free vs wait-free?
Lock-free: at least one thread makes progress at all times (system-wide). Individual threads may retry indefinitely under bad scheduling. Wait-free: every thread completes in bounded steps regardless of others. Wait-free is much harder to implement; lock-free is the practical bar for most production code.
When should I use lock-free instead of a mutex?
Once measurement shows lock contention as the bottleneck AND the critical section is small enough that CAS retry is cheaper than mutex parking. For counters, hot flags, simple atomic state, lock-free is a clear win. For more complex shared state (multi-field updates, long critical sections), locks are simpler and often faster.
What is the ABA problem?
In pointer-based lock-free structures, a CAS can succeed even though the value changed between the load and the CAS, because it was changed back to the same value. Without the original context (often a freed-and-reused node), the CAS installs garbage. Solutions: tagged pointers, hazard pointers, GC. See the ABA Problem & Tagged Pointers lesson.
Why doesn't lock-free help with multi-field updates?
CAS operates on one word at a time. If the invariant requires updating fields A and B together (head and tail of a linked list, balance and audit log), a single CAS can't do it atomically. Workarounds: pack both fields into one word (if they fit), use double-CAS (rare hardware support), or wrap them in an immutable struct and CAS the pointer. Or just use a lock.
Gotchas
- Treating 'lock-free' as 'always faster', measure first; mutexes often win for simple cases
- ABA in pointer structures, without GC, requires extra machinery
- Hot CAS contention causes cache-line ping-pong; lock-free counters can be slower than sharded
- Multi-field invariants need locks or pointer-swap; CAS only protects one word
- Lock-free does NOT mean wait-free; individual threads can starve
Memory Ordering: Acquire, Release, Relaxed, SeqCst
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes
C++/Rust let the programmer specify how strict an atomic operation's memory ordering must be. SeqCst (sequentially consistent) is strongest and slowest. Acquire-release is the right default for publishing data. Relaxed gives no ordering, only atomicity (counters). Java and Go atomics are SeqCst by default; there's no choice to make.
Key Points
- SeqCst: every thread sees the same global order of all SeqCst operations. Strongest, slowest.
- Acquire (on load) + Release (on store): publish-subscribe pattern. Writes before release are visible after acquire.
- Relaxed: atomic, but no ordering. Use for counters where order doesn't matter.
- Java atomics and Go's sync/atomic give SeqCst always. C++ and Rust allow per-operation choice.
- Wrong relaxation = wrong code. Acquire-release is the default if unsure; relax only after measuring AND reasoning carefully.
Interview Follow-up Questions
Why doesn't Java let me choose memory ordering?
Design choice. The Java team prioritised 'obviously correct' over 'maximally fast'. All Java atomic operations are sequentially consistent (since the memory model was clarified). The trade-off is some performance loss compared to hand-tuned C++; the upside is no entire class of subtle bugs. Go made the same choice for sync/atomic in Go 1.19.
When should I prefer acquire-release over SeqCst?
When the operation is a one-way publication or subscription. Producer releases, consumer acquires. Most lock-free patterns are this shape. SeqCst is needed when two independent atomics must agree on a global ordering (rare; mostly Dekker/Peterson-style algorithms).
Is relaxed ever wrong for counters?
Almost never wrong for the count itself. Wrong if other code reads the counter and decides something based on it relative to other state. Example: 'if hits > threshold, do X', under relaxed, hits can appear stale relative to other state. If the decision matters, use acquire on the load.
What is the cost difference between SeqCst and relaxed?
Hardware-dependent. On x86, SeqCst loads are basically free (loads are already strong); SeqCst stores require a fence (mfence or lock-prefixed). On ARM/POWER, SeqCst on either side requires fences. Relaxed is just the atomic instruction. The difference can be 2-10x for the operation, but the operation itself is nanoseconds; only matters in very hot paths.
Gotchas
- Mixing relaxed with code that depends on ordering = subtle bugs
- Acquire on store / release on load = compiler error; can't apply to wrong direction
- Java atomics are SeqCst; they can't be relaxed
- Memory ordering is ABOUT the compiler AND the CPU; both can reorder
- Reasoning by 'it works on my machine' fails on weakly-ordered ARM
The ABA Problem & Tagged Pointers
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Asked Often
ABA: thread A reads pointer P (= A); thread B pops A and pushes a new node at the same address; thread A's CAS succeeds even though the structure changed. Solutions: tagged pointers (pack version counter), hazard pointers (announce in-use), epoch-based reclamation, GC.
Key Points
- Pure CAS is unsafe when the same address can be reused (typical in C/C++ with manual memory)
- Java/Go avoid ABA via GC, freed memory not reused while references exist
- Tagged pointers: pack a version counter into unused pointer bits (low 3 bits typically)
- Hazard pointers: each thread declares pointers it's actively using; readers don't free those
- Epoch-based reclamation: defer free until all threads have moved past the freeing epoch
Interview Follow-up Questions
What's a 'hazard pointer'?
Each thread maintains a list of pointers it's currently using. Before freeing memory, the freer checks all hazard pointer lists; if any thread is using this pointer, defer the free. More memory overhead than CAS-only, but ABA-safe without tagged pointers.
What's epoch-based reclamation?
Track a global 'epoch' counter. Threads enter critical sections by reading the epoch; freed memory is tagged with its epoch. Don't actually free until ALL threads have advanced past the freeing epoch. Used by RCU in Linux kernel.
Tagged pointer, how does a counter fit in a pointer?
Most platforms use 64-bit pointers but only 48 bits of address. The high 16 bits are typically zero (or sign-extended). Pack a 16-bit version counter there. Or use the low bits, pointers to objects with alignment N have low log2(N) bits zero.
Does Java really not have ABA?
ABA on object identity is impossible, GC keeps freed objects alive. ABA on object STATE is possible (same object reused, fields changed). AtomicStampedReference handles both, stamp bumps on every change, regardless of reference.
Gotchas
- Reusing nodes from a pool reintroduces ABA even in GC languages
- AtomicStampedReference adds overhead, use only when ABA is a real risk
- Tagged pointers fail on architectures with 64-bit virtual addresses (some ARM)
- Hazard pointers add 1 cache line per thread per concurrent operation, memory cost
Treiber Stack: The Canonical Lock-Free Stack
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes
Lock-free stack using CAS on the head pointer. Push: build new node pointing at current head, CAS to install. Pop: read head, CAS-replace with head.next. The textbook lock-free data structure; suffers from the ABA problem and from contention on the head.
Key Points
- Push: allocate node, set node.next = head, CAS(&head, oldHead, node).
- Pop: load head, then CAS(&head, oldHead, oldHead.next). Retry on CAS failure.
- Vulnerable to ABA: between read and CAS, head can be popped, freed, and re-pushed at the same address.
- Memory reclamation is the hard part: when is it safe to free a popped node? Other threads might still be reading it.
- Single-cache-line contention on head: not great for many concurrent threads. Lock-free MPSC queues outperform for high contention.
Interview Follow-up Questions
Why is ABA a problem here?
Pop reads head=X, computes new_head = X.next, then CAS(head, X, X.next). If between the read and the CAS, another thread pops X (X is freed), pushes a new node Y, and pushes another node X' that happens to be allocated at address X, then head is X again and the CAS succeeds. But X' has a different X'.next than the X we read. We've installed garbage as the new head.
Why isn't ABA a problem in Java/Go?
GC. A node cannot be freed while any thread still holds a reference to it. The popping thread's local variable oldHead keeps it alive until that variable goes out of scope. So the address cannot be reused for a different node during the CAS. ABA can still happen if the GC happens to reuse the address (but it won't in practice while a thread holds the reference).
Is Treiber stack the best lock-free stack?
It's the simplest and the textbook one. For low contention it's fine. For high contention, all threads CAS the same head pointer, causing cache-line ping-pong; lock-free MPSC queues or elimination-backoff stacks scale better. For most production needs, a sync.Pool / object pool is simpler than a hand-rolled lock-free stack.
Gotchas
- ABA: in C/C++ without GC or hazard pointers, popped nodes can be reused, breaking CAS
- Memory reclamation: when is it safe to free a popped node?
- Cache-line contention on head: scales poorly with many writers
- Push allocates a node; allocator contention can dominate the lock-free win
- Lock-free does NOT mean wait-free; threads can starve indefinitely under bad scheduling
Hazard Pointers
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes
Memory reclamation scheme for lock-free structures in non-GC languages. Each thread publishes the pointers it's about to dereference (hazard pointers). Reclamation defers freeing a node until no thread's hazard pointer references it. Solves the 'when can I free this?' problem without a GC.
Key Points
- Each thread maintains a small fixed array of hazard pointers (typically 1-8).
- Before dereferencing a pointer obtained via shared atomic, publish it as a hazard pointer.
- On reclamation: collect all hazard pointers across threads, only free nodes not in the set.
- Bounds on memory reclamation latency: a node can be freed at most O(N * H) operations after being unlinked, where N=threads, H=hazards per thread.
- Alternative to epoch-based reclamation; lower latency for individual frees, more bookkeeping per access.
Interview Follow-up Questions
What problem do hazard pointers solve?
Memory reclamation in lock-free structures without GC. After a node is unlinked from a lock-free data structure, other threads might still hold pointers to it (they read it before the unlink). Freeing it immediately is unsafe (other threads will dereference garbage). Hazard pointers signal when it's finally safe.
Hazard pointers vs epoch-based reclamation?
Hazard pointers: per-thread per-access bookkeeping (slow per access, fast reclamation). Epoch-based (RCU): per-thread epoch counter (fast per access, slower reclamation). Hazard pointers free nodes sooner; epoch-based has lower per-access overhead. Both are correct; choice is a performance tuning.
How many hazard pointers do I need per thread?
Equal to the number of pointers a thread might be dereferencing simultaneously. For a Treiber stack pop: 1 (the head). For a Michael-Scott queue: 2 (head and head.next). For a more complex traversal: bounded by the depth of the access pattern. Typically 1-8 slots per thread are enough.
Why is C++ getting hazard pointers in the standard library?
Lock-free programming in C++ is rare but important (game engines, HFT, OS kernels). Memory reclamation has been the missing standard primitive. C++26 is expected to ship std::hazard_pointer based on years of work in folly, libcds, and the Concurrency TS. With it, application code can use lock-free structures without rolling its own reclamation.
Gotchas
- Publishing hazard pointer AFTER dereferencing = use-after-free
- Forgetting to clear hazard pointer = node never freed (memory leak)
- Too few hazard slots per thread = need to retire-and-restart on contention
- Hazard scan is O(threads × slots); doesn't scale to thousands of threads
- Mixing hazard pointers with epoch-based reclamation in the same structure: confusing
RCU: Read-Copy-Update
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes
Read-Copy-Update: writers create a new copy and atomically swap the pointer; readers see either the old or the new, never a partial state. Reads are nearly free (often just a pointer load). Writers are slower but rare. Used heavily in the Linux kernel for read-mostly data structures.
Key Points
- Reads are extremely cheap: dereference a pointer. No locks, no atomics on the read path.
- Writes: copy the data, modify, atomically swap the pointer to the new copy.
- Old copy must be reclaimed only after all readers that might be using it have moved on.
- Grace period: readers register a 'reader region'; reclamation waits for all reader regions to end.
- Best for read-mostly data: routing tables, configuration, dispatch tables. Not for hot writes.
Interview Follow-up Questions
What is the grace period?
The window between unlinking a node and freeing it. During the grace period, readers that started before the unlink might still be using the old node. The grace period ends when every reader that could have started before the unlink has finished. After that, free is safe. In the kernel, synchronize_rcu blocks until the grace period ends. In GC languages, the GC is essentially watching for the grace period continuously.
When should I use RCU vs sync.RWMutex (or RWMutex equivalent)?
RCU when reads vastly dominate writes (1000x or more) and the data is small enough to copy. Reads are essentially free (a pointer load). RWMutex has reader-counter overhead even on the read path. For small read-mostly data (config, dispatch tables, routing), RCU wins. For big data or write-heavy, RWMutex (or just Mutex) is fine.
Can I update a single field without copying?
Not safely. The whole point is that readers see a consistent snapshot. Mutating the existing copy lets a reader see the half-updated state. Always copy + atomic swap. For very large data structures where copying is expensive, look at sharded RCU (split into independent regions, RCU each separately).
Why is RCU so popular in the Linux kernel?
The kernel has many read-mostly data structures (process lists, routing tables, file descriptor tables, network namespaces) accessed by every system call. Lock-based access would serialise; RCU makes those reads free. The kernel team has invested years in RCU infrastructure (synchronize_rcu, call_rcu, srcu, sleepable RCU); the result is a major contributor to multi-core kernel scalability.
Gotchas
- Mutating in place defeats RCU; copy + swap is required
- Writers must wait for the grace period before freeing the old copy (in non-GC code)
- RCU is read-mostly; for write-heavy data, the copy cost dominates
- Multi-step updates (modify two related structures consistently) need extra coordination
- Long reader regions delay reclamation; in the kernel, this can pin memory
Wait-Free vs Lock-Free vs Obstruction-Free
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes
Three progress guarantees. Wait-free: every thread completes in bounded steps regardless of others. Lock-free: at least one thread makes progress at any time (system-wide progress). Obstruction-free: a thread makes progress if it runs alone for long enough. Strict ordering: wait-free > lock-free > obstruction-free > blocking.
Key Points
- Wait-free: hardest. Bounded per-thread steps. No starvation possible.
- Lock-free: at least one thread makes progress always. Individual threads may starve.
- Obstruction-free: a thread makes progress if no other thread interferes. Helped by exponential backoff.
- Blocking (lock-based): a thread holding a lock can pause everyone else indefinitely.
- Most production 'lock-free' code is actually lock-free, not wait-free. Wait-free is a research goal; lock-free is the practical bar.
Interview Follow-up Questions
Why is wait-free so much harder than lock-free?
Wait-free requires bounded per-thread steps under any scheduling. That means a failed CAS can't just retry; a bounded way to make progress is required. The standard trick is 'helping': a thread completes the operation that interfered with it, so on the next iteration the conflicting state is gone. The bookkeeping is significant: operation descriptors, per-thread state arrays, generation counters. Few production systems pay the cost.
Is lock-free always faster than locks?
No. Under low contention, a well-implemented mutex is often faster than CAS-retry loops (less branching, simpler instructions). Lock-free wins under high contention or when bounded progress is required (no priority inversion, no waiting on a slow lock holder). Always benchmark; don't assume lock-free = faster.
What is helping?
A wait-free protocol where threads that find another thread's incomplete operation finish it on the original thread's behalf. The original thread, when it next checks, sees its op complete. Helping prevents starvation: a slow thread can have its work completed by faster threads. Significant complexity; used in academic wait-free algorithms, rare in production.
When would I actually want wait-free?
Hard real-time systems where no thread can be allowed to wait indefinitely. Avionics, automotive, medical devices. Some signal-handler contexts where blocking is forbidden. Most application code does not need wait-free; lock-free is enough.
Gotchas
- 'Lock-free' marketing often means 'doesn't block on a mutex' rather than the formal definition
- Lock-free does NOT prevent individual thread starvation
- Wait-free algorithms have bad constant factors; rarely worth it in practice
- Obstruction-free without backoff = livelock under contention
- Blocking algorithms inside a lock-free data structure (e.g., malloc) defeat the guarantee
Spinlocks: When Busy-Waiting Wins
Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes
A spinlock is a mutex where the waiter spins in a tight CAS loop instead of going to sleep. Faster than mutex when the critical section is shorter than a context switch (~1µs). Awful when the critical section is long. Used inside OS kernels, sometimes in HFT systems; almost never appropriate in application code.
Key Points
- Spin in a CAS loop instead of parking the thread. No context switch cost.
- Wins when critical section < ~1µs (context switch cost) and there are spare cores.
- Loses badly when critical section is long: spinning threads burn 100% CPU doing nothing.
- Adaptive mutexes (Linux futex, Java synchronized): spin briefly, then park. Usually the right default.
- Spinlocks should NOT be used in user code without guaranteeing very short critical sections AND a spare core.
Interview Follow-up Questions
When do spinlocks actually win?
Three conditions must all hold: (1) critical section is shorter than a context switch (~1µs on Linux); (2) spare cores exist so spinning doesn't starve other work; (3) the lock holder won't be preempted (kernel context, RT thread, dedicated CPU). Most application code has none of these. Kernel code, lock-free data structure internals, and HFT systems sometimes do.
What's the difference between a spinlock and a busy-wait?
Mostly terminology. A spinlock is a busy-wait used to implement mutual exclusion. Busy-wait is the general technique of spinning until some condition is true. Both burn CPU; both win or lose under the same conditions.
Why does the PAUSE instruction matter?
Two things. It hints to the CPU that this is a spin loop, so the pipeline can deprioritise it (saves power, doesn't hurt co-running hyperthread). And it adds a small delay (10-100 cycles), reducing the rate of CAS attempts and thus cache-line traffic. On x86, PAUSE in spin loops is essentially free correctness; not using it is a small bug.
When does ttas (test-test-and-set) help?
Under contention. With plain TAS, every spinning thread's CAS bounces the cache line; the lock holder can barely write back to release. With TTAS, spinners just read (which can keep the line in shared state across multiple cores); the actual CAS only happens when the flag flips. Drastic reduction in cache traffic, much faster under contention.
Gotchas
- Spinlock with a long critical section = wasted CPU and bad latency
- Spinning waiter preventing the lock holder from being scheduled = livelock-like behaviour
- Plain TAS spinlock causes cache-line ping-pong; use TTAS
- Forgetting PAUSE on x86 wastes power and hurts co-running hyperthreads
- Userspace spinlocks where the OS can preempt the holder are almost always wrong
Lock-Free MPSC Queue
Section: Lock-Free Programming | Difficulty: Hard | Frequency: Sometimes
A multi-producer, single-consumer queue using CAS on linked-list pointers. Producers append at tail; consumer pops from head. No locks. The Michael-Scott (1996) design is the textbook impl. Used in Go's runtime, Linux kernel, low-latency trading.
Key Points
- Linked list with atomic head and tail pointers
- enqueue: CAS to append; on failure, retry with new tail
- dequeue: CAS to advance head; on failure, retry
- ABA-prone, needs hazard pointers, GC (Java), or tagged pointers
- Note: the Michael-Scott algorithm shown below is actually MPMC; a true MPSC variant simplifies the consumer side (no CAS on head)
Interview Follow-up Questions
When is lock-free actually faster?
Under high contention with short critical sections. Lock-free avoids lock-acquire latency (~50ns) and immune to priority inversion. But: harder to write, ABA-prone in C/C++, often slower under low contention. Benchmark before assuming.
How does Java sidestep ABA?
GC. A node that's been popped and reclaimed cannot have its address reused while a CAS still references it, GC keeps it alive until no references remain. In C/C++, hazard pointers, epoch-based reclamation, or tagged pointers are required.
MPSC vs MPMC, what's harder?
MPMC harder. MPSC has only one consumer, so head pointer contention is gone. MPMC needs CAS on both head and tail, plus more careful ABA handling. The Michael-Scott algorithm IS MPMC; the MPSC variant simplifies by trusting one consumer.
Should I use this in production Java?
Almost never. `LinkedBlockingQueue`, `ConcurrentLinkedQueue`, or `Disruptor` (LMAX) are proven. Roll a custom one only after measurement shows them as the bottleneck, and even then, prefer Disruptor for HFT-grade work.
Gotchas
- ABA in C/C++: freed node reused at same address, CAS thinks unchanged
- Memory ordering: need acquire/release semantics or full fences
- Tail can lag, dequeue must help advance tail before returning empty
- Cache line contention: head and tail on same cache line → false sharing; pad to 64 bytes
LMAX Disruptor
Section: Lock-Free Programming | Difficulty: Expert | Frequency: Sometimes
A pre-allocated ring buffer plus per-consumer sequence cursors. Producers claim slots with CAS; consumers wait until the producer cursor advances past their position. Slots are reused (no per-message allocation), data lives in cache lines under explicit control (no false sharing), and the wait strategy is pluggable (busy spin, yield, block). Built by LMAX for trading, hits 25M+ messages per second per pipeline.
Key Points
- Ring buffer of fixed size (must be a power of two so index = sequence & (size - 1) is one AND instruction).
- Slots are pre-allocated objects that get reused; producers fill in fields, consumers read them. Zero per-message allocation.
- Producer cursor: an AtomicLong (cache-line padded) holding the highest published sequence. Consumers spin until their target sequence <= producer cursor.
- Consumer cursor: each consumer has its own cursor. Producers wait if the slowest consumer is too far behind (overwrites would lose data).
- Wait strategies are pluggable: BusySpinWaitStrategy (lowest latency, burns CPU), YieldingWaitStrategy (spin then yield), BlockingWaitStrategy (park on a condition).
- False sharing is the silent killer; every cursor is padded to occupy its own cache line.
- Single-producer mode skips the producer-side CAS and is much faster than multi-producer mode.
Interview Follow-up Questions
Why is a ring buffer so much faster than a BlockingQueue?
Three reasons. (1) Pre-allocation: slots are reused, no GC pressure. BlockingQueue allocates a node per put. (2) No locks: producers and consumers coordinate through atomic cursor advances; BlockingQueue takes a lock on every put and take. (3) Cache friendliness: the ring is contiguous memory; the next slot is nearly always in L1. A linked-list-backed queue scatters nodes across the heap. The trio together gives 10-100x throughput on hot paths.
Why power of two for the ring size?
Index = sequence & (size - 1) instead of sequence % size. AND is a single cycle on every CPU; modulo is several. At 25M messages per second this matters. The cost: ring sizes are restricted to 1024, 2048, 4096, etc. Easy to live with.
What happens when the slowest consumer falls too far behind?
Producers stall. A producer about to claim sequence N must check that the slowest consumer has acknowledged sequence (N - SIZE), or it would overwrite a slot still being read. Real systems either size the ring large enough that this never happens, or accept backpressure on the producer. There's no equivalent of 'drop oldest' built in; it has to be implemented as a separate consumer that drains.
Single-producer vs multi-producer, how big is the difference?
Big. Single-producer skips the CAS on claim and the availability tracking on publish. Numbers from the LMAX paper: single-producer hits 25M ops/sec on a single thread; multi-producer with 4 threads hits about 100M total but with much higher variance. If the workload can funnel through one producer (one network socket, one input thread), do it.
When is the Disruptor the wrong choice?
Whenever pre-allocation doesn't fit the message model (variable-size events that can't be pooled, references to large heap objects), whenever throughput isn't the bottleneck (most CRUD apps), or whenever richer queue semantics are required (priority, deadlines, conditional retrieval). For request/response with low latency, an unbounded BlockingQueue or a Java 21 virtual-thread-per-request model is simpler and fast enough.
Gotchas
- Forgetting padding. The single most common mistake: a 'simple ring buffer' loses 80% of its throughput to false sharing between cursors.
- Using Object[] of allocated objects, not pre-allocated slot fields. Defeats the no-allocation property.
- Multi-producer mode without availability tracking: consumers will read uninitialized slots.
- Wait strategies that burn CPU in production. BusySpinWaitStrategy is for dedicated cores; on shared boxes it starves other work.
- Treating the ring as a dynamically resizable queue. The size is fixed at construction; pick it based on max in-flight, not average.
Sharded HashMap
Section: Lock-Free Programming | Difficulty: Hard | Frequency: Asked Often
Naive synchronized HashMap serializes every operation. Sharded version splits into N independent maps, each with its own lock. Hash key → shard index → operate on that shard's map. Java's ConcurrentHashMap does this with 16+ shards. Reads parallel, writes serialized only within shard.
Key Points
- N independent shards, each with own lock + map
- Hash key → shard index = key.hashCode() % N (better: use bit masking with power-of-2 N)
- Concurrent reads/writes on different shards proceed in parallel
- Resize: per-shard, or global (rare); ConcurrentHashMap uses tree-based bins instead
- Java 8+ ConcurrentHashMap uses CAS + per-bin synchronized, even finer than striping
Interview Follow-up Questions
How many shards is right?
Java's ConcurrentHashMap defaults to 16. Rough rule: at least N where N is the expected concurrent thread count. More shards = more parallelism but more memory and worse cache locality. Power-of-2 enables bit masking instead of modulo (faster).
Why is ConcurrentHashMap better than basic sharding?
Java 8+ uses CAS + per-bin synchronized. Each bin (linked list of entries) has its own lock; resize uses lock-free coordination. Effectively per-bin striping, way finer than 16 shards. Plus: tree-based bins for collision handling. State of the art for in-memory hash maps.
When does sharding NOT help?
Single hot key gets hammered, all writes go to same shard, that shard's lock is the bottleneck. Solution: don't share state; replicate per-thread/region. Or use an atomic counter or CAS-based primitive for that one key.
Gotchas
- Bad hash function → uneven shard distribution → some shards always hot
- Iterating across shards is non-atomic, snapshot is inconsistent
- size() across shards requires summing each, non-atomic without freeze
- Resize is per-shard, not global, capacity can drift unevenly
- Don't use string %= shardCount on user-provided keys, collision attacks
Concurrent Bloom Filter (CAS-Based)
Section: Lock-Free Programming | Difficulty: Hard | Frequency: Sometimes
Bloom filter: probabilistic set membership with no false negatives, configurable false positives. Concurrent version: bits stored in atomic longs; insert sets bits via OR atomically; query reads atomically. Naturally lock-free because OR is monotonic.
Key Points
- Bloom filter: array of bits + N hash functions; insert sets N bits, query checks all N
- False positive: all N bits happen to be set by other entries (configurable rate)
- False negative: never (if all N bits are 0, definitely not in set)
- Concurrent: atomic OR, naturally lock-free, monotonic (bits only go 0→1)
- Cannot delete (would require resetting bits, but other entries may use them)
Interview Follow-up Questions
Why is Bloom filter naturally lock-free?
Inserts only flip bits 0→1, monotonic. CAS retry just merges OR operations; intermediate states are valid. No reordering hazards. Compare to hash table inserts which involve linked-list manipulation, much harder to make lock-free.
Counting Bloom filter for delete?
Replace each bit with a small counter (4 bits, so per-cell add/subtract). Add increments; delete decrements. Concurrency: atomic add/sub on counter, still lock-free but more memory.
Where's this used in production?
Cassandra, BigTable, RocksDB use Bloom filters to skip SST file lookups for non-existent keys. Web crawlers track visited URLs. Browsers use them for malicious-URL screening. CDNs for cache-miss hints.
What about false positive rate at scale?
Pre-size for the expected entries. Insertion beyond capacity inflates FP rate. Production systems either: (a) overprovision, (b) rotate filters, or (c) use Cuckoo filter (counting BF replacement that supports delete cleanly).
Gotchas
- False positive rate degrades as filter fills, pre-size for expected load
- Hash-function quality matters, bad hashes → uneven bit distribution → worse FP rate
- Cannot delete, once a bit is set it can't be safely cleared
- Multiple Bloom filters with different hash seeds aren't interoperable
- Use a counting Bloom filter or Cuckoo filter when delete is required
Concurrent Priority Queue (Skiplist-Based)
Section: Lock-Free Programming | Difficulty: Hard | Frequency: Sometimes
Lock-free priority queue is hard, heap-based requires global rebalance. Skiplist-based works: probabilistic levels, ordered by priority, CAS-based insert/delete. Java's ConcurrentSkipListMap is the production-grade impl. Use it; don't roll a custom one.
Key Points
- Heap-based PQ: hard to make lock-free (global rebalancing on every op)
- Skiplist: probabilistic balanced structure, naturally lock-free
- Insert: choose random level, CAS pointers in at each level
- Java's ConcurrentSkipListMap is the canonical implementation
- PriorityBlockingQueue (Java) is heap-based with a single lock, fine for moderate scale
Interview Follow-up Questions
Why is heap-based PQ hard to make lock-free?
Heap operations percolate values up/down through the tree, modifying many nodes. CAS on individual nodes isn't enough, the heap invariant spans the structure. Lock-free heap papers exist but are complex; production systems use skiplist instead.
Skiplist vs B-tree for ordered concurrent map?
Skiplist easier to make lock-free (probabilistic levels = local CAS works). B-tree better cache locality but harder to make lock-free. Java picked skiplist for ConcurrentSkipListMap; many DB indexes use B-tree with locks.
When does PriorityBlockingQueue bottleneck?
At the single lock. Heap rebalancing is O(log n) under the lock; with many concurrent producers, lock contention dominates. Switching to ConcurrentSkipListMap-based design (or per-priority queues) helps.
How does this compare to Disruptor?
Disruptor is a single-producer/single-consumer (or MPMC) ring buffer with NO priority, strict FIFO. Priority semantics need a different structure. They solve different problems.
Gotchas
- PriorityBlockingQueue.iterator() is NOT in priority order, only poll() is
- ConcurrentSkipListMap.firstEntry() is a snapshot, value may have been polled by another thread by the time it's used
- Skiplist memory overhead: ~4x of equivalent array (multi-level pointers per node)
- size() on ConcurrentSkipListMap is O(n); don't call in hot paths
Concurrent Trie (RCU-Style Read Path)
Section: Lock-Free Programming | Difficulty: Hard | Frequency: Rare
Concurrent trie: prefix-indexed tree where reads traverse a stable snapshot (no locks) and writes copy-then-swap nodes (RCU style). Lock-free reads at the cost of some write overhead. Used in Linux routing tables, JDK's CHAMP, persistent data structures.
Key Points
- Trie = tree indexed by string prefix; common in routing, autocomplete, dictionaries
- RCU (read-copy-update): readers see consistent snapshots; writers copy + atomic swap
- Lock-free reads: traverse atomic-pointer fields; never block
- Writers serialize on root pointer CAS, but contention only on the changed branch
- Memory: temporary copies on update (free old when no readers reference)
Interview Follow-up Questions
Why path-copy instead of in-place mutation?
In-place would race with concurrent readers, they could traverse partially-modified state. Path-copy guarantees readers see the OLD tree until the root CAS commits the NEW tree atomically. Like Git's commit model.
How is memory freed?
Old nodes are reachable by any reader that started before the swap. Garbage collector cleans up when no reference remains (Java/Go automatic). C/C++ needs hazard pointers or epoch-based reclamation, much harder.
When is this preferred over a HashMap?
Prefix queries: 'all words starting with X', trie traverses to that prefix and enumerates. HashMap can't. Used heavily in autocomplete, IP routing tables (longest-prefix match), file systems.
Where does Linux use RCU-like patterns?
Routing tables, file system dentry cache, network device list. Reads are extremely common (every packet); writes are rare (config change). RCU-style copy-on-update gives lock-free reads at the cost of write overhead.
Detecting Race Conditions in Production
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes
Race conditions almost never reproduce locally. Catch them with built-in race detectors (go run -race, ThreadSanitizer, jcstress) integrated into CI, plus production monitoring for symptoms (counter mismatches, sporadic test failures, "impossible" log states).
Key Points
- Race conditions hide in unit tests, they need real load and timing
- go run -race is built into Go, use it in CI; 5-10x runtime cost
- Java has no built-in race detector; use jcstress for stress tests, JFR for production tracing
- Python has the GIL but still has races, no native detector; rely on stress tests and review
- Symptoms in production: sporadic test failures, counter mismatches, 'impossible' log states
- Always test on multiple architectures, x86 hides races that ARM exposes
Interview Follow-up Questions
Why don't unit tests catch race conditions?
Most unit tests run sequentially with predictable timing. Races need: (a) enough concurrency to expose interleaving, (b) enough trials to hit the unlucky timing, and (c) memory ordering scenarios that depend on hardware. Use stress tests that run the suspect code N×100 times under load.
Why does the Go race detector slow code down?
It instruments every memory access, load, store, atomic, and tracks happens-before relationships in a vector clock. The 5-10× slowdown is the cost of building and querying that data structure. Negligible for tests; too much for production.
Can I run go test -race in production?
Generally no. The slowdown is too much for production traffic. But for canaries or load-test environments? Sometimes yes, race-detected production runs catch what the test suite missed. Just be aware of the perf hit.
What does jcstress catch that go run -race doesn't?
go run -race catches accesses with no happens-before edge (data races). jcstress goes further, it demonstrates that specific Java Memory Model outcomes are observable, like (0,0) in the reordering example. Useful for verifying lock-free designs and understanding what JMM allows.
How to find a race in Python production?
It's hard. The tools: (1) log invariants, when invariants fail, the log line is evidence. (2) py-spy dump on the running process to see thread states. (3) faulthandler for crashing the process with full traceback on hangs. (4) Stress tests that exercise concurrency in CI. (5) Code review focused on every shared mutable variable.
Gotchas
- go run -race only catches races that ACTUALLY OCCUR during the run, coverage matters
- Race detector doesn't catch logical races (check-then-act with atomics)
- Python's GIL hides races at bytecode boundaries; releases between bytecodes expose them
- Tests passing 1000 times in a row doesn't prove no race, the unlucky 1001st time may fail in production
- x86 strong memory model hides races that fire on ARM, test on both architectures
- Sporadic test failures dismissed as 'flaky' are often races, investigate, don't retry
Race Detection in CI
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes
Run race detectors as part of CI. Go: -race. C++/Rust: ThreadSanitizer (TSan). Java: jcstress for memory model assumptions, plus quality concurrent tests. The detectors find concurrent unprotected access regardless of platform; they catch the bugs hidden by x86's strong ordering before they hit ARM in production.
Key Points
- Go: go test -race covers most data races. Run in CI on every PR.
- C++/Rust: ThreadSanitizer (TSan). Compile with -fsanitize=thread, run tests.
- Java: jcstress for memory-model bugs (lock-free code). Standard JUnit for higher-level concurrency.
- Race detectors slow execution 2-10x and use more memory; run in CI, not production.
- Stress-test concurrent code with high goroutine/thread counts; race detector + load surfaces real bugs.
Interview Follow-up Questions
Should the race detector be on for production?
No. It slows execution 2-10x and uses 5-10x more memory. CI is where it belongs: every PR, every test. Production runs the un-instrumented binary. The race detector's job is to catch bugs before they ship; once shipped, production should never see a race.
How does Go's race detector compare to TSan?
Same theory (happens-before vector clocks), different implementations. Go's is built into the toolchain, no extra setup. TSan needs Clang/GCC with -fsanitize=thread. Both have very low false positive rates: if they report a race, there's a race.
Does the race detector catch all bugs?
No. It catches data races (concurrent unprotected access). Doesn't catch deadlocks, livelocks, missing notifications, lost wakeups, atomicity violations on multi-step operations. Also doesn't catch bugs that depend on specific scheduling that didn't happen during the test run. Useful, not sufficient.
What about Rust?
Rust's type system catches most data races at compile time (Send/Sync traits). TSan still useful for unsafe code or for FFI. The Rust compiler + clippy + TSan in CI is a strong combination.
Gotchas
- Not running -race / TSan in CI = bugs hide until ARM production
- Tests that don't actually exercise concurrency: race detector reports nothing
- Long test suites with -race timeout because of 10x slowdown; tune timeouts
- Race-free does not mean correct; algorithmic concurrency bugs (deadlock, livelock) still possible
- Skipping race detection on 'performance' tests: those are exactly where races hide
Profiling Concurrent Code
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes
On-CPU profilers (perf, async-profiler, py-spy) show where threads spend CPU. Off-CPU profilers show where threads wait (locks, I/O). Concurrent code needs both: 'CPU is fine but throughput won't go higher' means off-CPU work, not on-CPU. Lock profilers and goroutine analyses are the specific tools.
Key Points
- On-CPU profile: where threads spend CPU. Use for CPU-bound bottlenecks.
- Off-CPU profile: where threads wait (block, sleep, park). Use for concurrency bottlenecks.
- Lock profilers: per-lock contention time. Async-profiler 'lock' mode, Go's mutex profiler.
- Thread/goroutine dumps: snapshot of every thread's stack. Use for hangs and deadlocks.
- Tracing (OpenTelemetry, runtime/trace): per-request timeline; useful for tail latency.
Interview Follow-up Questions
How to tell if the problem is CPU or contention?
Look at CPU utilisation under load. If CPU is at 100% and throughput plateaus, the workload is CPU-bound; profile on-CPU. If CPU is well below 100% but throughput plateaus, threads are blocked on something; profile off-CPU (lock, mutex, block profile). Both can be true; profile both.
What's the difference between CPU and wall profiling?
CPU profile only counts time the thread was running on a CPU. Wall profile counts everything including waits. For pure CPU bottlenecks, CPU is more focused. For 'why is this request slow', wall is the right pick (it shows the I/O waits, lock waits, GC pauses).
When to use tracing instead of profiling?
Profiling is statistical: aggregates over time, shows what's typical. Tracing is per-request: shows exactly what happened for one specific request. For tail latency ('the p99 is bad'), tracing wins (it surfaces what's different about slow requests). For throughput ('the average is too slow'), profiling wins (it shows where time goes in steady state).
How much overhead does profiling add?
On-CPU sampling at 100Hz: roughly 1-5%. Lock profiling: depends on event rate, can be 10%+. Tracing every request: 5-20%. For production profiling, use sampling: profile a fraction of requests / a percentage of the time. async-profiler at default rate is generally safe in production.
Gotchas
- Profiling only on-CPU when the bottleneck is contention: misses the problem
- Single thread dump is a snapshot; take 3-4 separated by seconds for hang diagnosis
- Profiling overhead can itself cause symptoms; use sampling at production scale
- Reading flame graphs without normalising by sample count: large boxes might be common, not slow
- Conflating tracing with profiling: tracing reports on specific requests, profiling on aggregate behaviour
Diagnosing Deadlocks and Hangs
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes
When a service hangs, dump every thread/goroutine's stack trace. The pattern of who's waiting on what reveals whether it's a deadlock (cyclic wait), a livelock (busy spinning), or starvation (one thread never scheduled). Tools, jstack, py-spy, go pprof, kill -3.
Key Points
- Step 1 of any hang: dump every thread/goroutine's stack, without it, the diagnosis is guesswork
- Java: `jstack <pid>` or `kill -3 <pid>`; output shows 'Found X deadlocked threads'
- Go: send SIGQUIT or use /debug/pprof/goroutine?debug=2, full stack of every goroutine
- Python: `py-spy dump --pid <pid>` for sync code; asyncio.all_tasks() for async
- Deadlock pattern: thread A blocked on lock held by B, B blocked on lock held by A
- Hang ≠ deadlock, could also be infinite loop, blocked syscall, network timeout, or starvation
Interview Follow-up Questions
What's the very first thing to do when a Java service hangs?
`jstack <pid>` (or `kill -3 <pid>` if the process is in the foreground). The dump tells you within 30 seconds whether you have a deadlock (explicitly labeled), lock contention (many BLOCKED threads on one monitor), or something else (waiting on I/O, infinite loop, GC pause).
Service is at 100% CPU but hung. Deadlock?
Probably not, deadlocked threads are blocked, not running. 100% CPU + no progress = livelock (busy retry loops) or infinite loop. Take a CPU profile (`go tool pprof`, `async-profiler`, `py-spy record`) to find what's hot.
How does one tell deadlock from starvation?
Thread dumps. Deadlock: a closed cycle of threads each waiting on a lock held by the next. Starvation: ONE thread is never scheduled while OTHERS make progress. Both look like 'something is stuck' from outside; the dump distinguishes them.
Should pprof be exposed in production?
Yes, but on a private port (localhost or VPC-internal), never on a public port. The /debug/pprof endpoints can leak request data and let an attacker DOS the service. Standard practice: `localhost:6060` and access via SSH or admin VPN.
How to debug a deadlock that won't reproduce?
Add a watchdog: every 30s, check for deadlock (Java's findDeadlockedThreads, or in any language, time-out individual operations and dump on timeout). Log full thread states when triggered. Eventually the deadlock fires; the evidence is waiting the next morning.
Gotchas
- kill -3 in production sends SIGQUIT, the Go runtime PRINTS THE DUMP AND EXITS the process. Use SIGUSR1/2 or a pprof endpoint instead.
- Java's `jstack` requires the same JVM version as the target process, and same UID; usually requires sudo or being the service user
- py-spy needs ptrace permissions; on hardened Linux, set CAP_SYS_PTRACE or run as root
- asyncio hangs don't show up as 'thread blocked', only one thread, looks idle. Use asyncio.all_tasks() instead.
- Goroutine dump alone doesn't show channel state, the dump shows 'chan receive' but not 'who has the channel'. Combine with code review.
- Hung threads waiting on a network socket look like deadlock, verify with strace/tcpdump before assuming
Bug Hunt: Find the Race in This Counter
Section: Debugging Concurrency | Difficulty: Basic | Frequency: Asked Often
The classic compound-update race. The counter looks atomic but isn't, read, increment, write are three separate steps that can interleave. Fix with an atomic primitive or a lock around the read-modify-write.
Key Points
- Compound operations (++ , += , !x) are NEVER atomic without explicit synchronization
- Two threads can both read 5, both compute 6, both write 6, losing one increment
- Tests pass on dev laptops; the race fires under production load
- Three correct fixes, atomic primitive (best), lock, or per-thread accumulator with merge
Interview Follow-up Questions
Why isn't `volatile` (Java) / `atomic.Bool` enough?
Those make individual reads and writes atomic and visible. But ++ requires THREE operations, a read, a compute, and a write. Even with each operation atomic, two threads can interleave between them. The fix is an atomic compound operation (incrementAndGet, atomic.Add) or a lock.
Why does this 'work' on a dev machine?
Light load + few threads + x86's strong memory model often hide races. Production load fires them. Move to ARM (Apple Silicon, Graviton) and the race fires immediately. Always test under load and on multiple architectures.
Tests pass 1000 times, is the code safe?
No. A race that fires 1 in 10 million increments doesn't show in unit tests but does at 1M req/sec in production. Use the race detector (Go) or stress tests with 10K+ iterations under concurrency.
When is LongAdder better than AtomicLong?
When many threads contend on the same counter. AtomicLong's CAS retries fail under heavy contention, burning CPU. LongAdder shards into per-thread cells; sum() aggregates on read. Trades memory for throughput. Used heavily for metrics counters.
Gotchas
- Tests pass on x86, fail on ARM, same code, different memory model
- Java: AtomicInteger.get() then .set() is NOT atomic (two ops); use accumulateAndGet/updateAndGet
- Python: even integer assignment may need synchronization on shared globals
- Go: counter++ on int doesn't flag without -race; always run with -race in CI
- Locking around the wrong thing, protecting the read but not the write
Bug Hunt: Why Does This Deadlock Under Load?
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Two threads each acquire two locks, in opposite order. Most of the time it works because the timing doesn't line up, under load, eventually it does, and both threads block forever waiting for each other. Fix with global lock ordering, or tryLock with timeout.
Key Points
- Under low load, the unlucky interleaving is rare, the bug looks like a flaky hang
- Under high load, it becomes nearly inevitable
- Four Coffman conditions must all hold; break any ONE → no deadlock
- The cheapest fix: enforce a global total ordering on lock acquisition
Interview Follow-up Questions
Why does the deadlock only happen 'sometimes'?
The bug requires a specific interleaving: thread 1 must acquire its first lock AND get scheduled out before acquiring its second, AND thread 2 must run during that window. Low load = rare interleaving. High load = inevitable interleaving.
What are the four Coffman conditions?
(1) Mutual exclusion (locks are non-shareable), (2) hold-and-wait (thread holds one lock, requests another), (3) no preemption (locks can't be forcibly taken), (4) circular wait (cycle in the wait-for graph). Break ANY one → deadlock impossible. Lock ordering breaks #4.
How is this detected in production?
Java: jstack <pid> shows 'Found N deadlocked threads' with the cycle. Go: SIGQUIT or /debug/pprof/goroutine?debug=2. Python: py-spy dump --pid. All-blocked threads in a cycle is the smoking gun.
When is tryLock preferable to lock ordering?
When ordering can't be enforced, e.g., locks are dynamic, or third-party code holds locks. tryLock with timeout breaks the no-preemption condition: if all locks can't be acquired, release what's held and retry.
Gotchas
- Reentrant locks (Java ReentrantLock, Python RLock) prevent SELF-deadlock but NOT mutual deadlock between threads
- Go's runtime only catches 'all goroutines asleep', partial deadlocks go undetected
- Holding a lock while calling user-supplied callback code is a deadlock waiting to happen
- Sorting locks by hash is fine for objects, but pointers across processes (shared memory) need stable IDs
Bug Hunt: Why Does My RWMutex Deadlock When I Re-enter?
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes
RWMutex is not reentrant. A goroutine holding the read lock that calls a method which also tries to acquire the read lock can deadlock if a writer is waiting in between. Java's ReentrantReadWriteLock is reentrant by design; sync.RWMutex is intentionally not. The fix is to restructure so locks aren't reentered.
Key Points
- sync.RWMutex (Go): NOT reentrant, RLock-then-Lock or RLock-then-RLock can deadlock
- ReentrantReadWriteLock (Java): IS reentrant, same thread can re-acquire its own lock
- Common trap: reader calls a method that also reads; a writer arrives between them; deadlock
- Fix: restructure to acquire the lock once at the outer scope, or pass an already-locked context
Interview Follow-up Questions
Why is sync.RWMutex (Go) intentionally non-reentrant?
Reentrant locks hide design issues. Go's authors argue: code that needs reentrant locks is calling itself unsafely. The right answer is to refactor, extract the locked work into internal helpers or release-and-reacquire. The non-reentrant default forces good design.
When does the RWMutex deadlock actually fire?
When a writer is queued between two readlocks of the same goroutine. Without writers, two consecutive RLocks 'work' (RWMutex allows multiple readers). With a writer waiting, the second RLock blocks because writer-pref forbids new readers. The first RLock can't release because the function is still inside it. → deadlock.
Should ReentrantLock be the default?
Reentrant locks are easier to compose but pay a small cost (tracking the holder thread + count). For tight, performance-critical code with no nested calls, the non-reentrant variant is fine. Default to ReentrantLock unless the overhead has been measured and matters.
How does this differ from a normal deadlock?
Normal deadlock: two threads, two locks, opposite order. Reentrant deadlock: ONE thread re-acquiring the same lock with a non-reentrant primitive, when a writer (other thread) is waiting. Less obvious because there are not two distinct lock instances.
Gotchas
- Go: sync.RWMutex is writer-preferring, once Lock() is called, new RLocks block
- Java: StampedLock and Semaphore are NOT reentrant; ReentrantLock and ReentrantReadWriteLock are
- Python: threading.Lock is NOT reentrant; threading.RLock IS
- Calling external code (callbacks, listeners) while holding a lock can introduce reentrancy
- ReentrantLock is fine for nested calls; using one to call into completely unknown code is still risky (might block on something else)
Bug Hunt: Lost Wakeup with Condition Variable
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes
Producer signals condition before consumer waits → consumer waits forever. Classic bug from forgetting to check the predicate inside the lock before waiting. Fix: always check predicate inside a while loop while holding the lock; the predicate guards both the wait AND the wakeup.
Key Points
- Lost wakeup: signal happens before wait, signal is dropped (condition variables don't queue signals).
- Fix: hold the lock, check predicate in a while loop, wait if predicate is false.
- Signal must be inside the same lock to make the predicate visible to the waker.
- Spurious wakeups also require while loop (not if): wait can return without anyone signalling.
- In Go, channels have built-in 'queued' semantics that avoid this; sync.Cond can lose signals.
Interview Follow-up Questions
Why is a while loop required around wait, not just an if?
Two reasons. (1) The predicate may have been satisfied between the wakeup and the time the thread re-acquired the lock; another waiter might have consumed the condition. (2) Spurious wakeups: the spec allows wait to return without anyone calling notify. The while loop handles both correctly; an if statement is a bug.
Why must the signal happen under the lock?
Because the predicate must be visible to the waker before the wakeup. If signal happens outside the lock, this interleaving is possible: producer sets predicate → consumer reads predicate (false) → consumer waits → producer signals (lost). Holding the lock during the predicate update AND the signal ensures the waker sees the updated predicate.
How does notify vs notifyAll affect this?
notify wakes one waiter (any one). notifyAll wakes all. If multiple consumers can validly act on the condition, notify can cause 'thundering herd' (all wake, all check, only one proceeds, others wait again). For correctness, both work with the while-loop pattern. notifyAll is safer when in doubt.
Are channels strictly better than condition variables?
For most cases in Go, yes. Channels queue values; a send can't be lost. For complex predicates that don't fit a queue model (multiple waiters, multiple conditions), sync.Cond is still useful. But the cases where Cond is the right tool are rare.
Gotchas
- Using if instead of while around wait: spurious wakeups break correctness
- Signalling outside the lock: predicate update may not be visible to waiter in time
- Using notify when multiple unrelated waiters wait on the same condition
- Forgetting to update the predicate before signal: wakes waiter that finds predicate false
- Believing the bug is rare: lost-wakeup happens reliably under load testing
Bug Hunt: The Broken Double-Checked Locking
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Asked Often
Pre-Java-5, double-checked locking was famously broken because the JMM allowed the constructor's writes to be reordered with the publication of the reference. A reader could see a non-null reference whose object's fields were still uninitialized. The fix is volatile (Java) or a static-holder idiom, and similar gotchas apply in Go and Python.
Key Points
- DCL trick: 'check, lock, check again, construct', saves a lock on the fast path
- Bug: without volatile, the constructor's writes can be reordered with the assignment to the field
- A reader sees a non-null reference; reads object fields; sees garbage / nulls
- Static-holder idiom (Java) is simpler and uses class-init guarantees instead
- Go's sync.Once is the right answer; in Python, module-level construction works
Interview Follow-up Questions
Why was DCL famously broken in Java before 1.5?
Pre-Java-5 JMM didn't constrain construction order. The line `instance = new Singleton()` could be split: allocate, assign field reference, then run the constructor. A reader could see the non-null reference before the constructor finished. The Java 5 JMM (JSR-133) fixed this, but ONLY when the field is volatile.
Why is the static-holder idiom safer than DCL?
Class initialization in Java is thread-safe by spec. Holder is initialized exactly once when first referenced. No lock is needed in user code. It's lazier than eager initialization (the holder class only loads when get() is called) and simpler than DCL.
Does volatile have a performance cost?
Yes, every volatile read/write emits memory barriers that prevent compiler optimizations and force CPU cache coherence traffic. For a singleton accessed millions of times, the cost is real but small. Static-holder avoids it entirely.
What's wrong with Python's DCL even with the GIL?
GIL serializes bytecode but `_instance = Singleton()` is multiple bytecodes. Two threads can both check None, both run the constructor (under the GIL serially, but both run!), losing one. With free-threaded CPython, the lack of memory barrier becomes the issue. The fix is module-level init or full-locking.
Should I use sync.Once for everything in Go?
Yes, for any 'initialize exactly once' pattern, sync.Once is the right tool. It's lightweight after the first call (a single atomic load on the fast path), correct on every architecture, and handles panics during init.
Gotchas
- Java: forgetting volatile on the singleton field is the classic broken-DCL bug
- Java: the static-holder idiom doesn't run init until get() is called, true laziness
- Python: DCL without holding the lock for both check and set has a race on free-threaded CPython
- Go: sync.Once.Do swallows panics, wrap critical init with a panic-aware version if needed
- Reading instance.someField on the fast path skips ALL synchronization, must guarantee instance was safely published
- Enum singletons in Java prevent reflection-based instantiation; regular ones don't
Bug Hunt: Goroutine Count Growing 10/sec
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes
A goroutine that blocks on a channel send/receive without a cancellation path lives forever. The fix is always a select on ctx.Done() or a buffered channel large enough to absorb all sends. The same bug appears in Java (futures never cancelled) and Python (asyncio tasks orphaned).
Key Points
- A goroutine that blocks forever holds memory + captured variables + channels
- Symptom: runtime.NumGoroutine() growing unbounded under load
- Diagnosis: /debug/pprof/goroutine?debug=2 shows what each goroutine is waiting on
- Fix pattern: every goroutine that sends/receives must have a cancellation path
Interview Follow-up Questions
How is a goroutine leak detected in production?
Expose `/debug/pprof/goroutine` and watch the count. Alert on `runtime.NumGoroutine()` growing unbounded. To diagnose: `curl /debug/pprof/goroutine?debug=2` shows every goroutine's stack, find the ones stuck on `chan send` or `chan receive`.
Why is the unbuffered-channel send pattern so prone to leaks?
Unbuffered = synchronous handoff. The sender blocks until a receiver is ready. If the receiver is gone (panic'd, request cancelled, slow), the sender blocks forever. Buffered channels add slack; ctx.Done() adds an exit path.
Is asyncio.create_task without await ever correct?
Rarely. Only for truly fire-and-forget work where completion and errors don't matter. Even then, keep a reference (Python may garbage-collect mid-run) and add a done_callback for exception logging. Prefer asyncio.TaskGroup for structured cleanup.
How is 'goroutine count growing' measured?
Expose runtime.NumGoroutine() as a Prometheus metric. Alert when it grows for >5 minutes without bound. Healthy services have a roughly bounded number of goroutines proportional to in-flight requests; unhealthy ones grow linearly with uptime.
Gotchas
- fire-and-forget goroutines without ctx.Done() are the #1 source of Go production leaks
- Java's CompletableFuture without orTimeout silently inherits 'wait forever' semantics
- Python: asyncio.create_task without keeping a reference can be GC'd mid-execution
- context.WithCancel without calling cancel() leaks the underlying timer goroutine
- Logging goroutines that block on a slow downstream → biggest leak risk in HTTP handlers
- Bounded queues + drop-on-full is often better than infinite buffering
Bug Hunt: Why Is This LongAdder Slower Than AtomicLong?
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes
LongAdder shards counters across threads to win under contention, but its sum() walks every shard, so reads are O(N). When read rate is high relative to write rate, AtomicLong (single CAS) is faster. The bug is using the right tool for the wrong workload.
Key Points
- AtomicLong: cheap reads (one load), CAS retries on contention (slow when many writers contend)
- LongAdder: cheap writes under contention (per-thread cells), expensive reads (sum walks all cells)
- Read-heavy + few writers → AtomicLong wins
- Write-heavy + many writers → LongAdder wins
- Always benchmark with the actual workload, intuition lies
Interview Follow-up Questions
When does LongAdder definitively beat AtomicLong?
Many writer threads, infrequent reads. Classic example: metrics counters scraped every 10-60 seconds while every request increments. Goetz's original benchmarks showed crossover at 4+ contending threads. Below that, AtomicLong wins.
Why is AtomicLong's read O(1) but LongAdder's O(N)?
AtomicLong stores a single value, read is one cache-line load. LongAdder stores N cells (typically one per CPU); sum() walks all of them and adds. False sharing protection means each cell is on its own cache line, so reading sum is N cache misses.
Are sharded counters ever wrong?
sum() can return slightly stale values because cells are read sequentially without locking the whole structure. For metrics, this is fine. For exact accounting (billing, money), use AtomicLong or a Lock.
Why does the JDK ship both AtomicLong and LongAdder if one is just better under contention?
They're complementary, not substitutes. AtomicLong fits the 'consistent read at any time, low contention' niche. LongAdder fits the 'high write throughput, periodic read' niche. The JDK author (Doug Lea) explicitly says: profile to choose.
Gotchas
- Many counters are read on every request (rate limiters, header injection) → AtomicLong wins
- Metrics counters scraped via Prometheus/StatsD → LongAdder may win
- Sharded counters are eventually consistent, sum() can return slightly old values
- False sharing across cache lines kills sharded perf if cells aren't padded, JDK pads automatically; rolling-your-own often forgets
- Profile-driven optimization beats intuition every time
Bug Hunt: False Sharing in a Striped Counter
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes
Striped counter built from an array of AtomicLong shows worse throughput at high thread count than a single AtomicLong. Bug: the array elements share cache lines; threads writing different stripes cause cache-line ping-pong. Fix: pad each element to its own cache line (LongAdder, @Contended, or manual padding).
Key Points
- Symptom: striped counter scales worse than expected; CPU at 100%, throughput plateaus or degrades.
- Root cause: array elements occupy adjacent memory; multiple elements share each cache line.
- Two cores writing to different elements on the same line: the line bounces between L1 caches.
- Fix: pad each element to 64 bytes (cache line size). Java's LongAdder does this internally.
- Diagnose: profile shows hot atomic-add op, cache miss rate is high, IPC (instructions per cycle) is low.
Interview Follow-up Questions
How do I detect false sharing in profiling?
Three signals together: (1) high CPU utilisation but low throughput, (2) hot atomic operations in the profile flame graph, (3) high cache-miss rate via perf stat -e mem_load_uops_retired.l3_hit. The combination strongly suggests false sharing. Pad and re-measure; if throughput jumps, false sharing is confirmed.
Should I always pad atomics?
No. Padding wastes memory (64 bytes per atomic instead of 8). Only pad after measuring false sharing. The general rule: pad atomics that multiple cores write to in hot paths. Don't pad atomics that are read-mostly or written by one thread.
Why does cache line bouncing matter so much?
Cache coherency protocol (MESI) requires that only one core have a cache line in 'modified' state. When core A writes to the line, core B's copy is invalidated; B must re-fetch on its next access. If A and B alternate writes, the line bounces between L1 caches; each access pays the cache-miss latency (~40-100 cycles) instead of L1 hit (~1-5 cycles).
How does LongAdder choose the number of cells?
Dynamically. Starts with one cell. On contention (CAS failures), grows the cell array up to a power-of-two ceiling (typically core count or larger). Each thread hashes to a cell. With enough cells and padding, contention disappears. Reads (sum) walk all cells, so reads are O(num cells).
Gotchas
- Striped counters without padding scale worse than non-striped
- Padding only some atomics: misses the bouncing on the unpadded ones
- @Contended without -XX:-RestrictContended is silently ignored on application classes
- False sharing in field layout (two atomic fields adjacent in a struct): same problem as arrays
- Padding waste: 64 bytes per atomic adds up; use sparingly
Bug Hunt: Memory Ordering on ARM
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes
Code passes tests on x86, fails on ARM. The bug: missing memory barrier (volatile in Java, atomic in C++, sync in Go) on a flag that the writer sets and the reader checks. x86's strong ordering hides the bug; ARM's weaker ordering exposes it. Fix: use proper synchronization primitives, never plain reads of shared variables.
Key Points
- x86 has strong ordering: most reorderings forbidden by hardware. Bugs hidden.
- ARM/POWER have weak ordering: writes can be reordered, loads can speculate ahead.
- The bug usually looks like: writer thread sets flag, reader thread sees the flag set but old data.
- Fix: use volatile / atomic / proper synchronisation. Never plain reads of shared mutable state.
- Detect via stress testing on ARM hardware (Apple Silicon, AWS Graviton, Raspberry Pi).
Interview Follow-up Questions
Why doesn't this bug appear on an x86 laptop?
x86 has strong memory ordering: stores are observed in program order, loads aren't reordered with respect to each other. The hardware enforces what the language semantically promises only when atomic/volatile is used. So bug code 'just works' on x86 because the hardware is generous. On ARM/POWER, the hardware is more permissive, exposing the missing synchronisation.
Are these bugs detectable in tests?
Stress testing on ARM hardware exposes most. Race detectors (Go's -race, ThreadSanitizer for C++/Rust) catch concurrent unprotected access regardless of platform. Run TSan in CI. For Java, jcstress is the gold standard for stress-testing memory model assumptions.
What if the code is just for x86?
Three reasons to still fix it: (1) future-proofing, the code may move to ARM (Apple Silicon, AWS Graviton), (2) compiler reordering, even on x86, the compiler can reorder reads/writes within a thread, breaking lock-free protocols, (3) correctness, using proper synchronisation makes the code's intent clear to readers.
How does this relate to volatile in Java?
Java's volatile gives release-on-write and acquire-on-read semantics on a single field. It's the minimum required to publish a value safely across threads. Plain (non-volatile) Java fields can be cached in registers, reordered, or have stale values across threads. Use volatile (or atomic, or synchronized) for any shared mutable state.
Gotchas
- Testing only on x86 = bugs hide until ARM deployment
- 'Works in dev, fails in prod' when prod is on ARM
- Race detector catches issues regardless of platform; run it in CI
- C++ default memory_order is SeqCst (safe but slow); explicit order_relaxed needs justification
- Java plain int reads/writes are NOT atomic for long; use volatile or AtomicLong
Bug Hunt: Works on x86, Fails on ARM, Why?
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes
x86 has a strong memory model that hides reordering bugs at runtime. ARM and POWER have weaker memory models that expose them. Code missing proper synchronization can 'work' on Intel and break on Apple Silicon, AWS Graviton, or mobile. The fix is to add a memory barrier, volatile, atomic, or a lock.
Key Points
- x86 uses TSO (Total Store Order), most reads and writes can't be reordered with each other
- ARM, POWER, RISC-V allow more reordering; the same code 'works' less reliably
- Compilers reorder too, even on x86, missing synchronization can fire on a JIT optimization
- Memory barriers (volatile, atomic store/load, lock acquire/release) prevent reordering
Interview Follow-up Questions
Why does x86 hide so many reordering bugs?
x86 uses TSO (Total Store Order). Most reads and writes can't be reordered with each other, only store-load reordering is allowed (and even that is rare in practice). The hardware is doing a lot of work to maintain the appearance of sequential consistency. ARM and POWER have weaker models that expose more reordering.
Does this affect compiled languages only?
No. Compilers and runtimes also reorder. The JIT can hoist a non-volatile read out of a loop on x86, and the bug fires immediately. Go's race detector catches some of these even on x86. The memory model rules apply at every layer, compiler, runtime, hardware.
Are 64-bit reads/writes atomic in Java?
Not always, long/double on 32-bit JVMs can be torn. Even on 64-bit, the JLS only guarantees atomicity for fields declared `volatile`. In practice HotSpot makes 64-bit reads atomic, but the spec doesn't require it.
Should I just make everything volatile / atomic?
It's tempting but expensive, every volatile/atomic access emits memory barriers that flush CPU state, costing tens of cycles. Use synchronization where there's actual cross-thread sharing; leave thread-local state alone. The cost is real on hot paths.
Gotchas
- Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
- JIT can hoist a non-volatile read out of a loop entirely → 'why is my thread spinning forever?'
- Java: long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
- Python: free-threaded CPython (3.13+) loosens visibility guarantees that GIL-based code accidentally relied on
- Go: copying a struct that contains atomic fields silently breaks them
- x86 hides write-write reordering; ARM does not
Code Review: This Lock-Free Counter
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes
A candidate's 'lock-free counter' looks slick, atomic operations, no synchronized blocks. But the get-then-set pattern isn't atomic, the read isn't ordered with the write, and overflow handling is missing. Walk through what an interviewer wants reviewers to find.
Key Points
- An atomic.get() followed by atomic.set() is NOT atomic, same race as ++ on volatile
- Compound operations need compareAndSet (CAS) or accumulateAndGet, not get+set
- Reading 'unconditionally' before checking is a read-then-act race
- Overflow on counters that monotonically increase is a real production bug
Interview Follow-up Questions
Why is 'check-then-act' such a common bug?
Because it reads naturally as English: 'if X then do Y.' Engineers write this naively without realizing the gap between 'check' and 'act' is exactly where another thread can race in. Every concurrent algorithm has to ask: is this single operation, or two? Atomicity must be explicit.
When is atomic.get() followed by atomic.set() acceptable?
When losing an update is acceptable. Examples: setting a 'last-seen timestamp' (overwriting is fine), publishing a configuration version (only one writer at a time anyway). For anything where two writers can race and both updates matter, use CAS or a lock.
Why does the candidate's code call itself 'lock-free'?
They confused 'uses atomic operations' with 'lock-free.' True lock-free means: no thread can be permanently blocked by another thread's actions. CAS retry loops are lock-free in this sense. But check-then-set with two atomic ops is neither lock-free nor correct, it's just broken.
Should this code be lock-free at all?
Probably not. For a counter accessed by ~10 threads, a Lock or synchronized method is simpler and fast enough. Lock-free programming pays off in extreme contention scenarios (e.g., kernel data structures, HFT systems). For most application code, just use a lock.
Gotchas
- atomic.get() then atomic.set() is two operations, not atomic
- atomic.set(atomic.get() + 1) is the textbook racy increment
- Reading outside a lock then acting inside it, if the read informed the action, it's a race
- Naming a class 'LockFree' doesn't make it correct
- Java's accumulateAndGet/updateAndGet build CAS retry loops automatically, prefer them when applicable
Code Review: Go Context Misuse
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Asked Often
Common Go context bugs: not passing context downstream, not checking ctx.Done() in loops, storing context in struct fields, ignoring cancel cleanup, calling cancel from inside the cancelled goroutine. Each one creates a goroutine leak or unresponsive cancellation.
Key Points
- Always pass ctx as the FIRST parameter; don't store in struct fields.
- Always defer cancel() after creating one with WithTimeout/WithCancel; otherwise leak.
- Pass ctx down through every call that does I/O or might block.
- In loops, check ctx.Done() either via select or ctx.Err().
- Returning early from cancelled context is a goroutine's responsibility; runtime won't kill it.
Interview Follow-up Questions
Why is storing ctx in a struct an anti-pattern?
Because ctx is per-request: it carries deadlines, cancellation, request-scoped values that apply to one call chain. Storing it in a struct field means subsequent calls inherit a stale ctx (the original request long-since done). The Go team explicitly recommends 'Don't store Contexts inside a struct type' in the standard library docs.
Should every function take ctx?
Every function that does I/O, blocks, or might run long. Pure compute that returns in microseconds doesn't need it. The convention: ctx as first parameter, named ctx, type context.Context. CPU-only helpers (sort, map transforms) usually don't need ctx.
What's the difference between ctx.Done() and ctx.Err()?
ctx.Done() returns a channel that closes when ctx is cancelled; useful in select. ctx.Err() returns nil while not cancelled, returns context.Canceled or context.DeadlineExceeded after cancellation; useful for non-blocking checks. They are equivalent expressions of the same signal.
Can a context be cancelled from inside the goroutine that received it?
The cancel function works (when in scope). But typically the cancel function lives with the parent, not the child. The child observes ctx.Done() or ctx.Err() to know it should exit. When a child needs to signal 'I'm done' to siblings, prefer errgroup or a shared structured-concurrency primitive.
Gotchas
- Forgetting defer cancel() leaks the cancellation goroutine
- Ignoring ctx in HTTP calls or DB queries: cancellation has no effect
- Storing ctx in struct: subsequent calls inherit stale ctx
- Using context.Background() in handlers: no inherited cancellation; should be the request's ctx
- Putting business data in ctx values: anti-pattern; use explicit parameters
Code Review: This asyncio Service for Blocking Calls
Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes
An async FastAPI handler calling a sync DB driver, sync HTTP client, and time.sleep stalls the entire event loop. Tail latency goes from 50ms to 500ms across all concurrent requests. Walk through every blocking call and either replace with async-aware libraries or wrap with asyncio.to_thread.
Key Points
- async def + a synchronous call inside = silent disaster, the event loop stalls
- All blocking I/O must be either async-native or wrapped in asyncio.to_thread
- asyncio.create_task without a reference can be garbage-collected mid-run
- asyncio.Lock and threading.Lock are NOT interchangeable
Interview Follow-up Questions
Why does ONE sync call inside async stall ALL concurrent requests?
asyncio is single-threaded. The event loop runs one coroutine at a time, switching between them at await points. A synchronous call blocks the only thread, no other coroutine can run until it returns. If 100 requests are concurrent and one of them sleeps for 500ms, all 100 see 500ms of stall.
How to detect blocking calls in production async code?
Set asyncio's debug mode (`PYTHONASYNCIODEBUG=1` or `loop.set_debug(True)`), it logs slow callbacks. Or use `aiomonitor` to dump active tasks. Production: track per-task duration; alert if any task occupies the loop for >50ms.
Is asyncio.to_thread fast?
Faster than threading.Thread (uses a shared default executor). Slower than native async (the cost is a context switch + Python's GIL release/reacquire). Use it as a bridge for unavoidable sync calls, not as a default.
Should threads be used instead of asyncio?
Without a need for 1000+ concurrent connections, threads are simpler. asyncio's value is at high concurrency where thread-per-task memory becomes prohibitive. For a web app handling 100 concurrent users, threading is fine and avoids the async-everywhere-or-nothing pitfall.
What's the right way to run a synchronous CPU task from asyncio?
ProcessPoolExecutor via `loop.run_in_executor`. The CPU work runs in another process (bypasses GIL), the event loop stays responsive. Don't run CPU work on the default thread executor, it still holds the GIL and stalls the loop.
Gotchas
- requests.get inside async def → silent stall, no warning
- time.sleep inside async def → freezes everything
- threading.Lock in asyncio code → blocks the event loop instead of yielding
- asyncio.create_task without keeping reference → task can be GC'd mid-run
- Mixing sync DB drivers with FastAPI → tail latency surge under load
- asyncio.gather with N=1M tasks → memory blow-up; bound with Semaphore
Production Postmortem: Concurrency Outages
Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes
Real concurrency outages share patterns: thread pool exhaustion from a slow downstream, missing timeouts that turn local slowness into global outage, retry storms after a brief failure, deadlock under load that's invisible at low traffic. The fixes are usually defensive (timeouts, bulkheads, circuit breakers) rather than algorithmic.
Key Points
- Most outages have multiple contributing causes; rarely one bug.
- Common pattern: 'X got slow → no timeout → Y got tied up → cascade'.
- Timeline analysis: when did metrics change? Tie to deploys, traffic, downstream events.
- Action items: not just 'fix the bug' but 'prevent the class of bug': timeouts, bulkheads, alerts.
- The blameless postmortem: focus on systemic gaps, not individual mistakes.
Interview Follow-up Questions
Why do most concurrency outages have multiple causes?
Single failures are usually contained by the system's defenses (timeouts, breakers, retries). Outages happen when multiple defenses fail simultaneously: 'downstream slow' AND 'no timeout' AND 'no bulkhead' AND 'cascading retries'. The postmortem usually finds 3-5 contributing factors. Each one alone wouldn't have caused the outage.
What does a good postmortem cover?
Timeline (what happened, when, in what order). Root causes (the bugs and the gaps that allowed the bugs to escalate). Detection (how we noticed; could we have noticed sooner?). Mitigation (what we did to stop the bleeding). Action items (concrete fixes, with owners and due dates). The blameless framing: 'how did the system allow this' rather than 'who screwed up'.
How do I prevent the next outage instead of just fixing this one?
Find the class of bug, not just the instance. 'No timeout on this call' → 'audit all calls for missing timeouts'. 'Race in this code path' → 'add race detection to CI'. 'Pool exhaustion' → 'add per-downstream pools and bulkheads'. The postmortem's action items should leave the system safer against similar future incidents, not just patched against this one.
What metrics should I have for early detection?
Per-downstream latency (p50, p99). Pool utilisation (in-use / max). Queue depth (job queues, retry queues, message queues). Error rate per downstream. Circuit breaker state. The aggregate 'service latency' is too lagging; per-component metrics catch problems before they cascade.
Gotchas
- Postmortem that blames an individual instead of fixing the system
- Action items without owners or due dates: never get done
- Fixing only the immediate bug, not the class of bug
- Not exercising the postmortem's findings: 'we'll add timeouts later'
- Skipping postmortems for 'small' incidents that turn out to be patterns