Concurrency in Java, Python & Go — 145-Topic Interview Guide

Process vs Thread vs Goroutine

Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often

A process owns memory and resources; threads share the process's address space and are scheduled by the OS; goroutines are user-space tasks multiplexed onto a small pool of OS threads by the Go runtime.

Key Points

Process: isolated address space, expensive context switch (~tens of μs), IPC required to share data
OS thread: ~1 MB stack, context switch ~1–10 μs, kernel-scheduled
Goroutine: ~2 KB initial stack (grows), context switch ~hundreds of ns, runtime-scheduled
Java virtual threads (21+): JVM-managed, M:N model, millions per JVM
Python threads share the GIL, only one executes Python bytecode at a time
Prefer processes for CPU-bound parallelism in Python; threads for I/O

Interview Follow-up Questions

How many goroutines can a Go program run?

Practically millions on modern hardware. Each starts with ~2 KB stack that grows as needed. Limited by available memory, not OS thread count.

What's the GIL and why does it matter?

Global Interpreter Lock, CPython serializes execution of Python bytecode across threads. Threads help for I/O (GIL released around system calls) but not CPU. Use multiprocessing for CPU parallelism.

Difference between virtual threads and goroutines?

Both use M:N scheduling onto OS threads. Goroutines are runtime-native and integrate with channels. Virtual threads are JVM-managed and work with existing blocking JDK APIs without rewrites.

Why does Go limit OS threads but not goroutines?

GOMAXPROCS sets the number of OS threads (P's) executing Go code simultaneously. Goroutines are multiplexed onto these. Default = number of CPU cores.

Can a goroutine outlive the function that started it?

Yes, a goroutine runs until its function returns. If main() exits, all goroutines are killed. Otherwise it can run forever, which is how leaks happen.

Gotchas

Goroutines started in a loop without context cancellation are the #1 source of leaks in production Go services
Virtual threads pin to their carrier inside synchronized blocks, prefer ReentrantLock for I/O-heavy code
Python threads do NOT give CPU parallelism, measure and switch to multiprocessing if compute-bound
Mixing asyncio and threading is a footgun, use asyncio.to_thread() for blocking calls
runtime.NumGoroutine() growing unbounded over time = leak; alert on it

Concurrency vs Parallelism

Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often

Concurrency is structuring a program to deal with many things at once; parallelism is actually doing many things at the same time. One can exist without the other.

Key Points

Concurrency = composition of independently executing tasks (a structure)
Parallelism = simultaneous execution of multiple tasks (an outcome)
Single-core CPU + threads = concurrency without parallelism (interleaved)
GPU compute or multi-core SIMD = parallelism without classical concurrency
Web server handling 10K connections = concurrency; matrix multiplication on 8 cores = parallelism
Async/event loops give concurrency without threads, single-threaded but interleaved

Interview Follow-up Questions

If Python's GIL prevents parallel threads, why use threading at all?

Threading provides concurrency for I/O-bound work, the GIL is released around blocking system calls. The result is overlapping waits, not parallel compute.

Is async/await concurrent, parallel, or both?

Concurrent only. asyncio runs on one thread; tasks interleave at await points. To add parallelism, run multiple asyncio event loops (one per process) or use a thread pool for blocking work.

Can a single-threaded program be concurrent?

Yes, Node.js and Python asyncio are single-threaded but concurrent. They interleave I/O-bound tasks via cooperative scheduling on an event loop.

Why does Rob Pike say 'concurrency is not parallelism'?

His point: concurrency is a way to structure problems with independent pieces; parallelism is one possible execution. A well-designed concurrent program runs efficiently on 1 or 100 cores, the structure is independent of the hardware.

Gotchas

More threads != more parallelism, once thread count exceeds core count, the cost is context-switching without throughput gain
asyncio is single-threaded, a CPU-bound task in an async function blocks the entire event loop
Thread pools sized for I/O are wrong for CPU-bound work and vice versa

Thread & Goroutine Lifecycle

Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often

A Java thread moves through NEW → RUNNABLE → BLOCKED/WAITING/TIMED_WAITING → TERMINATED. A goroutine has no public state, it runs until its function returns; the runtime parks/unparks it on channel and lock operations. Python threads expose start/join/is_alive but no fine-grained state.

Key Points

Java thread states: NEW, RUNNABLE, BLOCKED (waiting for monitor), WAITING, TIMED_WAITING, TERMINATED
RUNNABLE in Java covers both 'running' and 'ready to run', the JVM doesn't expose CPU-on/off
BLOCKED = waiting for a synchronized monitor; WAITING = parked via wait/join/park
Goroutines have no public state, they're either running, runnable, or parked
Cancellation in Go is via context.Context, propagated explicitly through call chains
Cancellation in Java is via Thread.interrupt(), checked cooperatively
Cancellation in Python is via threading.Event polling or asyncio.Task.cancel()

Interview Follow-up Questions

What's the difference between BLOCKED and WAITING in Java?

BLOCKED = waiting to acquire a synchronized monitor (no timeout, no signal). WAITING = parked via wait/join/LockSupport.park, woken by notify/unpark/interrupt.

Why is Thread.stop() deprecated?

It releases all locks abruptly, leaving objects in inconsistent states. Use interrupt + cooperative shutdown.

How do you cancel a goroutine?

Not from outside, signal it via context.Done() or by closing a channel it selects on. The goroutine must check the signal and return.

What happens to goroutines when main() exits?

They're killed immediately, no cleanup runs. Use sync.WaitGroup or context to wait for them before main returns.

Can a Thread be restarted after it terminates?

No. Thread.start() throws IllegalThreadStateException if called twice. Create a new Thread or use an ExecutorService.

How does Python cancel a thread?

Not directly. The convention is a shared threading.Event the thread polls. Or for asyncio, task.cancel() raises CancelledError at the next await.

Gotchas

Thread.sleep() and Object.wait() CLEAR the interrupt flag, must re-set with Thread.currentThread().interrupt()
Forgetting wg.Add() before 'go' creates a race with wg.Wait(), Add must be called from the parent goroutine
context.WithCancel without calling cancel() leaks the context's goroutine
Daemon threads (Java setDaemon, Python daemon=True) are killed at shutdown, avoid for cleanup work
Calling notify() vs notifyAll(), notify wakes ONE waiter; spurious wakeups require waiting in a loop
Swallowing asyncio.CancelledError prevents shutdown, re-raise after cleanup

Critical Sections & Shared State

Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often

A critical section is the part of code that touches shared mutable state and must run without interleaving from other threads. Identify it first, then pick a primitive (mutex, atomic, channel) to protect it. Most concurrency bugs come from missing or oversized critical sections.

Key Points

Shared mutable state is the source of every race. Remove sharing or remove mutation, and the problem disappears.
A critical section is any code path that reads-then-writes (or writes-then-reads) shared state and depends on the data not changing in between.
Two failures to avoid: forgetting to protect a critical section, and protecting too much (locking around I/O).
The right size of a critical section is the smallest set of statements that preserve the invariant in question.

Interview Follow-up Questions

How does one find critical sections in unfamiliar code?

Look for every read or write to a field that is also written by another thread. Then trace what invariant the code relies on. If the invariant spans more than one statement, those statements are the critical section.

What is wrong with making the lock cover the whole method?

Two things. Throughput drops because no other thread can do anything while one is in the method. And if the method calls into other code (a callback, a network call), it can deadlock or block on something that has no business being inside a lock.

Is a single statement always atomic?

No. `counter++` reads the value, increments, and writes back. Three operations. Other threads can interleave between them. A single statement is not the same as a single operation at the hardware level.

Race Conditions, Why Concurrent Bugs Happen

Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often

A race condition is when the program's correctness depends on the unpredictable interleaving of two or more threads' operations on shared state. Fix it by removing the sharing, or by making the operations atomic with a lock or atomic primitive.

Key Points

A race condition is a CORRECTNESS bug, not a 'maybe slow' bug, the wrong answer can be observed
Three ingredients: shared mutable state, multiple threads, unsynchronized access
Remove ANY one of the three and the race goes away
Read-modify-write operations like counter++ are NEVER atomic, they're three instructions
Compilers and CPUs reorder reads/writes, what looks sequential in source isn't
Race conditions hide on x86 (strong memory model) and explode on ARM

Interview Follow-up Questions

What's the difference between a data race and a race condition?

Subtle but important. A data race is unsynchronized access where at least one is a write, a memory model violation, undefined behavior. A race condition is broader, any logic bug whose outcome depends on timing. All data races are race conditions; not all race conditions are data races (e.g., check-then-act with atomics).

Why does counter++ fail even with volatile?

volatile gives visibility (other threads see the latest value) and ordering, but ++ is still three operations: read, add, write. Two threads can both read the same value, both add 1, both write, losing one update. volatile makes each operation atomic; only the COMPOUND operation needs to be atomic.

If code 'works' on x86, is there still a race?

Yes. x86's strong memory model masks many races at runtime, but the compiler is free to reorder. Move to ARM (Apple Silicon, AWS Graviton, mobile) and the same code may break. Use the race detector on every platform.

What's the simplest way to remove a race?

Remove the sharing. Confine the state to one thread (Go's idiom: 'share by communicating'), or make it immutable. Locks and atomics are the next-best fixes, they manage the race; removing the sharing eliminates it.

Are reads-only races okay?

Yes, concurrent reads of immutable state are always safe. A race requires at least one write.

Gotchas

Tests almost always pass, races need real load and timing to manifest
x86 hides races that ARM exposes, test on both
Even atomic.Add doesn't help with check-then-act: if (atomic.Load() == 0) atomic.Store(1) is racy
The Go race detector slows code 5-10x, run it in CI, not production
Java's volatile is for visibility, not atomicity of compound ops

Thread Safety Levels

Section: Core Mental Models | Difficulty: Basic | Frequency: Asked Often

Code is not thread-safe or unsafe in one bucket. There are levels: immutable, thread-confined, conditionally thread-safe, fully thread-safe, and lock-free. Knowing the level a class promises is the difference between safe code and a postmortem.

Key Points

Immutable: no writes after construction. Always safe to share.
Thread-confined: only ever touched by one thread. Safe by isolation, not by locking.
Conditionally thread-safe: individual methods are safe, but compound operations need external synchronisation.
Fully thread-safe: every method and every reasonable sequence of calls is safe under concurrent use.
Lock-free: thread-safe AND no thread can be permanently blocked by another. Stronger than fully thread-safe.

Interview Follow-up Questions

What is the difference between thread-safe and lock-free?

Thread-safe means concurrent use does not corrupt state or produce wrong results. Lock-free is stronger: at least one thread always makes progress. A locked queue is thread-safe but not lock-free, because the lock holder can be preempted and block all callers.

Why is conditionally thread-safe so easy to misuse?

Because each method call looks safe, so engineers assume sequences of calls are safe too. They are not. The lesson is to always think about the compound operation, not the individual calls.

Is final / readonly enough for immutability?

Almost. The fields must be final, the references must point to immutable objects too (or defensive copies), and the constructor must not leak `this` before completion. Java's record types and Kotlin's data classes get this right by default.

How should the thread-safety level of a class be documented?

In the Javadoc / docstring, in one line: 'This class is immutable', 'fully thread-safe', 'not thread-safe', 'thread-confined to the EDT'. Library authors should be specific. Callers should not have to guess.

Mutex, Semaphore, Condition Variable

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often

A mutex enforces 'one at a time'. A semaphore counts permits, N threads at a time. A condition variable lets threads wait for a state change. They're the three primitives every higher-level concurrency tool is built on.

Key Points

Mutex = binary lock, mutual exclusion, ONE owner
Semaphore = counted permits, allows N concurrent holders
Condition variable = wait until a predicate becomes true, used WITH a mutex
Mutex 'protects' invariants; semaphore 'limits' concurrency; condition variable 'signals' state changes
Always wait on a condition variable in a while-loop, never if (spurious wakeups + stolen wakeups)
Counting semaphore (N>1) for resource pools; binary semaphore (N=1) is 'almost a mutex' but lacks ownership

Interview Follow-up Questions

Mutex vs binary semaphore, what's the difference?

A mutex has ownership: only the thread that acquired can release. A binary semaphore is a counter capped at 1, any thread can release, even one that didn't acquire. Use mutex for mutual exclusion (most cases). Use semaphore for signaling (rare) or for resource pools (counting > 1).

Why pair a mutex AND a condition variable?

The condition variable's wait operation atomically: (a) releases the mutex and (b) parks the thread, so it can't miss a signal sent between the predicate check and the wait. Without the mutex, a check-then-wait race appears. They're a pair.

What's a spurious wakeup?

An OS or JVM may wake a parked thread without anyone calling signal/notify. The wakeup spec is 'at least once when notified', not 'exactly when notified.' That's why every wait MUST be in a while-loop that re-checks the predicate.

Difference between signal() and signalAll()?

signal/notify wakes ONE waiting thread (impl-defined which). signalAll/notifyAll wakes ALL of them. Use signal when only one thread can make progress (e.g., one slot freed in a buffer); use signalAll when state changed in a way that all waiters might benefit from.

Why does Go discourage condition variables?

Go's design philosophy is 'communicate, don't share.' Channels carry both data and synchronization in one operation, eliminating the mutex+condition pair for most cases. sync.Cond exists for the rare case (broadcast to many) but should be a last resort.

Gotchas

Forgetting to release a lock on the error path → permanent deadlock. Always use defer/finally/with.
Using if instead of while around await/wait → spurious + stolen wakeups corrupt invariants
Locking and waiting on different lock instances, the code looks locked but isn't
Calling signal() without holding the lock, the wait/check race is back
Counting semaphore release without prior acquire → permit leak (raises in BoundedSemaphore)
Mutex copy in Go (passing a struct by value), go vet catches it

Synchronisation vs Coordination

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Sometimes

Synchronisation protects shared state. Coordination orchestrates the order or timing of work. Mutexes synchronise. Channels, latches, barriers, futures coordinate. Mixing them up leads to overcomplicated code that locks where it should signal, and signals where it should lock.

Key Points

Synchronisation: 'no two threads in this section at once'. Mutex, RWLock, atomic, synchronized.
Coordination: 'do this after that'. Channel, latch, barrier, future, semaphore, condition variable.
Misapplication: using a mutex to wait for an event (busy-loop with mutex). Or using a channel to protect a variable (overkill).
Channels are coordination AND synchronisation in one move because the send happens-before the receive.
When in doubt, ask what's actually needed. Mutual exclusion of access? Or sequencing of work?

Interview Follow-up Questions

Are these two things really different, or just different words for the same?

Different. Synchronisation prevents simultaneous access to data. Coordination orders events in time. A mutex synchronises but does not coordinate (it has no 'wait for X to finish' semantics). A latch coordinates but does not synchronise (it does not protect any data). Some primitives do both.

Is a semaphore synchronisation or coordination?

Coordination, mostly. With permits = 1 it can act like a mutex (binary semaphore), but the typical usage is rate-limiting or signal-style: 'release a permit when X happens, acquire one to wait for it'. The N-permit case is purely coordination of how many things can run.

Why are channels a popular choice in Go?

Because they bundle synchronisation (the send happens-before the receive) with coordination (the receiver waits until something is sent). One primitive, two jobs. Most Go programs end up needing both at the same point in the code, so the bundle is a clean fit.

When should one reach for a condition variable?

When shared state guarded by a lock requires threads to wait until that state matches a predicate. Producer-consumer with a bounded buffer is the canonical example. For 'wait for one event', a latch or Event is simpler. For 'protect this state' alone, a lock by itself is enough.

Optimistic vs Pessimistic Concurrency Control

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often

Pessimistic: lock the data while working on it; nobody else can touch it. Cheap when contention is high, costs a lock acquire on every operation. Optimistic: read freely, do the work, commit only if nothing else changed (CAS, version number, MVCC). Nearly free when contention is low, requires retry under contention. Pick pessimistic when conflicts are common, optimistic when they're rare.

Key Points

Pessimistic = 'assume conflict, prevent it' (locks). Optimistic = 'assume no conflict, detect it' (CAS, version checks).
Pessimistic cost is constant per operation regardless of contention; optimistic cost is near-zero under low contention but grows (retries) under high contention.
Optimistic concurrency uses one of three mechanisms: CAS on a memory word, a version column / ETag in a row, or MVCC snapshots in a database.
The break-even point is the conflict rate. Roughly: under 10% conflict, optimistic wins. Above 50% conflict, pessimistic wins. In between, measure.
Optimistic doesn't avoid contention, it relocates it: instead of waiting for the lock, the caller waits for a successful retry. The work done before the failed commit is wasted.
Hybrid systems pick per-operation: PostgreSQL's MVCC is optimistic for reads (snapshots) and pessimistic for writes (row locks). HTTP If-Match uses optimistic versioning at the protocol layer.

Interview Follow-up Questions

How does one decide which to use?

Estimate the conflict rate. (1) Multiple users editing the same row: usually low (most rows are touched by one user at a time). Use optimistic. (2) Hot counter incremented by every request: very high. Use sharding (LongAdder, per-CPU counters), not either lock. (3) A connection pool's free-list: low when pool is sized right, high when pool is undersized. Use a small pessimistic lock around the list, but mostly fix the pool sizing. (4) Two cashiers on the same drawer: high. Use pessimistic. The break-even is roughly 10-20% conflict; below that optimistic wins on throughput, above that the retries hurt.

What about livelock with optimistic?

Real risk. Two threads can keep beating each other to the commit, each retrying, neither making progress. Defenses: backoff with jitter between retries, bounded retry count with a fallback to pessimistic, or use an atomic helper (incrementAndGet, addAndGet) that's been tuned by the JDK to use exponential backoff under contention. For SQL: detect the failed UPDATE, surface a 409 Conflict to the user, let them resolve.

Why does MVCC fit here?

MVCC (PostgreSQL, Oracle, InnoDB) is the database scaling story for optimistic reads: every transaction reads a consistent snapshot at its start time, no locks needed, no contention with writers. Writes still take row locks (pessimistic), but reads scale linearly with cores. It's the optimistic answer to 'readers shouldn't block writers, writers shouldn't block readers.' The same pattern shows up in user-space: copy-on-write trees, RCU in the kernel, persistent data structures in functional languages.

How does this map to distributed systems?

etcd's compare-and-swap on a key (compare against expected version, set if matches) is optimistic. Distributed locks (Redlock, Zookeeper ephemeral nodes) are pessimistic. Sagas with compensating transactions are optimistic for the cross-service case (assume each step succeeds, compensate if not). Two-phase commit is pessimistic (lock everything, agree, commit). The cost calculus is the same: optimistic cheaper when conflicts are rare, pessimistic safer when they're common.

Are these always exclusive?

No. Most real systems combine both. PostgreSQL: MVCC reads + row locks for writes. Java StampedLock: optimistic read with fallback to read lock. JPA: @Version optimistic locking inside transactions that may also use row locks for explicit pessimistic operations. The right design often has both, applied per operation based on the expected conflict rate of that operation.

Gotchas

Optimistic without retry: a single CAS failure becomes a user-visible error. Always handle the false return, either by retrying or surfacing a conflict.
Pessimistic + user think time: holding a row lock across a UI session is a deadlock factory. Move locks to commit time (optimistic) or shrink the lock window.
Mixing optimistic and pessimistic on the same data without coordination: an optimistic writer can win the race against a pessimistic reader's expected ordering.
Versioning a row but never bumping the version on every write path: a code path that forgets to increment version silently breaks optimistic checks.
Counting CAS failure as a 'lock' in profiles: it isn't; millions of failed CAS in an atomic counter means contention to fix, not lock-elimination work to do.

Deadlock, Livelock, Starvation

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often

Deadlock, threads block each other forever. Livelock, threads are running but making no progress. Starvation, one thread never gets the resource it needs because others keep grabbing it first. All three look similar from the outside; the fix is different for each.

Key Points

Deadlock = circular wait, Thread A holds X waits for Y, Thread B holds Y waits for X
Coffman conditions: mutual exclusion + hold-and-wait + no preemption + circular wait
Break ANY ONE of the four → no deadlock
Lock ordering (always acquire locks in a fixed global order) is the most common deadlock prevention
Livelock = threads keep retrying and stepping on each other, no blocking, no progress
Starvation = unfair scheduling, a thread is permanently outrun by others

Interview Follow-up Questions

What are the four Coffman conditions for deadlock?

1) Mutual exclusion (resource can't be shared); 2) Hold-and-wait (thread holds one resource and waits for another); 3) No preemption (resources can't be forcibly taken); 4) Circular wait (a cycle in the wait-for graph). All four must hold simultaneously, break ANY one and deadlock is impossible.

What's the difference between deadlock and livelock?

In deadlock, threads are blocked, doing nothing, waiting forever. In livelock, threads are running and consuming CPU, but their actions cancel out and no progress is made. Livelock is harder to detect because the program looks 'busy.'

How is deadlock prevented in practice?

Most common: enforce a global lock acquisition order. Other tactics: tryLock with timeout (break no-preemption), open call (release the lock before calling out), structured concurrency (locks scoped to a function). The cheapest fix is usually 'don't hold two locks at once.'

Why does Java's ReentrantLock have tryLock and Go's sync.Mutex doesn't (until recently)?

Go's design philosophy is that lock contention should be rare; reaching for tryLock often signals a design problem. Go 1.18 added TryLock anyway for cases that need a fallback code path. Use it sparingly, most uses indicate the lock is in the wrong place.

What's starvation?

When a thread perpetually fails to acquire the resource it needs because other threads keep getting there first. Common with priority schedulers (high-priority threads keep preempting low-priority ones) or unfair locks (newcomers grab the lock before queued waiters).

Gotchas

Reentrant locks (Java ReentrantLock, Python RLock) prevent self-deadlock but NOT mutual deadlock between threads
Holding a lock while calling user-supplied code (a callback, a remote service) is a deadlock waiting to happen, that code might try to take a held lock
Go's runtime detects only 'all goroutines asleep' deadlocks, partial deadlocks (some goroutines stuck, others running) are silent
Java's jstack and -XX:+PrintConcurrentLocks show deadlocked threads, use them in production hangs
Database transactions can deadlock too, ORDER BY-then-FOR UPDATE on different tables is the SQL version

Spurious and Stolen Wakeups

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often

A waiting thread can wake up without anyone calling notify (spurious), or be beaten to the lock by another thread that consumed the condition (stolen). The defence is to always check the condition in a while loop after wait returns, never if.

Key Points

Spurious wakeup: the OS or runtime wakes a waiting thread for no reason. Allowed by every POSIX-shaped wait API.
Stolen wakeup: someone signalled, the thread woke up, but another thread grabbed the resource before the first could re-acquire the lock.
Both are why every pthread / Java / Python tutorial says 'wait inside while, never if.'
wait/await releases the lock atomically, parks the thread, re-acquires the lock on wake. The re-check happens after re-acquire.
Channels and semaphores avoid the issue because they have an integer count, not a free-form predicate.

Interview Follow-up Questions

Why are spurious wakeups even allowed?

POSIX condition variables document them as legal because real implementations sometimes find it cheaper or simpler to wake everyone than to track exactly who needs to wake. The rule of 'check in a loop' is part of the contract, so the runtime is free to wake conservatively.

Is signal vs signalAll about spurious wakeups?

No, that is a separate question. signal wakes one waiter, signalAll wakes all. Both can also produce stolen wakeups (another waiter beats the original to the lock). The while loop handles both.

Do semaphores have the same issue?

No. A semaphore's acquire blocks until count > 0 and decrements atomically. There is no predicate to re-check, no race window between wake and acquire. Same for channel sends and receives.

What about asyncio.Condition.wait?

Same rule: wrap in `while not predicate: await cond.wait()`. asyncio's coroutines do not have the OS spurious-wakeup issue, but the stolen-wakeup case is still real. The while is correct in both.

Priority Inversion

Section: Core Mental Models | Difficulty: Advanced | Frequency: Sometimes

A high-priority thread waits on a lock held by a low-priority thread. A medium-priority thread runs, preempting the low-priority holder, which means the high-priority thread also waits indirectly. The fix is priority inheritance: bump the holder's priority to match the highest waiter.

Key Points

Three threads, one lock. High waits on Low for the lock. Medium runs, preempting Low. High is blocked indefinitely.
Famous case: Mars Pathfinder, 1997. The rover kept rebooting due to a priority inversion in the VxWorks scheduler.
Priority inheritance: while Low holds the lock, raise its priority to High's so Medium cannot preempt.
Common in real-time systems, audio drivers, embedded code, kernel synchronisation. Less common in Java/Go/Python.
JVM and most modern OS schedulers do not implement priority inheritance by default; problem solved by avoiding strict thread priorities in the first place.

Interview Follow-up Questions

What is priority inheritance?

A scheduling rule: while a thread holds a lock that a higher-priority thread is waiting for, the holder is temporarily promoted to that priority. The medium-priority thread can no longer preempt it. Used in real-time OS kernels (VxWorks, RT Linux, FreeRTOS).

Does Java do priority inheritance?

No. The JVM advisory priorities map onto OS scheduler priorities, which usually do not implement inheritance. The practical advice for Java code is to avoid relying on thread priority for correctness.

How was the Mars Pathfinder bug fixed?

The team enabled priority inheritance on the VxWorks mutex used by the bus management task. NASA pushed a configuration change to the rover remotely. After that, the high-priority task could no longer be indirectly blocked by medium-priority work.

When does this matter for normal application code?

Rarely. It comes up when thread priorities are set, the OS respects them strictly (real-time configurations), and locks are shared across priority levels. Most servers ignore priorities and the problem does not occur. The lesson generalises though: any time a fast operation can wait behind a slow one, the shape is inversion-like.

Memory Visibility & CPU Reordering

Section: Core Mental Models | Difficulty: Advanced | Frequency: Asked Often

Modern CPUs and compilers reorder reads and writes for performance. Without explicit synchronization, one thread's writes may NEVER become visible to another thread, or may appear in a different order. This is the deepest reason concurrent code breaks, and skipping it means writing code that "works on a laptop, fails in production."

Key Points

Skipping this lesson leads to broken concurrent code
CPUs have per-core caches, a write on Core 1 may sit in Core 1's cache forever, invisible to Core 2
Compilers reorder instructions for optimization; CPUs reorder them for pipelining
The 'obvious sequential order' of source code is not what runs on the hardware
x86 has a strong memory model and hides most reordering bugs; ARM and POWER expose them
ALL modern languages have a memory model, Java JMM, Go, C++11, even Python (informally)
Synchronization primitives (volatile, atomic, mutex, channel) insert MEMORY BARRIERS that flush caches and prevent reordering

Interview Follow-up Questions

What's a memory barrier?

A CPU instruction (or compiler directive) that prevents reordering across it and flushes caches. Different barriers do different things: a 'release' barrier ensures prior writes are visible; an 'acquire' barrier ensures subsequent reads see those writes; a 'full' barrier does both. Java's volatile, Go's sync/atomic, and C++11 atomics all emit appropriate barriers.

Why does x86 hide so many reordering bugs?

x86 has a TSO (total store order) memory model, most reads and writes can't be reordered with each other (only store-load can). ARM and POWER have weaker models that allow more reordering. Code that 'works' on Intel laptops can fail on Apple Silicon, AWS Graviton, mobile devices, or any non-x86 production target.

What's the difference between visibility and ordering?

Visibility: when does a write become observable to another thread? Without synchronization, possibly never. Ordering: when multiple writes happen, in what order do other threads see them? Without synchronization, any order. Both are required for correct concurrent code; both are provided together by proper synchronization primitives.

Are 64-bit reads/writes atomic in Java?

Not always. The JLS guarantees atomicity of 32-bit ops; 64-bit long/double reads/writes can be torn on 32-bit JVMs unless declared volatile. On 64-bit JVMs and HotSpot in practice they're atomic, but the spec doesn't require it. Use volatile or AtomicLong to be safe.

If a Mutex is used everywhere, is reordering still a concern?

Inside the critical section, no, Mutex.acquire/release are full memory barriers. Between critical sections, yes, the lock only orders accesses that happen WHILE it is held. Common bug: reading shared state outside the lock 'because it's just a check.'

Gotchas

Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
Compiler optimizations can hoist a volatile-less read out of a loop entirely → 'why is my thread spinning forever?'
long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
Java: writing to a field after publishing a reference to its containing object can be invisible to readers who don't synchronize
Go: copying a sync.Mutex breaks it (go vet catches); copying a struct with atomic fields breaks them
Python's GIL gives bytecode-level atomicity but NOT visibility for the publication pattern

Linearizability and Sequential Consistency

Section: Core Mental Models | Difficulty: Advanced | Frequency: Sometimes

Linearizability says every operation appears to take effect at one instant between its call and its return, and that instant respects real time across threads. Sequential consistency is weaker: each thread's operations stay in program order and there is some single global order, but it doesn't have to match wall-clock time. Linearizability composes across objects; sequential consistency does not.

Key Points

Linearizability: each operation has a single 'effect point' that lies between its invocation and its response, and the order of effect points is consistent with real time.
Sequential consistency: there exists some interleaving of all operations that respects each thread's program order, but the interleaving does not have to match real time.
Linearizability is composable: if every object is linearizable, the system is linearizable. Sequential consistency is not.
Java volatile reads and writes on a single variable are linearizable; combining multiple volatiles or doing read-modify-write is not. Use java.util.concurrent.atomic for linearizable RMW.
Redis is not linearizable by default. etcd, ZooKeeper, and DynamoDB conditional writes are.
Linearizability has a real cost: in an async system with failures, you can't have linearizable reads with sub-RTT latency (Attiya and Welch, 1994).

Interview Follow-up Questions

If linearizability is just sequential consistency plus a real-time constraint, why does the real-time part matter so much?

Two reasons. First, it is what client-visible 'happens after' means: if write A returned before read B started, B should see A. Without that, a client cannot trust its own ordering. Second, linearizability composes. If every object in a system is linearizable, the whole system is. Sequential consistency does not compose: two SC objects together can produce histories that no SC system could.

Is volatile in Java linearizable?

A single volatile read or write on a single variable is linearizable on its own. The write has one effect point (the store), the read has one (the load), and the JMM guarantees the global total order matches real time for that variable. The catch: combining multiple volatile fields or doing read-modify-write loses linearizability on the combination. That is what AtomicLong, AtomicReference, and CAS are for.

Is Redis linearizable?

Out of the box, no. Standalone Redis with a single primary is linearizable for reads from the primary, but the moment replication is added, replicas can return stale values. Redis Sentinel and Cluster do not give linearizable reads. Use WAIT for stronger guarantees, or pick a system designed for linearizability (etcd, Zookeeper, DynamoDB strong reads) when correctness depends on it.

Why do people say linearizability is expensive?

Two costs. Latency: linearizable reads in a replicated system require talking to a quorum (or the leader) to confirm freshness, so they cannot be served from a local replica. Availability: the CAP triangle says under partition the choice is consistency or availability. A linearizable system gives up availability during a partition. For systems where stale reads are OK, that price isn't worth paying.

Gotchas

Composing two sequentially consistent objects can give a non-SC system. Composing two linearizable objects always gives a linearizable system.
A linearizable register is what most papers mean by 'atomic register.' The terms are interchangeable in the literature.
Don't conflate linearizability with serializability. Linearizability is about single-object real-time order. Serializability is about multi-operation transactions.
A snapshot read across two unrelated linearizable objects is not itself linearizable. A transaction or a globally-ordered log is required for that.

Atomic Operations & CAS

Section: Core Mental Models | Difficulty: Advanced | Frequency: Asked Often

Atomic operations are indivisible CPU instructions that read-modify-write a single memory location safely, without locks. Compare-and-swap (CAS) is the foundation, "if this value is still X, replace it with Y", and it's how every lock-free data structure is built.

Key Points

An atomic operation is a single CPU instruction that can't be interrupted partway through
CAS = compare-and-swap: 'if memory[addr] == expected, write new and return success; else fail and return current value'
Lock-free programming uses CAS in retry loops, never blocks, but may spin under contention
Atomics give you single-variable safety; for multi-variable invariants you still need a lock
ABA problem: a value goes A → B → A while you're not looking; CAS thinks nothing changed
LongAdder/striped counters trade memory for throughput under high contention

Interview Follow-up Questions

What's the ABA problem?

A thread reads value A, plans to CAS A → C. Before the CAS, another thread changes A → B → A. Your CAS succeeds because the value is back to A, but the world changed in between. Common in lock-free linked lists where a node is removed and re-added. Solutions: tagged pointers (include a version counter), hazard pointers, or use a higher-level primitive.

When does a CAS retry loop become a busy-wait?

Under heavy contention, every CAS may fail and retry. CPU goes to 100%, no useful work happens. Mitigations: exponential backoff (sleep between retries), give up after N tries and use a lock fallback, or shard the contention (LongAdder). If you see a CAS loop dominating CPU profile, that's the symptom.

Are AtomicReference assignments atomic for compound operations?

AtomicReference.set() is atomic. compareAndSet is atomic. But '`if (ref.get() == X) { ref.set(Y) }`' is NOT atomic, between the get and set, another thread can race in. Use compareAndSet, or wrap in a lock.

Why is volatile not enough for counter++?

volatile makes the read and write each atomic and visible. But ++ is read-modify-write, three operations. Two threads can both read 5, both compute 6, both write 6, losing one update. AtomicInteger.incrementAndGet uses CAS to make the whole thing atomic.

How does CAS work at the CPU level?

x86 has the LOCK CMPXCHG instruction; ARM has LDREX/STREX or CAS (since ARMv8.1). The CPU asserts exclusive cache line ownership for the duration, performs the compare and conditional write, releases. Cost is roughly an L2 cache miss, fast but not free.

Gotchas

ABA problem on linked-list CAS, value comes back to original after intermediate change
Reading an atomic field non-atomically (`int x = atomic.Load(); other code; use x`), value can change between read and use
AtomicInteger.incrementAndGet() is atomic; AtomicInteger.get() + 1 followed by set() is NOT
Hot CAS loop on a single AtomicLong from many threads = CPU bottleneck, switch to LongAdder
Java: AtomicInteger.lazySet vs set, lazySet doesn't establish happens-before, can lose visibility
Go: atomic operations require *value pointer (& prefix); using value-by-value is silently wrong

False Sharing

Section: Core Mental Models | Difficulty: Advanced | Frequency: Sometimes

Two unrelated variables can sit on the same CPU cache line. When two threads write to them concurrently, the cache line ping-pongs between cores even though the variables are not actually shared. The code is correct but slow. Fix by padding to one variable per cache line.

Key Points

CPU cache lines are 64 bytes on x86 and ARM. The smallest unit of cache coherence is the line, not the variable.
If two threads write variables in the same line, every write invalidates the other core's cache copy.
Symptom: linear-looking code that does not scale with cores. Adding threads makes it slower.
Fix: pad each hot variable to a full cache line, or use frameworks (LongAdder, padded atomics) that already do it.
Most common in counters, queues, lock state, per-thread statistics.

Interview Follow-up Questions

Why does the cache work in 64-byte chunks even when only 8 bytes get written?

DRAM is slow. Caches amortise the cost by fetching whole lines when any byte in them is touched. The line is also the unit of the coherence protocol (MESI on x86). When one core writes, the protocol has to invalidate the line on every other core, regardless of which byte changed.

How is a false-sharing-affected program identified?

Two clues. First, throughput drops or even reverses as cores are added. Second, perf c2c on Linux shows cache-line modify traffic. To test a suspected hot field, run a quick benchmark with padding and see if it speeds up.

Does false sharing happen with reads only?

No. Reads can be served from each core's local cache copy, so concurrent reads are fine. The problem is writes: a write invalidates copies on every other core, and they have to refetch.

Why does LongAdder beat AtomicLong under contention?

Two reasons. It splits the counter into per-thread cells, so threads usually update different cells. And each cell is padded to its own cache line, so the cells do not false-share with each other. Together: less contention AND less coherence traffic.

Backpressure: Strategies and Signals

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Asked Often

When the producer is faster than the consumer, something has to give. Backpressure is how the system tells the producer to slow down (or what happens when it can't). Four strategies: block the producer, drop new work, shed load with errors, or propagate the slowdown back through the call chain. Picking the wrong one is how you turn a slow consumer into a memory blowup or a thundering retry storm.

Key Points

Without backpressure, a slow consumer forces unbounded memory growth or work loss
Block the producer (bounded queue blocks on full): natural backpressure, default for most pipelines
Drop incoming (drop newest or drop oldest): for telemetry where stale data is worse than missing data
Shed load (return 503 / fail fast): for request paths where the client can retry or degrade
Propagate (the slow downstream signals the upstream caller): for end-to-end flow control across services

Interview Follow-up Questions

What happens without backpressure?

Two failure modes. Memory grows unbounded as the queue fills with work the consumer can't process; eventually OOM kills the process. Or work piles up and gets dropped silently when the queue overflows, which looks like a service that just loses messages with no error. Both are the kind of bug that's invisible until production.

Block, drop, or shed: how to choose?

Block when the producer has nothing better to do (batch jobs, ETL pipelines). Drop when stale data is worse than missing data (telemetry, sampled metrics, real-time logs). Shed when the client can retry or degrade (HTTP requests, RPC calls). Propagate when the upstream caller has a budget that can be respected (request handlers, gRPC chains).

Why is propagation special?

It's the only strategy that handles cross-service overload gracefully. When a service is slow because its downstream is slow, propagating the deadline tells callers to give up too. They redirect, cache, or fail fast. The alternative (everyone blocks waiting on slow downstream) cascades the failure upward.

How does backpressure relate to the bulkhead pattern?

Bulkhead is one mechanism that produces backpressure. A bulkhead caps how many concurrent calls can be in flight to a downstream; when it's full, new callers fail fast (load shed). The bulkhead is the implementation, backpressure is the broader concept of 'how the system signals overload'.

Gotchas

Unbounded queue = no backpressure = memory blowup under sustained overload
Drop without metrics = silent data loss (count and alert on drops)
Shed without Retry-After = client retries immediately, makes things worse
Block in a request handler = thread starvation; combine with timeouts
Propagate without timeouts = nothing actually propagates; ctx.Done is the channel

The Colored Function Problem

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Sometimes

In some languages (Python, JavaScript, Rust), async functions and sync functions are different colours. Sync code can't call async without an event loop; async code can't call sync blocking calls without freezing the loop. The colour propagates upward through every caller. Go and Java 21+ virtual threads avoid the problem by making the runtime handle async transparently.

Key Points

Python / JavaScript / Rust: async is a function colour. Async can call sync; sync cannot directly call async (needs a loop).
Once async is added at any layer, every caller above must also be async, or pay an awkward bridge cost.
Go avoids it: every function looks blocking; the runtime parks goroutines on I/O. No async/await syntax.
Java 21+ virtual threads do the same: write blocking-style code; the JVM provides async-like efficiency.
Cross-colour bridging exists (asyncio.to_thread, asyncio.run, asgiref.sync) but it's friction, not a clean solution.

Interview Follow-up Questions

Why did Python and JavaScript end up with colored functions?

Both have single-threaded runtimes (well, JS does; Python's GIL makes it effectively single-threaded for CPU work). Async/await was added later as a way to multiplex I/O on one thread without callbacks. The syntax made the colour explicit. In contrast, Go and Java were designed (or redesigned) with native multithreading; they could provide blocking-style code with multiplexing under the hood.

Is the colored function problem actually a problem?

It's friction, not a blocker. Working systems exist in all of these languages. The issue is that adopting async at one layer forces it on all callers above. Half-async codebases (some sync, some async) spend energy on bridges that don't add user value. Go and Java 21+ skip this entire category of work.

What about Kotlin coroutines?

Kotlin coroutines are colored too: a `suspend fun` can only be called from another suspend function or a coroutine builder. The colour is `suspend` instead of `async` but the propagation rule is the same. Kotlin's win is structured concurrency primitives (CoroutineScope, supervisorJob) that make the propagation manageable.

When would the explicit-colour approach be preferred?

When explicit visibility into what blocks vs what doesn't is the priority. The `async` keyword in Python/JS/Rust forces a call-site decision. Go and Java 21+ trade this visibility for ergonomics; the call site doesn't reveal whether the call will park the virtual thread. Both choices are defensible; pick based on what the team optimises for.

Gotchas

Calling an async function in sync code without a loop returns a coroutine/Promise, not a value
Calling sync blocking I/O in async code freezes the entire event loop
Mixing thread-based code (threading.Lock) with asyncio is the worst of both worlds
Adopting async at one layer forces async on every caller above (the 'colouring' propagates)
Even Go and Java virtual threads can starve when blocking C extensions don't release the runtime

When NOT to Use Concurrency

Section: Core Mental Models | Difficulty: Intermediate | Frequency: Sometimes

The best concurrent code is usually no concurrent code. Every added thread or task costs memory, debugging difficulty, and a chance for race conditions. Reach for concurrency only after measuring a real bottleneck where the workload's nature (I/O wait, CPU parallelism, latency hiding) actually benefits.

Key Points

Concurrency is a cost, pay it only when the benefit is measured, not assumed
Premature concurrency is worse than premature optimization, it can't be taken back without rewriting
If the bottleneck is the database, adding threads in the app server doesn't help
Under 1000 simultaneous users, async/await is rarely needed
Single-threaded event-loop services (Redis, Node.js) prove single-threaded scales further than people think
The ONE-CORE rule: if it can be solved on one core, do it on one core

Interview Follow-up Questions

When should code become concurrent?

When measurement shows it's I/O-bound with lots of concurrent operations (use threads or async), OR it's CPU-bound and the work parallelizes (use processes/goroutines). Don't add concurrency to make a single user request faster, that almost never works.

If concurrency is so risky, why does any production code use it?

Because at scale the single-threaded ceiling is real, Redis hits ~100K ops/sec on one core, then needs sharding. Most apps run a process per CPU core (gunicorn, Puma) and let the OS schedule them. The concurrency boundary is process-level, kept simple. Within each process: as little concurrency as possible.

What's the cheapest 'concurrency' that's actually free?

Running multiple processes (one per CPU). No shared state, no locks, OS handles scheduling. This is what most modern web stacks do, it's why Python web servers scale despite the GIL. The 'concurrency' is between requests/processes, not within them.

Is async worth using even for a small service?

Probably not, unless the framework is async-first (FastAPI, Sanic). The migration cost from sync to async is high (every library, every test, mental model change) and the benefit only kicks in at high concurrent connection counts.

When is concurrency NOT a choice, i.e., truly required?

GUI threads (must not block the UI), real-time systems (must meet deadlines), high-frequency trading (latency-bound), and serving > 10K concurrent connections (memory-bound on threads). For everything else, it's a choice, and often the wrong one.

Gotchas

Threading.Thread() in a script that runs <1 minute → complexity cost paid for nothing
asyncio.gather on 10 items → not enough concurrency to matter; just loop
Adding sharding before measuring single-shard limits → premature, costs cohesion
Concurrent code reads as 'sophisticated' in code reviews, bias toward simplicity
'It might scale' is not a reason to add concurrency, measure first

Producer-Consumer Pattern

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

One or more producers add work items to a bounded buffer; one or more consumers remove and process them. The buffer decouples their rates; the synchronization makes the buffer safe and prevents busy-waiting.

Key Points

Bounded buffer: producers block when full, consumers block when empty
Two condition variables (notFull, notEmpty), never one, or you risk lost wakeups
Always wait in a while-loop, never if, to handle spurious and stolen wakeups
In Go, a buffered channel IS the entire pattern, no extra synchronization needed
In Python, queue.Queue is the canonical impl, don't roll your own
Graceful shutdown: producer closes channel / sends sentinel; consumer detects and exits

Interview Follow-up Questions

Why use TWO conditions (notFull, notEmpty) instead of one?

With a single condition, signal() might wake the wrong type of waiter, e.g. a producer when the buffer is empty. notifyAll() would be required for correctness, which is wasteful.

Why must wait() be in a while loop, not if?

Spurious wakeups exist (JVM/OS can wake without signal) and stolen wakeups (another thread races in and changes state). Re-checking the condition is the only safe pattern.

In Go, who should close the channel?

The producer side. Never close from the consumer (sender doesn't know it's safe to send) and never close twice. Multi-producer: use sync.WaitGroup + a single closer goroutine.

How is 'no more items' signalled in Java BlockingQueue?

Use a sentinel value (poison pill) or a separate completion flag; BlockingQueue has no built-in 'closed' state like Go channels.

How does Python's queue.Queue handle shutdown?

Same sentinel pattern as Java; put None on the queue, consumers check for it. With multiple consumers, re-put the sentinel after seeing it so the next consumer also sees it.

Gotchas

Closing a Go channel twice → panic. Sending on a closed channel → panic.
wg.Add() must run BEFORE 'go ...' or it races with wg.Wait()
Spurious wakeups: never use 'if' instead of 'while' around await()
BlockingQueue.put vs offer vs add: put blocks, offer returns false, add throws, pick the right one
Forgetting to handle InterruptedException = thread silently lives on
Python: forgetting to put the sentinel = consumer hangs forever

Worker Pool Pattern

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

A fixed pool of N worker threads/goroutines pulls jobs off a shared queue and processes them in parallel. It bounds resource usage (you can't run out of memory by spawning a thread per request) and keeps overhead constant under load.

Key Points

Fixed N workers > thread-per-task: bounded memory, predictable overhead
Workers loop: pull from queue, process, repeat, until queue closes
Result delivery: separate result channel, Future objects, or callback
Sizing: ~core count for CPU work, much higher for I/O, use Goetz's formula
Always have a shutdown story: close queue → workers drain → join all
Error propagation matters: one worker failing should signal the rest

Interview Follow-up Questions

How is a worker pool sized?

CPU-bound: pool ≈ core count. I/O-bound: Goetz's formula → cores × (1 + wait/compute). For 90% I/O work, that's ~10× cores. Empirically tune: ramp up workers, watch throughput plateau, back off slightly.

What's the right rejection policy when the queue is full?

Default in Java's ThreadPoolExecutor is to throw (RejectedExecutionException). Production-favorite: CallerRunsPolicy, the submitter does the work, providing natural backpressure. DiscardPolicy silently drops tasks (use only if dropping is acceptable, e.g. metrics). Never use unbounded queues in production.

Why is newCachedThreadPool() dangerous?

It uses a SynchronousQueue with no upper bound on threads. Under sustained load, every submission spawns a new thread, until the JVM hits the OS thread limit (typically a few thousand) and crashes. Stick to fixed-size pools.

Should each request get its own thread, or share a pool?

Share a pool. Thread-per-request is fine at small scale (a few hundred req/sec) and is what Java pre-21 servlet servers do, but it doesn't bound resource usage. Modern advice: virtual threads (Java 21+) for the request-per-thread model with effectively no upper bound; fixed pool for CPU-heavy work.

How are errors propagated from a worker pool?

Java: Future.get() throws ExecutionException wrapping the worker's exception. Python: future.exception() or .result() raises. Go: errgroup.Group.Wait() returns the first non-nil error. Custom pools need an error channel + a 'first failure cancels all' protocol.

Gotchas

Java's newFixedThreadPool uses an unbounded LinkedBlockingQueue by default, bursts → OOM. Use ThreadPoolExecutor explicitly with ArrayBlockingQueue.
Forgetting to call shutdown() on an ExecutorService → JVM doesn't exit
Forgetting to close the work channel in Go → workers block forever waiting for more
Calling .get() on a Future inside a worker → can deadlock if the pool is full and the worker is waiting for another task to complete
Task throws an exception, future swallows it, always log future.exception() or wrap in try/catch
ProcessPoolExecutor with closures or lambdas → pickle errors; use top-level functions

Pipeline Pattern

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

A pipeline chains stages where each stage transforms data and passes it to the next. Each stage runs concurrently; bounded queues between stages give backpressure for free. Dominant pattern for stream processing, ETL, and any 'producer → transform → transform → consumer' shape.

Key Points

Each stage is a concurrent worker (or pool) reading from one queue, writing to the next.
Bounded queues between stages give automatic backpressure: a slow stage throttles upstream.
The slowest stage is the bottleneck. Throughput equals throughput of the slowest stage.
Shutdown propagates downstream: close the input, the stage drains and closes the output.
Errors are tricky: cancel the whole pipeline, or skip-and-log, or dead-letter the bad item. Pick a policy.

Interview Follow-up Questions

What if one stage is much slower than the others?

Two options. Scale that stage horizontally: run multiple workers reading from the same input queue. Or rebalance: do less work in that stage, more in faster ones. The bounded queue makes the slowness visible (it fills up, upstream blocks); tune from there.

How are errors propagated through a pipeline?

Three patterns. Cancel-on-error: every stage shares a context; first error cancels everything (errgroup-style). Skip-on-error: log the error, continue with the next item. Dead-letter: send the bad item to a separate error channel for later inspection. The right choice depends on whether each item is independent.

When to use a pipeline vs fan-out?

Pipeline: each item flows through stages in sequence (A → B → C → D). Fan-out: multiple workers process items independently. They compose: a pipeline stage can fan out internally (10 workers all reading from the input queue, all writing to the output). Use pipeline when there is a sequence; fan-out when there is parallelism within a stage.

Gotchas

Forgetting to close the output channel leaves downstream goroutines blocked forever
Unbounded queues between stages defeat backpressure; bound everything
Errors in the middle of the pipeline can leave stages running with no input/output
Slow consumer at the end backs up everything; monitor queue depths

Fan-Out / Fan-In

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

Fan-out: distribute work to multiple workers concurrently. Fan-in: merge their results back into a single stream. The two together parallelise the middle of a pipeline. Standard shape: one input, N workers each doing the same transformation, one output combining results.

Key Points

Fan-out: one input channel/queue feeds N workers. Each worker reads independently.
Fan-in: N output channels merge into one. With channels, this is N goroutines copying to a shared output channel.
Order is not preserved by default; preserving it requires a sequence number and a re-sort downstream.
Worker count = degree of parallelism. Cap it (errgroup.SetLimit, semaphore, ThreadPoolExecutor maxWorkers).
Combine with retry, circuit breaker, timeout for production-grade fan-out.

Interview Follow-up Questions

How many workers should run?

For CPU-bound work, equal to or slightly less than core count. For I/O-bound work, much higher (50, 100, 500), bounded by the downstream's concurrency limit. The wrong default is 'as many as possible'; that exhausts file descriptors, overwhelms the downstream, and yields nothing. Measure throughput at different worker counts and find the knee.

How is order preserved?

Tag each input with a sequence number. Fan out. In the consumer, re-sort by sequence. Or use a single-threaded consumer with a min-heap to stream in order. Or use a partitioned approach where each partition is single-threaded but partitions are parallel.

What if some workers are much slower than others?

With a shared input channel, work-stealing happens automatically: fast workers grab more items. With per-worker input queues, balancing has to be manual. Default to shared input unless there is a specific reason for per-worker queues (cache locality, stateful workers).

How are errors propagated?

Two patterns. Cancel-on-first-error: shared context, first error cancels the rest (errgroup, asyncio.TaskGroup). Or collect errors and continue: each worker logs/returns its error, the consumer decides. Cancel-on-first is the default for transactional fan-out; collect-and-continue for batch processing where partial success is fine.

Gotchas

Forgetting to close the output channel leaves the consumer blocked forever
Unbounded worker count exhausts FDs and overwhelms downstream
Order is lost unless explicitly preserved with sequence numbers
One slow worker doesn't slow others if input is shared, but slows the overall completion
Errors in one worker can leak resources if not propagated through cancellation

Futures and Promises

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

A future is a placeholder for a value that will be available later. A promise is the producer side that fulfils the future. The pair allows starting work asynchronously and consuming the result with backpressure (the consumer waits if not ready). Foundational primitive in async programming.

Key Points

Future = consumer-side handle (read result, wait, attach callbacks). Promise = producer-side handle (set value or error).
Many languages have only one type that does both jobs: CompletableFuture (Java), Future (Python), Promise (JS), Future (Rust).
Composition: chain futures (then/map/flatMap), wait for all (allOf/gather), wait for any (anyOf).
Cancellation is the hard part. Many implementations support cancelling the future, but stopping the actual work depends on the producer cooperating.
Don't block in a future-returning function. The whole point is async; calling .get() inside an async function defeats the purpose.

Interview Follow-up Questions

What is the difference between Future and Promise?

Future is the read side: the consumer reads the value when ready. Promise is the write side: the producer fulfils the future. JavaScript and Scala keep them as one Promise type that does both. Java keeps them separate (Future read-only, CompletableFuture both). C++ has std::future and std::promise as a clear pair. Conceptually they are two ends of the same thing.

How is a future cancelled?

Depends on the implementation. CompletableFuture.cancel(true) marks it cancelled and any caller of .get() throws CancellationException, but the underlying work may keep running unless it cooperates (checks isCancelled, responds to interrupt). asyncio.Task.cancel raises CancelledError at the next await. Channels in Go don't have built-in cancellation; pass a context.

Why can chaining futures with then/map be cleaner than awaiting each?

For a sequence of dependent operations (A → B → C), the chained version reads as a flat pipeline and can run continuations on a different executor than the source. The await/blocking version is also fine; it depends on whether dataflow style or imperative style is preferred. Both end up with the same semantics.

What is a 'completed' future?

A future that already has a value or an error. Useful for testing, for known-static results, or as the start of a chain. CompletableFuture.completedFuture(x), asyncio.Future().set_result(x), and similar constructors return one immediately.

Gotchas

Calling .get() / await inside async code defeats the purpose; chain instead
Cancellation marks the future cancelled but may not stop the underlying work
Java CompletableFuture's default executor (ForkJoinPool.commonPool) is shared; specify an executor for I/O-bound chains
Composing futures with shared mutable state still needs synchronisation
exceptionally / catch resets the error chain; subsequent stages run on the recovered value, not the error

Reactor Pattern and the Event Loop

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

One thread, one loop. The loop asks the kernel which file descriptors are ready, fires the registered handlers for each, and goes back to the kernel. No threads block. This single design powers Node.js, Python asyncio, Netty, Tokio, libuv, nginx, and the Go netpoller. It is the answer to 'how do I serve 100K connections without 100K threads?'

Key Points

Three pieces: a demultiplexer (epoll, kqueue, IOCP, io_uring), a queue of handlers keyed by FD, and a loop that drives them.
Loop body: block in epoll_wait, get a list of ready FDs, run each FD's handler to completion, then loop again.
Handlers must be non-blocking. Any blocking call (sync I/O, sleep, CPU-heavy work) freezes the entire loop and stalls every other connection.
CPU-heavy work goes to a worker pool (Node.js worker_threads, Python's run_in_executor); the result comes back as a callback or future on the loop.
Single-threaded by design. Removes data races inside the handler. Costs multi-core scaling unless multiple loops run (one per core, share-nothing).
The 'colored functions' problem comes from this pattern: handlers and the things they call must agree on the same async ABI (callback, promise, async/await).

Interview Follow-up Questions

Why is the reactor single-threaded?

Two reasons. Simplicity: handlers run sequentially in one thread, so locks are never needed for state inside the loop. Cache: one thread on one core keeps L1/L2 hot for the loop and its scheduler state. Cost: a single loop cannot use multiple cores. The standard solution is to run one loop per core (share-nothing); each loop owns a slice of the work, and they coordinate over message queues if they have to.

What's a proactor and how is it different?

Reactor (epoll, kqueue): the kernel signals 'this FD is ready, the caller does the I/O.' Proactor (IOCP on Windows, io_uring on Linux): the caller submits the I/O to the kernel and the kernel notifies when it's done with the data already in the buffer. Proactor saves a syscall per operation and scales better at very high rates, which is why io_uring is widely seen as the future on Linux. asyncio's ProactorEventLoop on Windows uses IOCP; libuv added io_uring as an opt-in backend (epoll is still the default).

Why don't goroutines have this problem?

They do have the same machinery, just hidden. The Go netpoller is a reactor: every blocking network call parks the goroutine on the netpoller, the kernel's epoll readiness wakes it. The difference is that Go schedules goroutines preemptively and runs them on multiple OS threads (M's), so the 'CPU-heavy handler stalls everything' bug doesn't happen at the language level; one slow goroutine just keeps its M busy while others run elsewhere. Conceptually it's a multi-loop reactor with the loop hidden in the runtime.

How is a reactor scaled across cores?

Three patterns. (1) One loop per core, share-nothing: nginx, Tokio in 'multi-thread' mode, Vert.x verticles. Each loop owns its own connections; cross-loop communication is via channels. (2) Single loop + worker pool: Node.js. The loop dispatches CPU work to worker_threads and gets results as callbacks. (3) Acceptor + worker loops: Netty's EventLoopGroup. One loop accepts, multiple loops handle. Pick (1) for max throughput, (2) for ergonomic single-threaded code, (3) for fine-grained control.

Where does this break down?

When most of the work is CPU-bound. A reactor is optimal for many connections that are mostly idle (chat, dashboards, gateways). It's overkill or actively bad with few connections doing heavy compute (machine learning serving, video transcoding); a thread-per-request model is simpler and uses cores naturally. Match the model to the workload.

Gotchas

A blocking call inside a handler freezes the loop. This includes sleep(), input(), requests.get(), psycopg2 queries, and most non-async libraries.
Forgetting to await an async function: it returns a coroutine object that does nothing. Linters catch this; reviews don't always.
The loop runs callbacks in the order they're scheduled, not the order their I/O completed. Never assume FIFO across handlers if work is queued via call_soon.
Exceptions in handlers can crash the loop in some libraries; in asyncio they get logged and lost unless the task is awaited. Always check for unawaited tasks in production.

Async vs Blocking I/O

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

Blocking I/O: one thread per concurrent operation, thread sleeps during I/O wait. Async I/O: one thread handles many operations via an event loop, syscalls are non-blocking with callbacks/coroutines. Async wins for many concurrent connections; blocking wins for few but heavy ones, or when the language doesn't have great async support.

Key Points

Blocking model: thread per request. Each thread has memory cost (pthreads default ~8 MB virtual / a few hundred KB committed; JVM default 512KB-1MB), thread-create cost, context switch cost.
Async model: event loop reads ready I/O events, dispatches to coroutines. One thread can multiplex thousands of connections.
Async needs the entire stack to be async: blocking call inside an async function freezes the loop.
Choose based on workload: many connections + small per-connection work = async. Few connections + heavy work = threads.
Virtual threads (Java 21+, Go goroutines) blur the line: write blocking-style code, runtime provides async-like efficiency.

Interview Follow-up Questions

Why is async usually faster for I/O?

Each blocking thread has fixed memory cost (pthreads default 8 MB virtual on Linux; JVM defaults to 512KB-1MB; only the touched stack pages are physically committed) and OS scheduling cost. 10,000 concurrent connections need 10,000 threads, even at JVM defaults that's ~5-10 GB of address space. Async uses one thread (or a small pool) and an event loop; the event loop costs are amortised. For pure I/O workloads, async services routinely handle 10x to 100x more concurrent connections per server.

When is blocking actually fine?

When concurrency is low (a CLI tool, a CRON job, a low-traffic admin endpoint). When the per-request work is heavy and CPU-bound (the I/O is small relative to compute). When the language ecosystem doesn't have good async support (synchronous languages without coroutines). For most internal tooling, blocking is simpler and the cost doesn't matter.

Are virtual threads (Java) and goroutines (Go) async or blocking?

Both, in a sense. The programmer writes blocking-style code (call, wait for result, return). The runtime provides async-like efficiency by parking the virtual thread/goroutine during I/O and scheduling another on the same OS thread. The result is async throughput with blocking syntax. Mostly a clear win; the trade-off is some loss of explicit control.

What is the 'colored function' problem in async languages?

In Python and JavaScript, async functions can only be called from async contexts (with await). Sync functions cannot directly call async ones (they get a coroutine object back). This 'colors' the codebase: once async is adopted at one layer, it propagates upward through every caller. Go and Java's virtual threads avoid this by making the runtime handle the async-ness without language-level coloring.

Gotchas

Calling blocking code from async freezes the event loop
Mixing thread-based libraries (threading.Lock) with asyncio races horribly
Per-thread memory cost adds up: 10K threads = ~10 GB just for stack address space (mostly virtual)
Virtual threads inherit thread-locals, but stack traces look different from platform threads
Async stack traces are harder to read (the event loop is in the middle)

Retry with Exponential Backoff and Jitter

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

Retry transient failures with delays that double each attempt (exponential backoff). Add jitter (randomness) so a thousand clients don't synchronise their retries and DDoS the recovering server. Cap the total attempts and the total elapsed time. Only retry idempotent operations.

Key Points

Exponential backoff: delay = base * 2^attempt. Avoids hammering the failing service.
Jitter is required, not optional. Without it, retries synchronise into a thundering herd.
Bound retries: max attempts AND max total elapsed time. Both, not either.
Only retry transient errors (5xx, timeouts, connection refused). Never retry 4xx (client errors).
Only retry idempotent operations. POST is dangerous; GET, PUT, DELETE are usually safe.

Interview Follow-up Questions

Why is jitter important?

Without jitter, all clients that started failing at roughly the same time retry at exactly the same intervals. When the downstream recovers, every retry hits at once and overwhelms it again. Jitter (random delay) spreads the retries over the interval, letting the downstream absorb them gradually. Standard practice in distributed systems.

Full jitter vs decorrelated jitter?

Full jitter: delay = random(0, base * 2^attempt). Simple, works well. Decorrelated jitter: delay = random(base, prev_delay * 3), where prev_delay is the previous attempt's delay. Better at avoiding correlation across clients but more complex. For most workloads, full jitter is enough.

When NOT to retry?

Three cases. Permanent errors (4xx HTTP, validation failures, auth failures): retrying never helps. Non-idempotent operations (POST without an idempotency key): retry can charge the customer twice. Operations approaching a hard deadline: a retry that takes another 5s when the request budget is 1s is wasted work. Always check the operation type and the remaining budget.

How is a transient error identified?

5xx HTTP responses, network timeouts, connection refused, DNS failures are usually transient. 4xx HTTP, 'invalid input', 'permission denied' are permanent. For databases, deadlocks are transient (retry); constraint violations are not. The library should support per-error classification; avoid blanket-retrying all exceptions.

Gotchas

No jitter = thundering herd when downstream recovers
Retrying non-idempotent operations (POST without idempotency key) double-charges users
No total-time bound = retries last longer than the original request budget
Retrying 4xx errors wastes everyone's time
Retry inside a retry (caller retry plus the HTTP client's) creates exponential blowup

Circuit Breaker Pattern

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

Stop calling a downstream that is failing. After N failures in a window, the breaker opens and immediately fails subsequent calls without hitting the downstream. After a cooldown, it allows a probe; if successful, close again. Prevents one slow downstream from cascading into the calling service.

Key Points

Three states: closed (normal), open (fail fast), half-open (probe single call).
Trip on failure rate or absolute count over a sliding window. 'X failures per minute' is more useful than 'X consecutive failures'.
Open state has a cooldown (typically 30s-5min). After cooldown, breaker is half-open: allows ONE probe call.
Without a circuit breaker, a slow downstream ties up threads/goroutines waiting on timeouts; the caller slows down trying to call something that won't respond.
Per-downstream breakers, not global. One slow service shouldn't block calls to a healthy one.

Interview Follow-up Questions

Circuit breaker vs retry: are they competing?

Complementary. Retry handles transient blips: 'this one call might work next time'. Circuit breaker handles sustained failure: 'this downstream is down, stop calling it'. Combine: retry within a single call, circuit breaker across calls. The retry hits the breaker first; if the breaker is open, the call fails fast without the retry attempts running.

How is the failure threshold picked?

Two parameters: failure rate (e.g., 50%) and minimum calls (e.g., 10). The minimum-calls floor prevents tripping on a tiny sample (one failure out of one is 100% failure rate but means nothing). Sliding-window size controls how reactive: smaller window trips faster, more false positives; larger window is slower to trip and slower to recover.

Why per-downstream and not global?

Different downstreams have different reliability. Service A being slow shouldn't stop calls to service B. Per-downstream breakers also allow tuning thresholds per downstream (a payment gateway gets a stricter breaker than a logging service).

What does 'fail fast' actually save?

Two things. Threads/goroutines/connections that would have been tied up waiting for the downstream's timeout. And the downstream's load: not piling on more requests while it is trying to recover. The first is critical: with 100 threads and a downstream that takes 30s to time out, 100 calls per minute is enough to exhaust all threads.

Gotchas

Single global breaker for all downstreams: one bad service blocks everything
No probe in half-open state: breaker stays open forever
Counting client errors (4xx) as breaker failures: trips on user mistakes, not service health
Threshold too low: trips on benign blips. Too high: trips too late, after damage is done
Forgetting fallback when breaker opens: callers see CircuitOpenError instead of degraded response

Bulkhead Pattern

Section: Core Patterns | Difficulty: Intermediate | Frequency: Sometimes

Isolate resources so a failure in one part of the system doesn't drain everything. Separate thread pool, connection pool, or rate limit per downstream so one slow downstream can't tie up resources needed by healthy ones. Named after ship hulls divided into compartments to prevent total flooding.

Key Points

Each downstream gets its own resource budget (threads, connections, semaphore permits).
When one downstream is slow, only its budget exhausts. Calls to other downstreams keep flowing.
Without bulkheads, a single thread pool serving all downstreams gets fully tied up by the slow one. Latency cascades.
Pair with circuit breaker (fail-fast) and timeout (don't wait forever) for full resilience.
Cost: more threads/connections to manage. The win is bounded blast radius.

Interview Follow-up Questions

Bulkhead vs circuit breaker?

Different problems. Circuit breaker: stop calling a downstream that is failing. Bulkhead: limit concurrent calls so a slow downstream can't drain shared resources. They pair: bulkhead caps in-flight calls; if those calls are slow and the bulkhead fills up, callers fail fast (similar to circuit-breaker open). For full protection, use both.

How big should each bulkhead be?

Sized to the downstream's actual capacity, with headroom. If the downstream can handle 100 req/sec at 200ms latency, in-flight is 100 * 0.2 = 20 calls. Set the bulkhead a bit higher (30) to absorb bursts, but not so high that the calling service ties up resources when the downstream slows.

Why not just use timeouts?

Timeouts release the resource (thread, connection) only after the timeout fires. With a 30-second timeout and a slow downstream, threads stay busy for 30 seconds each. With a bulkhead, in-flight is capped from the start; new callers either wait briefly or fail fast. Timeouts AND bulkhead together: bulkhead caps in-flight, timeout caps per-call duration.

What about thread pool isolation specifically?

ThreadPoolBulkhead provides a dedicated pool per downstream. Stronger isolation than semaphore (failure can't even reach the shared pool's threads), but more memory per downstream. Use semaphore bulkheads as default; reach for thread-pool bulkheads when stronger isolation is required (the downstream call has potentially leaky resource usage).

Gotchas

One shared pool with no per-downstream cap = slow downstream blocks all calls
Bulkhead too small = false rejections under normal traffic
Bulkhead too big = downstream still gets overloaded; bulkhead provided no protection
Forgetting to release the semaphore on exception leaks permits permanently
Bulkhead without timeout = each in-flight call can still hang forever; combine them

Idempotency

Section: Core Patterns | Difficulty: Intermediate | Frequency: Asked Often

An idempotent operation produces the same result whether called once or many times. Critical for retries: if a charge call is idempotent, a network blip that triggers a retry doesn't double-charge the customer. Achieved via idempotency keys (client-supplied unique IDs) or natural idempotence (PUT, DELETE, set operations).

Key Points

Idempotent operations: GET, PUT (replace), DELETE, set-x-to-y. Calling twice == calling once.
NON-idempotent: POST that creates, INSERT, charge. Calling twice produces two records or two charges.
Idempotency key: client generates a unique ID per logical operation, server deduplicates on it.
Server stores 'this key was seen, here is the response' for some retention window.
Without idempotency, retries can double-charge, double-create, double-send. With it, retries are safe.

Interview Follow-up Questions

Why is idempotency a concurrency topic?

Because retries are concurrency. In any network or distributed system, requests can be received twice (client retried, network duplicated, load balancer retried, the response was lost on the way back). Without idempotency, those duplicate receives have duplicate effects. Idempotency makes the system tolerant of these everyday concurrent events.

How long should idempotency keys be stored?

At least as long as the maximum retry window. For client-driven retries, hours to a day. For batch/queue-driven workflows that may retry over days, keep them longer. Common choice: 24 hours for sync APIs, 7 days for async/queue workflows. Storage cost is small (one key + cached response per request).

What happens if two clients send the same idempotency key for different operations?

Tricky. The server has to decide: is the second request actually the same as the first (fingerprint matches), or is it a key collision (different request with the same key)? Stripe-style policy: if the request body differs from the cached one, return 409 Conflict. If they match, return the cached response. Always include enough of the request in the dedup check to detect collisions.

Are GET requests idempotent?

By the HTTP spec, yes. GET should not modify server state; calling twice is the same as calling once. Servers that implement GET correctly are idempotent for free. The idempotency-key pattern is for the operations that aren't naturally idempotent (POST that creates, charge, send).

Gotchas

Generating a new key per retry defeats deduplication; one logical operation = one key
Storing the key without storing the response means retries fail to replay the original outcome
Not handling concurrent requests with the same key (need a lock) leads to double-execution
Treating different requests with the same key as duplicates leads to wrong responses
POST/INSERT without idempotency keys is a bug waiting to happen under network failures

Strict Alternating Output

Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often

Two threads must alternate strictly, A then B then A then B, n times. The canonical solution uses a Semaphore pair, with thread A holding the 'A's turn' permit and signaling B's after each step. Same pattern with wait/notify, asyncio.Event, or Go channels.

Key Points

Two semaphores, started with permits 1 and 0 respectively → enforces order
Each thread acquires its semaphore, prints, then releases the OTHER's semaphore
Equivalent patterns: wait/notify pair, two condition variables, two channels
Generalizes to N-thread round-robin (next lesson)

Interview Follow-up Questions

Why two semaphores, not one?

One semaphore can't enforce *which* thread runs next, it only enforces 'one at a time.' Two semaphores form a token-passing protocol: each thread waits for its specific turn-token before printing and hands the other token forward.

Will signal() vs signalAll() matter here?

Either works for two threads, since there's only ever one waiter. With more threads, signalAll is safer (avoids spurious-wakeup edge cases) but wastes one wakeup. The semaphore pattern sidesteps the question entirely, semaphores are 1:1 by construction.

How does this generalize to N threads round-robin?

An array of N semaphores: thread[i] acquires sem[i], runs, releases sem[(i+1) % N]. The next lesson covers Print Zero Even Odd, which is N=3 with conditional turns.

Three-Way Round-Robin Output

Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often

Three threads must produce 010203040..., Zero, then alternating Odd/Even. Three semaphores form a token-passing relay: Zero hands to Odd-or-Even based on parity; the parity thread hands back to Zero.

Key Points

Three semaphores: zeroSem (starts 1), oddSem (0), evenSem (0)
Zero always runs after every digit; parity thread alternates 'who's next' based on counter
Generalizes the strict-alternation pattern to conditional next-turn
wait/notify version: shared int + condition variables, harder to get right

Interview Follow-up Questions

Why does Zero always run between every Odd and Even?

Because Zero releases the next parity thread, which then releases Zero. So the order is forced: Zero, parity, Zero, parity. No way for two parity threads to run consecutively without Zero in between.

What if the parity choice were random instead of based on i?

Same pattern, Zero just picks based on whatever the rule is. The semaphore relay enforces 'one runs at a time' regardless of the choice. The choice logic is independent of the synchronization.

How to solve this with a single mutex?

Single mutex + shared counter + condition variables, with each thread waiting on a predicate ('counter % 2 == 1' for odd, '== 0 and counter > 0' for even). Works but verbose; semaphores express the intent more cleanly.

Coordinated Conditional Output

Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often

Four threads, Number, Fizz, Buzz, FizzBuzz, coordinate to print FizzBuzz for 1..n. The right thread is whichever matches the current number's divisibility. Pattern: shared counter + per-thread Semaphore, dispatcher wakes only the right thread.

Key Points

Four threads, one shared counter, four semaphores
A 'dispatcher' increments the counter and releases exactly the matching thread's semaphore
Or: each thread waits on a condition that matches its rule (i % 3, i % 5, i % 15)
Same token-passing pattern as FooBar but routed by predicate

Interview Follow-up Questions

Predicate-on-condition vs semaphore dispatcher, which is cleaner?

Both are valid. Semaphore dispatcher is faster (no spurious wakeups, no broadcast) but requires a coordinator thread. Predicate version is shorter to write but uses notifyAll/broadcast, less efficient at scale. For interview, either is acceptable; the senior signal is being able to discuss the tradeoff.

Why does the predicate version need a while loop, not if?

Spurious wakeups + the broadcast wakes ALL waiters every time. Each thread re-checks its predicate on wake; only the one that matches the new state proceeds, the others go back to sleep. `if` would mistakenly print on a wake even when the predicate has become false.

How to handle 100 different output rules, not just 4?

Single condition variable + a predicate function per thread, all subscribed to broadcasts. Or a dispatcher that maps rules → channels in a routing table. The semaphore-per-thread approach is cleanest for small N (2-4); broadcast/predicate scales better for large N.

Ordered Three-Thread Execution (Print in Order)

Section: Core Patterns | Difficulty: Easy | Frequency: Asked Often

Three threads start in arbitrary order; ensure first() runs before second() runs before third(). The minimal solution: two binary semaphores starting at 0, where each method releases the next's semaphore on completion. Same pattern with CountDownLatch.

Key Points

Two semaphores both start at 0 → first method must signal before second can proceed
first() prints, then releases s1; second() acquires s1, prints, releases s2; third() acquires s2
CountDownLatch (Java) is a clean alternative, semantically more obvious
Generalizes to N-stage pipelines via N-1 semaphores or N-1 latches

Interview Follow-up Questions

Semaphore vs CountDownLatch, which to pick?

CountDownLatch when the semantics are 'wait for one event' (one-shot). Semaphore when the synchronization point may be reused. For Print in Order, latch is conceptually clearer; semaphore works fine.

What if first() throws after printing but before releasing?

second() and third() will block forever. Production-safe version uses try/finally to guarantee release on error path. Or: rethink whether 'first failed' means second should also fail (probably yes, propagate the error).

How does this generalize to a 5-stage pipeline?

4 latches/semaphores, each chained: stage1 releases s1; stage2 awaits s1, releases s2; etc. Or, single CountDownLatch with count=N for join-all semantics. Choose based on whether stages are sequential or fan-in.

N-of-M Barrier Synchronization

Section: Core Patterns | Difficulty: Hard | Frequency: Asked Often

Hydrogen and Oxygen threads must pair into water, exactly 2 H + 1 O before any can release. The pattern: two semaphores controlling H/O entry + a CyclicBarrier of 3 ensuring all three threads of a molecule are present before any prints.

Key Points

Pattern: rate-limiting semaphore + barrier
Hydrogen semaphore = 2 permits; Oxygen semaphore = 1 permit; both refill after barrier
CyclicBarrier(3) ensures the 2 H + 1 O appear together before any prints
After barrier, semaphores reset, next molecule's H and O begin

Interview Follow-up Questions

What's a CyclicBarrier exactly?

A barrier where N threads must all call `await()` before any can proceed. Once all arrive, they're all released simultaneously. *Cyclic* because the barrier resets to wait for the next batch of N. Java has it natively; Python's `threading.Barrier` is equivalent; Go has no builtin (build with sync.Cond).

Why do we need both the semaphores AND the barrier?

Semaphores cap *how many* of each type can enter. Barrier ensures the *correct mix* (2 H + 1 O) all reach the print step together. Without semaphores, four hydrogens could pass the barrier and starve oxygens. Without the barrier, individual threads print whenever they enter, no coordination.

How does this generalize to N-of-M?

Same pattern: M semaphores rate-limiting each kind, barrier of size N. Want 3 sodium + 1 chloride for NaCl3? Set hSem=3 (rename), oSem=1, barrier(4). Pattern scales to any 'fixed mix' synchronization.

Dining Philosophers

Section: Core Patterns | Difficulty: Hard | Frequency: Asked Often

Five philosophers, five forks, alternating thinking and eating. Each needs both adjacent forks to eat, so naively-acquired forks deadlock. Classic solutions: enforce odd/even acquisition order, limit to 4 eaters at once, or use a waiter (arbitrator).

Key Points

Naive solution (each grabs left then right) deadlocks, circular wait
Solution 1: odd philosophers grab left first, even grab right first → asymmetry breaks cycle
Solution 2: a Semaphore(N-1) limits concurrent eaters → no full cycle possible
Solution 3: arbitrator (waiter) serializes fork pickups
Modern Go solution: channel-based fork ownership eliminates locks entirely

Interview Follow-up Questions

Why does naive (everyone grabs left, then right) deadlock?

Five philosophers each grab their left fork simultaneously. Now every philosopher holds one fork and waits for the right one, held by their neighbor. Circular wait, all four Coffman conditions hold, deadlock.

Asymmetry vs Semaphore(4), which is preferred?

Semaphore(4) is simpler to explain and provably deadlock-free. Asymmetry is slightly faster (no semaphore overhead) but harder to verify. Most senior interviewers accept either; the win is articulating *why* each works.

Can a philosopher starve under these solutions?

Yes, none of these solutions prevent starvation, only deadlock. A philosopher could be perpetually beaten to forks by faster neighbors. Fixing requires fairness, e.g., a queue or arbitrator (Solution 3, the 'waiter').

What's the arbitrator solution?

A central 'waiter' grants permission to pick up forks. Each philosopher requests both forks atomically through the waiter; only granted if both are free. Eliminates deadlock and provides fairness if the waiter uses a queue.

Readers-Writers Problem

Section: Core Patterns | Difficulty: Medium | Frequency: Asked Often

Many readers can access shared data concurrently, but writers need exclusive access. The classic problem: which side gets priority? Reader-pref starves writers; writer-pref starves readers; fair locks alternate. Java/Go/Python all ship reader-writer locks.

Key Points

Readers can run concurrently with each other; writers need exclusive access
Reader-preference: writer can starve under sustained reader load
Writer-preference: reader can starve when writers always queue
Fair / FIFO: alternates by arrival order, slightly slower, no starvation
Most stdlib RW locks are writer-preferring (Java, Go, Python's read-write semaphore)

Interview Follow-up Questions

When does an RWMutex actually win over a plain Mutex?

When reads dominate writes by 5x or more AND each read does meaningful work (so the lock-acquire cost is amortized). For tiny critical sections, a Mutex is faster, RWMutex bookkeeping overhead exceeds the parallelism benefit.

Why is the default writer-preferring?

Without writer-pref, a steady stream of readers can starve writers indefinitely. Writers usually represent state changes that need to happen; reads can wait an extra millisecond. Most production lock libraries default to writer-pref for this reason.

What's the danger of writer-pref?

The re-entry deadlock, a goroutine/thread holding RLock that calls a method which also tries to RLock can deadlock if a writer arrives between. Covered in Bug Hunt: RWMutex Reentrant. Solution: don't re-enter (use 'Locked' helpers) or use a reentrant RW lock.

How does Linux's RCU compare?

Read-Copy-Update is a different beast, readers don't lock at all (they read a stale snapshot); writers create new copies and atomically install them. Used heavily in the Linux kernel. Best read-throughput in existence; complex update semantics; no general-purpose user-space equivalent.

Sleeping Barber

Section: Core Patterns | Difficulty: Medium | Frequency: Sometimes

A barber sleeps when no customers; a customer wakes the barber if asleep, otherwise sits in a finite waiting room (or leaves if full). Classic producer-consumer with a bounded buffer + signal-on-empty pattern. Two semaphores + a counted slot map onto it cleanly.

Key Points

Three roles: barber, customer, finite waiting room
Two semaphores: customers (producer signals barber) + barber (signals next haircut starts)
Mutex protects the seat counter; bounded chairs reject if full
Producer-consumer with a bounded buffer in different clothes

Interview Follow-up Questions

How is this different from producer-consumer?

It's a producer-consumer with: (1) a finite buffer that *rejects* on full instead of blocking the producer, and (2) the consumer (barber) blocks on the buffer being empty. Same primitives, different overflow policy.

What if the shop has multiple barbers?

N barbers reading from the same channel/queue. Each barber is independent; customers don't care which barber serves them. The channel buffered at CHAIRS still bounds the waiting room. This is a worker pool.

Why does the customer use semaphore.release() to wake the barber?

It's the signal that work is available. The barber blocks on customerSem.acquire(), semantically 'wait for a customer.' The release/acquire pair is the synchronization. The shared waiting count is just bookkeeping for the rejection check.

Cigarette Smokers Problem

Section: Core Patterns | Difficulty: Hard | Frequency: Rare

Three smokers each have one of three ingredients (tobacco, paper, matches). An agent randomly drops two of the three on the table; the smoker with the missing ingredient must roll and smoke. Tests whether the candidate can avoid 'helper threads' (Patil's restriction) and use only counting semaphores cleanly.

Key Points

Three smokers, each with infinite supply of one ingredient
Agent puts two ingredients on the table; smoker with the third must take and smoke
Patil's restriction: NO conditional statements in the agent code (just put + signal)
Trick: use 'pusher' threads as helpers, they observe what's available and signal the matching smoker
Modern view: this 'difficulty' is contrived and the helpers pattern is fine

Interview Follow-up Questions

What's Patil's restriction and why does it matter?

Patil (1971) added the rule: the agent thread cannot have conditional logic, it just puts ingredients down and signals. This forces semaphores-only coordination without checking inter-thread state, requiring 'pusher' helper threads. Modern view: the restriction is contrived; in production, the agent would just be smart.

Without pushers, can the problem be solved?

Yes, let the agent inspect what's on the table and directly signal the right smoker. But this 'cheats' Patil's restriction. The pushers exist to honor the academic constraint.

What real problems does this map to?

Pattern matching on event streams, multiple subscribers each waiting for a specific combination of recent events. CSP literature uses this as a canonical example of selective receive with multiple conditions.

Is this still asked in interviews?

Rarely as the original. More often as a variant: 'three event sources, three matching subscribers, dispatch by combination.' Once the pusher pattern clicks, the variants are straightforward.

Santa Claus Problem

Section: Core Patterns | Difficulty: Hard | Frequency: Rare

Santa sleeps until either 9 reindeer return (deliver toys) or 3 elves arrive (consult). Reindeer take priority over elves. Tests selective wakeup + group barrier + priority, three classic primitives composed into one problem.

Key Points

Santa sleeps until: 9 reindeer ready OR 3 elves waiting
Reindeer have priority, Santa serves them first if both are queued
Each group needs barrier-style synchronization (all 9 / all 3 must be present)
Composition: priority + group-of-N barrier + producer-consumer signaling

Interview Follow-up Questions

What if a 4th elf arrives while 3 are waiting?

Per the canonical rules, they wait, the next group of 3 starts only after the current 3 finish consulting. Modeled by the barrier: it accepts only N waiters at a time. The 4th elf blocks at the barrier until the previous batch is released.

How is reindeer priority enforced?

Santa checks reindeer count first when waking. If both groups signal him simultaneously, he handles reindeer; the elves' signal stays pending in the semaphore for next loop iteration. This is the standard 'priority by check order' pattern.

Why is this problem interesting?

It composes three primitives that are usually tested separately: barrier (group of N), priority (which kind of wake-up to handle first), and producer-consumer signaling. Combining them tests whether the candidate can decompose a multi-faceted problem into the right primitive set.

Categorical Mutual Exclusion

Section: Core Patterns | Difficulty: Medium | Frequency: Sometimes

A bathroom can hold N people of the same category but never mix categories. People of category A enter while bathroom is empty or only A's are inside; same for B's. Generalizes Readers-Writers, readers and writers are categories, with N=1 for writers.

Key Points

Two states: 'A only', 'B only', or empty
Same-category entry is unrestricted (up to capacity if any)
Cross-category entry must wait until current category empties
Generalization of Readers-Writers, same problem, more symmetric
Risk: starvation, if one category keeps arriving, the other never enters

Interview Follow-up Questions

How does this differ from Readers-Writers?

It's the symmetric version. RW lock has 'readers' (many concurrent) vs 'writers' (one exclusive), asymmetric. Categorical mutex has two symmetric categories, both allowing many concurrent. Same primitive structure (lock + condition), generalized.

Can starvation happen?

Yes. If category A keeps arriving while B's are waiting, B may never enter, A's never reach count 0. Fixes: (1) once any B is waiting, block new A's from entering; (2) round-robin between categories with a max consecutive entries. Adds fairness, costs throughput.

What if there are 3 categories, not 2?

Same pattern. The predicate becomes 'current is null or current == my category.' Generalizes to N categories with the same lock + condition + state-string approach.

What about a capacity limit per category?

Add a counter check in the wait predicate: 'current is null or (current == my category and count < capacity).' Same lock, additional condition. Combines with categorical exclusion cleanly.

Goroutines, Spawning, Scheduling, Lifecycle

Section: Go Concurrency | Difficulty: Basic | Frequency: Asked Often

Goroutines are user-space tasks multiplexed onto a small pool of OS threads by the Go runtime's M:P:G scheduler. Spawn with `go func() {...}()`, ~2KB initial stack, cheap to create, millions are practical. No public state machine; lifetime is 'until the function returns.'

Key Points

Goroutines are NOT OS threads, multiplexed onto N OS threads where N = GOMAXPROCS
Stack starts at ~2KB, grows on demand up to ~1GB
M:P:G model: M = OS thread, P = logical processor, G = goroutine
Work stealing: idle P's steal goroutines from busy P's run queues
No stop/kill, goroutines exit only when their function returns

Interview Follow-up Questions

What's M:P:G scheduling?

M = machine (OS thread). P = processor (logical, count = GOMAXPROCS). G = goroutine. Each P has a run queue of G's. M's pick a G from their P's queue and execute it. Idle M's steal from other P's. The runtime parks G's on blocking calls (channel, lock, syscall) without blocking the underlying M.

How are goroutines cheaper than OS threads?

(1) Stack: 2KB initial vs 1MB; (2) Context switch: user-space, no kernel transition, ~hundreds of ns vs ~5μs; (3) Scheduling: Go scheduler optimized for many goroutines; OS scheduler optimized for few threads. Net result: 100K goroutines is fine; 100K threads would crash.

Can a goroutine outlive main?

Yes, until main returns. Once main returns, the runtime exits and kills all goroutines (no cleanup, no defers). Use sync.WaitGroup or context.Context to wait for them before main returns.

How is a goroutine stopped?

Not from outside. The goroutine itself must return. Pass a `context.Context` and have the goroutine select on `ctx.Done()`, when the context is cancelled, the goroutine exits voluntarily. There's no `goroutine.Stop()`.

Gotchas

Closure capture in loops (pre-Go 1.22): always pass loop vars as args
wg.Add() must be BEFORE 'go ...' or it races with wg.Wait()
Spawning unbounded goroutines = potential OOM under load
tight pure-CPU loops without yields could starve the scheduler before Go 1.14

GMP Scheduler: How Goroutines Actually Run

Section: Go Concurrency | Difficulty: Advanced | Frequency: Asked Often

G = goroutine, M = OS thread, P = logical processor. The runtime keeps GOMAXPROCS P's; each P holds a local run queue of runnable G's. M's must hold a P to run G's. Blocking syscalls hand the P off to another M so other goroutines keep running. Idle P's steal work from busy ones. The network poller and sysmon thread keep things moving. The whole machine runs in user space and is what makes 100K goroutines cheap.

Key Points

G = goroutine (small struct + stack), M = OS thread (machine), P = logical processor (GOMAXPROCS of these). M needs a P to run G's.
Each P has a local run queue of 256 runnable G's plus a 'runnext' LIFO slot for newly created goroutines. Overflow goes to the global queue.
Work stealing: when a P's local queue is empty, it steals half of another P's queue. Keeps load balanced without explicit coordination.
Blocking syscalls: the M holding a P releases the P (handoff), another M (or a fresh one) picks the P up and continues running other G's. The blocked G stays attached to its M.
Channel and lock waits don't block an M. The G is parked in user space (a wait list inside the channel/mutex), the M moves on to other work.
Network poller (epoll/kqueue/IOCP) is one shared component. When network I/O completes, it puts the waiting G back on a P's run queue.
sysmon is a dedicated OS thread (no P) that preempts long-running G's, retakes P's stuck in syscalls too long, and triggers GC.
Async preemption via signals (since Go 1.14): even a tight loop with no function calls can be preempted. Before 1.14, such a loop could starve the scheduler.

Interview Follow-up Questions

Why have P at all? Why not just M and G?

P is the unit of scheduling capacity. GOMAXPROCS P's means at most GOMAXPROCS goroutines run in parallel, regardless of how many M's exist. This decouples 'how many cores are allowed to be used' from 'how many OS threads exist.' Stuck syscalls can spawn extra M's without stealing CPU from useful work, because those M's don't have a P. The local run queue per P is also a cache-locality win: G's recently created on a P tend to share data with other G's on the same P.

What is 'spinning' and why does the runtime do it?

After releasing a G, an M with no work to do can spin briefly (a few microseconds) checking whether new work appeared, instead of parking immediately. Parking and waking M's costs an OS round trip; if work is about to appear, spinning is cheaper. The runtime keeps the count of spinning M's low (one or two) to avoid burning CPU. It shows up in pprof as 'runtime.findrunnable' time.

How does GOMAXPROCS interact with cgroups and CPU limits?

Before Go 1.25, GOMAXPROCS defaulted to the number of host CPUs, ignoring cgroup CPU limits. On a 64-core box with a 2-core cgroup, GOMAXPROCS=64 caused massive scheduling overhead. The standard workaround was uber-go/automaxprocs. Go 1.25 made the runtime cgroup-aware by default, so most modern containers no longer need the workaround. Older Go versions still do.

Can the scheduler be starved?

Hard, but yes. (1) Calls to C code through cgo block the M without releasing the P; if all M's go into cgo at once, no goroutines run. (2) Tight loops in Go assembly that don't include preemption points still aren't preemptible. (3) Holding a runtime mutex (very rare in user code) blocks scheduling. For normal Go code, sysmon's signal-based preemption catches almost everything.

What's the cost of a goroutine creation?

About 200ns and 2KB of stack on first creation. Reuse from a pool helps amortize the stack allocation. Compared to an OS thread (~5us creation, ~1MB stack), goroutines are nearly free, which is why patterns like 'spawn one goroutine per request' are idiomatic in Go.

Gotchas

Runaway G creation under load: spawning a goroutine per inbound message without bounds will OOM the process when the queue can't drain fast enough. Always have a worker pool or semaphore.
cgo calls block an M for the entire C call. Millions of cgo calls per second will exhaust the M pool and require raising the soft cap (default 10000).
GOMAXPROCS too low (default 1 in Go pre-1.5, or in containerized environments without automaxprocs) silently serializes everything.
GOMAXPROCS too high on a small cgroup cuts performance because the OS scheduler thrashes between threads that compete for limited CPU.
Long syscalls (slow disks) can spike the M count; suddenly seeing hundreds of M's is a sign of blocking syscalls under load.

Channels, Buffered, Unbuffered, Patterns

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Channels are typed conduits between goroutines. Unbuffered = synchronous handoff (sender blocks until receiver ready). Buffered = bounded queue. Closing signals 'no more values' to receivers. The cornerstone of Go's 'share by communicating' philosophy.

Key Points

Unbuffered: synchronous, send blocks until receive, vice versa
Buffered: send blocks only when buffer full; receive blocks only when empty
close(ch) = 'no more values'; receivers see ok=false; sending on closed = panic
Nil channels block forever, useful in select to disable a case
Channels are not free, for tight CPU loops, mutex is faster

Interview Follow-up Questions

Buffered or unbuffered, which to use?

Unbuffered for synchronization (signal that something happened). Buffered for decoupling (producer can run ahead of consumer up to buffer size). Buffered also tolerates burstiness. As a rule of thumb: start with unbuffered; add buffer only when measurement shows it helps.

What does closing a nil channel do?

Panics. Sending to a nil channel blocks forever. Reading from nil blocks forever. Nil channels are mostly useful in `select` to disable a case dynamically.

When are channels slower than mutex?

Tight critical sections, millions of ops/sec on shared state. Channel send/receive has overhead (~50-100ns); a mutex Lock/Unlock is ~20-30ns. For a hot counter, atomic.Int64 wins; for coordination, channels win.

What's the right channel buffer size?

Depends on the burst tolerance required. If producer and consumer rates are matched, 0 (unbuffered) is fine. If a producer bursts at 10x consumer rate for 100ms, buffer for that burst. Don't make it 'large' as a default; large buffers hide backpressure problems.

Gotchas

Closing a channel twice → panic
Sending on a closed channel → panic (so receiver should never close)
Range loop on a never-closed channel → blocks forever (goroutine leak)
Make sure only ONE goroutine closes the channel; coordinate with WaitGroup if multiple senders
Buffered channels with capacity = number of sends → potential deadlock if a receive is missed

select Deep Dive

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

select waits on multiple channel operations and proceeds with whichever is ready first. Add a default for non-blocking operations; combine with time.After for timeouts; with ctx.Done() for cancellation. The cornerstone of channel composition.

Key Points

select picks ONE ready case; if multiple ready, randomly chosen
default makes select non-blocking, if no case ready, default runs
time.After(d) returns a channel that fires once after d → easy timeouts
ctx.Done() returns a channel, select on it for cancellation
select on nil channel = case never fires → useful for dynamic channel disabling

Interview Follow-up Questions

What if multiple cases are ready simultaneously?

Go picks one uniformly at random. This prevents priority inversion / starvation, no case can be perpetually starved by others. Critical for fairness.

How is priority between cases implemented?

Two-step select. Outer select picks higher-priority channel only; if no high-priority ready, inner select picks any. Or use a guard with a buffered channel: drain high-priority first, then check normal.

Why does time.After leak under heavy use?

Each call creates a Timer + goroutine. Calling it millions of times in a tight loop (e.g., per-iteration timeout) lets the timers accumulate. Use `time.NewTimer` + explicit `Stop()` for hot paths.

When should select have a default?

When non-blocking semantics are required, 'try send, drop if full' or 'try receive, no value? skip.' Default has subtle implications: it makes the select consume CPU when called in a tight loop. Use sparingly outside try-X patterns.

Gotchas

select with no cases blocks forever (used to park the main goroutine intentionally)
select with only default is just an immediate non-blocking check
time.After in tight loops leaks timers, use NewTimer + Stop
default in select on multiple channels can cause starvation if always-ready cases dominate
Nil channel cases never fire, both feature and footgun

context.Context, Cancellation & Deadlines

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

context.Context propagates cancellation, deadlines, and request-scoped values through call chains. Every long-running operation should accept a Context and select on ctx.Done(). The universal cancellation primitive in modern Go.

Key Points

Context flows through the call graph, pass as first parameter to every function that may block
WithCancel: parent function cancels manually
WithTimeout / WithDeadline: auto-cancellation after duration
ctx.Done() returns a channel that closes on cancellation
Always defer cancel() to release resources, even when not strictly needed

Interview Follow-up Questions

Why is ctx the first parameter, not last?

Convention; aids readability and discoverability. Go style guides (Effective Go, gopls) flag it. Functions that don't take ctx are usually CPU-only / synchronous; those that do are network/DB-bound, easy to spot at a glance.

When should context NOT be used?

Pure-CPU synchronous functions (`Sum`, `Reverse`). Tight loops where the overhead of `select` matters. Constructor functions (use a Done method instead). The ctx is for orchestration of cancellable operations; don't sprinkle it on everything.

What's context.Background vs context.TODO?

Functionally identical (empty context, never cancels). Convention: Background for top-level (main, init); TODO when the right context is unknown (placeholder during refactoring). The TODO signals 'come back and fix this.'

Memory leak, context.WithCancel without calling cancel?

Yes. The internal timer/goroutine is never released. Always `defer cancel()`. Even after parent context is cancelled, the explicit cancel call cleans up the child's resources. linters (vet, staticcheck) catch this.

Gotchas

Forgetting `defer cancel()` → goroutine leak (the cleanup goroutine sticks around)
Storing context as a struct field; convention says pass it explicitly
Putting too much in context.Value, easy to abuse for global state
Mixing cancellation: ctx is cancelled but inner code didn't check ctx.Done() → operation continues
context.Background() in a function that should take a ctx, silently disables cancellation

Goroutine Leak Prevention

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Every spawned goroutine must have a known exit story, context cancellation, channel close, natural completion, or timeout. Goroutines that block forever are the #1 production scaling bug in Go services. Detect via runtime.NumGoroutine and pprof/goroutine.

Key Points

Symptom: NumGoroutine grows unbounded under load
Diagnosis: pprof/goroutine shows what each leaked goroutine is blocked on
Pattern: every send/receive needs a select-with-ctx.Done() escape
Bounded buffered channels absorb bursts; default-on-full handles overflow
Test for leaks: run handler 10K times in test, check NumGoroutine before/after

Interview Follow-up Questions

What's the canonical leak in production Go services?

Fire-and-forget goroutine doing logging or metrics, sending to an unbuffered/under-capacity channel whose consumer is slow or down. Every request leaks one goroutine; unbounded growth under load.

How is a leak reproduced locally?

Stress test the handler with go-test running N times. Check `runtime.NumGoroutine()` before and after. Growth indicates a leak. Use `goleak.VerifyNone(t)` to make this automatic.

Why don't unbuffered channels usually leak in tests?

Tests run linearly. Production has burst load: 1000 requests/sec with a slow consumer means 1000 leaked goroutines/sec. Tests usually don't reproduce because the consumer keeps up at low load.

When is unbuffered fine?

When sender and receiver are tightly paired with bounded lifetimes, e.g., 'wait for one response.' Both sides are explicit; the goroutine ends after a single exchange. The pattern that bites is *unbounded fire-and-forget*.

Gotchas

context.WithCancel without `defer cancel()` leaks the timer goroutine
for {} loops without select-on-ctx.Done leak forever
Channel send with no receiver blocks forever, wrap in select + default OR + ctx.Done
time.After in tight loops creates one timer per call; use NewTimer + Stop
Goroutines waiting on http.Request body to fully drain, close the body explicitly

Go Memory Model & Sync Guarantees

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Go's memory model defines when one goroutine is guaranteed to observe writes made by another. The rule is short, synchronize using channels, sync primitives, or sync/atomic, or you have a data race.

Key Points

Go's mantra: 'Don't communicate by sharing memory; share memory by communicating'
A send on a channel happens-before the corresponding receive completes
A receive from an unbuffered channel happens-before the send completes
Mutex.Unlock() happens-before the next Mutex.Lock() returns
sync/atomic loads/stores establish happens-before (since Go 1.19)
go run -race detects data races at runtime; run it in CI

Interview Follow-up Questions

Why does Go say 'don't communicate by sharing memory'?

Channels make ownership and synchronization explicit in the type system. Shared-memory bugs hide; channel patterns surface coordination at the call site.

Is reading a bool atomic in Go?

No. Without atomic.Bool or a mutex, the race detector flags concurrent read+write. Word-size loads may be atomic on x86 but the Go memory model doesn't guarantee it.

What does the race detector actually do?

Instruments memory accesses at compile time and tracks happens-before relationships. Reports any pair of accesses where at least one is a write and no happens-before edge exists.

Difference between unbuffered and buffered channel send semantics?

Unbuffered: send blocks until receiver is ready, a sync point. Buffered: send blocks only when buffer is full, like a queue.

Can sync.Mutex be reentrant?

No. Re-locking the same mutex from the same goroutine deadlocks. This is intentional; Go discourages reentrant locks because they hide design issues.

Gotchas

Copying a struct that contains a sync.Mutex breaks it; go vet will warn
Closing a channel twice panics; sending to a closed channel panics
select with no default + no ready case blocks forever, common cause of deadlock
Goroutine leaked on channel send: writer blocks forever if no one reads and no buffer
atomic operations require the value's address, atomic.AddInt64(&x, 1), not (x, 1)

sync.Mutex and sync.RWMutex

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

sync.Mutex is Go's basic mutex (Lock/Unlock). sync.RWMutex allows many readers OR one writer. Always use defer Unlock right after Lock. Mutexes are not reentrant in Go: same goroutine locking twice deadlocks. Pass mutexes by pointer; copying a mutex copies its state (including held-ness).

Key Points

Mutexes are NOT reentrant in Go. Same goroutine locking twice deadlocks.
Always Lock then defer Unlock. Skipping defer is the source of most lock leaks.
Mutex zero value is unlocked and ready to use. No initialisation needed.
Copying a Mutex copies its state. Embed by value carefully; pass structs containing mutexes by pointer.
RWMutex is slower than Mutex for write-heavy or short-critical-section workloads. Profile.

Interview Follow-up Questions

Why doesn't Go support reentrant mutexes?

Design choice. The Go authors argue that needing reentrancy is usually a signal of a design problem (the function does not know whether it already holds the lock, which means it does not know its own state). The alternative is to refactor: split the function into a public version that locks and a private version that assumes the lock is held. The public version locks and calls the private; nested calls go through the private.

Mutex or RWMutex by default?

Mutex. RWMutex looks like 'free parallelism for reads', but the bookkeeping (atomic counter for readers, separate write path) costs cycles even for the read path. For critical sections shorter than ~1µs, Mutex wins. Switch to RWMutex when reads vastly dominate (50:1 or more) AND each read holds the lock long enough that the read-path overhead is negligible.

Can a goroutine that holds a Mutex pass it to another goroutine?

Yes. The mutex tracks held-or-not, not which goroutine holds it. Goroutine A can Lock, send the protected struct over a channel, and goroutine B can Unlock. This is unusual; the conventional pattern is acquire-and-release in the same function. But the runtime does not enforce that.

What goes wrong when defer Unlock is forgotten?

Two failure modes. First, an early return between Lock and Unlock leaks the lock; the next caller deadlocks. Second, a panic between them leaks the lock; recovery in a parent goroutine sees the lock held forever. defer Unlock right after Lock makes both paths correct.

Gotchas

Mutex is not reentrant; same goroutine locking twice deadlocks
Copying a struct that contains a Mutex copies the lock state; pass by pointer
RWMutex is slower than Mutex for short critical sections; profile before assuming
Forgetting defer Unlock leaks the lock on panic or early return
Holding a mutex across a channel send/receive can deadlock with the channel's other end

sync Package: Once, WaitGroup, Cond, Pool, Map

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Beyond mutexes, sync provides Once (one-time init), WaitGroup (wait for N goroutines), Cond (wait/signal on a predicate), Pool (reusable object cache), and Map (lock-free read-mostly map). Each solves a specific shape; reaching for the wrong one leads to ugly code.

Key Points

sync.Once.Do guarantees the function runs exactly once across all goroutines.
WaitGroup.Add(n) MUST happen before launching goroutines; Done in defer; Wait blocks until counter is 0.
sync.Cond is rarely the right tool in Go; channels usually express the same intent more clearly.
sync.Pool reduces GC pressure for short-lived objects but is NOT a connection pool: items can be reaped any time.
sync.Map wins for read-mostly disjoint-key workloads. For write-heavy or read-write-mixed, plain map + RWMutex is faster.

Interview Follow-up Questions

When to use sync.Map vs map + RWMutex?

sync.Map is faster ONLY in two scenarios: write-once-read-many (entries added once, then mostly read) and disjoint-key access (different goroutines touching different keys). For mixed reads and writes on the same keys, or for write-heavy workloads, plain map with RWMutex is faster. Default to map+RWMutex; switch to sync.Map only after measuring the win.

Why is sync.Pool 'best-effort'?

The pool can drop items at any garbage collection. The Go authors made this trade-off so the pool does not pin memory. It cannot be used for objects with finite-resource semantics (file handles, DB connections, network sockets). Use it only for objects that are 'free to recreate but expensive to allocate' (buffers, slices, parser state).

What is the difference between Cond.Signal and Cond.Broadcast?

Signal wakes ONE waiting goroutine. Broadcast wakes ALL waiting goroutines. Use Signal when only one waiter can usefully proceed (e.g., one item enqueued, only one consumer can take it). Use Broadcast when state changed in a way many waiters care about (cache invalidation, shutdown signal). Wrong choice = lost wakeup or wasted wakeups.

WaitGroup.Add inside the goroutine: why is that wrong?

It races with Wait. The parent calls Wait; the counter is still 0 (because Add has not run yet); Wait returns immediately; goroutines then run after the parent has moved on. The rule: every Add must happen-before every Wait. The simplest way is to Add before launching.

Gotchas

WaitGroup.Add inside a goroutine races with Wait; Add before launching
sync.Pool items can be dropped at any GC; never use for finite resources
sync.Cond.Wait must be inside a for-loop on the predicate (spurious wakeups)
sync.Map cannot be ranged AND mutated safely; use Range carefully
Calling Signal/Broadcast without holding the Cond's Locker can cause missed wakeups

sync/atomic: Lock-Free Single-Variable Updates

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

sync/atomic provides lock-free atomic operations on integers and pointers. Add, Load, Store, Swap, CompareAndSwap. Go 1.19+ added typed atomic.Int64, atomic.Pointer[T], atomic.Bool which are cleaner than the function-style API. Use for counters, hot flags, lock-free data structures.

Key Points

Atomic ops are lock-free: single hardware instruction (LOCK XADD on x86, LDXR/STXR on ARM).
Go 1.19+ typed API (atomic.Int64) is preferred over the function-style API.
CompareAndSwap is the building block for lock-free algorithms; the retry loop pattern is identical to Java.
atomic.Value supports any type but is replace-only (no compound update). Use atomic.Pointer[T] when possible.
Plain reads/writes to int64 are NOT atomic on 32-bit platforms; use atomic.Int64 if portability matters.

Interview Follow-up Questions

atomic.Int64 vs sync.Mutex around an int64: which is faster?

Atomic by a wide margin. atomic.Add is one CPU instruction; mutex Lock is at least dozens of cycles even uncontended, hundreds under contention. Use atomic for counters, single-variable invariants. Mutex when the invariant spans more than one variable.

When to use atomic.Value vs atomic.Pointer[T]?

atomic.Pointer[T] for new code on Go 1.19+. It is type-safe, no interface boxing. atomic.Value exists for backward compatibility and for cases where T is unknown at compile time. The two have one important difference: atomic.Value requires the same concrete type on every Store after the first.

How does Go's atomic interact with the memory model?

Atomic operations have sequentially consistent semantics in Go's memory model (since Go 1.19 made this explicit). A Load synchronises with the Store that wrote the value: writes that happened before the Store are visible to code after the Load. This matches what most engineers intuitively expect from atomics.

Can a lock-free data structure be built with sync/atomic?

Yes, and people have. Lock-free queues, stacks, hash maps. But: the standard library covers the common cases (channels for queues, sync.Map for some maps), and rolling a lock-free structure is where subtle bugs live (ABA problem, memory ordering, leaks of removed nodes). Reach for off-the-shelf or academic-tested designs first.

Gotchas

Function-style API requires 8-byte alignment for int64; typed API handles this
atomic.Value requires same concrete type on every Store after the first
Plain int64 reads/writes are NOT atomic on 32-bit platforms
Atomic operations are NOT enough for compound state (multiple fields); use a mutex or restructure to one immutable struct
CAS retry loops can spin under high contention; benchmark before assuming lock-free is faster

sync.Map vs map+RWMutex

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Sometimes

sync.Map is the standard library's lock-free concurrent map. It wins for two specific shapes: write-once-read-many entries, and disjoint-key access from many goroutines. For mixed read/write on the same keys, plain map + sync.RWMutex is faster. The Go authors put a warning in the docs to this effect.

Key Points

sync.Map is untyped (any/any). No generics in standard library; type assertions everywhere.
Optimised for write-once-read-many or disjoint-key access. Other patterns: use map+RWMutex.
Range is weakly consistent: it iterates a snapshot, but mutations during Range may or may not be reflected.
LoadOrStore is the atomic 'get or insert' that map+RWMutex needs separate Lock/Unlock for.
If you need iteration order, range, or generics, use plain map with RWMutex. sync.Map is a sharp tool for specific shapes.

Interview Follow-up Questions

When is sync.Map actually faster than map+RWMutex?

Two shapes per the docs: (1) write-once-read-many entries (write at insert, then read forever) and (2) many goroutines mostly accessing disjoint keys (each goroutine touches a different subset). For mixed writes/reads on contended keys, map+RWMutex wins. Benchmark the real workload before assuming.

Why is sync.Map untyped?

It predates generics. The Go team has not added a generic version to the standard library; the existing sync.Map remains untyped for backward compatibility. For typed concurrent maps in modern Go, either build a wrapper with generics + map + RWMutex, or use a third-party library.

Can sync.Map iteration miss entries?

Yes, if entries are added during the iteration. The Range docs say: 'Range may reflect any mapping for that key from any point during the Range call. Range does not block other methods on the receiver; even f itself may call any method on m.' In practice, when a consistent snapshot is needed, copy the entries first.

What about ordered iteration?

sync.Map does not preserve insertion order or any other order. For ordered iteration, use map+RWMutex (which also doesn't preserve order, so sort the keys), or maintain a separate slice of keys with the same lock.

Gotchas

sync.Map is untyped; type assertions on every Load are easy to get wrong
Range is weakly consistent; not safe for code that needs an exact snapshot
Storing nil values is allowed but easily confused with 'not present'; check the bool
For most workloads, map+RWMutex outperforms sync.Map; default to it unless you have measured otherwise
Range can block forever if f never returns, including on writes that block on the map

errgroup and golang.org/x/sync/semaphore

Section: Go Concurrency | Difficulty: Intermediate | Frequency: Asked Often

errgroup.Group is the structured fan-out for Go: launch goroutines via Go(func), Wait for all, first error cancels siblings via Context. semaphore.Weighted bounds concurrency with weighted permits. Together they cover most real fan-out cases cleaner than raw WaitGroup + manual cancel plumbing.

Key Points

errgroup.WithContext returns a Group AND a derived Context that is cancelled when any goroutine returns an error.
First error wins; subsequent errors from siblings are dropped (only logged).
g.SetLimit(n) (added in x/sync errgroup v0.0.0-2022) caps in-flight goroutines without a separate semaphore.
semaphore.Weighted permits >1 weight per acquisition; useful when tasks have different cost (memory, CPU).
Both are in golang.org/x/sync (not stdlib) but are de facto standard for Go fan-out.

Interview Follow-up Questions

errgroup vs sync.WaitGroup: when to use which?

errgroup when error propagation or context cancellation on first failure is needed. WaitGroup when there are no errors to propagate (background fan-out where each goroutine handles its own errors), or when finer control is needed. For most fan-out-and-wait code, errgroup is the right default.

What happens to slow goroutines after the first error?

The shared context is cancelled. Slow goroutines that respect ctx (check ctx.Done(), pass ctx to downstream calls) will see the cancellation and exit early. Goroutines that ignore ctx will run to completion. errgroup waits for all of them in either case.

How does SetLimit differ from a Semaphore?

SetLimit caps in-flight Go calls; the limit is per-Group. Semaphore caps a global resource (memory, connection slots) that may span multiple Groups. SetLimit is simpler for the common case of 'don't fan out more than N at once'. Semaphore is more flexible (weighted permits, shared across components).

Why are these in x/sync rather than stdlib?

Historical: x/sync was the staging area for sync extensions. Both are stable, widely used, and effectively standard. The Go team has stated no plans to move them, partly because evolving them in x/sync is easier than the stdlib's strict compatibility promise.

Gotchas

Forgetting to capture the loop variable (item := item) before passing to g.Go creates a race in pre-Go-1.22 code
errgroup only returns the FIRST error; subsequent errors are silently dropped
A goroutine that ignores ctx runs to completion even after error cancellation
semaphore.Weighted.Acquire blocks until ctx done or weight available; check ctx.Err() to distinguish
errgroup zero value works but loses cancellation; always use WithContext for fan-out

Go Runtime Tuning: GOMAXPROCS, GC, Scheduler

Section: Go Concurrency | Difficulty: Advanced | Frequency: Sometimes

GOMAXPROCS sets how many OS threads execute Go code concurrently (default = cpu count). GOGC controls when GC runs (default 100 = collect when heap doubles). GOMEMLIMIT (Go 1.19+) caps RSS to keep within container limits. The Go scheduler is M:N; you almost never need to think about it, but knowing the levers helps when you do.

Key Points

GOMAXPROCS defaults to cpu count. In containers without CFS-quota awareness, that may be the host's cores, not your limit. Use uber-go/automaxprocs.
GOGC=100 means GC when heap doubles since last GC. Higher values trade memory for fewer collections.
GOMEMLIMIT is a soft cap on total memory. Without it, Go can OOM in containers even with GOGC tuned low.
The scheduler is M:N: many goroutines on a few OS threads. Goroutines park and resume cheaply (microseconds).
GODEBUG=schedtrace=1000 prints scheduler stats every 1s. Useful for diagnosing scheduling stalls.

Interview Follow-up Questions

Should GOMAXPROCS be tuned manually?

Almost never on bare metal or VMs. The default (cpu count) is right. In containers with CFS quota, use uber-go/automaxprocs to read the limit correctly. The only time to override manually is when cores are deliberately reserved for other processes (e.g., a sidecar).

What is the right GOGC?

Default 100 is fine for most services. Higher values (200, 500) reduce GC cost at the cost of higher memory; useful for batch jobs that prioritise throughput. Lower values (50, 25) reduce memory at the cost of more CPU on GC; useful when memory is tight. With GOMEMLIMIT set, GOGC can usually stay at default and the limit drives GC pressure.

How does the M:N scheduler differ from OS threads?

Goroutines (G) are scheduled onto P's (logical processors); P's are bound to M's (OS threads). Goroutines that block on syscalls or channels park; the M can run another goroutine without OS context switch. Park/unpark is microseconds vs the OS context switch's microseconds-to-milliseconds. Result: Go programs can have hundreds of thousands of goroutines without scheduling overhead becoming dominant.

What is GOTRACEBACK and when is it needed?

Controls how much detail is in panic stack traces. Default 'single' shows the panicking goroutine; 'all' shows every goroutine. Set to 'all' in production for richer crash dumps. Doesn't affect performance, only crash output.

Gotchas

GOMAXPROCS default is host CPU count, NOT container CPU limit; use automaxprocs
Without GOMEMLIMIT, Go can OOM in tight containers even with GOGC tuned
Running pprof with too many concurrent profiles slows the program; profile sparingly in prod
GODEBUG=schedtrace adds overhead; not for prod, only for diagnostics
runtime.NumGoroutine() shows total; doesn't tell you which are blocked vs runnable

Threads, Runnable, Virtual Threads

Section: Java Concurrency | Difficulty: Basic | Frequency: Asked Often

Java's concurrency unit is the thread. Platform threads wrap OS threads (1MB stack, ~10K max). Virtual threads (Java 21+) are JVM-managed M:N tasks (~few KB), millions per JVM. Use Runnable/Callable to define work; Thread/Executors to schedule.

Key Points

Platform thread = OS thread; Virtual thread = JVM-managed M:N (Java 21+)
Use Runnable for fire-and-forget; Callable for results; Thread.start to launch
Virtual threads work with all blocking JDK APIs, no library rewrite needed
Thread.join() waits for completion; thread.getState() gives runtime state
Avoid creating threads directly in production code; use ExecutorService

Interview Follow-up Questions

Platform thread vs virtual thread, when to choose?

Platform: CPU-bound work, small thread counts (<1000). Virtual: I/O-bound, high concurrency (10K+ requests). Virtual threads work with existing blocking APIs, no rewrite to async required to scale.

Why is Thread.stop() deprecated?

It releases all locks abruptly, leaving objects in inconsistent states. Worse, it can corrupt internal data structures (a HashMap mid-resize, for example). Use interrupt + cooperative shutdown, let the thread finish what it's doing safely.

How are virtual threads scheduled?

M:N. JVM has a small pool of carrier (platform) threads. Virtual threads run on carriers; on blocking JDK calls, the virtual thread unmounts from its carrier so the carrier can run another. Re-mount when the I/O completes.

When does Runnable beat Callable?

When no return value or exception propagation is required. Fire-and-forget background tasks (logging, metrics flush). Callable's return type and checked exceptions add unnecessary ceremony for those cases.

Gotchas

Thread.stop() is deprecated and unsafe, never use
Thread.sleep() and Object.wait() CLEAR the interrupt flag, re-set with Thread.currentThread().interrupt()
Don't reuse Thread instances, start() throws on a started/terminated thread
Virtual threads pin to their carrier inside synchronized blocks, use ReentrantLock for I/O-heavy code on virtual threads
Daemon threads are killed at JVM exit, don't use for cleanup work

synchronized vs ReentrantLock

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

synchronized is the language-level mutex, concise, JVM-optimized, automatic release on exception. ReentrantLock is the API-level mutex, adds tryLock with timeout, fairness, multiple Conditions, lock interruptibility. Pick based on which features the call site needs.

Key Points

synchronized: cheap, JVM-optimized, auto-release. Default choice.
ReentrantLock: tryLock(timeout), fairness, multiple Conditions, interruptible
Both reentrant, same thread can re-acquire
synchronized.notify() / notifyAll() vs ReentrantLock.newCondition().signal()
Java 21+ virtual threads + synchronized = pinning (use ReentrantLock instead for I/O-heavy code)

Interview Follow-up Questions

When is synchronized faster than ReentrantLock?

For uncontended fast path. JVM lock coarsening and escape analysis reduce uncontended sync to a few CPU instructions (biased locking did this aggressively pre-JDK 15, then was disabled by JEP 374). ReentrantLock has more bookkeeping. Under contention they're similar; in benchmarks, synchronized usually wins for short critical sections.

Why does virtual thread + synchronized 'pin' the carrier?

Virtual threads run on carrier threads. When a virtual thread enters a synchronized block, it can't unmount from its carrier (JVM internals). I/O inside the block holds the carrier. Workaround: use ReentrantLock for I/O-heavy critical sections on virtual threads.

When to choose StampedLock?

Read-mostly workloads with very rare writes. The optimistic path skips ALL synchronization on read, fastest possible. But: not reentrant, manual stamp management, fall-back complexity. Generally niche; ReentrantReadWriteLock is more common.

synchronized on this, bad practice?

Sometimes. Any caller with a reference to the object can lock it externally, `synchronized(myObj) { ... }` from outside. For libraries, use a private final lock object instead. For internal classes, `synchronized(this)` is fine.

Gotchas

ReentrantLock without try/finally → exception leaks the lock
synchronized on String literals or autoboxed Integers → shared monitors, deadlocks across unrelated code
Fairness in locks adds overhead, use only when measurement justifies it
Reading a long/double without synchronization on 32-bit JVMs is NOT atomic
Holding a lock during an external call (network, callback) → deadlock if callee tries to lock something held by caller

AbstractQueuedSynchronizer (AQS)

Section: Java Concurrency | Difficulty: Advanced | Frequency: Asked Often

AQS is the framework underneath every lock in java.util.concurrent. It provides a single int state variable, a CLH-style wait queue, and the park/unpark plumbing. Subclasses define what the state means and when acquire/release succeeds. ReentrantLock, Semaphore, CountDownLatch, ReadWriteLock, FutureTask, ThreadPoolExecutor's worker locks, and StampedLock are all AQS subclasses or close cousins.

Key Points

One volatile int state field. Subclasses interpret it: 0/1 for ReentrantLock, hold count for reentrancy, available permits for Semaphore, count for CountDownLatch.
FIFO wait queue (a CLH variant): each waiter is a Node with a Thread and a status. Spin briefly, then park.
Subclasses override tryAcquire / tryRelease (exclusive) or tryAcquireShared / tryReleaseShared (shared). Everything else (queueing, parking, signalling) is provided.
Supports interruptible, timed, and fair acquisition out of the box.
ConditionObject (the Condition impl) lives on top of AQS too; await/signal use a separate condition queue that migrates nodes back to the main wait queue on signal.
Java 21 added VirtualThread support: AQS parks via LockSupport, which knows how to unmount a virtual thread from its carrier.

Interview Follow-up Questions

Why one int instead of an object?

Atomic CAS on a single 32-bit word is hardware-cheap and fits in one cache line with the rest of the AQS header. A heavier representation (a HashMap, an Object reference) would cost an allocation per state change and a cache miss per access. The whole framework is designed around a single hot CAS path. Most synchronizers fit fine: 0/1 for a mutex, hold count for reentrancy, available permits for a semaphore, count for a latch. When 32 bits isn't enough, AbstractQueuedLongSynchronizer extends it to 64.

What is the CLH queue and why does AQS use a variant of it?

Craig, Landin, Hagersten 1993. A linked list of Nodes where each node spins on a flag in its predecessor. AQS modifies the original CLH: instead of pure spinning, nodes park after a brief spin attempt (the standard approach for non-real-time systems). The benefits: O(1) enqueue/dequeue, no allocation on the unlock path (the head moves, doesn't get freed), and it's easy to reason about FIFO fairness.

How does Condition.await fit into all this?

AQS exposes ConditionObject which has its own singly-linked condition queue. await() releases the AQS lock fully, adds the thread to the condition queue, and parks. signal() pulls a node from the condition queue and migrates it to the main AQS wait queue. When that node eventually reaches the head, it tries to re-acquire the lock with the original hold count. Spurious wakeups are possible because LockSupport.park can return early; the await loop checks the condition again.

Why don't ReentrantLock and synchronized share an implementation?

synchronized is built into the JVM; it's part of bytecode (monitorenter/monitorexit) and the runtime can apply lock coarsening and lock elision because it sees the bytecode. (Biased locking was disabled by default in JDK 15 and removed in JDK 18.) ReentrantLock is a library type built on AQS; the JVM has no special knowledge of it. Tradeoff: synchronized gets JVM optimizations but no tryLock with timeout, no lockInterruptibly, no fair mode, no Condition. ReentrantLock gets all of those at the cost of being just a normal class. Java 21-23 ReentrantLock works with virtual threads without pinning the carrier (synchronized used to pin); JEP 491 in Java 24 fixed the synchronized-pinning case too.

Has AQS changed for virtual threads?

Yes. The parking primitive is LockSupport.park(blocker), which the JVM intercepts for virtual threads: instead of parking the carrier, it unmounts the virtual thread and frees the carrier to run another. This means ReentrantLock and friends do not pin a carrier under virtual threads, even though synchronized still does (which is why HotSpot is rewriting synchronized for Loom).

Gotchas

Forgetting to make state writes happen via setState/compareAndSetState. They are volatile-safe; direct writes to other fields are not.
In tryRelease, only return true when the lock is fully released (state == 0 for a reentrant mutex). Returning true early wakes a waiter that will then fail tryAcquire and re-park, wasting a context switch.
Mixing exclusive and shared modes in the same synchronizer is possible but tricky (see ReentrantReadWriteLock). Don't try without first reading the source.
Signal-on-Condition without holding the lock throws IllegalMonitorStateException. ConditionObject checks the AQS owner.

volatile in Java

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

volatile gives a single field two guarantees: every read sees the latest write from any thread (visibility), and reads and writes cannot be reordered across the access (ordering). It does not make compound operations like value++ atomic. Use it for status flags, write-once references, and the published reference in double-checked locking.

Key Points

Visibility: a write to a volatile field is immediately visible to any thread that reads it.
Ordering: writes that happened before the volatile write are visible to a reader that observes it. Reads after the volatile read see those writes too.
NOT atomic for compound operations. value++ is read-modify-write, not one operation.
Long and double on 32-bit JVMs are atomic only when declared volatile.
For atomicity, use AtomicInteger / AtomicLong / AtomicReference. When both are needed, those classes already provide volatile semantics.

Interview Follow-up Questions

Is volatile a substitute for synchronized?

Only for single-field invariants. Volatile provides visibility and ordering on one field. Synchronized provides visibility, ordering, mutual exclusion, AND allows reasoning about multiple fields together. When the invariant spans more than one variable, use a lock or atomic.

What is the cost of volatile?

Per-access cost: a memory barrier on each write and a load barrier on each read. On modern CPUs that is on the order of tens of nanoseconds. Cheap for status flags. Measurable at millions of reads per second on the same field. For hot counters, use LongAdder, which is internally sharded.

Does volatile help with long and double on 32-bit JVMs?

Yes. The JLS only guarantees atomicity of 32-bit reads and writes. 64-bit long and double can be torn (a read picks up half the high word and half the low word). Declaring them volatile forces atomicity. On 64-bit JVMs in practice, HotSpot makes them atomic regardless, but the spec only requires it under volatile.

When prefer VarHandle over volatile?

When the same visibility/ordering is needed but with finer control over which barrier (acquire-only, release-only, opaque). Used in performance-critical lock-free code in JDK internals. Application code almost never needs it.

Gotchas

volatile does not lock; multiple writers can still race on the same field
volatile array reference is volatile, but reading array[i] is NOT (the element access is plain). Use AtomicReferenceArray for per-element ordering
Long and double need volatile to be atomic on 32-bit JVMs
Compound updates (value++, value = compute(value)) are not atomic under volatile alone
synchronized provides everything volatile does, plus mutual exclusion. Do not stack them on the same field

Java Memory Model & Happens-Before

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

The JMM defines what reads can observe what writes across threads. The happens-before relation is the rule set, if action A happens-before action B, then A's effects are visible to B.

Key Points

Without happens-before, the JVM/CPU can reorder reads/writes for performance
volatile guarantees: visibility (latest value) + ordering (no reordering across the access)
synchronized guarantees: mutual exclusion + happens-before on lock release → next acquire
final fields: safely visible after constructor completes, no synchronization needed
Thread.start() happens-before any action in the started thread
Action in a thread happens-before another thread's successful join() return

Interview Follow-up Questions

What does volatile guarantee that a regular field doesn't?

Two things: (1) visibility, every read sees the latest write from any thread; (2) ordering, reads/writes are not reordered across the volatile access (acquire/release semantics).

Why doesn't volatile make ++ atomic?

++ is read-modify-write. volatile makes the read and write each atomic, but two threads can interleave and lose updates. Use AtomicInteger.incrementAndGet().

What is happens-before?

A partial ordering on memory operations. If A happens-before B, everything visible to A is visible to B. Established by volatile, synchronized, Thread.start/join, final field freeze, etc.

Are final fields thread-safe without synchronization?

Yes, the JMM guarantees any thread that observes the constructed object sees correctly initialized final fields, even without synchronization.

Does Python have a memory model like Java?

No formal one. CPython piggybacks on the GIL for atomicity of basic ops, but this is implementation-specific. Always synchronize explicitly, Lock, Event, Queue.

Gotchas

Reading a volatile array reference is volatile, but reading array[i] is NOT, use AtomicReferenceArray
synchronized on a String literal or autoboxed Boolean can deadlock unrelated code (shared instances)
Constructors that publish 'this' before completion break the final-field guarantee
long and double are NOT atomic on 32-bit JVMs without volatile
Python CPython's atomic-bytecode behavior is not portable to PyPy or free-threaded CPython

Java Atomics: AtomicInteger, AtomicReference, LongAdder

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

java.util.concurrent.atomic gives lock-free single-variable updates: AtomicInteger / AtomicLong / AtomicReference for counters and references, LongAdder for high-contention counters, AtomicReferenceArray for per-element ordering. All built on hardware compare-and-swap.

Key Points

All atomic classes provide volatile-style visibility plus atomic compound operations like incrementAndGet, compareAndSet, accumulateAndGet.
Use AtomicLong for counters with low contention. Use LongAdder when many threads write the same counter and reads are rare.
compareAndSet is the building block. Most other methods are CAS retry loops.
AtomicReference.compareAndSet protects against the wrong WRITE, but not against the wrong READ. Reading a reference and dereferencing it does not interlock with concurrent writes.
AtomicStampedReference adds a version counter to defeat the ABA problem.

Interview Follow-up Questions

AtomicLong vs LongAdder, how to choose?

LongAdder wins for write-heavy counters that are read infrequently, like metrics scraped every 10 seconds. AtomicLong wins for counters read on every request (header injection, rate limits) because get() is O(1). Benchmark the workload, but the rule of thumb is: many writers + rare reads = LongAdder.

Is compareAndSet really lock-free?

It compiles to a single hardware CAS instruction (lock cmpxchg on x86, CAS or LDREX/STREX on ARM). One CPU instruction, no thread parking, no context switch. The retry loop on failure is busy-wait, but each iteration is fast and bounded by other threads' progress.

What is VarHandle and when is it needed?

VarHandle (Java 9+) gives explicit acquire / release / opaque / volatile semantics on a field, instead of the fixed volatile semantics of Atomic classes. Used by JDK internals and high-performance lock-free code. Application code almost never needs it: AtomicInteger / AtomicReference cover 99% of cases.

Why is AtomicReference not enough for safe lazy init?

AtomicReference makes the reference write atomic and visible. But the constructor that builds the object can still be reordered with the publication if the field is just an AtomicReference and not synchronised properly. Use AtomicReference + a CAS-based double-check, or just use the static-holder idiom which is simpler.

Gotchas

AtomicLong.get() then .set() is two operations, not one. Use accumulateAndGet or compareAndSet.
AtomicReferenceArray protects per-element ordering; a volatile array reference does NOT.
Under heavy contention, CAS retry loops can spin a lot. Move to LongAdder or sharded counters.
Compound operations across multiple atomics need a lock; atomics are single-variable only.
lazySet on Atomic* skips the release barrier. Faster, but no happens-before. Use only when the implications are well understood.

ThreadLocal & Inheritable State

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Sometimes

ThreadLocal stores per-thread values, every thread sees its own copy. Used for: request context, SimpleDateFormat (not thread-safe), DB connections per thread. Pitfalls: leaks in thread pools, doesn't inherit by default, virtual threads explode memory.

Key Points

Each thread has its own value, isolated, no synchronization needed
Common uses: request-id for logging, per-thread DB connections, SimpleDateFormat
InheritableThreadLocal copies parent's value to child threads
Pitfall: thread pool reuses threads, ThreadLocal values persist across requests → leaks
ScopedValue is the modern replacement (preview in 21, finalised in Java 25): bounded scope, no leaks

Interview Follow-up Questions

Why is ThreadLocal a memory leak risk?

In a thread pool, the thread isn't destroyed between tasks. ThreadLocal values stay attached. Without `remove()`, the value persists until the thread eventually dies (which may be never for long-lived pools). Worse: the value can hold onto large objects (entire user contexts), causing slow OOM.

ThreadLocal vs InheritableThreadLocal vs ScopedValue?

ThreadLocal: per-thread, manual cleanup, no inheritance. InheritableThreadLocal: per-thread, copies on creation (but only at thread creation, not on async dispatch). ScopedValue (preview in Java 21, finalised in Java 25): lexically scoped, auto-cleared, works with virtual threads, no leak risk. Use ScopedValue when on Java 25+; ThreadLocal for legacy code.

Why are ThreadLocals problematic with virtual threads?

Each virtual thread gets its own ThreadLocal map. With 1M virtual threads + 50 ThreadLocals + each holds 100 bytes → 5GB of memory just for ThreadLocal storage. Java's ScopedValue is designed for the virtual-thread era, bounded, lexically scoped, much less memory.

Where do logging frameworks use ThreadLocal?

MDC (Mapped Diagnostic Context) in SLF4J/Logback/Log4j is a ThreadLocal map. Set request ID in middleware → log statements automatically include it. Same pattern; same leak risk if middleware doesn't clear.

Gotchas

ThreadLocal in a pool without remove() in finally → leak
InheritableThreadLocal copies AT CREATION, async dispatch doesn't propagate (need explicit handoff)
Virtual threads + many ThreadLocals → memory explosion (use ScopedValue)
ThreadLocal.set(null) is NOT the same as remove(), null entry stays in the map
MDC in async chains often loses the value, middleware must propagate or wrap

ExecutorService & Thread Pools

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

ExecutorService manages a pool of threads + a work queue. Use ThreadPoolExecutor with bounded queue + named threads + sensible rejection policy in production. newFixedThreadPool/newCachedThreadPool have unbounded queues, production hazards.

Key Points

ThreadPoolExecutor (the actual class), full control
Executors.newXxx, convenience but with footguns (unbounded queues)
Always: bounded queue, named threads, explicit rejection policy
shutdown() vs shutdownNow(), graceful vs forceful
Java 21+ virtual thread executor, request-per-thread reborn

Interview Follow-up Questions

Why is newFixedThreadPool dangerous in production?

It uses an UNBOUNDED LinkedBlockingQueue. Burst load → queue grows without bound → OOM. Always use ThreadPoolExecutor explicitly with ArrayBlockingQueue(N) or another bounded queue.

What's the right rejection policy?

Default (AbortPolicy) throws RejectedExecutionException. CallerRunsPolicy makes the submitter run the task, natural backpressure. DiscardPolicy silently drops. DiscardOldestPolicy drops the oldest queued. CallerRuns is the most-used in production.

How to size a thread pool?

CPU-bound: pool = NumCPU. I/O-bound: Goetz's formula, cores × (1 + waitTime/computeTime). For Java 21 virtual threads: don't size at all; use newVirtualThreadPerTaskExecutor with semaphores for downstream-protection.

Is it necessary to shutdown the executor?

Yes. Without shutdown, the JVM doesn't exit because non-daemon worker threads stay alive. Always call shutdown() (or use try-with-resources in Java 19+). For tests, shutdown in @AfterEach to avoid thread leaks.

Gotchas

Executors.newFixedThreadPool: unbounded queue → potential OOM
Executors.newCachedThreadPool: unbounded thread count → potential OS thread exhaustion
Forgetting to call shutdown() → JVM doesn't exit
Future.get() without timeout → can wait forever if task hangs
Worker thread throwing uncaught exception → killed silently; pool spawns replacement; bug invisible without UncaughtExceptionHandler

CompletableFuture & Async Composition

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

CompletableFuture composes async operations into pipelines: thenApply (transform), thenCompose (chain), allOf/anyOf (combine), exceptionally (recover), orTimeout (deadline). The Java equivalent of JS Promise chains, with explicit executor binding.

Key Points

supplyAsync: launch async computation; returns CompletableFuture<T>
thenApply: transform result (sync). thenCompose: chain another async operation
allOf: wait for all. anyOf: wait for first.
Always specify the Executor, default ForkJoinPool.commonPool() is shared
exceptionally / handle for error recovery; orTimeout for deadlines

Interview Follow-up Questions

thenApply vs thenCompose, when?

thenApply for SYNC transforms. thenCompose when the next step itself returns a CompletableFuture (async). thenCompose is JavaScript's Promise.then semantics; thenApply is map. If the function `T → U` then thenApply; if `T → CF<U>` then thenCompose.

Does CompletableFuture have a stack trace?

By default, no, async work runs on a worker thread, original call stack is lost. Use whenComplete with logging, or capture stack at submit time. Or use Project Reactor / RxJava which preserve more context. Java's structured concurrency (Java 21+) helps too.

Why specify the Executor in every async method?

Without it, async tasks run on ForkJoinPool.commonPool(), shared across the whole JVM. Long-running tasks starve other commonPool users (parallel streams, etc.). Always pass a dedicated executor.

Can a CompletableFuture be cancelled?

Sort of. cancel(true) marks the future as cancelled with CancellationException. But it does NOT interrupt the running thread, that's a persistent gotcha. Use the CompletableFuture chain to model cancellation explicitly (a separate cancelled future + check).

Gotchas

Default executor is ForkJoinPool.commonPool(), shared, easily exhausted
thenApply where thenCompose was needed → CF<CF<T>> (nested future)
cancel(true) doesn't interrupt the running thread
exceptionally hides the exception unless logged; whenComplete is better for visibility
join() can hang if no timeout, always pair with orTimeout for production

ConcurrentHashMap

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Java's general-purpose concurrent map. Lock-free reads, CAS-based and per-bin locking on writes. Use computeIfAbsent and merge for atomic compound operations. Far better than synchronized HashMap or Collections.synchronizedMap for everything except trivial workloads.

Key Points

Reads are completely lock-free. Writes lock per-bin (the linked list at one hash bucket).
Java 8+ replaced segment locks with per-bin synchronisation + CAS. Resize is incremental and lock-free.
containsKey + put is racy. Use putIfAbsent or computeIfAbsent.
size() is approximate under concurrent mutation. mappingCount() returns long for huge maps.
Iterators are weakly consistent: they reflect the state at some point during iteration, not a snapshot.

Interview Follow-up Questions

How is CHM different from Collections.synchronizedMap?

synchronizedMap wraps every method in a single object lock. Two threads calling get serialise. CHM has lock-free reads and per-bin write locking, so concurrent reads and writes to different keys do not interfere. CHM is dramatically faster under contention and the only choice for serious workloads.

Why is size() approximate?

Under concurrent mutation, the count is maintained across multiple per-bin counters (so writes do not all contend on one). On a size() call, CHM sums them in a snapshot that may not reflect concurrent updates that happened during the sum. For most uses (capacity hints, metrics) the approximation is fine. For exact count, freeze the map first.

Can CHM be used as a concurrent set?

Yes. Either use ConcurrentHashMap.newKeySet() (returns Set view), or use the map with values that get ignored (Boolean.TRUE). The newKeySet variant is cleaner and uses ConcurrentHashMap.KeySetView under the hood.

What is wrong with iterating a CHM?

Iterators are weakly consistent: they will not throw ConcurrentModificationException, but they may or may not reflect mutations made during iteration. They reflect the state at some point during the iteration, not a snapshot. Acceptable for monitoring, not for transactional reads.

Gotchas

computeIfAbsent function should be fast and side-effect-free. While it runs, the bin is locked.
Recursive update inside computeIfAbsent (calling map.put on the same map) can deadlock or throw.
size() is O(N) and approximate. Cache it when called often.
Default initial capacity is 16, default concurrency level was historically 16. In Java 8+ concurrencyLevel is just a hint.
Null keys and null values are forbidden. Throws NullPointerException.

BlockingQueue Family

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

BlockingQueue is the Java interface for thread-safe producer-consumer queues. Pick by overflow behaviour and ordering: ArrayBlockingQueue (bounded), LinkedBlockingQueue (default unbounded, dangerous), PriorityBlockingQueue (ordered), SynchronousQueue (no buffer, hand-off), DelayQueue (scheduled), TransferQueue (sender knows when received).

Key Points

put blocks if full, take blocks if empty. offer/poll have non-blocking and timed variants.
ArrayBlockingQueue is the default safe choice for production: bounded, simple, predictable.
LinkedBlockingQueue defaults to UNBOUNDED. Always pass a capacity for production.
SynchronousQueue has zero buffer: producer waits for consumer to take directly. Used by Executors.newCachedThreadPool.
All implementations are fully thread-safe: put/take/poll/offer can be called from any thread.

Interview Follow-up Questions

Why is LinkedBlockingQueue dangerous by default?

Its no-arg constructor uses Integer.MAX_VALUE as capacity, which is effectively unbounded. Under burst load, the queue grows without bound and the JVM OOMs. Always pass an explicit capacity. When in doubt about which bounded queue to pick, use ArrayBlockingQueue.

ArrayBlockingQueue vs LinkedBlockingQueue: which is faster?

Depends on the workload. ArrayBlockingQueue uses one lock for put and take; LinkedBlockingQueue uses two (head and tail), so it can have one producer and one consumer not blocking each other. For low contention, ArrayBlockingQueue's simpler design wins. For very high contention with separate producer and consumer threads, LinkedBlockingQueue can win.

When to use SynchronousQueue?

When zero buffering is required: every produced item must be handed off to a waiting consumer. Used as the work queue in newCachedThreadPool to force thread creation per task. Also used in some hand-off algorithms where buffering would mask a problem.

What is TransferQueue and why does it matter?

TransferQueue extends BlockingQueue with a transfer method: the producer not only puts the value, it also waits for a consumer to take it. Useful for request-response patterns where the producer needs confirmation that the consumer received the work.

Gotchas

LinkedBlockingQueue with no arg = unbounded = OOM risk under burst
Mixing add/remove (throw) with put/take (block) leads to inconsistent error handling
PriorityBlockingQueue is unbounded; use a separate semaphore for bounded priority work
Iterator on a BlockingQueue is weakly consistent, not a snapshot
Forgetting to handle InterruptedException in put/take leaves the thread alive but the cancellation lost

Sync Utilities: Latch, Barrier, Semaphore, Phaser, Exchanger

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Five coordination primitives in java.util.concurrent. CountDownLatch: one-shot start signal. CyclicBarrier: rendezvous N threads, reusable. Semaphore: limit concurrent access to N permits. Phaser: dynamic-party barrier. Exchanger: hand off a value between two threads.

Key Points

CountDownLatch is one-shot. Once it hits zero, it stays zero. Cannot reset.
CyclicBarrier is reusable. After N threads arrive, it triggers an optional action and resets.
Semaphore.acquire() blocks for a permit. Use it for connection pools, rate limiters, bounded concurrent execution.
Phaser is the flexible cousin of CyclicBarrier: parties can register and deregister mid-flight.
Exchanger pairs exactly two threads. Each calls exchange(value) and gets the other's value.

Interview Follow-up Questions

When to use CountDownLatch vs CyclicBarrier?

CountDownLatch when the count and the waiters are different sets of threads. One thread (or N) counts down; M threads wait. Used for one-shot start gates and 'wait for all workers to finish'. CyclicBarrier when the same N threads need to rendezvous repeatedly. Used for iterative parallel algorithms.

What does 'fair' mean for Semaphore?

Fair = FIFO order. Threads acquire in the order they called acquire(). Unfair = any waiting thread can win. Fair is slower (more scheduling, more context switches) but prevents starvation. For low-contention pools the cost is invisible; for high-contention rate limiters fair is the right default.

Phaser vs CyclicBarrier?

Phaser is a superset. CyclicBarrier requires a fixed party count at construction. Phaser lets parties register and deregister mid-flight. Phaser also tracks phase number, so threads can ask 'which phase are we on'. Use CyclicBarrier when count is fixed; reach for Phaser when it is not.

Are these primitives ever the wrong tool?

Yes, often. To 'wait for N futures to complete', CompletableFuture.allOf is cleaner than a CountDownLatch. For a producer-consumer queue, BlockingQueue is cleaner than rolling one with Semaphore. These primitives are foundations; reach for higher-level utilities first when they fit.

Gotchas

CountDownLatch cannot be reset. For a reusable variant, use CyclicBarrier or Phaser.
Forgetting to release a Semaphore permit (no try/finally) leaks permits permanently
CyclicBarrier with a barrier action: that action throws? All waiting threads get BrokenBarrierException.
Unfair Semaphore can starve a thread under sustained contention.
Phaser deregistration is one-shot; deregistering twice from one thread will desync the count.

ReadWriteLock & StampedLock

Section: Java Concurrency | Difficulty: Advanced | Frequency: Sometimes

ReentrantReadWriteLock allows many concurrent readers OR one writer. StampedLock adds an optimistic-read mode that does not block writers. Use ReadWriteLock when reads dominate writes by 10x or more. Use StampedLock when reads vastly dominate and a retry is tolerable. Both are sharper than they look.

Key Points

ReentrantReadWriteLock: many readers OR one writer at a time. Readers do not block readers.
Read locks are reentrant. Write locks are reentrant. Holding write lock and acquiring read = OK. Holding read lock and acquiring write = DEADLOCK on the same thread.
StampedLock has optimistic read: no lock acquisition, validate after the read. Falls back to read lock if validation fails.
StampedLock is NOT reentrant. Calling readLock twice on the same thread deadlocks.
Both are heavier than synchronized for low-contention or write-dominant code. Measure.

Interview Follow-up Questions

When is ReadWriteLock actually faster than synchronized?

When reads dominate by 10x or more AND the read holds the lock long enough that contention matters. For very short reads, the cache-line bouncing from incrementing the reader count can make ReadWriteLock slower than just synchronized. Profile both.

When to use StampedLock?

Read-heavy code where reads are very cheap (a few field reads) and writes are rare. Geometry caches, configuration that updates seconds apart, lookup tables. Avoid for code that needs reentrancy or for reads with side effects (the optimistic read can run twice on validation failure).

Why is StampedLock not reentrant?

Reentrancy requires per-thread bookkeeping that adds cost. StampedLock's design pushed that cost out to keep the fast path tight. The trade is that calling readLock or writeLock on a lock the same thread already holds will self-deadlock.

Can readers starve writers?

With unfair ReadWriteLock, yes. New readers can keep arriving while a writer waits, and the writer never gets in. Use the fair constructor (true) to prevent that. StampedLock does not have a fairness option; in practice it is biased toward writer fairness.

Gotchas

Read lock cannot be upgraded to write lock on ReentrantReadWriteLock. Drop and re-acquire.
StampedLock readers must be side-effect-free and cheap. Optimistic read may execute and then re-execute.
StampedLock is NOT reentrant. Holding read or write and re-acquiring the same hangs.
Unfair ReadWriteLock can starve writers under sustained read load.
For very short critical sections, plain synchronized often beats both.

ForkJoinPool & Parallel Streams

Section: Java Concurrency | Difficulty: Intermediate | Frequency: Sometimes

ForkJoinPool is Java's work-stealing thread pool optimised for divide-and-conquer recursion. RecursiveTask / RecursiveAction split work, fork() schedules, join() waits. Parallel streams run on the common pool. Sharp tool: useful for CPU-bound recursive work, terrible for blocking I/O.

Key Points

Work stealing: each worker has its own deque; when idle, it steals from another worker's deque tail.
Designed for divide-and-conquer with no blocking. Tasks split until small, compute, combine.
Common pool is shared across the JVM. Parallel streams use it. Blocking inside it starves everyone.
Default parallelism = available cores - 1. Override with -Djava.util.concurrent.ForkJoinPool.common.parallelism.
For CPU-bound recursive work, fork-join can outperform a fixed thread pool. For I/O, use a dedicated executor.

Interview Follow-up Questions

Why is the common pool shared across the JVM?

Performance and simplicity. Allocating a new ForkJoinPool per parallel stream call would create dozens of pools, dozens of thread pools, and contention for cores. The common pool centralises that. The downside is that any user of parallel streams shares it, so one badly-behaved task affects everyone.

When to avoid parallel streams?

For pipelines where the per-element cost is small (the parallelisation overhead dominates), for I/O-bound work (the pool starves), and for ordered pipelines where preserving order matters (parallel streams have order-preservation modes but they cost). Default to sequential streams; reach for parallel only when the elements are expensive and CPU-bound.

What does work stealing buy?

Load balancing for free. With a fixed-size pool and a global queue, all workers contend on the queue head. With per-worker deques, contention only happens when a worker steals (rare on balanced workloads). Work stealing turns out to be the right scheduler for divide-and-conquer where subtask sizes are uneven and unpredictable.

How big should THRESHOLD be?

Big enough that the per-task overhead (fork, queue, join) is small relative to the leaf work, but small enough to leave many leaves to balance across cores. Rule of thumb: aim for 10x to 100x the number of cores in total tasks. Profile to tune.

Gotchas

Blocking inside the common pool starves every other parallel stream caller
Recursive tasks that share mutable state defeat the model. Compose pure functions.
Parallel stream parallelism = commonPool size, NOT number of stream sources
fork() then immediately join() on the same task is just compute(); fork-join shines when one side runs inline and the other forks
Setting common.parallelism too high (>cores) wastes context switches; too low (=1) defeats the point

Structured Concurrency (Java 21+)

Section: Java Concurrency | Difficulty: Advanced | Frequency: Sometimes

StructuredTaskScope (Java 21 preview, finalised in 25) ties subtasks to a parent scope. The scope owns lifetimes: when it exits, all subtasks are cancelled or joined. Replaces the unstructured spawn-and-forget pattern of plain ExecutorService for fan-out work. Pairs naturally with virtual threads.

Key Points

Subtasks are scoped to a try-with-resources block. Scope close = all subtasks cancelled or joined.
Fork-join is structured: parent waits for children, children inherit deadlines and cancellation.
anySuccessful joiner: first success cancels siblings. Used for hedged calls, fastest-replica reads.
allSuccessfulOrThrow: all must succeed or all are cancelled. Used for fan-out where every result matters.
Pairs naturally with virtual threads: cheap to spawn 1000 subtasks, scope cleans up reliably.

Interview Follow-up Questions

Why is this called 'structured' concurrency?

Concurrent control flow follows the same lexical structure as sequential control flow. Subtasks live within the lexical block of their scope; when the block exits, they are guaranteed done or cancelled. Like try-with-resources for concurrency. The opposite of unstructured ExecutorService.submit, where the future outlives the calling code and orphans are easy.

How does this interact with virtual threads?

Beautifully. Virtual threads make spawning a subtask cheap (microseconds, no OS thread). StructuredTaskScope by default spawns each subtask on its own virtual thread. The pair turns 'fan out 100 calls and wait for all' from a careful executor configuration into a 5-line block.

What about exceptions?

With allSuccessfulOrThrow, the first failure cancels remaining subtasks and join() throws the failure. With anySuccessfulOrThrow, all-failed throws an exception aggregating all of them. The scope handles both the propagation and the cancellation; no per-fork try/catch is needed.

Is this stable yet?

JEP 505 finalised it as a stable preview-graduated feature in Java 25 (2025). Earlier Java versions had it as a preview API with slightly different shape. For new code on Java 25+, prefer StructuredTaskScope over plain ExecutorService for any fan-out that waits for the results before returning.

Gotchas

scope.join() must be called before reading any subtask result. Otherwise IllegalStateException.
Subtask.get() before join() also throws. Order matters.
Forgetting try-with-resources defeats the structuring. Always use the resource block.
Joiners are typed by result type. Mixing forks with different return types complicates the joiner.
Subtasks inherit the parent's interrupt status. Cancel propagates downward through nested scopes.

JMH: Benchmarking Concurrent Java

Section: Java Concurrency | Difficulty: Advanced | Frequency: Sometimes

JMH (Java Microbenchmark Harness) is the only Java microbenchmarking tool worth trusting. Hand-rolled timing loops give meaningless numbers because the JIT optimises away the work, dead-code-eliminates results, and warm-up effects dominate. JMH manages all of that and produces statistically sound measurements.

Key Points

JIT warmup matters: cold code runs interpreted, warm code is fully JIT-compiled. Measure only warm.
Dead code elimination silently deletes unconsumed results. Use Blackhole or return values.
Loop unrolling and constant folding can vaporise simple benchmarks. Inputs must come from @State.
@Threads(N) measures contention. Per-thread state vs shared state matters.
Reported numbers are op/sec or ns/op with confidence intervals. Trust the CIs.

Interview Follow-up Questions

Why not just use System.nanoTime() in a loop?

Three reasons. JIT warmup: the first thousand iterations run interpreted or partially compiled, giving wildly different timings than the steady state. Dead code elimination: if the result is unused, the JIT may remove the work entirely. Loop unrolling and constant folding: simple inputs let the JIT precompute, so the measurement reflects a literal-load instead of compute. JMH addresses all three.

How many forks to use?

Start with 2-3. JVM-to-JVM variance comes from GC layout, code cache layout, JIT decisions. Multiple forks let the harness see that variance and report it in the confidence intervals. For publish-quality numbers, 5-10 forks. For day-to-day work, 2 is fine.

Does JMH measure contention correctly?

Yes, with @Threads(N). The N benchmark threads run the same @Benchmark method concurrently. Pair with @State(Scope.Benchmark) for shared state to measure contention. JMH also has @Group for asymmetric workloads (some threads do A, others do B).

What is the relationship between JMH and async-profiler?

JMH reports how fast something is. Async-profiler reports where the time goes. Both are essential. JMH-only gives a number with no path to improvement. Profiler-only gives a flame graph but no measurement of impact. Use them together: JMH to compare, profiler to explain the comparison.

Gotchas

Hand-rolled timing loops are almost always wrong. Use JMH.
Returning the result is the easy way to defeat DCE; use Blackhole when a return is not possible
@State scope decides shared vs per-thread; getting it wrong measures the wrong thing
Single-fork results are noisy; multiple forks expose JVM-to-JVM variance
Runtime in dev IDE is misleading; build the JAR and run from CLI

The Python GIL, What It Is and Why It Matters

Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often

The Global Interpreter Lock is a mutex that serialises execution of Python bytecode in CPython. It makes single-threaded code fast and simple, but means threads cannot run Python code in parallel, even on a 64-core machine. Reach for multiprocessing for CPU-bound work, threading or asyncio for I/O.

Key Points

The GIL is a CPython implementation detail, not part of the language spec
Threads release the GIL around blocking I/O calls, so threading IS useful for I/O work
Threads do NOT help CPU-bound code in CPython, they serialise via the GIL
GIL switches between threads roughly every 5ms (sys.setswitchinterval)
PyPy, Jython, and IronPython have no GIL; CPython 3.13+ has experimental free-threaded mode (PEP 703)
Even with the GIL, common bytecode operations like += are NOT atomic, bytecode boundaries can interleave

Interview Follow-up Questions

Why does the GIL exist?

Two reasons: (1) it makes the CPython interpreter implementation simple, reference counting and other internals don't need fine-grained locks; (2) it makes single-threaded code fast and most Python code is single-threaded. Removing the GIL has historically slowed single-threaded performance, that's the tradeoff free-threaded CPython is now navigating.

If the GIL serialises Python bytecode, why is += not atomic?

The GIL is released between bytecode operations to let other threads run. += is multiple bytecodes (LOAD, ADD, STORE). Two threads can read the same value, both add 1, both write, losing one update. Use threading.Lock or queue.Queue to make compound operations atomic.

Does the GIL apply to NumPy / Pandas / TensorFlow operations?

No, these libraries release the GIL inside their C/Fortran code. NumPy operations on large arrays run in parallel across threads. The GIL only affects pure Python bytecode. This is why scientific Python is fast despite the GIL.

Will the GIL be removed?

PEP 703 (accepted 2023) introduces free-threaded CPython, opt-in starting 3.13. The plan is gradual, make it experimental, then default, then default-only. Timeline is years. Until then, the GIL is here.

Why does asyncio not have GIL problems?

asyncio is single-threaded, only one piece of code runs at a time. The GIL becomes irrelevant because there's no parallelism to serialise. asyncio achieves concurrency by interleaving tasks at await points, not by running them simultaneously.

Gotchas

Adding threading to CPU-bound code makes it slower, measure before adding
multiprocessing on Windows/macOS uses 'spawn' by default, globals from main aren't copied
Pickle errors when passing un-picklable objects (lambdas, locks) to ProcessPoolExecutor
asyncio + a CPU-heavy synchronous call → entire event loop stalls until it finishes
Mixing threads and asyncio is a footgun, prefer asyncio.to_thread() for blocking calls
The GIL is released around blocking I/O, but NOT around CPU loops in pure Python

threading vs multiprocessing, Picking the Right Tool

Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often

threading shares memory and is great for I/O-bound work because the GIL releases around blocking calls. multiprocessing spawns separate processes that bypass the GIL, the only way to get real CPU parallelism in stock CPython.

Key Points

threading: shared memory, light, GIL-bound, pick for I/O concurrency
multiprocessing: separate memory, ~50ms startup each, no GIL, pick for CPU parallelism
concurrent.futures gives a unified API over both, Executor, submit(), map(), Future
Process startup method differs by OS: 'fork' on Linux (fast, copy-on-write), 'spawn' on macOS/Windows (slower, isolated)
multiprocessing requires picklable arguments, lambdas, locks, sockets won't cross the boundary
Process pools are best for medium-large CPU tasks; small tasks lose to IPC overhead

Interview Follow-up Questions

When does multiprocessing pool overhead outweigh the parallelism win?

When tasks are small. Each task pickle + IPC + result unpickle costs ~100μs. If your task takes 50μs, processes will be slower than serial. Rule of thumb: process pool wins when each task is at least a few ms and there are enough tasks to amortize startup.

Why are sockets and locks not picklable?

Their state isn't meaningful in another process, a socket file descriptor or a lock identity in process A makes no sense in process B. multiprocessing rejects them rather than silently producing broken state. Use Manager.Lock() to cross processes.

How can a numpy array be shared across processes without pickling it?

Use multiprocessing.shared_memory.SharedMemory (3.8+), the array lives in OS shared memory, processes attach by name. Or use a memory-mapped file. Or use a multiprocessing.Array. Manager-backed arrays work but are slow because every access proxies through IPC.

concurrent.futures or raw threading/multiprocessing?

concurrent.futures, almost always. It's the modern API: Executor abstraction, Future objects, clean shutdown via context manager, exception propagation, as_completed iteration. Reach for raw threading.Thread only for long-lived workers with custom lifecycles.

Is a thread pool inside each worker process viable?

Yes, common pattern for mixed CPU+I/O workloads. ProcessPoolExecutor at the top, ThreadPoolExecutor inside each worker. The processes get CPU parallelism; the threads inside each handle blocking I/O without pinning a whole core to wait.

Gotchas

Forgetting `if __name__ == '__main__':` guard on Windows, re-imports module recursively, infinite spawn
Module-level state set by the parent isn't visible to spawn-mode children, must pass explicitly
Pickle errors on lambdas, use def or functools.partial
ProcessPoolExecutor swallows worker crashes, wrap in try/except and check fut.exception()
fork on macOS is officially discouraged (3.8+) due to subtle bugs with system libraries
Closing a Pool inside the with block is automatic; outside, pool.close() + pool.join() are required or workers leak

asyncio Fundamentals

Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often

asyncio runs many tasks on a single thread by interleaving them at await points. It scales to 100K+ concurrent I/O operations because each task is just a few KB instead of a thread's megabytes, but a single CPU-heavy synchronous call freezes the entire event loop.

Key Points

asyncio is single-threaded, concurrency without parallelism
await is a yield point: control returns to the event loop, other tasks run
Tasks are cheap (~few KB); spawning 100K is realistic
ANY blocking call (requests.get, time.sleep, DB driver) stalls the entire event loop
asyncio.to_thread() bridges blocking code into the async world
The library being called must be async-aware, sync libraries don't 'become' async

Interview Follow-up Questions

Why is asyncio single-threaded?

By design, cooperative scheduling on a single thread eliminates the need for locks and avoids GIL contention. Each await point is an explicit yield, so race conditions are dramatically rarer than with threads. The tradeoff is no CPU parallelism within one event loop.

What's the difference between gather and wait?

gather() waits for ALL tasks, returns results in input order, propagates exceptions. wait() returns (done, pending) sets and accepts return_when=FIRST_COMPLETED, FIRST_EXCEPTION, or ALL_COMPLETED. gather is the common case; wait is for first-finished semantics.

How is an asyncio task cancelled?

task.cancel() schedules a CancelledError to be raised inside the task at the next await. The task should catch it, do brief cleanup, and re-raise. asyncio.wait_for and asyncio.timeout() are higher-level wrappers that cancel + raise TimeoutError.

Can asyncio run inside a thread?

Yes, each thread can run its own event loop with asyncio.run(). Useful when integrating async code into a thread-pool-based legacy service. Don't share async objects (Locks, Queues) across threads, they're tied to one event loop.

Why does requests not work with asyncio?

requests is synchronous, its socket I/O blocks the calling thread. In an event loop, that blocks every concurrent task. Use aiohttp or httpx (async mode) instead, or wrap with asyncio.to_thread() when stuck with requests.

Gotchas

Forgetting `await` calls a coroutine but doesn't run it; emits a 'coroutine was never awaited' warning
Calling sync DB drivers inside async, silent event loop stall, tail latency spikes
Swallowing CancelledError, breaks asyncio.wait_for and graceful shutdown
Creating tasks without awaiting/storing them, task may be garbage collected mid-run
Mixing asyncio.run() inside another running event loop, RuntimeError; use await or get_event_loop()
Using time.sleep in async code; asyncio.sleep is the right tool (the sync one blocks the loop)

Python threading: Lock, RLock, Condition, Event, Barrier

Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often

threading module ships the usual suspects: Lock (mutex), RLock (reentrant), Condition (wait/notify), Event (one-shot signal), Semaphore (counted permits), Barrier (rendezvous N threads). The GIL makes Python thread interleaving look atomic for single bytecodes but not for multi-step operations, so these primitives are still needed.

Key Points

Lock acquired twice on the same thread = deadlock. RLock allows recursive entry.
Even with the GIL, operations like dict[k] += 1 are NOT atomic (load, add, store across multiple bytecodes).
Condition.wait() releases the lock and re-acquires on wakeup. Always pair with a predicate loop to handle spurious wakeups.
Event is the right tool for one-time signals (shutdown flag, ready signal). Don't roll a custom Lock+flag.
queue.Queue is built on these and provides a thread-safe producer-consumer for free.

Interview Follow-up Questions

Doesn't the GIL make all operations atomic?

No. The GIL makes individual bytecodes atomic (a single LOAD or STORE happens entirely or not at all). But Python operations like x += 1 compile to multiple bytecodes (LOAD, ADD, STORE), and the GIL can be released between them. Two threads doing x += 1 can lose an update. dict[k] = v on the other hand IS atomic on CPython (single STORE_SUBSCR), though that is implementation detail and not safe to rely on.

Lock vs RLock: which is the default?

Lock. It is faster, surfaces the 'same thread re-acquires' bug as a deadlock immediately rather than hiding it. Reach for RLock only when nested method calls genuinely all need the lock and refactoring to acquire once at the top is not feasible. RLock often masks a design issue.

When is queue.Queue preferable over Lock + Condition?

Almost always for producer-consumer. queue.Queue handles the locking, the bounded capacity, the wait on full/empty, the shutdown protocol (with sentinels). Hand-rolling with Lock+Condition is for cases where the predicate is more complex than queue size, or where a custom data structure beyond FIFO is needed.

Why is Event better than a Lock + boolean for shutdown?

Event is one-shot: once set, every wait() returns immediately, including future waits. With Lock + boolean, every loop iteration must recheck and cannot block efficiently. Event allows wait with a timeout, which is exactly what worker loops want ('do work for at most 1 second, then check shutdown').

Gotchas

Holding a Lock across a yield from / await blocks the whole asyncio loop
Forgetting `with` and using acquire/release manually leaks locks on exception
Condition.wait() must be called inside a loop on the predicate; single 'if' is wrong
Barrier with action: the action runs ONCE on whichever thread arrives last; exceptions break the barrier
BoundedSemaphore raises if release() exceeds initial count; Semaphore silently allows it

queue Module: Queue, LifoQueue, PriorityQueue

Section: Python Concurrency | Difficulty: Basic | Frequency: Asked Often

queue.Queue is the standard thread-safe FIFO. LifoQueue is a stack. PriorityQueue is a heap. SimpleQueue is FIFO without size bound, optimised for short-lived. All handle the locking internally; pair with a sentinel (None) to signal shutdown to consumers.

Key Points

queue.Queue is thread-safe by design. No external lock needed.
maxsize > 0 makes the queue bounded; put() blocks when full. Backpressure for free.
Standard shutdown: producer puts a sentinel (None); consumer breaks when it sees one.
task_done() + join() let the producer wait for all submitted work to be processed.
queue.Queue is for THREADS. asyncio has its own asyncio.Queue. multiprocessing has multiprocessing.Queue.

Interview Follow-up Questions

queue.Queue or asyncio.Queue or multiprocessing.Queue?

Different worlds, do not mix. queue.Queue is for threading, asyncio.Queue is for coroutines (async put/get), multiprocessing.Queue is for processes (pickled across the boundary). Pick by the producer/consumer concurrency model. Mixing them (calling queue.Queue.get from an async coroutine) blocks the event loop.

Why is queue.SimpleQueue faster than queue.Queue?

Queue has a configurable maxsize, supports task_done/join, and uses a lock + condition. SimpleQueue is FIFO-only, unbounded, and uses a lock-free fast path for the common case. For high-throughput producer-consumer that doesn't need bounding or task_done, SimpleQueue can be measurably faster.

What goes in PriorityQueue tuples?

(priority, tiebreaker, payload). The tiebreaker (an itertools.count() value or a timestamp) prevents the heap from comparing payloads when priorities are equal. Without it, two equal-priority items try to compare payload, which fails for non-orderable types like dicts or custom objects without __lt__.

How are workers shut down cleanly?

Two patterns. Sentinel: producer puts N copies of None (one per worker), each worker breaks when it sees one. Event: set a threading.Event, workers check it on each iteration with a short timeout on get(). Sentinel is cleaner when N workers is known; Event is cleaner for dynamic worker pools.

Gotchas

queue.Queue is for threads only; asyncio coroutines must use asyncio.Queue
PriorityQueue tuples need a tiebreaker; otherwise equal priorities try to compare payloads
task_done() called more times than put() raises ValueError
maxsize=0 is the default and means UNBOUNDED. Always set explicitly for production.
join() blocks the calling thread; if no consumers are running, it blocks forever

concurrent.futures: ThreadPoolExecutor & ProcessPoolExecutor

Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often

concurrent.futures is the high-level API for running callables on a thread or process pool. ThreadPoolExecutor for I/O-bound work (the GIL is released during blocking I/O). ProcessPoolExecutor for CPU-bound work (one Python interpreter per process, true parallelism). Submit returns a Future; map streams results in input order.

Key Points

ThreadPoolExecutor: I/O-bound (HTTP, DB, file). The GIL releases during blocking syscalls so multiple threads make progress.
ProcessPoolExecutor: CPU-bound (parsing, image processing, math). True parallelism, but pickling cost on every call.
executor.map preserves input order. as_completed yields in completion order. Pick by what the downstream consumer needs.
Always use a context manager (with). Forgetting to shutdown leaks worker threads or processes.
Default max_workers: min(32, cpu_count+4) for ThreadPool, cpu_count for ProcessPool. Tune to the workload.

Interview Follow-up Questions

ThreadPoolExecutor or ProcessPoolExecutor: how to choose?

When work is dominated by waiting (HTTP, DB queries, disk I/O), use threads. The GIL releases during blocking I/O syscalls, so threads make real concurrent progress. When work is dominated by Python-level CPU (parsing, math, image processing without numpy releasing the GIL), use processes. For CPU work in a C extension that releases the GIL (numpy, opencv), threads work too.

What is the cost of ProcessPoolExecutor?

Process startup (fork or spawn), and pickling every argument and every return value. For tiny workloads, the overhead exceeds the speedup. Rule of thumb: each task should do at least 10ms of work to justify process overhead, more on Windows where spawn is slower than fork.

Why not just use threading.Thread directly?

concurrent.futures provides the pool (bounded concurrency), the Future abstraction (result, exception, callback), error handling, and timeouts in one API. Plain threading.Thread is one thread per call, no result mechanism, no error propagation back to the caller. concurrent.futures is what's wanted 95% of the time.

Can a Future be cancelled mid-execution?

No. cancel() only succeeds if the task has not started. Once running, the worker thread or process executes to completion. Python has no thread cancellation primitive. The standard workaround is cooperative cancellation: pass a threading.Event to the worker, check it periodically, return early if set.

Gotchas

Forgetting `with` context manager leaks workers (especially processes, which keep file handles open)
ProcessPoolExecutor on Windows uses spawn, not fork; everything in __main__ guard
executor.map raises only on iteration; exceptions in early items are deferred
Default max_workers is often wrong for the workload; size based on bottleneck (cores for CPU, latency * QPS for I/O)
Cancelled futures still complete pending work in the queue if other threads pick them up

asyncio Patterns: gather, TaskGroup, Semaphore

Section: Python Concurrency | Difficulty: Intermediate | Frequency: Asked Often

asyncio.gather runs awaitables concurrently and returns results in input order. TaskGroup (Python 3.11+) is the structured-concurrency replacement: better cancellation, better error propagation. asyncio.Semaphore caps in-flight concurrency. asyncio.timeout adds a deadline. These four cover most async fan-out patterns.

Key Points

gather runs N awaitables concurrently; result is a list in input order.
TaskGroup replaces gather for new code: any task failure cancels siblings, the group raises ExceptionGroup.
asyncio.Semaphore is the concurrency limit. Without it, gather over 10,000 URLs spawns 10,000 connections.
asyncio.timeout(s) is a context manager. The whole block must finish in s seconds or it cancels.
Bare `asyncio.create_task(coro)` without keeping a reference may be garbage-collected mid-flight.

Interview Follow-up Questions

TaskGroup vs gather: which to use?

TaskGroup for new code on Python 3.11+. It handles cancellation correctly (gather's cancel behaviour was historically buggy), provides ExceptionGroup for multiple failures, and aligns with structured concurrency. Use gather only for simple cases on older Python or when return_exceptions=True semantics are specifically needed.

What is the difference between Task and Future?

Task is a Future that wraps a coroutine. asyncio.create_task(coro) schedules the coroutine and returns a Task. A Task can be awaited, cancelled, queried for state. Future is the lower-level primitive (also returned by run_in_executor for thread/process integration). In application code, almost always work with Tasks.

Why does 'coroutine was never awaited' sometimes appear?

An async function was called without being awaited. async def foo(); foo() returns a coroutine object; it does not run the coroutine. Await foo(), schedule with create_task, or pass to gather/TaskGroup. Forgetting one of these means the coroutine is built and discarded.

How does asyncio interact with threads and processes?

asyncio.to_thread runs a sync function on the default executor (a ThreadPoolExecutor) and returns an awaitable. Useful for blocking calls (file I/O, subprocess, sync libraries) that cannot be awaited directly. For CPU-bound work, asyncio.get_running_loop().run_in_executor(ProcessPoolExecutor(), ...) bridges to processes.

Gotchas

create_task without keeping a reference may be GC'd before completion
gather with return_exceptions=False raises immediately; siblings keep running
Forgetting `await` on an async function returns a coroutine, not a result
Calling sync blocking code (time.sleep, requests.get) from async blocks the event loop
asyncio.Lock is for asyncio coroutines; threading.Lock will block the event loop

ContextVar: Async-Safe Request Context

Section: Python Concurrency | Difficulty: Intermediate | Frequency: Sometimes

ContextVar (Python 3.7+) is the async-safe replacement for threading.local. Each task gets its own isolated copy of the variable; values set in a parent task propagate to children but mutations don't leak back. The right tool for request IDs, trace IDs, user context, and any per-request state in async code.

Key Points

Each asyncio Task gets a copy of the parent's context. Sets in the task don't affect siblings or parent.
Replaces threading.local for async: thread-locals don't isolate across asyncio tasks on the same thread.
Used for request-scoped data: trace ID, request ID, authenticated user, tenant ID.
var.set() returns a Token; pass it to var.reset(token) to restore the previous value (rare in async code).
Frameworks (FastAPI, Starlette, structlog) use ContextVar internally for request-scoped state.

Interview Follow-up Questions

ContextVar vs threading.local: when each?

threading.local for sync code that needs per-thread state. ContextVar for async code (per-task) AND for sync code that needs per-call isolation (Context.run scopes a value to a specific callable). For new code in an async codebase, always ContextVar.

Does ContextVar work with thread pools?

Yes, but with care. Each thread starts with an empty context unless the context is copied explicitly. asyncio.to_thread copies the current context; concurrent.futures.ThreadPoolExecutor.submit does NOT, so contextvars.copy_context().run(...) is required. Frameworks usually handle this automatically.

Can a ContextVar have no default?

Yes. var.get() then raises LookupError if unset. Or var.get(default) returns default. The no-default form is useful for catching bugs ('this var must be set by middleware'); the default form is forgiving.

Why does var.set() return a Token?

For undo. var.reset(token) restores the previous value. Useful in synchronous code that wants to set a value temporarily. In async code, the per-task isolation usually means reset is unneeded; the value disappears when the task ends.

Gotchas

threading.local in async code: tasks on the same thread share state, race city
ThreadPoolExecutor.submit does NOT copy context; trace IDs disappear in pool work
var.set() inside a finally block of a task does NOT affect the parent (per-task isolation)
Forgetting default + var.get() raises LookupError; catch or pre-set in middleware
Context is captured at create_task time, not at await time; mutations between create and await don't propagate

Multiprocessing Deep Dive

Section: Python Concurrency | Difficulty: Advanced | Frequency: Sometimes

multiprocessing spawns separate Python processes to escape the GIL. Pool, Process, Queue, Pipe, Manager, shared memory. The catch: every argument and return value is pickled across the boundary, start methods (fork/spawn/forkserver) behave very differently, and shared state requires explicit machinery.

Key Points

Each process has its own Python interpreter and its own GIL. True CPU parallelism.
Arguments and return values must be picklable. Lambdas, closures, file handles, locks all fail.
Start method matters: fork (Unix default, fast, copy-on-write) vs spawn (Windows, slower, fresh interpreter) vs forkserver.
Shared state needs Manager (proxied, slow) or shared_memory (fast, raw bytes).
multiprocessing.Pool has the same map/imap/apply API as concurrent.futures' ProcessPoolExecutor; the latter is usually preferred.

Interview Follow-up Questions

When is multiprocessing preferable over ProcessPoolExecutor?

ProcessPoolExecutor is built on multiprocessing and provides a cleaner API for the common case (map, submit, futures). Reach for raw multiprocessing when needed: a Manager for shared state, shared_memory, custom IPC via Pipe, or fine control over the start method. For 'run this function on N cores', use ProcessPoolExecutor.

Why does code work on Linux but fail on macOS / Windows?

Default start method changed. macOS in Python 3.8+ defaults to spawn (was fork before). Windows has always been spawn. Spawn re-imports the module in each worker, which breaks lambdas, top-level code without an `if __name__ == '__main__':` guard, and code that depends on shared parent state. Add the guard, refactor lambdas to module-level functions, and pass shared state explicitly.

Manager.dict vs shared_memory: when to pick which?

Manager.dict for low-frequency access to small objects. Each access is an IPC round trip (pickle, send, unpickle), so it does not scale. shared_memory for large fixed-size buffers (numpy arrays, byte buffers) needing lots of reads and writes without copying. The trade-off is convenience vs speed.

How does PEP 703 (no-GIL Python) change this?

Free-threaded CPython (3.13+ as a build option, supported but non-default in 3.14, default-on planned for later) removes the GIL. Threads achieve true parallelism, so multiprocessing becomes less necessary for CPU-bound work. multiprocessing remains useful for fault isolation (a crash in one process does not take down the parent) and for memory isolation (no shared mutable state by default).

Gotchas

Forgetting `if __name__ == '__main__':` on Windows/macOS spawn causes infinite recursion
Lambdas, local functions, and closures cannot be pickled; use module-level callables
Manager proxies are SLOW; every read or write is an IPC round trip
fork after threading is undefined behaviour in many libraries; prefer spawn or forkserver
Forgetting shm.unlink() leaks shared memory until reboot on some platforms

Bridging Sync and Async

Section: Python Concurrency | Difficulty: Advanced | Frequency: Sometimes

asyncio.to_thread runs sync code on a thread without blocking the event loop. asyncio.run_coroutine_threadsafe runs async code from a thread. asgiref.sync provides sync_to_async / async_to_sync helpers. Mixing the two worlds is messy; isolate the boundary and pick one direction per layer.

Key Points

Calling sync blocking code from async (without to_thread) freezes the event loop.
Calling async code from sync (without proper bridging) raises RuntimeError or silently never runs.
asyncio.to_thread offloads the sync call to the default thread pool. Returns awaitable.
From a non-event-loop thread, asyncio.run_coroutine_threadsafe schedules a coroutine on the loop and returns a concurrent.futures.Future.
When in doubt, isolate the boundary: a layer is fully sync OR fully async, never mixed.

Interview Follow-up Questions

Why not just call sync code from async?

It works, but if the sync code blocks (HTTP request, DB query, file I/O, time.sleep), it freezes the entire event loop. Every other coroutine, every other request, stops making progress. asyncio.to_thread offloads the blocking call to a thread, freeing the loop to handle other work.

What thread runs asyncio.to_thread tasks?

By default, the loop's default executor (a ThreadPoolExecutor with min(32, cpu_count + 4) workers). It can be replaced with loop.set_default_executor(custom_executor) for a larger pool or one with custom thread settings. For CPU-bound sync code, use ProcessPoolExecutor instead via run_in_executor.

When is anyio preferable over asyncio directly?

anyio is a higher-level wrapper that works on both asyncio and trio with the same API. Reach for it for runtime-agnostic libraries, for trio's structured concurrency, or for easier sync/async bridging via anyio.from_thread.run / anyio.to_thread.run_sync. For pure asyncio applications, the asyncio API is fine.

What is the boundary to aim for?

Pick a direction per layer. The web framework layer is async. The data access layer is async (using async drivers). The CPU-bound utility layer is sync, called from async via to_thread. Don't ping-pong: a sync function that calls an async function that calls a sync function is hard to reason about and slow.

Gotchas

Calling time.sleep / requests.get / sync DB drivers from async blocks the event loop
asyncio.run inside a running loop raises RuntimeError; only call from sync code
asyncio.run_coroutine_threadsafe needs a loop already running on another thread
to_thread workers come from a fixed pool; thousands of concurrent to_thread calls queue up
asgiref helpers create per-call loops if needed; expensive in tight loops

PEP 703: No-GIL Python

Section: Python Concurrency | Difficulty: Advanced | Frequency: Sometimes

PEP 703 makes the GIL optional. Free-threaded CPython (3.13+ build option, supported but not default in 3.14) lets multiple Python threads run bytecode in parallel. Threads finally help CPU-bound work. Cost: per-object reference counting needs atomics, single-threaded performance regresses ~10-15%, C extensions need rebuilding.

Key Points

Free-threaded Python (3.13+) builds without the GIL. Threads achieve true parallelism.
Only available in a separate build (python3.13t). Default CPython still has the GIL.
C extensions must opt in via Py_mod_gil. Old extensions trigger a re-enabling of the GIL.
Single-threaded perf regresses (~10-15% in 3.13) due to atomic reference counting and biased locking on objects.
Multi-threaded scaling on CPU-bound code finally works without multiprocessing.

Interview Follow-up Questions

Should one switch to free-threaded Python today?

Probably not for production yet. The free-threaded build is experimental in 3.13 and supported (but not default) in 3.14. Becoming default is on the roadmap but not yet officially scheduled. Most third-party C extensions (numpy, pandas, torch, etc.) are still adding no-GIL support. Test in a dev environment, not in production. Single-threaded perf also regresses ~10-15%, which matters for some workloads.

Does this make multiprocessing obsolete?

No. Multiprocessing still gives fault isolation (a crash in one worker does not take the parent), memory isolation (workers cannot corrupt each other's state), and the ability to scale beyond one machine. Free-threaded Python makes threads viable for CPU-bound work, but processes remain useful when isolation matters.

Will existing threaded code break?

Code that is correctly synchronised (uses Lock around shared mutable state) works the same. Code that relies on the GIL to make multi-bytecode operations atomic (e.g., assumes counter += 1 is safe) was always broken; on free-threaded Python the bugs are easier to trigger. Audit shared mutable state for missing locks.

What changes for library authors?

Pure-Python libraries should work unchanged. C extensions need to declare no-GIL safety via Py_mod_gil slot, audit for thread-unsafe global state, and add explicit synchronisation for any shared mutable C-level state (caches, free lists). The cpython docs have a porting guide.

Gotchas

Single-threaded perf regresses ~10-15% on free-threaded build (atomic refcounts, biased locking overhead)
Loading any GIL-required C extension re-enables the GIL process-wide
Race conditions hidden by GIL serialisation become visible; correctness audits matter
Free-threaded build is python3.13t (separate binary); pip install wheels are limited
Some libraries gate behaviour on sys.version; check for free-threaded build separately

Connection Pool

Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often

A pool of pre-established connections (database, HTTP, gRPC) reused across requests. Building a connection per request is too slow (TLS handshake, TCP setup, auth). Pool provides bounded resource usage, faster requests, and back-pressure when the limit is reached.

Key Points

Connection establishment is expensive: TCP handshake, TLS handshake, auth. Pool amortises this.
Pool size = max concurrent in-flight requests. Sized by downstream limit, not by app traffic.
Acquire blocks (or fails) when pool is empty; this is the back-pressure mechanism.
Health check: validate connection on borrow; broken connections must be discarded, not handed out.
Idle timeout: connections that have been idle too long are closed (server may have closed them already).

Interview Follow-up Questions

How big should the connection pool be?

Use Little's Law: throughput × average query latency. At 100 queries/sec and 50ms average, 5 in-flight is the baseline. Add 2x headroom for bursts. So pool of 10. Don't size by number of app threads; that gives wildly oversized pools that overwhelm the database.

What is the relationship between pool size and database max_connections?

Pool size per app instance × number of app instances ≤ database max_connections × 0.8 (leave headroom). Postgres default max_connections is 100; with 10 app instances that's 8 connections each. Sounds low. Two responses: tune Postgres up to 200-500, or use PgBouncer in front to multiplex.

Should there be separate pools for read replicas and writers?

Yes. Read traffic is bursty; write traffic is steady. Separate pools allow each to be sized appropriately and route queries to the right backend. Most ORMs and DB libraries support multiple data sources.

Why recycle connections (maxLifetime)?

Three reasons. Server-side connection limits or memory leaks accumulate over time. DNS or load balancer changes need fresh connections to take effect. Server restarts may close idle connections without us noticing. Recycling every 30 minutes keeps the pool healthy without thinking about it.

Gotchas

Pool size larger than database can handle: every request fails when DB max_connections hit
Forgetting to release connection (no try-with-resources): pool leak, eventual exhaustion
No idle timeout: server-closed connections stay in the pool, fail on next borrow
Long-running query holds a connection: pool exhausted by 1 slow query
Same pool for read and write: write spike starves reads, or vice versa

Background Job Queue

Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often

Persist jobs to durable storage (Postgres, Redis, SQS), workers pull and process them. Need: idempotent handlers (jobs can be delivered twice), visibility timeout (a worker that crashes mid-job must release the job for retry), exponential retry with dead-letter queue, scheduled/delayed jobs, priority queuing.

Key Points

Durable storage so jobs survive restart. Memory-only queues lose work on crash.
Visibility timeout: when a worker pulls a job, hide it for N seconds. If the worker doesn't ack, it reappears.
Idempotency on the handler side: jobs can be delivered more than once (visibility timeout expiry, worker crashes after work but before ack).
Dead-letter queue (DLQ) for jobs that fail repeatedly. Don't retry forever; surface to humans.
Sidekiq/Resque (Redis), RQ (Redis), Celery (RabbitMQ or Redis broker), BullMQ (Redis), AWS SQS, and Postgres SKIP LOCKED are popular implementations. Only Sidekiq/RQ/BullMQ are Redis-native; Celery's primary broker is RabbitMQ (Redis is supported but not the default).

Interview Follow-up Questions

Why not just use a list in Redis (LPUSH/RPOP)?

Works for the simplest case but lacks visibility timeout, retries, persistence guarantees on crash, scheduled jobs. All of that ends up being rebuilt. For anything beyond toy workloads, use a real job queue (Sidekiq, RQ, Celery, SQS) or Postgres SKIP LOCKED. Redis-only is fine with Redis Streams (which has consumer groups and acks) but not raw LPUSH/RPOP.

Why is Postgres SKIP LOCKED so popular?

Most teams already run Postgres. No new system to operate. Transactional with the rest of the data (a job can be enqueued and a row updated atomically). SKIP LOCKED prevents lock contention between workers. Throughput is real (10K+ jobs/sec on a single Postgres). The downside: not as fast as Redis or SQS at very high throughput, and database CPU goes to queue work.

What goes in the dead-letter queue?

Jobs that exceeded max retries. Store the original payload, the attempt count, and the last error. Periodically (or via dashboard), humans inspect, decide whether the job is recoverable (fix and replay) or permanently broken (delete). DLQ size is a great alert: a sudden spike means something downstream broke.

How do I prioritise jobs?

Add a priority column (0-9, 0=highest). Worker selects ORDER BY priority, run_at. For more sophisticated needs (per-tenant fairness, SLA-aware scheduling), use multiple queues with different worker pools, or a dedicated scheduler. Most teams over-engineer this; a single priority field handles 95% of needs.

Gotchas

No visibility timeout: worker crash mid-job means the job is lost OR re-runs immediately and floods retries
Non-idempotent handler with at-least-once delivery: side effects double on retry
Unbounded retries: a permanently failing job retries forever, drowning out healthy work
No DLQ inspection: dead jobs accumulate, no one notices the silent failure
Treating queue as a database: long-running jobs hold locks, block compaction, slow queries

In-Memory Cache with TTL Invalidation

Section: Real-World Scenarios | Difficulty: Medium | Frequency: Asked Often

Thread-safe cache where each entry expires after T seconds. Core decisions: lazy expiry vs background sweep; ConcurrentHashMap vs RWLock; cache-stampede prevention via single-flight. Tests understanding of expiration semantics + concurrency together.

Key Points

Two expiry strategies: lazy (check on read) vs active sweep (background timer)
ConcurrentHashMap (Java) / sync.Map (Go) / dict + RLock (Python) for storage
Cache stampede: when key expires, many concurrent reads all try to recompute → use single-flight
Hot-key contention: shard the cache by key hash for high throughput

Interview Follow-up Questions

Lazy vs active sweep, pick one?

Use both. Lazy expiry on reads gives correctness. Active sweep bounds memory for keys rarely read. Without active sweep, keys never read again sit forever, memory leak. Without lazy, stale data is served between sweep cycles.

What's cache stampede and how is it prevented?

When a hot key expires and 100 concurrent reads all see 'expired' and all try to recompute simultaneously, hammering the upstream. Single-flight: only one recomputation per key; others wait for it. Java's `computeIfAbsent`, Go's `singleflight`, hand-rolled per-key locks in Python.

How would this scale to 1M ops/sec?

Sharding. Split the cache into N partitions by key hash; each has its own lock and map. ConcurrentHashMap (Java) does this internally with ~16 bins; for Go/Python, do it explicitly. Separate single-flight group per shard.

What about negative caching?

Cache 'not found' with a shorter TTL. Otherwise every miss for a non-existent key hammers the loader. Production caches usually have separate hit/miss TTLs.

Token Bucket & Sliding Window Rate Limiters

Section: Real-World Scenarios | Difficulty: Medium | Frequency: Asked Often

A rate limiter caps request rate per key. Token bucket: refill N tokens/sec, requests consume one. Sliding window: count requests in the last second. Both need atomic counter updates and either lazy refill or scheduled refill, choice depends on traffic patterns and burst tolerance.

Key Points

Token bucket: refill rate, max bucket capacity (allows bursts)
Sliding window: log of timestamps, count those in [now-1s, now]
Per-key state requires concurrent map; per-bucket update requires atomic
Distributed rate limiting: Redis SETNX + INCR + EXPIRE, or atomic Lua script
Real production: Stripe's algorithm uses sliding window + leaky bucket per dimension

Interview Follow-up Questions

Token bucket vs sliding window, pick one?

Token bucket allows bursts (use up tokens fast, refill slowly). Sliding window strictly caps requests-per-window. Stripe's API uses sliding-window because billing endpoints can't tolerate bursts. Most CDNs use token bucket for friendliness. Pick based on the contract being enforced.

How is this made distributed across N servers?

Per-server limiters drift; a shared store is required. Redis SETNX + INCR + EXPIRE is the standard (atomic Lua script for the check-and-update). Sketches like CountMinSketch reduce memory for huge key spaces. Trade exact accuracy for performance.

What about rate limit by IP vs by user vs by API key?

All three usually. Multi-dimensional limits, most strict wins. Stripe documents this: per-account, per-endpoint, per-action all have separate limits. Each is a TokenBucket (or sliding window) keyed by the identifier.

How are limit responses handled gracefully?

Return 429 with `Retry-After` header. Add `X-RateLimit-Remaining` and `X-RateLimit-Reset` headers so clients can self-throttle. For internal services, exponential backoff with jitter on receiving 429.

Retry Library: Production Implementation

Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often

Build a retry library: classify errors (transient vs permanent), exponential backoff with jitter, max-attempts AND max-duration, per-call deadline budget, optional circuit breaker integration. Two ways to get it wrong: retry permanent errors, or retry without jitter.

Key Points

Classify errors: only retry transient (5xx, timeout, connection refused). Never retry 4xx.
Exponential backoff with FULL jitter: delay = random(0, base * 2^attempt).
Bound both attempts AND total elapsed time. Either alone is not enough.
Pass remaining deadline to each attempt; don't run a 5s call when 1s of budget remains.
Compose with circuit breaker: retries hit the breaker first, fail fast if open.

Interview Follow-up Questions

What goes in the retryable-error classifier?

Network errors (timeout, connection refused, DNS fail), 5xx HTTP responses, gRPC UNAVAILABLE/DEADLINE_EXCEEDED/RESOURCE_EXHAUSTED, database deadlock errors, cloud throttling responses (429 with Retry-After). NOT in: 4xx (client errors), validation failures, auth errors, business-logic errors. The classifier is per-protocol; HTTP, gRPC, DB each have their own.

Why should I prefer a deadline over a fixed max-attempts?

Because attempts can grow latency unboundedly. With max-attempts=5 and exponential backoff, the worst case is base * (1+2+4+8+16) = 31 * base. If base is 1s, that's 31s of latency for one user request. A deadline (e.g., 5s total) caps the experience. Both together give the best behaviour: stop on attempts AND on time.

Should retry be in the client SDK or in the application code?

Client SDK is the cleanest place; the application calls one function and gets retry for free. The risk: stacked retries when the application also wraps the SDK in its own retry. Solution: SDK exposes a way to disable its retry, or document clearly. When the SDK isn't under the team's control, application-level retry is the right place.

Gotchas

Retrying without a per-attempt timeout means slow downstream + slow retries = unbounded latency
Retrying a non-idempotent POST without an idempotency key duplicates the side effect
Stacked retries (application retry + the HTTP client's retry) multiply attempts exponentially
Using elapsed time but not respecting context cancellation: retry continues after caller gave up
Treating all exceptions as retryable: validation errors and OOM get retried too

Circuit Breaker: Production Implementation

Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often

Build a circuit breaker that wraps downstream calls. State machine: closed → (failures) → open → (cooldown) → half-open → (probe) → closed/open. Per-downstream instance. Trip on failure rate over a sliding window with a minimum-call floor. Half-open allows N probes; success rate decides if we close or reopen.

Key Points

Three states: closed (normal), open (fail-fast), half-open (probing). Atomic state transitions.
Trip policy: failure rate over a sliding window (last N calls or last T seconds), with a minimum-call floor.
Cooldown in open state: 30s-5min typical. After cooldown, transition to half-open.
Half-open allows N probes (typically 3-10). Track their outcomes; success rate decides next state.
Per-downstream breaker. Never one global breaker. Different downstreams have different reliability.

Interview Follow-up Questions

How do I size the sliding window?

Tradeoff between reactivity and false positives. A 10-call window: trips fast but flaky on bursty traffic. A 100-call or 60-second window: more stable but slower to trip. For high-traffic services, 60-second time-based windows work well; for low-traffic, count-based with a minimum-call floor.

What if I get a flood of requests right when half-open opens?

Don't admit them all. Half-open should cap concurrent probes (e.g., 3 at a time, others fail-fast as if open). Without this cap, the still-recovering downstream gets a surge of probe traffic at the moment it can least handle it.

Should I share circuit breaker state across instances?

Usually not. Each service instance maintains its own breaker. The downstream's actual health is the same; each instance discovers it independently. Sharing state across instances (via Redis, say) sounds smart but adds latency and a coordination point. Per-instance breakers converge to the same decision quickly enough.

How does the breaker interact with metrics?

Emit gauge for state (0/1/2 for closed/open/half-open), counters for trips and resets, histograms for time-in-state. Alerts on 'breaker open for >5 minutes' or 'breaker tripping >10 times per hour'. The breaker is one of the most useful sources of downstream-health signal.

Gotchas

One global breaker for all downstreams: one bad downstream stops everything
Counting client errors (4xx) as breaker failures: trips on user mistakes, not service health
Half-open with no probe cap: surge of probe traffic on a recovering downstream
No fallback when breaker opens: caller sees CircuitOpenException with no graceful path
Breaker without timeout: slow calls can pile up before the breaker even sees them as failures

Idempotency Key: Stripe-Style Implementation

Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often

Server-side dedup of retries on non-idempotent operations. Client sends Idempotency-Key header (unique per logical operation). Server: lock on key → check cache → if cached, replay response → else process and cache → release. TTL on cache (24h typical). Detect key collisions via request fingerprint.

Key Points

Client generates ONE key per logical operation, reuses on retry. NOT one per HTTP attempt.
Server-side: lock on key (so concurrent retries don't both run), check cache, process, store response, release.
Cache stores: response body + status code + relevant headers. Replay must be byte-identical.
TTL: typically 24h. Long enough for client retries, short enough that storage doesn't grow forever.
Detect key collisions: store a fingerprint (hash of request body) and compare. If different, return 422 instead of cached response.

Interview Follow-up Questions

What happens with concurrent requests using the same idempotency key?

The lock serialises them. First request acquires the lock, processes, stores the cached response, releases. Concurrent requests either wait briefly (with blocking lock acquisition) or fail with 409 'in flight' (with fail-fast). For a server-side idempotency layer, the 409 response is usually preferred: the client retries with backoff and the second time finds the cached response.

Why detect key collisions?

Two clients might accidentally generate the same key (UUID collisions are rare but not zero in long-tail). Or a single client reuses a key for two different operations by mistake (using a session ID instead of an operation ID, say). Without collision detection, the server replays the first response for the second request, which is the wrong response. The fingerprint check catches this and returns 422.

Should the cache be local or shared?

Shared. With multiple instances behind a load balancer, the same key might land on different instances. A local cache means each instance dedupes only against its own history; the lock would also need to be shared. Use Redis or similar.

What about idempotency for fire-and-forget background jobs?

Use an idempotency key in the job payload. The handler checks 'have I processed key X?' before doing the side effect. Same idea, different transport. Critical for at-least-once queues (SQS, Kafka with at-least-once consumer) where a single logical operation can be delivered multiple times.

Gotchas

Generating new key per HTTP attempt: server sees each retry as a fresh request
Caching only response body, not status code: replay returns wrong status
No collision detection: same key for different operations replays wrong response
No TTL on cached response: storage grows without bound
Lock without TTL: crashed processor blocks subsequent retries forever

API Aggregator (Backend-for-Frontend)

Section: Real-World Scenarios | Difficulty: Intermediate | Frequency: Asked Often

Service that calls N downstream services concurrently and merges results into a single response. Common in mobile/web BFFs. Patterns: fan out with errgroup/TaskGroup/gather, per-call timeouts, partial-result tolerance (return what we have if some fail), per-downstream circuit breakers.

Key Points

Fan out concurrently: total latency = max of downstreams, not sum.
Per-call timeout: one slow downstream shouldn't drag the whole response.
Partial-result tolerance: decide which downstreams are critical, which are optional.
Per-downstream circuit breaker: when one is unhealthy, fail it fast and degrade gracefully.
Cache aggressively: page-level cache and per-downstream cache reduce fan-out cost.

Interview Follow-up Questions

What's the total-latency calculation for an aggregator?

max of the slowest downstream call. The whole point of the aggregator is parallelism. Accidental serial calls (await user; then await cart; then await recs) make latency the sum, which defeats the purpose. Always fan out concurrently.

How to decide which downstreams are critical vs optional?

Product/UX decision. The page that says 'username and avatar' needs the user-service; without it, the page can't render. The page section that shows 'people you might know' is optional; if it fails, hide that section but still show the rest. Talk to designers; over-tolerating failures hides real problems, but treating everything as critical makes the page brittle.

Should the aggregator be its own service?

Yes for non-trivial apps. Pattern: BFF (Backend for Frontend). One BFF per client (mobile, web, partner API), each shaped to that client's needs. The BFF does the fan-out, caching, and shaping. Backends stay focused on their domain. Mobile app talks to one BFF endpoint instead of orchestrating 5 calls itself.

How are aggregator responses cached?

Two layers. Per-downstream cache (the user-service result is the same for the same userId for some seconds). Page-level cache (the full assembled page when it's safe to cache). Per-downstream cache wins more because invalidation is per-domain; page-level cache wins on repeated identical requests.

Gotchas

Sequential awaits instead of concurrent fan-out: latency becomes sum, not max
No per-call timeout: one slow downstream blocks the whole page
Treating all downstreams as critical: any single failure breaks the page
No circuit breaker per downstream: calls keep timing out for an unhealthy service
No caching: every request fans out to all backends; database hammered

Polite Concurrent Web Crawler

Section: Real-World Scenarios | Difficulty: Hard | Frequency: Asked Often

Crawl 1M URLs concurrently while: (1) bounding total in-flight requests, (2) deduplicating visited URLs, (3) respecting per-domain politeness (max N concurrent per host). Bounded worker pool + concurrent set + per-domain semaphore.

Key Points

Worker pool with N workers fetches from a shared work queue
Concurrent set (visited) prevents fetching the same URL twice
Per-domain semaphore caps requests-per-host (politeness)
BFS-style traversal: each fetched page enqueues newly-discovered links
Termination: when queue empty AND no in-flight workers

Interview Follow-up Questions

How is pool size decided?

Web crawling is I/O-bound, pool size much larger than core count. Goetz's formula: cores × (1 + wait/compute). For ~500ms HTTP fetches and millisecond parsing: 100-500× cores. Empirically: ramp up until upstream complains or the file-descriptor table maxes out.

Why per-host limits?

Politeness, don't DDOS individual hosts. Most sites tolerate 4-8 concurrent connections from one client. Public crawler etiquette and robots.txt expect this.

How does termination work when the queue drains?

Track in-flight worker count. Done = queue empty AND inFlight == 0. Tricky because between 'queue empty' check and worker incrementing inFlight, the queue could re-fill. Real solution: a CountDownLatch / WaitGroup that fires when both conditions hold atomically.

How is robots.txt handled?

Per-host robots.txt cache. Fetch and parse once per host; cache the rules. Apply rules before queuing URLs for that host. Add to the per-host limit logic.

Lock Convoys

Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes

A lock convoy happens when threads contending for a lock end up serialized in the kernel wait queue. The first thread runs and releases. The next thread is woken from a kernel sleep, which costs a context switch and a cache miss. The lock is fair, but throughput drops because every acquire pays a kernel round trip even when the critical section is microseconds. Common in fair mutexes under steady contention.

Key Points

All N threads end up in the kernel wait queue. Each acquire pays a wakeup, a context switch, and cold-cache restart.
Critical section is short, but wall-clock time per acquire is dominated by the wakeup cost (microseconds), not the section itself (nanoseconds).
Triggered by fair locks (FIFO wakeup), high contention, and short critical sections. The trio is the warning sign.
Symptom: throughput plateaus or drops as you add threads, but no thread is CPU bound; perf shows time inside futex/wait, not inside the critical section.
Fixes: reduce contention (sharding, RCU, lock-free), prefer adaptive/unfair locks (default in glibc, Java ReentrantLock without fair=true), or restructure so the section is so short that nobody ever blocks.

Interview Follow-up Questions

Why does fairness make convoys worse?

A fair lock parks the releasing thread instead of letting it grab the lock back. So even if the releasing thread is the one with the hottest cache and zero work to do before the next acquire, it has to go to the back of the queue. Every acquire pays a wakeup. An unfair lock lets a recently-running thread re-acquire (`barge`), which usually costs nothing because it never blocked in the first place. Throughput goes up, fairness goes down.

Glibc moved away from fair futexes years ago. Why is this still a problem?

Modern futex-based mutexes are unfair by default, which avoids the worst case. Convoys still appear when (1) fair locks are explicitly opted into (Java's ReentrantLock(true), Go's mutex in starvation mode after 1ms wait, pthread mutexes with PTHREAD_PRIO_INHERIT and FIFO scheduler), or (2) the critical section grows because of GC pauses, page faults, or scheduler preemption. Once parking starts, it tends to keep happening.

How is a convoy different from regular lock contention?

Regular contention: threads spin briefly, occasionally one parks, throughput scales sub-linearly with cores. Convoy: every acquire pays a wakeup, threads spend most of their wall-clock time in `park`, throughput goes flat or drops. The tell is the breakdown of time per acquire: critical section in nanoseconds, total time per op in microseconds. The gap is the convoy.

Can a convoy happen without a lock?

Yes. Channel-based serialization (Go), single-threaded actor (Erlang, Node.js), or a single-writer queue can all convoy under enough load. The mechanism is the same: producers stack up faster than the consumer can drain, and each producer pays a wakeup when finally serviced. The fix is also the same: shard the consumer, or batch.

Gotchas

Adding a 'just to be safe' fair lock to fix a starvation complaint can tank throughput by 100x.
Convoys are time-of-day sensitive: light load looks fine, peak load suddenly cliffs. Load-test under realistic concurrency.
GC pauses and STW phases extend critical sections enough to push a healthy unfair mutex into convoy mode for a few seconds.
Profilers that sample the running thread won't show the convoy: time is inside the kernel scheduler, not application code. Use mutex/contention profilers.

Scheduled Callback Executor (HashedWheelTimer)

Section: Real-World Scenarios | Difficulty: Hard | Frequency: Sometimes

Schedule callbacks to run at future times. Naive heap-based scheduler scales to thousands; for millions, use a HashedWheelTimer, bucketize callbacks by absolute deadline, sweep one bucket per tick. Used by Netty, Linux kernel timer wheel, Kafka.

Key Points

Heap-based: O(log N) insert, O(log N) extract, fine for thousands
Hashed wheel: O(1) insert, O(N/buckets) per tick, scales to millions
Trade precision for throughput: wheel has tick granularity (~10ms)
Single tick thread + worker pool dispatches expired callbacks

Interview Follow-up Questions

When does a heap-based scheduler bottleneck?

With many concurrent timers being added/cancelled. Each operation is O(log N), and the heap mutex serializes them. Above ~50K timers, the lock becomes a hotspot. Wheels make insert O(1) at the cost of slightly imprecise expiration.

How does Linux's timer wheel work?

Cascading wheels: a 'second' wheel for tasks within ~5s, then 'tens-of-seconds' wheel, etc. Tasks far in the future demote down through wheels as time passes. Same idea as the kernel's timer subsystem and rate-limited netfilter rules.

What about cancellation?

Wheel-based: O(1), just mark the task as cancelled. The tick thread checks before dispatching. Heap-based: lazy delete (mark + skip) is O(1); eager delete is O(N) without cross-references; with parent pointers in the heap, O(log N).

Single tick thread vs multiple?

Almost always single. The tick thread does minimal work, it just dispatches to a worker pool. Multiple tick threads compete on the wheel locks; the contention is worse than the single-thread baseline.

In-Process Pub-Sub Bus

Section: Real-World Scenarios | Difficulty: Hard | Frequency: Asked Often

An in-memory pub-sub: publishers send to topics, subscribers receive matching messages. Decisions: per-subscriber bounded queue (slow consumer doesn't block fast); ordered or fan-out; sync or async delivery. Trades latency for backpressure.

Key Points

Topics → subscriber list; publish iterates and delivers (or queues per subscriber)
Each subscriber has bounded queue → slow consumer doesn't backpressure publisher
Drop-on-full vs block-on-full, slow-consumer policy is THE design decision
Concurrent map for topic→subscribers (CHM, sync.Map, dict + RLock)
Used by Spring Events, Akka Streams, in-process EventBus libraries

Interview Follow-up Questions

Why per-subscriber queues instead of one global queue?

Per-subscriber queues isolate slow consumers, one slow handler doesn't block others. Single global queue: every subscriber processes serially → throughput limited by slowest. Per-subscriber: each runs at its own pace, can be drop-on-full independently.

Drop-on-full vs block-on-full?

Drop-on-full preserves publisher throughput (publisher never blocks). Block-on-full applies backpressure, publisher slows when consumer can't keep up. Drop is right for telemetry/metrics; block is right for guaranteed delivery (orders, billing).

How are subscriber failures handled?

Each worker wraps the handler in try/catch, exceptions log and continue. Otherwise a thrown exception kills the worker thread, leaving the subscriber silently dead. Always isolate handler bugs from the worker loop.

How does this differ from Kafka?

Kafka has persistent log; messages survive restart; consumers track offsets. In-process bus is ephemeral, restart loses messages. Kafka is for cross-service durable pub-sub; in-process bus is for in-app event-driven decoupling.

DAG Task Runner with Backpressure

Section: Real-World Scenarios | Difficulty: Hard | Frequency: Sometimes

Run tasks with dependencies, each task waits for its prerequisites. Topological-order scheduling + worker pool + bounded concurrency. Pattern: per-task CountDownLatch / Future / WaitGroup, with dependency graph driving submission.

Key Points

Topological order: tasks run as soon as their dependencies complete
Per-task synchronization: Future, CompletableFuture, CountDownLatch, errgroup
Worker pool bounds concurrency; max-in-flight is the backpressure knob
Failure propagation: if a task fails, downstream tasks are cancelled (or retried)
Used by Apache Airflow, Argo Workflows, Make, build systems

Interview Follow-up Questions

How are cycles handled?

Detect them at submit time via DFS or topological-sort validation. If a task is added with a dependency that doesn't resolve, reject. Most DAG runners (Airflow, Argo) refuse cyclic graphs at definition time.

Failure semantics, what if task B depends on A and A fails?

Three strategies: (1) cancel all dependents (Airflow default, 'fail fast'), (2) retry A with backoff and continue, (3) skip B and proceed with whatever can still run. Airflow has a 'trigger_rule' that lets each task pick its policy.

How is resource usage bounded?

Worker pool with max parallelism (`g.SetLimit(N)` in Go's errgroup, ThreadPoolExecutor in Java). For finer control, per-resource semaphores, DB-bound tasks share one limit, network-bound tasks share another.

How does this scale to distributed execution?

Same DAG, but tasks run on different machines. Dependency tracking moves to a coordinator (Airflow scheduler, Kubernetes Argo controller). Per-task results stored externally (DB, S3). Workers pull tasks whose deps are done.

Inventory Reservation with Concurrent Holds

Section: Real-World Scenarios | Difficulty: Hard | Frequency: Asked Often

Booking system: 100 concurrent users grab the last seat. Only one wins. The losers see 'sold out' before completing checkout. Pattern: optimistic locking with version-stamped CAS, OR pessimistic per-item lock, OR atomic decrement on a Redis counter.

Key Points

Optimistic: version + CAS, read inventory, attempt update, retry if version changed
Pessimistic: lock per item, serializes high-contention items but cleaner semantics
Soft holds: reserve for N minutes, expire if checkout doesn't complete
Distributed: Redis DECR, or Postgres SELECT FOR UPDATE, or two-phase commit

Interview Follow-up Questions

Optimistic vs pessimistic, when?

Optimistic when contention is low (rare conflicts → CAS rarely retries). Pessimistic when contention is high (hot key, lots of users for last seat → CAS retry loop becomes a busy wait). For 'last seat sold' interview, pessimistic actually wins, serializes naturally.

How is this made distributed?

Redis: `DECR inventory:item-id` is atomic; if result < 0, increment back and reject. Or Postgres: `SELECT ... FOR UPDATE` row lock + decrement in transaction. Both are pessimistic; both serialize the hot row.

What about concert ticket sales, million users, tiny inventory?

Standard pattern: virtual queue with cap (e.g., 50K users in queue, 10K seats). Behind the queue, soft holds with short TTL (~3 min). Avoids stampeding the inventory layer; queue throttles attempts. Ticketmaster, BookMyShow do this.

Race condition: user clicks 'reserve' twice, same hold or two holds?

Idempotency: pass a client-provided idempotency key. Server stores the first request's hold ID under that key; subsequent requests get the same ID. Stripe pioneered the API pattern.

Distributed Lock: Redis or ZooKeeper

Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often

Mutual exclusion across processes/machines. Three implementations: SETNX with TTL on Redis (cheap, mostly correct), Redlock (Redis cluster, controversial safety), ZooKeeper ephemeral nodes (strongly correct, more ops). All require a TTL/lease so a crashed holder doesn't block forever. Always design for: 'what if the holder thinks it has the lock but actually doesn't?'

Key Points

Redis SET key value NX EX ttl is the simplest distributed lock. Atomic, TTL-based, fast.
Lock holder must store a unique token (UUID) and check it on release; otherwise release-by-someone-else is possible.
TTL is required: if the holder crashes, the lock auto-expires. Pick TTL > expected critical-section duration.
Fencing tokens (monotonic ID) protect against zombies: 'I had the lock, then GC paused me, lock expired, someone else took it, then I came back and acted on stale belief'.
ZooKeeper / etcd / Consul give stronger guarantees (linearisable, sequential ephemeral nodes) at higher operational cost.

Interview Follow-up Questions

Is Redis SETNX really safe?

For most use cases where the cost of double-execution is acceptable, yes. For workloads where double-execution corrupts data (writes to a database that doesn't enforce idempotency), no. The classic Redlock-vs-Martin-Kleppmann debate is exactly this. The pragmatic answer: use Redis SETNX for non-critical mutual exclusion (rate limiters, scheduling); use ZooKeeper/etcd or fencing tokens for safety-critical paths.

What is a fencing token?

A monotonically increasing number associated with each lock acquisition. The lock holder includes it on every protected operation; the resource (database, file, etc.) checks it against the largest token it has seen. If the token is smaller, the operation is rejected. Prevents 'zombie holder' (a holder whose lock expired but who still thinks it has it) from corrupting state.

Why does TTL matter so much?

Because the holder might crash. Without TTL, the lock is held until manually released. A crashed process never releases. The lock blocks forever. With TTL, a crashed holder's lock auto-expires, and someone else can take over. Trade-off: TTL too short means the work might not finish before the lease expires. Either keep work short or extend the lease via heartbeat.

When should I reach for ZooKeeper instead of Redis?

When correctness matters more than simplicity. ZK is built for coordination: linearisable, ephemeral nodes auto-clean on disconnect, sequential nodes give natural fencing tokens. Operationally heavier than Redis (a ZK ensemble must be run). For high-stakes locks (leader election, primary-replica failover), ZK or etcd is the right choice. For lower-stakes (rate limit buckets, scheduled job dedup), Redis is fine.

Gotchas

No TTL: crashed holder blocks the lock forever
No token check on release: someone else's release deletes the lock
Long-running work without lease extension: TTL expires mid-work, two holders concurrent
Treating SETNX as linearisable: it is not under all failure modes
No fencing token in critical paths: zombies can corrupt data after lock expiry

Heartbeat & Leader Election

Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often

Heartbeat: each instance periodically writes 'I'm alive' to shared storage. Leader election: instances compete to become the single coordinator; loser instances watch the leader's heartbeat. When the leader's heartbeat stops, a new election runs. Backed by ZooKeeper, etcd, Consul, or a simpler Redis lease.

Key Points

Heartbeat: leader writes a key with TTL every N seconds; if it stops writing, the key expires.
Election: candidates try to atomically claim the leader key. The winner is the leader. Losers watch.
Lease > heartbeat interval × 3: tolerates transient network blips without flapping.
Always design for split-brain: two instances that both think they are leader. Fencing tokens prevent corruption.
ZooKeeper, etcd, Consul, Kubernetes leases give correct primitives. Redis with TTL is simpler but has known edge cases.

Interview Follow-up Questions

What is split-brain and how does it happen?

Two instances both believe they are the leader. Happens during network partitions: leader A is fine but the network between A and the coordination service is broken; A's lease expires; B claims leadership; network heals; A still thinks it's leader. Both write. Fencing tokens (monotonic counter on each leadership claim) let downstream resources reject the stale leader's writes.

ZooKeeper / etcd / Consul vs Redis for election?

ZK/etcd/Consul are designed for coordination: linearisable, ephemeral nodes that auto-clean on disconnect, sequence numbers as natural fencing tokens. Operationally heavier (the coordination cluster must be run). Redis is simpler but well-known to have edge cases (the Redlock debate). For high-stakes leadership (primary database failover), use ZK/etcd. For low-stakes (cron deduplication), Redis is fine.

How do I size the lease TTL?

Trade-off between failover speed and stability. Short TTL (e.g., 5s): fast failover when the leader dies, but risk of flapping during transient network blips. Long TTL (e.g., 60s): stable, but slow failover. Most production systems use 15-30s lease with renewal every 5-10s. The 'renew interval × 3 ≤ lease' rule of thumb keeps the lease healthy under one missed renewal.

Does the leader need to know it's still the leader?

Yes. Pattern: leader holds a 'leadership context' that is cancelled if the renewal fails. All work performed by the leader checks this context. If the context is cancelled (lease lost), the leader stops doing work immediately, even if it hasn't yet learned about the new leader. Without this, a leader running a slow operation can keep writing after losing leadership.

Gotchas

TTL too short: harmless network blip causes leadership flap
Renewing on a separate connection: connection drops independently, lease lost while leader is fine
Doing work after losing leadership: the new leader is also writing; conflict
No fencing token: stale leader corrupts data after lease expiry
Single-region election with no fencing: cross-region partition can have two regional leaders

Saga Pattern: Distributed Transactions Without 2PC

Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often

A saga executes a multi-step business transaction across services as a sequence of local transactions, each with a compensating transaction that undoes it. If any step fails, run the compensations for the steps already done. Replaces two-phase commit (2PC) for the microservices era. Two flavours: orchestrated (a coordinator drives) or choreographed (each service reacts to events).

Key Points

Local transactions per step. Each has a compensating transaction (cancel order, refund payment, restore inventory).
On failure at step N, run compensations for steps N-1, N-2, ..., 1 in reverse order.
Compensations must be idempotent: they may be replayed. Use idempotency keys.
Orchestrated: one coordinator decides next step, calls each service, handles failures. Easier to reason about.
Choreographed: each service publishes events, others react. Decouples services but harder to debug.

Interview Follow-up Questions

Saga vs 2PC?

2PC (two-phase commit) gives strong consistency: all participants either commit or all abort. Requires a transaction coordinator and locks across services for the full duration. Doesn't scale beyond a few services and breaks under network partitions. Saga gives eventual consistency: each step commits its local transaction; on failure, compensations undo. Scales to many services, tolerant of partitions, but readers may see partial state during execution.

Orchestrated vs choreographed?

Orchestrated: coordinator service knows the workflow, calls each service in order, handles failures centrally. Pros: visible flow, easy to add steps, easy to monitor. Cons: coordinator becomes a single point of complexity. Choreographed: each service reacts to events; no coordinator. Pros: services are independent. Cons: workflow is implicit (no single place shows it), debugging spans logs across services. Default to orchestrated unless there's a strong reason otherwise.

Why must compensations be idempotent?

Because the saga can crash and restart. The coordinator may run a compensation, crash before recording success, restart, and run it again. Idempotent compensations mean the second run is a no-op. Use idempotency keys per compensation, the same way they're used for forward operations.

What if a compensation itself fails?

Now there's a 'pivot' situation: an unrecoverable inconsistency that needs human intervention. Persist the saga state, alert on-call, manually reconcile. Don't pretend it can be automated away; design for the case where compensation fails to converge and a human has to look.

Gotchas

Forgetting idempotency keys: replays double-charge, double-refund
Not persisting state between steps: coordinator crash loses progress
Choreographed saga across many services: workflow is invisible, debugging is awful
Treating compensations as 'just like forward, in reverse': they're not, semantics differ (refund != negative charge)
No alert on compensation failure: silent inconsistency

Outbox Pattern: Reliable Event Publishing

Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often

Write the event to an 'outbox' table in the SAME transaction as the business write. A separate process polls the table and publishes events to Kafka/queue/webhook. Solves the dual-write problem: a database AND a message broker can't be written to atomically, so events would otherwise be lost or duplicated when one of the writes fails.

Key Points

Dual write problem: 'INSERT order' + 'publish OrderCreated event' is two writes, no atomic guarantee. Either can fail.
Outbox: write the event row in the SAME DB transaction as the business write. Either both happen or neither.
A relay process polls the outbox, publishes each event to the broker, marks it published (or deletes it).
At-least-once: the relay can publish, crash, restart, publish again. Subscribers must be idempotent.
CDC alternative: tools like Debezium read the database WAL and publish change events without an outbox table.

Interview Follow-up Questions

Why not just use a distributed transaction (XA)?

XA between a database and a message broker exists in theory but is rarely used in practice. It locks both the DB and the broker until commit; performance is poor; not all brokers support it (Kafka doesn't). The outbox sidesteps the problem: only the DB is involved in the transaction.

What about Debezium / CDC?

Debezium reads the database's write-ahead log (WAL) and publishes changes as events to Kafka. With CDC, no outbox table is needed — every INSERT/UPDATE/DELETE becomes an event. Trade-off: every change gets published including ones not meant as events; the event payload can't be shaped (it's the row); CDC requires platform-specific setup. Outbox gives cleaner event design at the cost of one extra table and the relay process.

Does the outbox pattern guarantee exactly-once delivery?

No. It gives at-least-once: the event will eventually be published, possibly more than once if the relay crashes between publishing and marking-sent. Consumers must be idempotent. Combine with idempotency keys at the consumer side to get exactly-once-effect.

How big should the outbox table grow?

After events are published, the row can either be deleted or kept as an audit log. For high-volume systems, delete to keep the table small (and fast for the relay query). For audit / replay needs, keep with a TTL (delete rows older than 30 days). Either way, partition the table or run regular VACUUM to keep it healthy.

Gotchas

Forgetting to put the outbox INSERT in the same transaction defeats the pattern
Relay marks events 'sent' before actually publishing: events lost on crash
Single relay instance: bottleneck and SPOF; run multiple with SKIP LOCKED
Letting the outbox table grow unbounded: relay queries slow down, table bloats
Consumers not idempotent: at-least-once delivery causes duplicate side effects

Cancellation Propagation Across Services

Section: Real-World Scenarios | Difficulty: Advanced | Frequency: Asked Often

When a request is cancelled (client disconnect, deadline exceeded, parent abort), the cancellation must propagate through every downstream call so wasted work stops everywhere. gRPC ships deadlines in headers; HTTP relies on connection close detection; Go does it with context; Java with CompletionStage / virtual thread interrupts. Done wrong, zombie work ties up resources after the user has gone.

Key Points

Cancellation is end-to-end: a single layer that ignores it leaves zombie work running
Client disconnect: the server must detect socket close and stop processing (HTTP request.context() in Go, request.is_disconnected() in Starlette)
gRPC propagates a deadline header (grpc-timeout); intermediate services subtract elapsed time and pass the remainder downstream
HTTP has no standard deadline header; pass via X-Request-Timeout or rely on connection lifetime
Forgotten cancellation = thread/goroutine pool exhaustion as zombies pile up

Interview Follow-up Questions

How is a client disconnect detected?

Depends on the framework. Go's net/http: r.Context().Done() fires. Node.js: req.on('close') fires. Starlette/FastAPI: await request.is_disconnected(). Java Servlet: AsyncListener.onTimeout. The detection isn't magic; the underlying TCP socket close is what triggers it. If the runtime doesn't expose this, the disconnect isn't detectable, and the only fallback is to enforce a deadline.

How does gRPC propagate deadlines?

On the wire, gRPC sends a grpc-timeout header (e.g., 'grpc-timeout: 500m' for 500ms). Server-side libraries set the request context deadline from this. When the server makes further gRPC calls and passes the ctx, the client library reads the remaining deadline from ctx and sends a smaller grpc-timeout. Each hop subtracts the elapsed time. This is deadline propagation across services, and it's the main reason gRPC is preferred over REST for inter-service calls.

What about plain HTTP? Is there a deadline header?

Not officially. Some teams use X-Request-Timeout or X-Deadline as a custom convention. Others rely on connection-level timeouts and the client disconnect signal. When both sides are under the team's control, gRPC or a custom header works. Otherwise, the only options are TCP and per-hop timeouts.

What's the difference between cancellation and timeout?

Timeout is a deadline; cancellation is the act of aborting work in progress. A timeout PRODUCES a cancellation when it fires. Cancellation can also come from other sources: parent task cancelled, user pressed cancel, system shutdown. The handler logic is the same: check ctx.Done() / is_cancelled(), unwind cleanly, release resources.

Gotchas

Forgetting to pass ctx to a downstream call: that layer ignores the deadline, zombies pile up
Long CPU loops without ctx checks: cancellation can't interrupt CPU work, only I/O
Deadline too tight at the edge: real requests time out under normal load
Deadline too loose at the edge: zombies pile up before deadline hits
Treating cancellation as exceptional: it's a normal part of every request lifecycle

CPU-bound vs I/O-bound

Section: Performance & Tradeoffs | Difficulty: Basic | Frequency: Asked Often

A CPU-bound task spends most of its time doing computation; an I/O-bound task spends most of its time waiting. Different workloads need really different concurrency strategies, and picking the wrong one makes the code slower, not faster.

Key Points

CPU-bound: bottleneck is compute. Speedup requires more cores OR faster code.
I/O-bound: bottleneck is waiting (network, disk, DB). Speedup requires more concurrency to overlap waits.
Threads help I/O-bound work even with the GIL, Python releases the GIL around blocking calls
Threads do NOT help CPU-bound work in Python (GIL serializes execution), use multiprocessing
Async/await scales I/O-bound to 100K+ tasks but does nothing for CPU-bound
Right pool size: CPU-bound ≈ core count; I/O-bound ≈ much higher (often 100s)

Interview Follow-up Questions

How does one tell if code is CPU or I/O bound?

First step: `time` it. If user+sys ≈ real, CPU-bound; if user+sys << real, I/O-bound. Then profile (py-spy / pprof / async-profiler) to confirm where the time goes. Most production code is I/O-bound, DBs, RPCs, disk.

Why doesn't async/await help CPU-bound code?

Async/await is cooperative single-threaded concurrency. There's only one thread, so CPU work runs on it serially. A CPU-heavy task in an async function blocks the event loop and stalls every other task. Use a thread/process pool for CPU work, async for I/O.

What's the right thread pool size?

CPU-bound: cores (sometimes cores+1 to handle minor I/O). I/O-bound: cores × (1 + wait/compute). For 90% I/O work, that's 10× cores. With Java 21+ virtual threads or Go goroutines, the answer collapses to 'one per task.'

When does CPU-bound matter in practice?

Image/video encoding, ML inference (without GPU), regex on huge inputs, JSON parsing of massive payloads, cryptography, compression. Most web services are I/O-bound; data pipelines are often CPU-bound.

Why does Python's GIL release on I/O?

Blocking system calls (sockets, file I/O, sleep) release the GIL because the C implementation knows the thread will idle. Other Python threads can run during the wait. CPU work in pure Python keeps the GIL held until a bytecode boundary.

Gotchas

A function can be CPU-bound in one workload and I/O-bound in another, measure on actual data
Hashing/JSON-parsing/compression in 'I/O' code paths is often the hidden CPU bottleneck
GPU-accelerated compute is technically I/O from the CPU's perspective, different concurrency model again
Network I/O can become CPU-bound at very high throughput (parsing dominates)
asyncio + a CPU-heavy task = stalled event loop and timeouts on every concurrent task

Throughput vs Latency Tuning

Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Asked Often

Throughput = ops/sec the system can sustain. Latency = time per op. They trade off via batching, queuing, and concurrency. Higher concurrency → higher throughput but worse tail latency. Little's Law: average concurrency = throughput × average latency.

Key Points

Throughput: system-level capacity (req/sec)
Latency: per-op time, especially p50/p99/p99.9 percentiles
Little's Law: L = λW (concurrent in-flight = throughput × avg latency)
Batching trades latency for throughput (waits for batch, then processes many)
Queue length grows when arrival rate exceeds service rate, exponential queueing

Interview Follow-up Questions

What's Little's Law and why does it matter?

L = λW. Average number of items in system = arrival rate × average time in system. Useful for prediction: at 1000 req/sec and average latency 100ms, there are 100 in-flight requests at any moment. Size connection pool, thread pool, and queue depth around this.

Why does p99 matter more than average?

Page-load times are bottlenecked by the slowest of N parallel requests. If 99% of requests are fast and 1% are slow, the 1% dominates user experience. Average masks tail. Always track p50 (median), p99 (tail), p99.9 (worst-case).

How does batching affect latency?

Batching adds wait-for-batch time to each operation. If batch size = 100 and ops arrive at 1ms intervals, last-in-batch waits 99ms before processing. But total processing for 100 ops drops dramatically. Net: higher throughput, individual latency up by batch wait.

What's USL (Universal Scalability Law)?

Neil Gunther's law: as concurrency N increases, throughput grows linearly (in idealized world), then sublinearly (contention), then DROPS (coordination cost). Real systems have an optimal N, beyond that, more threads = lower throughput. Find N empirically.

Gotchas

Average latency hides tail, always look at p99/p99.9
Adding threads beyond optimal N HURTS throughput (USL)
Unbounded queues turn brief overload into permanent latency degradation
Cache hit rate drops when concurrency increases (cache-line contention, eviction)

Context Switch Cost

Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes

OS thread context switch: ~5μs (kernel-mediated, save/restore registers, TLB flush). Goroutine / virtual thread context switch: ~hundreds of ns (user-space, no kernel transition). Async task switch: ~tens of ns (function call). Choosing the right unit affects throughput at scale.

Key Points

OS thread switch: ~1-5μs typical, includes kernel mode switch + TLB pollution
Goroutine / virtual thread switch: ~200-500ns, user-space only
asyncio task switch: ~50-100ns, just a coroutine resumption
At 1M req/sec, a 1μs context switch costs 1 second of CPU per second, 100% of one core
Mass goroutine spawning is fine; mass thread spawning hits scheduler limits

Interview Follow-up Questions

Why do OS thread switches cost so much more?

Three reasons: (1) kernel mode transition (~hundreds of ns by itself); (2) full register save/restore including FP and SIMD state; (3) TLB pollution, virtual memory mappings get partially invalidated, causing subsequent memory accesses to refill from page tables. Goroutine switches do none of this, same address space, just swap stack pointers.

What's TLB pollution?

TLB = Translation Lookaside Buffer, the CPU's cache of virtual→physical address translations. Switching processes flushes it; switching threads in the same process partially flushes (TLB tagged by ASID on modern CPUs). After a thread switch, the next N memory accesses are slower because they refill the TLB.

How are context switches measured in production?

Linux: `vmstat 1` shows cs (context switches per second) globally. Per-process: `pidstat -w 1`. For perf-critical services, monitor cs/sec; sudden spikes correlate with throughput drops. Goroutine 'switches' don't show in OS metrics, use Go's pprof/sched.

When does context-switch cost become the bottleneck?

When tasks are very short. If each task does 1μs of work and incurs a 5μs context switch, 5x more time goes to switching than working. Either batch tasks (do many per switch) or use cheaper switching (goroutines, async).

Gotchas

Mass-spawning OS threads (e.g., one per request) hits OS scheduler limits at ~10K threads
vmstat cs only counts OS-level switches; goroutine and async switches invisible
TLB pollution makes the FIRST few accesses after a switch much slower than steady-state
Hyper-threading helps hide some thread-switch latency by running another logical core
Synchronized blocks in Java pin virtual threads to carriers, losing the cheap-switch benefit

Memory Overhead Comparison

Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes

OS thread: ~1MB stack reserved. Goroutine: ~2KB initial, grows on demand. Virtual thread (Java 21): ~few KB. asyncio task: ~1KB. The difference dictates whether you can run 10K vs 1M concurrent units on the same hardware.

Key Points

OS thread: 1MB default stack reserved (most unused) → 10K threads = 10GB RSS
Goroutine: 2KB initial, grows up to 1GB → 10K goroutines = 20MB
Virtual thread (Java 21): ~few KB stack → 1M virtual threads ~ few GB
asyncio task: ~1KB per task (Python) → 1M tasks ~ 1GB
Per-thread/task overhead matters when you serve millions of concurrent connections

Interview Follow-up Questions

Why do OS threads reserve 1MB even if unused?

Stack is virtually mapped at thread creation; physical pages allocated only when touched (most aren't). RSS may be smaller than 1MB × thread count, but the virtual address space cost is real; eventually 32-bit address space exhausts (less of an issue on 64-bit, but still). Some kernels actually allocate the stack eagerly.

How does goroutine stack 'grow on demand'?

Each goroutine starts with a tiny stack (2KB in Go 1.4+). When a function call would overflow, the runtime allocates a larger stack, copies the contents, updates pointers, and re-runs. Cost is per-growth (rare in practice). Stacks also shrink on GC if usage drops.

What's the memory cost of 1M asyncio tasks?

Each task is a coroutine state object, ~1KB in Python (the suspended frame, locals, await chain). 1M tasks ~ 1GB. Add aiohttp connections (~few KB each) and the total reaches several GB. Bound concurrency or risk OOM.

Where does ThreadLocal cost matter?

Each thread has a ThreadLocal map. Java apps with many ThreadLocals (Spring, MDC, request context, ORM) can use ~50KB per thread JUST for ThreadLocals. With 1M virtual threads, that's 50GB. Java 21 ScopedValue solves this for new code.

Gotchas

1M threads on Linux = ~1MB×1M = 1TB RSS, won't fit on most servers
Async stacks live across awaits, long-lived tasks accumulate stack memory
ThreadLocal × virtual threads = memory explosion (use ScopedValue in Java 21+)
Goroutine stack growth costs CPU during the copy, usually invisible but profiles can show it

Amdahl's Law

Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes

Maximum speedup from parallelisation = 1 / (s + p/N), where s is serial fraction, p is parallel fraction, N is processor count. Even small serial fractions cap maximum speedup. 5% serial = 20x maximum, no matter how many cores. The implication: optimising the serial part is often the highest-leverage work.

Key Points

Speedup(N) = 1 / (s + (1-s)/N). As N → ∞, speedup → 1/s.
5% serial = 20x maximum speedup. 1% serial = 100x. 10% = 10x.
Real systems have multiple serial parts: I/O, lock contention, sequential bottlenecks.
Highest-leverage optimisation: reduce the serial fraction, not just add cores.
Universal Scalability Law (USL) extends Amdahl with a contention term, often more realistic.

Interview Follow-up Questions

What's the practical lesson?

Throwing cores at a problem has diminishing returns. To break the limit, reduce the serial fraction. Profile to find what's serial: a single-threaded log writer, a global lock, a sequential I/O to a shared backend. Eliminate or parallelise it; only then does adding cores keep helping.

How does Amdahl differ from USL?

Amdahl assumes parallel portions scale perfectly. USL (Universal Scalability Law) adds a contention term and a coherency term, modeling cases where adding cores actually slows you down (cache-line bouncing, lock contention overhead). USL fits real systems better; Amdahl is the simpler upper bound.

Does Amdahl apply to distributed systems?

The same shape. The 'serial' part might be a coordinator, a leader, a global lock, a single-shard query. Most scaling problems in distributed systems are Amdahl-shaped: 'this part of the workflow can't be parallelised, and it caps the throughput'.

Why don't speedups in benchmarks match Amdahl predictions?

Several reasons. Benchmark workload may not match production (different serial fraction). Cache effects help on some sizes (more cores = more cache total) and hurt on others (false sharing). System overhead (scheduler, GC) is non-trivial. Amdahl is the upper bound; real speedup is at or below it.

Gotchas

Adding cores past the Amdahl ceiling is wasted money
'Embarrassingly parallel' workloads still have a serial setup/teardown that limits speedup at scale
Lock contention is a hidden serial fraction; profile to find it
Per-core wins (better algorithm, less contention) sometimes beat throwing more cores
Real systems are often USL-shaped (degradation past peak), not Amdahl-shaped (asymptote)

Universal Scalability Law (USL)

Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes

USL extends Amdahl's Law with a coherency term. Speedup = N / (1 + α(N-1) + βN(N-1)). The β term captures the case where adding cores actually slows you down (cache coherency, lock contention). Real systems peak at some N and degrade beyond it. USL fits this; Amdahl doesn't.

Key Points

α (contention): the part that serialises, like Amdahl's serial fraction.
β (coherency): the cost of coordinating between cores; quadratic in N.
Real systems often peak at finite N, then degrade. USL predicts the peak; Amdahl doesn't.
Fitting USL to your benchmark data tells you where to invest: reduce contention OR coordination.
Practical use: capacity planning. Know the peak before you scale past it and waste money.

Interview Follow-up Questions

How does USL differ from Amdahl?

Amdahl predicts an asymptote; USL predicts a peak followed by decline. Real systems usually have the peak shape: adding cores helps up to some N, then cache coherency and contention overhead exceed the parallelism gain, and you go backward. Amdahl is a useful upper bound; USL is closer to what you'll actually measure.

What does the β term physically represent?

The cost of cores coordinating with each other. Cache coherency traffic (cache lines bouncing between L1 caches), atomic operations on shared memory, lock acquisition cost, NUMA cross-socket access. These costs grow as cores multiply because every core might need to talk to every other core.

How is USL used in practice?

Two ways. (1) Capacity planning: fit USL to benchmark data, predict the peak, don't scale past it. (2) Optimisation guidance: a high-α fit means 'reduce serial work'; a high-β fit means 'reduce inter-core coordination'. The shape of the curve points at the right kind of fix.

Where can USL not be applied?

USL assumes a single bottleneck shape (contention + coherency). For systems with multiple distinct bottlenecks at different N (memory at N=4, network at N=16, disk at N=64), USL gives a single curve that doesn't fit any one bottleneck well. For those, profile each bottleneck separately.

Gotchas

Fitting USL to too few data points (3-4) gives unreliable α, β estimates
Treating the asymptote as the goal: real systems peak and decline; the peak is the goal
Ignoring the curve shape: high α and high β need different fixes
Adding cores past the USL peak makes you slower AND more expensive
USL is per-workload; same hardware can have very different curves for different workloads

Tail Latency: Why p99 Matters More Than Average

Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Asked Often

p99 (the 99th percentile latency) is the user experience; the average is a lie. Tail latency amplifies in fan-out: a request that fans out to 10 services has a p99 dominated by the slowest of 10 backends, not by their average. Mitigations: hedged requests, tied requests, micro-partitioning, smaller fan-out, better tail metrics.

Key Points

Average hides outliers. The p99 IS the user experience for 1% of requests, which is millions when you serve millions.
Tail amplification: a request fanning out to N backends has p99 ≈ max of N independent draws. With N=10, p99 ≈ p99.9 of one backend.
Mitigations: hedged requests (send to two replicas, take first), tied requests (cancel the loser), micro-batching, fewer fan-outs.
Coordinated omission: load testing tools that wait for slow responses miss tail latency. Use HDR Histogram, k6, or wrk2.
Read 'The Tail at Scale' by Dean & Barroso (2013); foundational paper, still relevant.

Interview Follow-up Questions

Why does p99 amplify with fan-out?

Each backend independently has some chance of being slow. Fanning out to N backends and waiting for all means waiting for the slowest. The probability that ALL N are fast is much lower than the probability that ONE is fast. With N=10 and per-backend p99=100ms, ~10% of fan-out requests will hit at least one slow backend. The fan-out p99 lands near the original p99.9 of one backend.

What's the simplest way to reduce tail latency?

Hedged requests. Send to one replica; if not done after the p95 latency, send a duplicate to another replica; take the first response. ~5% extra load, dramatic p99 improvement (often 5-10x). Only safe with idempotent operations.

What is coordinated omission?

A flaw in many load testers: they send the next request only after the previous one returns. If the server slows down, the tester slows down with it, sending fewer requests and missing the latency that users at constant rate would actually experience. Use open-loop testers (wrk2, k6, vegeta) that send at constant rate regardless of response time.

Optimise for p50 or p99?

Depends on the workload. For batch/throughput, p50 (or average). For user-facing requests, p99. For 'tail-sensitive' systems (search, ad serving), even p99.9. The question is: how many users have a bad experience? At millions of requests, even p99.99 is many users.

Gotchas

Reporting average latency in dashboards: hides the tail entirely
Coordinated omission in load tests: reported p99 understates real p99 by 10-100x
Fan-out without thinking about amplification: p99 of the user request is much worse than p99 of any backend
Hedging non-idempotent operations: double-execution under network failures
Prometheus summary metrics with default config: percentiles drift over the reporting window

Benchmarking Methodology

Section: Performance & Tradeoffs | Difficulty: Intermediate | Frequency: Sometimes

Microbenchmarks lie unless they're done carefully. Warm up the JIT/runtime. Defeat dead-code elimination. Use realistic input. Measure with confidence intervals (multiple runs). Benchmark the right thing (the operation, not the noise around it). End-to-end load tests for system-level decisions, microbenchmarks for code-level decisions.

Key Points

Warmup is critical: JIT, branch predictor, caches all need to stabilise.
Dead-code elimination silently deletes your benchmark; consume the result.
Multiple runs in fresh JVMs (forks) catch JVM-to-JVM variance.
Report confidence intervals; if they overlap, you can't claim one is faster.
Microbenchmark vs load test: microbenchmark for 'is op A faster than op B'; load test for 'does the system handle 1000 RPS'.

Interview Follow-up Questions

Why doesn't a simple time.time() loop work?

Three reasons. (1) Warmup: cold code runs differently from warm code. (2) Dead code elimination: if the result is unused, the compiler deletes the work. (3) Variance: GC pauses, scheduler quanta, cache effects produce huge run-to-run variance. JMH/pytest-benchmark/testing.B handle all three. Hand-rolled timing is almost always wrong unless the target is whole applications, not microbenchmarks.

How many runs are needed?

Enough that confidence intervals are tight. JMH defaults: 5 warmup + 5 measurement iterations × 2 forks = 10 measurements. For tighter CIs, increase forks (more JVM-to-JVM samples). For tighter relative comparisons, increase iterations within each fork.

When to prefer load testing over microbenchmarks?

When the question is system-level: 'can we handle Black Friday', 'what's our p99 at 1000 RPS', 'does the cache help end-to-end'. Microbenchmarks answer 'is op A faster than op B'. Use load tests (k6, wrk, gatling) for the system question, microbenchmarks to drill into specific hot spots.

What's the relationship between benchmarking and profiling?

Benchmarking measures: 'this takes X ns'. Profiling explains: 'X ns is spent here'. Both essential. Benchmark-only gives a number with no path to improvement. Profile-only gives a flame graph but no impact estimate. Workflow: profile to find the hot spot → benchmark the hot spot → make a change → re-benchmark to confirm.

Gotchas

Hand-rolled timing loops give numbers that are off by 10x or more
Single run = no idea of variance; could be off by a factor of 2
Discarding the benchmark result lets DCE delete your loop
Reporting averages without intervals hides 'no significant difference'
Benchmarking in dev/IDE = unreliable; always run benchmarks from CLI

Measuring Lock Contention

Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes

Lock contention is when threads spend significant time waiting for locks instead of running. Measure with: profiler block-time view, JFR/async-profiler lock events, perf-events for futex contention. The fix is usually one of: shrink the critical section, shard the lock, replace with lock-free, or remove shared state entirely.

Key Points

Symptom: high CPU time spent in lock acquisition / blocked / parked states.
Java: JFR events 'jdk.JavaMonitorEnter', async-profiler 'lock' mode, jstack thread dumps.
Linux: perf trace -e 'sched:*' or perf record with off-cpu profiling.
Fix priority: shrink critical section > shard lock > lock-free structure > redesign to avoid sharing.
False sharing looks like contention; measure cache-miss rate to distinguish.

Interview Follow-up Questions

How is contention identified as the bottleneck?

Check CPU utilisation under load. If CPU is well below 100% but throughput won't go higher, threads are blocked on something. The 'something' is often a lock. Profile with off-CPU sampling (async-profiler lock mode, perf trace, Go mutex profiler) to find the specific lock.

Shrink critical section vs shard lock: which first?

Shrink first. Often the critical section contains work that doesn't need the lock (compute outside, only update inside). Shrinking is cheap and doesn't change the API. Sharding is more invasive (queries that span partitions have to deal with N partitions). Try shrinking; if still contended, shard.

What's the difference between contention and false sharing?

Contention: threads explicitly waiting on the same lock. False sharing: threads writing to different fields that happen to be on the same cache line, causing cache-line ping-pong even though there's no logical contention. Contention shows up as block time; false sharing shows up as high cache-miss rate with high CPU. Different fixes (lock partitioning vs padding/separation).

Is GIL contention measurable in CPython?

Yes. py-spy --gil shows per-thread GIL hold time. ceval.c instrumentation (CPython internals) can be enabled. For most Python services, the GIL is the bottleneck on multi-threaded CPU work, which is why 'use multiprocessing' or 'use C extensions that release the GIL' are the standard advice.

Gotchas

CPU at 100%: contention is NOT the bottleneck; the system is CPU-bound, optimise the code
CPU low but throughput limited: lock or I/O is the bottleneck; profile to find which
Confusing high CPU with high contention: profile both on-CPU AND off-CPU
Profiler overhead: high sampling rates can themselves cause perceived contention
Optimising a non-bottleneck: makes the system more complex without helping

NUMA & Cache Effects

Section: Performance & Tradeoffs | Difficulty: Advanced | Frequency: Sometimes

Multi-socket servers have NUMA: each CPU socket has its own memory; accessing the other socket's memory is slower (1.5-3x). Pinning threads to NUMA nodes and allocating memory locally can dramatically improve performance for memory-bound workloads. False sharing is the related cache-line problem within a socket.

Key Points

NUMA: each socket has its own memory bank. Cross-socket access is slower than local.
Linux: numactl --hardware shows NUMA topology. numactl --cpunodebind=0 --membind=0 pins to node 0.
JVM, Go runtime, Postgres can be NUMA-aware; check docs and tune.
False sharing: two cores writing different fields on the same cache line cause cache-line bouncing.
Fix false sharing with padding (@Contended in Java, _Alignas(64) in C) so independent fields land on different cache lines.

Interview Follow-up Questions

How big is the NUMA penalty?

Cross-socket memory access is typically 1.5x to 3x slower than local. For latency-sensitive workloads (databases, message brokers, in-memory caches), this is significant. For CPU-bound workloads with small working sets that fit in cache, NUMA effects are smaller. Measure with `perf stat -e LLC-loads,LLC-load-misses` to see how much cross-socket traffic is happening.

How is false sharing identified?

Symptoms: high CPU, low instructions-per-cycle (IPC), cache-coherency events spike (perf stat -e mem_load_uops_retired.l3_hit). Profile shows time spent in trivial atomic operations. Fix: pad the contended fields. The improvement is often dramatic (2-10x for the affected operation).

Should atomics always be padded?

No. Padding wastes memory (64 bytes per field instead of 8). Only pad after measuring false sharing. The Java LongAdder uses padding because the per-thread cells are explicitly designed to be hot; a one-off counter doesn't need it.

How does NUMA interact with virtualisation/cloud?

Cloud VMs may or may not expose NUMA. Smaller instance types typically don't (one socket worth of memory). Larger instances often do. Check `numactl --hardware` inside the VM. For NUMA-aware workloads on multi-socket VMs, pinning still helps.

Gotchas

Cross-socket memory access is 1.5-3x slower; bind threads and memory to the same node
False sharing on hot atomics can dominate even simple workloads
JVM and Go runtimes are mostly NUMA-blind by default; manual binding helps
Large heaps that span sockets thrash; either pin or use multiple smaller processes
Padding wastes memory; only apply where you've measured false sharing

Parallel Merge Sort & ForkJoin

Section: Performance & Tradeoffs | Difficulty: Medium | Frequency: Sometimes

Divide-and-conquer naturally parallelizes, recursive halves run on different cores. Java's ForkJoinPool is purpose-built for this; Go uses goroutines + WaitGroups; Python uses concurrent.futures. Threshold check stops parallelism for small subarrays, fork overhead exceeds the parallelism win.

Key Points

Divide-and-conquer = natural parallelism (each half is independent)
Threshold: stop forking when subarray < N (~~ 1000 elements)
Java ForkJoinPool: work-stealing scheduler optimized for D&C
Sequential merge step is unavoidable bottleneck, Amdahl's law
Speedup is logarithmic vs linear: 8 cores ≠ 8x speedup

Interview Follow-up Questions

Why a threshold? Why not parallelize all the way down?

Fork overhead (creating a task, scheduling, joining) is a few μs. For a subarray of 10 elements, the fork cost dwarfs the work. Threshold ~= 1000 elements is empirical, fork cost amortized over enough work. Below threshold, sequential is faster.

What's the theoretical speedup of parallel merge sort?

O(n log n) sequential; O(log² n) on infinite cores. With P cores: O(n log n / P + log n). Practically: 4 cores ~= 3.2x speedup, 8 cores ~= 5.5x, due to merge step bottleneck and synchronization overhead. Amdahl's Law: sequential fraction caps speedup.

Why ForkJoinPool over regular ExecutorService?

FJP uses work-stealing, idle threads steal tasks from busy threads' queues. Optimized for recursive divide-and-conquer where subtasks are short and bursty. Regular ThreadPoolExecutor's central queue serializes scheduling, bottleneck for fine-grained tasks.

Can the merge step itself be parallelized?

Yes, with O(log n) parallelism using rank-based parallel merge (find each element's rank in the other array via binary search, write to position in parallel). Used in libraries like Intel TBB. Doubles complexity for marginal gains; usually not worth it.

Gotchas

ForkJoinPool.commonPool() is shared, heavy use starves parallel streams
ProcessPoolExecutor pickles arguments, large arrays have IPC overhead
Threshold too low → fork overhead dominates
Threshold too high → underutilization on big arrays
Parallel merge sort has O(n) extra memory for the merge buffer

Parallel MapReduce: Word Frequency Counter

Section: Performance & Tradeoffs | Difficulty: Medium | Frequency: Sometimes

Map: each worker processes a chunk of input → partial counts. Reduce: merge partial counts into final result. Map step is embarrassingly parallel; reduce step is the synchronization point. Java's parallelStream, Python's multiprocessing.Pool.map, Go's errgroup, all express this.

Key Points

Map: chunk input, each worker computes partial result independently
Reduce: merge partial results, must be associative (and ideally commutative)
Embarrassingly parallel: no inter-worker communication during map
Reduce strategies: tree-reduce (parallel), sequential merge, ConcurrentHashMap
Foundation of distributed computing, Hadoop, Spark, BigQuery all extend this

Interview Follow-up Questions

Why per-worker map instead of shared ConcurrentHashMap?

Avoid lock contention. Each worker writes to its OWN map (no synchronization needed). Final reduce merges. Concurrent updates to one shared map serialize at the lock, defeats parallelism. The shard-then-merge pattern is universal in MapReduce.

How does this connect to Hadoop / Spark?

Distributed MapReduce. Same pattern at machine scale: each machine processes a chunk of input from HDFS/S3, emits partial results; reduce stage shuffles by key and merges. Hadoop's per-mapper / per-reducer abstraction is exactly this.

What's a tree-reduce?

Pair up partial results and merge in pairs in parallel; repeat until one result. log(N) merge depth instead of linear. Useful when reduce is expensive and many partials exist. Java's parallelStream uses this internally.

Why must reduce be associative?

Parallel execution can merge in any order: ((A+B)+(C+D)) or (A+(B+(C+D))). Associative ⇒ same result regardless of grouping. For commutative reducers (like sum), order also doesn't matter. Common bug: reducers that look fine but aren't strictly associative under floating-point arithmetic.

Gotchas

Mutable accumulators (like StringBuilder) need synchronization or per-worker copies
parallelStream uses commonPool, heavy use starves other parallelStream callers
Reduce step can be the bottleneck if maps are huge, consider tree-reduce
Float addition is not strictly associative, parallel reduce gives slightly different results than sequential

Parallel Graph Traversal (BFS/DFS)

Section: Performance & Tradeoffs | Difficulty: Hard | Frequency: Sometimes

Parallel BFS: process each level in parallel; serial frontier-to-frontier transitions. Parallel DFS: harder due to depth-recursion; usually use task-pool of subgraphs. Concurrent visited-set required either way; ConcurrentHashMap or Bloom filter.

Key Points

BFS parallelizes naturally per-level, all nodes at level k processed concurrently
DFS harder, recursive nature serializes; use bounded task pool of subgraphs
Visited set must be concurrent, CHM, sync.Map, atomic bitset
Frontier transitions are barriers, synchronize all workers between levels
Used in PageRank, social-graph queries, dependency resolution at scale

Interview Follow-up Questions

Why is BFS easier to parallelize than DFS?

BFS has natural levels, all nodes at distance k are processed before moving to k+1. Each level is embarrassingly parallel. DFS has recursive call structure, depth-first ordering must be preserved (somewhat), and recursion serializes naturally. Workarounds: parallel DFS uses task pool of subgraphs.

Why does visited set need to be concurrent?

Many workers may discover the same neighbor simultaneously. Without atomic add, two workers might both think they discovered it first → both enqueue → duplicate work. ConcurrentHashMap.newKeySet().add() returns true exactly once per key, natural deduplication.

When does parallel BFS NOT win?

Tree-shaped graphs (each node has one parent), frontier is small, parallelism is wasted on coordination. Sparse graphs in general. Win is biggest on dense, wide graphs (many neighbors per node, many nodes per level).

What about Bloom filter for visited set?

Useful at extreme scale (billions of nodes) where exact set is too big. Trade memory for false positives. Used by web-scale crawlers. Risk: Bloom false positive → drop a real new node. Usually pair with exact set as fallback for confirmation.

Gotchas

If workers don't synchronize between levels, processing crosses levels and BFS becomes wrong
Concurrent map's `add` semantics matter, use `newKeySet().add()` (returns true if newly added) for dedup
Level barrier requires ALL workers from level k to finish before level k+1 starts, wg.Wait()
DFS isn't level-based, naive parallelization breaks ordering
Memory-bound when frontier is huge, bound concurrency to avoid OOM

Lock-Free Programming, CAS Loops

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes

Lock-free algorithms use atomic compare-and-swap (CAS) instead of mutexes. The pattern: read current value, compute new value, CAS-install, retry on failure. At least one thread always makes progress (no deadlocks, no waiting on a slow holder), but individual threads can starve under contention. Pays off when contention is high and locks would serialise.

Key Points

Lock-free = at least one thread is always making progress (system-wide). Individual threads may retry many times.
The CAS retry loop is the building block: load → compute → CAS → retry on failure.
No deadlocks possible (no thread blocks on another). No priority inversion. No waiting on a slow holder.
Costs: ABA problem in pointer-based structures, memory reclamation in non-GC languages, cache-line contention under hot CAS.
Use when locks are measurably the bottleneck. For most code, mutexes are simpler AND faster.

Interview Follow-up Questions

Lock-free vs wait-free?

Lock-free: at least one thread makes progress at all times (system-wide). Individual threads may retry indefinitely under bad scheduling. Wait-free: every thread completes in bounded steps regardless of others. Wait-free is much harder to implement; lock-free is the practical bar for most production code.

When should I use lock-free instead of a mutex?

Once measurement shows lock contention as the bottleneck AND the critical section is small enough that CAS retry is cheaper than mutex parking. For counters, hot flags, simple atomic state, lock-free is a clear win. For more complex shared state (multi-field updates, long critical sections), locks are simpler and often faster.

What is the ABA problem?

In pointer-based lock-free structures, a CAS can succeed even though the value changed between the load and the CAS, because it was changed back to the same value. Without the original context (often a freed-and-reused node), the CAS installs garbage. Solutions: tagged pointers, hazard pointers, GC. See the ABA Problem & Tagged Pointers lesson.

Why doesn't lock-free help with multi-field updates?

CAS operates on one word at a time. If the invariant requires updating fields A and B together (head and tail of a linked list, balance and audit log), a single CAS can't do it atomically. Workarounds: pack both fields into one word (if they fit), use double-CAS (rare hardware support), or wrap them in an immutable struct and CAS the pointer. Or just use a lock.

Gotchas

Treating 'lock-free' as 'always faster', measure first; mutexes often win for simple cases
ABA in pointer structures, without GC, requires extra machinery
Hot CAS contention causes cache-line ping-pong; lock-free counters can be slower than sharded
Multi-field invariants need locks or pointer-swap; CAS only protects one word
Lock-free does NOT mean wait-free; individual threads can starve

Memory Ordering: Acquire, Release, Relaxed, SeqCst

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes

C++/Rust let the programmer specify how strict an atomic operation's memory ordering must be. SeqCst (sequentially consistent) is strongest and slowest. Acquire-release is the right default for publishing data. Relaxed gives no ordering, only atomicity (counters). Java and Go atomics are SeqCst by default; there's no choice to make.

Key Points

SeqCst: every thread sees the same global order of all SeqCst operations. Strongest, slowest.
Acquire (on load) + Release (on store): publish-subscribe pattern. Writes before release are visible after acquire.
Relaxed: atomic, but no ordering. Use for counters where order doesn't matter.
Java atomics and Go's sync/atomic give SeqCst always. C++ and Rust allow per-operation choice.
Wrong relaxation = wrong code. Acquire-release is the default if unsure; relax only after measuring AND reasoning carefully.

Interview Follow-up Questions

Why doesn't Java let me choose memory ordering?

Design choice. The Java team prioritised 'obviously correct' over 'maximally fast'. All Java atomic operations are sequentially consistent (since the memory model was clarified). The trade-off is some performance loss compared to hand-tuned C++; the upside is no entire class of subtle bugs. Go made the same choice for sync/atomic in Go 1.19.

When should I prefer acquire-release over SeqCst?

When the operation is a one-way publication or subscription. Producer releases, consumer acquires. Most lock-free patterns are this shape. SeqCst is needed when two independent atomics must agree on a global ordering (rare; mostly Dekker/Peterson-style algorithms).

Is relaxed ever wrong for counters?

Almost never wrong for the count itself. Wrong if other code reads the counter and decides something based on it relative to other state. Example: 'if hits > threshold, do X', under relaxed, hits can appear stale relative to other state. If the decision matters, use acquire on the load.

What is the cost difference between SeqCst and relaxed?

Hardware-dependent. On x86, SeqCst loads are basically free (loads are already strong); SeqCst stores require a fence (mfence or lock-prefixed). On ARM/POWER, SeqCst on either side requires fences. Relaxed is just the atomic instruction. The difference can be 2-10x for the operation, but the operation itself is nanoseconds; only matters in very hot paths.

Gotchas

Mixing relaxed with code that depends on ordering = subtle bugs
Acquire on store / release on load = compiler error; can't apply to wrong direction
Java atomics are SeqCst; they can't be relaxed
Memory ordering is ABOUT the compiler AND the CPU; both can reorder
Reasoning by 'it works on my machine' fails on weakly-ordered ARM

The ABA Problem & Tagged Pointers

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Asked Often

ABA: thread A reads pointer P (= A); thread B pops A and pushes a new node at the same address; thread A's CAS succeeds even though the structure changed. Solutions: tagged pointers (pack version counter), hazard pointers (announce in-use), epoch-based reclamation, GC.

Key Points

Pure CAS is unsafe when the same address can be reused (typical in C/C++ with manual memory)
Java/Go avoid ABA via GC, freed memory not reused while references exist
Tagged pointers: pack a version counter into unused pointer bits (low 3 bits typically)
Hazard pointers: each thread declares pointers it's actively using; readers don't free those
Epoch-based reclamation: defer free until all threads have moved past the freeing epoch

Interview Follow-up Questions

What's a 'hazard pointer'?

Each thread maintains a list of pointers it's currently using. Before freeing memory, the freer checks all hazard pointer lists; if any thread is using this pointer, defer the free. More memory overhead than CAS-only, but ABA-safe without tagged pointers.

What's epoch-based reclamation?

Track a global 'epoch' counter. Threads enter critical sections by reading the epoch; freed memory is tagged with its epoch. Don't actually free until ALL threads have advanced past the freeing epoch. Used by RCU in Linux kernel.

Tagged pointer, how does a counter fit in a pointer?

Most platforms use 64-bit pointers but only 48 bits of address. The high 16 bits are typically zero (or sign-extended). Pack a 16-bit version counter there. Or use the low bits, pointers to objects with alignment N have low log2(N) bits zero.

Does Java really not have ABA?

ABA on object identity is impossible, GC keeps freed objects alive. ABA on object STATE is possible (same object reused, fields changed). AtomicStampedReference handles both, stamp bumps on every change, regardless of reference.

Gotchas

Reusing nodes from a pool reintroduces ABA even in GC languages
AtomicStampedReference adds overhead, use only when ABA is a real risk
Tagged pointers fail on architectures with 64-bit virtual addresses (some ARM)
Hazard pointers add 1 cache line per thread per concurrent operation, memory cost

Treiber Stack: The Canonical Lock-Free Stack

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes

Lock-free stack using CAS on the head pointer. Push: build new node pointing at current head, CAS to install. Pop: read head, CAS-replace with head.next. The textbook lock-free data structure; suffers from the ABA problem and from contention on the head.

Key Points

Push: allocate node, set node.next = head, CAS(&head, oldHead, node).
Pop: load head, then CAS(&head, oldHead, oldHead.next). Retry on CAS failure.
Vulnerable to ABA: between read and CAS, head can be popped, freed, and re-pushed at the same address.
Memory reclamation is the hard part: when is it safe to free a popped node? Other threads might still be reading it.
Single-cache-line contention on head: not great for many concurrent threads. Lock-free MPSC queues outperform for high contention.

Interview Follow-up Questions

Why is ABA a problem here?

Pop reads head=X, computes new_head = X.next, then CAS(head, X, X.next). If between the read and the CAS, another thread pops X (X is freed), pushes a new node Y, and pushes another node X' that happens to be allocated at address X, then head is X again and the CAS succeeds. But X' has a different X'.next than the X we read. We've installed garbage as the new head.

Why isn't ABA a problem in Java/Go?

GC. A node cannot be freed while any thread still holds a reference to it. The popping thread's local variable oldHead keeps it alive until that variable goes out of scope. So the address cannot be reused for a different node during the CAS. ABA can still happen if the GC happens to reuse the address (but it won't in practice while a thread holds the reference).

Is Treiber stack the best lock-free stack?

It's the simplest and the textbook one. For low contention it's fine. For high contention, all threads CAS the same head pointer, causing cache-line ping-pong; lock-free MPSC queues or elimination-backoff stacks scale better. For most production needs, a sync.Pool / object pool is simpler than a hand-rolled lock-free stack.

Gotchas

ABA: in C/C++ without GC or hazard pointers, popped nodes can be reused, breaking CAS
Memory reclamation: when is it safe to free a popped node?
Cache-line contention on head: scales poorly with many writers
Push allocates a node; allocator contention can dominate the lock-free win
Lock-free does NOT mean wait-free; threads can starve indefinitely under bad scheduling

Hazard Pointers

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes

Memory reclamation scheme for lock-free structures in non-GC languages. Each thread publishes the pointers it's about to dereference (hazard pointers). Reclamation defers freeing a node until no thread's hazard pointer references it. Solves the 'when can I free this?' problem without a GC.

Key Points

Each thread maintains a small fixed array of hazard pointers (typically 1-8).
Before dereferencing a pointer obtained via shared atomic, publish it as a hazard pointer.
On reclamation: collect all hazard pointers across threads, only free nodes not in the set.
Bounds on memory reclamation latency: a node can be freed at most O(N * H) operations after being unlinked, where N=threads, H=hazards per thread.
Alternative to epoch-based reclamation; lower latency for individual frees, more bookkeeping per access.

Interview Follow-up Questions

What problem do hazard pointers solve?

Memory reclamation in lock-free structures without GC. After a node is unlinked from a lock-free data structure, other threads might still hold pointers to it (they read it before the unlink). Freeing it immediately is unsafe (other threads will dereference garbage). Hazard pointers signal when it's finally safe.

Hazard pointers vs epoch-based reclamation?

Hazard pointers: per-thread per-access bookkeeping (slow per access, fast reclamation). Epoch-based (RCU): per-thread epoch counter (fast per access, slower reclamation). Hazard pointers free nodes sooner; epoch-based has lower per-access overhead. Both are correct; choice is a performance tuning.

How many hazard pointers do I need per thread?

Equal to the number of pointers a thread might be dereferencing simultaneously. For a Treiber stack pop: 1 (the head). For a Michael-Scott queue: 2 (head and head.next). For a more complex traversal: bounded by the depth of the access pattern. Typically 1-8 slots per thread are enough.

Why is C++ getting hazard pointers in the standard library?

Lock-free programming in C++ is rare but important (game engines, HFT, OS kernels). Memory reclamation has been the missing standard primitive. C++26 is expected to ship std::hazard_pointer based on years of work in folly, libcds, and the Concurrency TS. With it, application code can use lock-free structures without rolling its own reclamation.

Gotchas

Publishing hazard pointer AFTER dereferencing = use-after-free
Forgetting to clear hazard pointer = node never freed (memory leak)
Too few hazard slots per thread = need to retire-and-restart on contention
Hazard scan is O(threads × slots); doesn't scale to thousands of threads
Mixing hazard pointers with epoch-based reclamation in the same structure: confusing

RCU: Read-Copy-Update

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes

Read-Copy-Update: writers create a new copy and atomically swap the pointer; readers see either the old or the new, never a partial state. Reads are nearly free (often just a pointer load). Writers are slower but rare. Used heavily in the Linux kernel for read-mostly data structures.

Key Points

Reads are extremely cheap: dereference a pointer. No locks, no atomics on the read path.
Writes: copy the data, modify, atomically swap the pointer to the new copy.
Old copy must be reclaimed only after all readers that might be using it have moved on.
Grace period: readers register a 'reader region'; reclamation waits for all reader regions to end.
Best for read-mostly data: routing tables, configuration, dispatch tables. Not for hot writes.

Interview Follow-up Questions

What is the grace period?

The window between unlinking a node and freeing it. During the grace period, readers that started before the unlink might still be using the old node. The grace period ends when every reader that could have started before the unlink has finished. After that, free is safe. In the kernel, synchronize_rcu blocks until the grace period ends. In GC languages, the GC is essentially watching for the grace period continuously.

When should I use RCU vs sync.RWMutex (or RWMutex equivalent)?

RCU when reads vastly dominate writes (1000x or more) and the data is small enough to copy. Reads are essentially free (a pointer load). RWMutex has reader-counter overhead even on the read path. For small read-mostly data (config, dispatch tables, routing), RCU wins. For big data or write-heavy, RWMutex (or just Mutex) is fine.

Can I update a single field without copying?

Not safely. The whole point is that readers see a consistent snapshot. Mutating the existing copy lets a reader see the half-updated state. Always copy + atomic swap. For very large data structures where copying is expensive, look at sharded RCU (split into independent regions, RCU each separately).

Why is RCU so popular in the Linux kernel?

The kernel has many read-mostly data structures (process lists, routing tables, file descriptor tables, network namespaces) accessed by every system call. Lock-based access would serialise; RCU makes those reads free. The kernel team has invested years in RCU infrastructure (synchronize_rcu, call_rcu, srcu, sleepable RCU); the result is a major contributor to multi-core kernel scalability.

Gotchas

Mutating in place defeats RCU; copy + swap is required
Writers must wait for the grace period before freeing the old copy (in non-GC code)
RCU is read-mostly; for write-heavy data, the copy cost dominates
Multi-step updates (modify two related structures consistently) need extra coordination
Long reader regions delay reclamation; in the kernel, this can pin memory

Wait-Free vs Lock-Free vs Obstruction-Free

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes

Three progress guarantees. Wait-free: every thread completes in bounded steps regardless of others. Lock-free: at least one thread makes progress at any time (system-wide progress). Obstruction-free: a thread makes progress if it runs alone for long enough. Strict ordering: wait-free > lock-free > obstruction-free > blocking.

Key Points

Wait-free: hardest. Bounded per-thread steps. No starvation possible.
Lock-free: at least one thread makes progress always. Individual threads may starve.
Obstruction-free: a thread makes progress if no other thread interferes. Helped by exponential backoff.
Blocking (lock-based): a thread holding a lock can pause everyone else indefinitely.
Most production 'lock-free' code is actually lock-free, not wait-free. Wait-free is a research goal; lock-free is the practical bar.

Interview Follow-up Questions

Why is wait-free so much harder than lock-free?

Wait-free requires bounded per-thread steps under any scheduling. That means a failed CAS can't just retry; a bounded way to make progress is required. The standard trick is 'helping': a thread completes the operation that interfered with it, so on the next iteration the conflicting state is gone. The bookkeeping is significant: operation descriptors, per-thread state arrays, generation counters. Few production systems pay the cost.

Is lock-free always faster than locks?

No. Under low contention, a well-implemented mutex is often faster than CAS-retry loops (less branching, simpler instructions). Lock-free wins under high contention or when bounded progress is required (no priority inversion, no waiting on a slow lock holder). Always benchmark; don't assume lock-free = faster.

What is helping?

A wait-free protocol where threads that find another thread's incomplete operation finish it on the original thread's behalf. The original thread, when it next checks, sees its op complete. Helping prevents starvation: a slow thread can have its work completed by faster threads. Significant complexity; used in academic wait-free algorithms, rare in production.

When would I actually want wait-free?

Hard real-time systems where no thread can be allowed to wait indefinitely. Avionics, automotive, medical devices. Some signal-handler contexts where blocking is forbidden. Most application code does not need wait-free; lock-free is enough.

Gotchas

'Lock-free' marketing often means 'doesn't block on a mutex' rather than the formal definition
Lock-free does NOT prevent individual thread starvation
Wait-free algorithms have bad constant factors; rarely worth it in practice
Obstruction-free without backoff = livelock under contention
Blocking algorithms inside a lock-free data structure (e.g., malloc) defeat the guarantee

Spinlocks: When Busy-Waiting Wins

Section: Lock-Free Programming | Difficulty: Advanced | Frequency: Sometimes

A spinlock is a mutex where the waiter spins in a tight CAS loop instead of going to sleep. Faster than mutex when the critical section is shorter than a context switch (~1µs). Awful when the critical section is long. Used inside OS kernels, sometimes in HFT systems; almost never appropriate in application code.

Key Points

Spin in a CAS loop instead of parking the thread. No context switch cost.
Wins when critical section < ~1µs (context switch cost) and there are spare cores.
Loses badly when critical section is long: spinning threads burn 100% CPU doing nothing.
Adaptive mutexes (Linux futex, Java synchronized): spin briefly, then park. Usually the right default.
Spinlocks should NOT be used in user code without guaranteeing very short critical sections AND a spare core.

Interview Follow-up Questions

When do spinlocks actually win?

Three conditions must all hold: (1) critical section is shorter than a context switch (~1µs on Linux); (2) spare cores exist so spinning doesn't starve other work; (3) the lock holder won't be preempted (kernel context, RT thread, dedicated CPU). Most application code has none of these. Kernel code, lock-free data structure internals, and HFT systems sometimes do.

What's the difference between a spinlock and a busy-wait?

Mostly terminology. A spinlock is a busy-wait used to implement mutual exclusion. Busy-wait is the general technique of spinning until some condition is true. Both burn CPU; both win or lose under the same conditions.

Why does the PAUSE instruction matter?

Two things. It hints to the CPU that this is a spin loop, so the pipeline can deprioritise it (saves power, doesn't hurt co-running hyperthread). And it adds a small delay (10-100 cycles), reducing the rate of CAS attempts and thus cache-line traffic. On x86, PAUSE in spin loops is essentially free correctness; not using it is a small bug.

When does ttas (test-test-and-set) help?

Under contention. With plain TAS, every spinning thread's CAS bounces the cache line; the lock holder can barely write back to release. With TTAS, spinners just read (which can keep the line in shared state across multiple cores); the actual CAS only happens when the flag flips. Drastic reduction in cache traffic, much faster under contention.

Gotchas

Spinlock with a long critical section = wasted CPU and bad latency
Spinning waiter preventing the lock holder from being scheduled = livelock-like behaviour
Plain TAS spinlock causes cache-line ping-pong; use TTAS
Forgetting PAUSE on x86 wastes power and hurts co-running hyperthreads
Userspace spinlocks where the OS can preempt the holder are almost always wrong

Lock-Free MPSC Queue

Section: Lock-Free Programming | Difficulty: Hard | Frequency: Sometimes

A multi-producer, single-consumer queue using CAS on linked-list pointers. Producers append at tail; consumer pops from head. No locks. The Michael-Scott (1996) design is the textbook impl. Used in Go's runtime, Linux kernel, low-latency trading.

Key Points

Linked list with atomic head and tail pointers
enqueue: CAS to append; on failure, retry with new tail
dequeue: CAS to advance head; on failure, retry
ABA-prone, needs hazard pointers, GC (Java), or tagged pointers
Note: the Michael-Scott algorithm shown below is actually MPMC; a true MPSC variant simplifies the consumer side (no CAS on head)

Interview Follow-up Questions

When is lock-free actually faster?

Under high contention with short critical sections. Lock-free avoids lock-acquire latency (~50ns) and immune to priority inversion. But: harder to write, ABA-prone in C/C++, often slower under low contention. Benchmark before assuming.

How does Java sidestep ABA?

GC. A node that's been popped and reclaimed cannot have its address reused while a CAS still references it, GC keeps it alive until no references remain. In C/C++, hazard pointers, epoch-based reclamation, or tagged pointers are required.

MPSC vs MPMC, what's harder?

MPMC harder. MPSC has only one consumer, so head pointer contention is gone. MPMC needs CAS on both head and tail, plus more careful ABA handling. The Michael-Scott algorithm IS MPMC; the MPSC variant simplifies by trusting one consumer.

Should I use this in production Java?

Almost never. `LinkedBlockingQueue`, `ConcurrentLinkedQueue`, or `Disruptor` (LMAX) are proven. Roll a custom one only after measurement shows them as the bottleneck, and even then, prefer Disruptor for HFT-grade work.

Gotchas

ABA in C/C++: freed node reused at same address, CAS thinks unchanged
Memory ordering: need acquire/release semantics or full fences
Tail can lag, dequeue must help advance tail before returning empty
Cache line contention: head and tail on same cache line → false sharing; pad to 64 bytes

LMAX Disruptor

Section: Lock-Free Programming | Difficulty: Expert | Frequency: Sometimes

A pre-allocated ring buffer plus per-consumer sequence cursors. Producers claim slots with CAS; consumers wait until the producer cursor advances past their position. Slots are reused (no per-message allocation), data lives in cache lines under explicit control (no false sharing), and the wait strategy is pluggable (busy spin, yield, block). Built by LMAX for trading, hits 25M+ messages per second per pipeline.

Key Points

Ring buffer of fixed size (must be a power of two so index = sequence & (size - 1) is one AND instruction).
Slots are pre-allocated objects that get reused; producers fill in fields, consumers read them. Zero per-message allocation.
Producer cursor: an AtomicLong (cache-line padded) holding the highest published sequence. Consumers spin until their target sequence <= producer cursor.
Consumer cursor: each consumer has its own cursor. Producers wait if the slowest consumer is too far behind (overwrites would lose data).
Wait strategies are pluggable: BusySpinWaitStrategy (lowest latency, burns CPU), YieldingWaitStrategy (spin then yield), BlockingWaitStrategy (park on a condition).
False sharing is the silent killer; every cursor is padded to occupy its own cache line.
Single-producer mode skips the producer-side CAS and is much faster than multi-producer mode.

Interview Follow-up Questions

Why is a ring buffer so much faster than a BlockingQueue?

Three reasons. (1) Pre-allocation: slots are reused, no GC pressure. BlockingQueue allocates a node per put. (2) No locks: producers and consumers coordinate through atomic cursor advances; BlockingQueue takes a lock on every put and take. (3) Cache friendliness: the ring is contiguous memory; the next slot is nearly always in L1. A linked-list-backed queue scatters nodes across the heap. The trio together gives 10-100x throughput on hot paths.

Why power of two for the ring size?

Index = sequence & (size - 1) instead of sequence % size. AND is a single cycle on every CPU; modulo is several. At 25M messages per second this matters. The cost: ring sizes are restricted to 1024, 2048, 4096, etc. Easy to live with.

What happens when the slowest consumer falls too far behind?

Producers stall. A producer about to claim sequence N must check that the slowest consumer has acknowledged sequence (N - SIZE), or it would overwrite a slot still being read. Real systems either size the ring large enough that this never happens, or accept backpressure on the producer. There's no equivalent of 'drop oldest' built in; it has to be implemented as a separate consumer that drains.

Single-producer vs multi-producer, how big is the difference?

Big. Single-producer skips the CAS on claim and the availability tracking on publish. Numbers from the LMAX paper: single-producer hits 25M ops/sec on a single thread; multi-producer with 4 threads hits about 100M total but with much higher variance. If the workload can funnel through one producer (one network socket, one input thread), do it.

When is the Disruptor the wrong choice?

Whenever pre-allocation doesn't fit the message model (variable-size events that can't be pooled, references to large heap objects), whenever throughput isn't the bottleneck (most CRUD apps), or whenever richer queue semantics are required (priority, deadlines, conditional retrieval). For request/response with low latency, an unbounded BlockingQueue or a Java 21 virtual-thread-per-request model is simpler and fast enough.

Gotchas

Forgetting padding. The single most common mistake: a 'simple ring buffer' loses 80% of its throughput to false sharing between cursors.
Using Object[] of allocated objects, not pre-allocated slot fields. Defeats the no-allocation property.
Multi-producer mode without availability tracking: consumers will read uninitialized slots.
Wait strategies that burn CPU in production. BusySpinWaitStrategy is for dedicated cores; on shared boxes it starves other work.
Treating the ring as a dynamically resizable queue. The size is fixed at construction; pick it based on max in-flight, not average.

Sharded HashMap

Section: Lock-Free Programming | Difficulty: Hard | Frequency: Asked Often

Naive synchronized HashMap serializes every operation. Sharded version splits into N independent maps, each with its own lock. Hash key → shard index → operate on that shard's map. Java's ConcurrentHashMap does this with 16+ shards. Reads parallel, writes serialized only within shard.

Key Points

N independent shards, each with own lock + map
Hash key → shard index = key.hashCode() % N (better: use bit masking with power-of-2 N)
Concurrent reads/writes on different shards proceed in parallel
Resize: per-shard, or global (rare); ConcurrentHashMap uses tree-based bins instead
Java 8+ ConcurrentHashMap uses CAS + per-bin synchronized, even finer than striping

Interview Follow-up Questions

How many shards is right?

Java's ConcurrentHashMap defaults to 16. Rough rule: at least N where N is the expected concurrent thread count. More shards = more parallelism but more memory and worse cache locality. Power-of-2 enables bit masking instead of modulo (faster).

Why is ConcurrentHashMap better than basic sharding?

Java 8+ uses CAS + per-bin synchronized. Each bin (linked list of entries) has its own lock; resize uses lock-free coordination. Effectively per-bin striping, way finer than 16 shards. Plus: tree-based bins for collision handling. State of the art for in-memory hash maps.

When does sharding NOT help?

Single hot key gets hammered, all writes go to same shard, that shard's lock is the bottleneck. Solution: don't share state; replicate per-thread/region. Or use an atomic counter or CAS-based primitive for that one key.

Gotchas

Bad hash function → uneven shard distribution → some shards always hot
Iterating across shards is non-atomic, snapshot is inconsistent
size() across shards requires summing each, non-atomic without freeze
Resize is per-shard, not global, capacity can drift unevenly
Don't use string %= shardCount on user-provided keys, collision attacks

Concurrent Bloom Filter (CAS-Based)

Section: Lock-Free Programming | Difficulty: Hard | Frequency: Sometimes

Bloom filter: probabilistic set membership with no false negatives, configurable false positives. Concurrent version: bits stored in atomic longs; insert sets bits via OR atomically; query reads atomically. Naturally lock-free because OR is monotonic.

Key Points

Bloom filter: array of bits + N hash functions; insert sets N bits, query checks all N
False positive: all N bits happen to be set by other entries (configurable rate)
False negative: never (if all N bits are 0, definitely not in set)
Concurrent: atomic OR, naturally lock-free, monotonic (bits only go 0→1)
Cannot delete (would require resetting bits, but other entries may use them)

Interview Follow-up Questions

Why is Bloom filter naturally lock-free?

Inserts only flip bits 0→1, monotonic. CAS retry just merges OR operations; intermediate states are valid. No reordering hazards. Compare to hash table inserts which involve linked-list manipulation, much harder to make lock-free.

Counting Bloom filter for delete?

Replace each bit with a small counter (4 bits, so per-cell add/subtract). Add increments; delete decrements. Concurrency: atomic add/sub on counter, still lock-free but more memory.

Where's this used in production?

Cassandra, BigTable, RocksDB use Bloom filters to skip SST file lookups for non-existent keys. Web crawlers track visited URLs. Browsers use them for malicious-URL screening. CDNs for cache-miss hints.

What about false positive rate at scale?

Pre-size for the expected entries. Insertion beyond capacity inflates FP rate. Production systems either: (a) overprovision, (b) rotate filters, or (c) use Cuckoo filter (counting BF replacement that supports delete cleanly).

Gotchas

False positive rate degrades as filter fills, pre-size for expected load
Hash-function quality matters, bad hashes → uneven bit distribution → worse FP rate
Cannot delete, once a bit is set it can't be safely cleared
Multiple Bloom filters with different hash seeds aren't interoperable
Use a counting Bloom filter or Cuckoo filter when delete is required

Concurrent Priority Queue (Skiplist-Based)

Section: Lock-Free Programming | Difficulty: Hard | Frequency: Sometimes

Lock-free priority queue is hard, heap-based requires global rebalance. Skiplist-based works: probabilistic levels, ordered by priority, CAS-based insert/delete. Java's ConcurrentSkipListMap is the production-grade impl. Use it; don't roll a custom one.

Key Points

Heap-based PQ: hard to make lock-free (global rebalancing on every op)
Skiplist: probabilistic balanced structure, naturally lock-free
Insert: choose random level, CAS pointers in at each level
Java's ConcurrentSkipListMap is the canonical implementation
PriorityBlockingQueue (Java) is heap-based with a single lock, fine for moderate scale

Interview Follow-up Questions

Why is heap-based PQ hard to make lock-free?

Heap operations percolate values up/down through the tree, modifying many nodes. CAS on individual nodes isn't enough, the heap invariant spans the structure. Lock-free heap papers exist but are complex; production systems use skiplist instead.

Skiplist vs B-tree for ordered concurrent map?

Skiplist easier to make lock-free (probabilistic levels = local CAS works). B-tree better cache locality but harder to make lock-free. Java picked skiplist for ConcurrentSkipListMap; many DB indexes use B-tree with locks.

When does PriorityBlockingQueue bottleneck?

At the single lock. Heap rebalancing is O(log n) under the lock; with many concurrent producers, lock contention dominates. Switching to ConcurrentSkipListMap-based design (or per-priority queues) helps.

How does this compare to Disruptor?

Disruptor is a single-producer/single-consumer (or MPMC) ring buffer with NO priority, strict FIFO. Priority semantics need a different structure. They solve different problems.

Gotchas

PriorityBlockingQueue.iterator() is NOT in priority order, only poll() is
ConcurrentSkipListMap.firstEntry() is a snapshot, value may have been polled by another thread by the time it's used
Skiplist memory overhead: ~4x of equivalent array (multi-level pointers per node)
size() on ConcurrentSkipListMap is O(n); don't call in hot paths

Concurrent Trie (RCU-Style Read Path)

Section: Lock-Free Programming | Difficulty: Hard | Frequency: Rare

Concurrent trie: prefix-indexed tree where reads traverse a stable snapshot (no locks) and writes copy-then-swap nodes (RCU style). Lock-free reads at the cost of some write overhead. Used in Linux routing tables, JDK's CHAMP, persistent data structures.

Key Points

Trie = tree indexed by string prefix; common in routing, autocomplete, dictionaries
RCU (read-copy-update): readers see consistent snapshots; writers copy + atomic swap
Lock-free reads: traverse atomic-pointer fields; never block
Writers serialize on root pointer CAS, but contention only on the changed branch
Memory: temporary copies on update (free old when no readers reference)

Interview Follow-up Questions

Why path-copy instead of in-place mutation?

In-place would race with concurrent readers, they could traverse partially-modified state. Path-copy guarantees readers see the OLD tree until the root CAS commits the NEW tree atomically. Like Git's commit model.

How is memory freed?

Old nodes are reachable by any reader that started before the swap. Garbage collector cleans up when no reference remains (Java/Go automatic). C/C++ needs hazard pointers or epoch-based reclamation, much harder.

When is this preferred over a HashMap?

Prefix queries: 'all words starting with X', trie traverses to that prefix and enumerates. HashMap can't. Used heavily in autocomplete, IP routing tables (longest-prefix match), file systems.

Where does Linux use RCU-like patterns?

Routing tables, file system dentry cache, network device list. Reads are extremely common (every packet); writes are rare (config change). RCU-style copy-on-update gives lock-free reads at the cost of write overhead.

Detecting Race Conditions in Production

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes

Race conditions almost never reproduce locally. Catch them with built-in race detectors (go run -race, ThreadSanitizer, jcstress) integrated into CI, plus production monitoring for symptoms (counter mismatches, sporadic test failures, "impossible" log states).

Key Points

Race conditions hide in unit tests, they need real load and timing
go run -race is built into Go, use it in CI; 5-10x runtime cost
Java has no built-in race detector; use jcstress for stress tests, JFR for production tracing
Python has the GIL but still has races, no native detector; rely on stress tests and review
Symptoms in production: sporadic test failures, counter mismatches, 'impossible' log states
Always test on multiple architectures, x86 hides races that ARM exposes

Interview Follow-up Questions

Why don't unit tests catch race conditions?

Most unit tests run sequentially with predictable timing. Races need: (a) enough concurrency to expose interleaving, (b) enough trials to hit the unlucky timing, and (c) memory ordering scenarios that depend on hardware. Use stress tests that run the suspect code N×100 times under load.

Why does the Go race detector slow code down?

It instruments every memory access, load, store, atomic, and tracks happens-before relationships in a vector clock. The 5-10× slowdown is the cost of building and querying that data structure. Negligible for tests; too much for production.

Can I run go test -race in production?

Generally no. The slowdown is too much for production traffic. But for canaries or load-test environments? Sometimes yes, race-detected production runs catch what the test suite missed. Just be aware of the perf hit.

What does jcstress catch that go run -race doesn't?

go run -race catches accesses with no happens-before edge (data races). jcstress goes further, it demonstrates that specific Java Memory Model outcomes are observable, like (0,0) in the reordering example. Useful for verifying lock-free designs and understanding what JMM allows.

How to find a race in Python production?

It's hard. The tools: (1) log invariants, when invariants fail, the log line is evidence. (2) py-spy dump on the running process to see thread states. (3) faulthandler for crashing the process with full traceback on hangs. (4) Stress tests that exercise concurrency in CI. (5) Code review focused on every shared mutable variable.

Gotchas

go run -race only catches races that ACTUALLY OCCUR during the run, coverage matters
Race detector doesn't catch logical races (check-then-act with atomics)
Python's GIL hides races at bytecode boundaries; releases between bytecodes expose them
Tests passing 1000 times in a row doesn't prove no race, the unlucky 1001st time may fail in production
x86 strong memory model hides races that fire on ARM, test on both architectures
Sporadic test failures dismissed as 'flaky' are often races, investigate, don't retry

Race Detection in CI

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes

Run race detectors as part of CI. Go: -race. C++/Rust: ThreadSanitizer (TSan). Java: jcstress for memory model assumptions, plus quality concurrent tests. The detectors find concurrent unprotected access regardless of platform; they catch the bugs hidden by x86's strong ordering before they hit ARM in production.

Key Points

Go: go test -race covers most data races. Run in CI on every PR.
C++/Rust: ThreadSanitizer (TSan). Compile with -fsanitize=thread, run tests.
Java: jcstress for memory-model bugs (lock-free code). Standard JUnit for higher-level concurrency.
Race detectors slow execution 2-10x and use more memory; run in CI, not production.
Stress-test concurrent code with high goroutine/thread counts; race detector + load surfaces real bugs.

Interview Follow-up Questions

Should the race detector be on for production?

No. It slows execution 2-10x and uses 5-10x more memory. CI is where it belongs: every PR, every test. Production runs the un-instrumented binary. The race detector's job is to catch bugs before they ship; once shipped, production should never see a race.

How does Go's race detector compare to TSan?

Same theory (happens-before vector clocks), different implementations. Go's is built into the toolchain, no extra setup. TSan needs Clang/GCC with -fsanitize=thread. Both have very low false positive rates: if they report a race, there's a race.

Does the race detector catch all bugs?

No. It catches data races (concurrent unprotected access). Doesn't catch deadlocks, livelocks, missing notifications, lost wakeups, atomicity violations on multi-step operations. Also doesn't catch bugs that depend on specific scheduling that didn't happen during the test run. Useful, not sufficient.

What about Rust?

Rust's type system catches most data races at compile time (Send/Sync traits). TSan still useful for unsafe code or for FFI. The Rust compiler + clippy + TSan in CI is a strong combination.

Gotchas

Not running -race / TSan in CI = bugs hide until ARM production
Tests that don't actually exercise concurrency: race detector reports nothing
Long test suites with -race timeout because of 10x slowdown; tune timeouts
Race-free does not mean correct; algorithmic concurrency bugs (deadlock, livelock) still possible
Skipping race detection on 'performance' tests: those are exactly where races hide

Profiling Concurrent Code

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes

On-CPU profilers (perf, async-profiler, py-spy) show where threads spend CPU. Off-CPU profilers show where threads wait (locks, I/O). Concurrent code needs both: 'CPU is fine but throughput won't go higher' means off-CPU work, not on-CPU. Lock profilers and goroutine analyses are the specific tools.

Key Points

On-CPU profile: where threads spend CPU. Use for CPU-bound bottlenecks.
Off-CPU profile: where threads wait (block, sleep, park). Use for concurrency bottlenecks.
Lock profilers: per-lock contention time. Async-profiler 'lock' mode, Go's mutex profiler.
Thread/goroutine dumps: snapshot of every thread's stack. Use for hangs and deadlocks.
Tracing (OpenTelemetry, runtime/trace): per-request timeline; useful for tail latency.

Interview Follow-up Questions

How to tell if the problem is CPU or contention?

Look at CPU utilisation under load. If CPU is at 100% and throughput plateaus, the workload is CPU-bound; profile on-CPU. If CPU is well below 100% but throughput plateaus, threads are blocked on something; profile off-CPU (lock, mutex, block profile). Both can be true; profile both.

What's the difference between CPU and wall profiling?

CPU profile only counts time the thread was running on a CPU. Wall profile counts everything including waits. For pure CPU bottlenecks, CPU is more focused. For 'why is this request slow', wall is the right pick (it shows the I/O waits, lock waits, GC pauses).

When to use tracing instead of profiling?

Profiling is statistical: aggregates over time, shows what's typical. Tracing is per-request: shows exactly what happened for one specific request. For tail latency ('the p99 is bad'), tracing wins (it surfaces what's different about slow requests). For throughput ('the average is too slow'), profiling wins (it shows where time goes in steady state).

How much overhead does profiling add?

On-CPU sampling at 100Hz: roughly 1-5%. Lock profiling: depends on event rate, can be 10%+. Tracing every request: 5-20%. For production profiling, use sampling: profile a fraction of requests / a percentage of the time. async-profiler at default rate is generally safe in production.

Gotchas

Profiling only on-CPU when the bottleneck is contention: misses the problem
Single thread dump is a snapshot; take 3-4 separated by seconds for hang diagnosis
Profiling overhead can itself cause symptoms; use sampling at production scale
Reading flame graphs without normalising by sample count: large boxes might be common, not slow
Conflating tracing with profiling: tracing reports on specific requests, profiling on aggregate behaviour

Diagnosing Deadlocks and Hangs

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes

When a service hangs, dump every thread/goroutine's stack trace. The pattern of who's waiting on what reveals whether it's a deadlock (cyclic wait), a livelock (busy spinning), or starvation (one thread never scheduled). Tools, jstack, py-spy, go pprof, kill -3.

Key Points

Step 1 of any hang: dump every thread/goroutine's stack, without it, the diagnosis is guesswork
Java: `jstack <pid>` or `kill -3 <pid>`; output shows 'Found X deadlocked threads'
Go: send SIGQUIT or use /debug/pprof/goroutine?debug=2, full stack of every goroutine
Python: `py-spy dump --pid <pid>` for sync code; asyncio.all_tasks() for async
Deadlock pattern: thread A blocked on lock held by B, B blocked on lock held by A
Hang ≠ deadlock, could also be infinite loop, blocked syscall, network timeout, or starvation

Interview Follow-up Questions

What's the very first thing to do when a Java service hangs?

`jstack <pid>` (or `kill -3 <pid>` if the process is in the foreground). The dump tells you within 30 seconds whether you have a deadlock (explicitly labeled), lock contention (many BLOCKED threads on one monitor), or something else (waiting on I/O, infinite loop, GC pause).

Service is at 100% CPU but hung. Deadlock?

Probably not, deadlocked threads are blocked, not running. 100% CPU + no progress = livelock (busy retry loops) or infinite loop. Take a CPU profile (`go tool pprof`, `async-profiler`, `py-spy record`) to find what's hot.

How does one tell deadlock from starvation?

Thread dumps. Deadlock: a closed cycle of threads each waiting on a lock held by the next. Starvation: ONE thread is never scheduled while OTHERS make progress. Both look like 'something is stuck' from outside; the dump distinguishes them.

Should pprof be exposed in production?

Yes, but on a private port (localhost or VPC-internal), never on a public port. The /debug/pprof endpoints can leak request data and let an attacker DOS the service. Standard practice: `localhost:6060` and access via SSH or admin VPN.

How to debug a deadlock that won't reproduce?

Add a watchdog: every 30s, check for deadlock (Java's findDeadlockedThreads, or in any language, time-out individual operations and dump on timeout). Log full thread states when triggered. Eventually the deadlock fires; the evidence is waiting the next morning.

Gotchas

kill -3 in production sends SIGQUIT, the Go runtime PRINTS THE DUMP AND EXITS the process. Use SIGUSR1/2 or a pprof endpoint instead.
Java's `jstack` requires the same JVM version as the target process, and same UID; usually requires sudo or being the service user
py-spy needs ptrace permissions; on hardened Linux, set CAP_SYS_PTRACE or run as root
asyncio hangs don't show up as 'thread blocked', only one thread, looks idle. Use asyncio.all_tasks() instead.
Goroutine dump alone doesn't show channel state, the dump shows 'chan receive' but not 'who has the channel'. Combine with code review.
Hung threads waiting on a network socket look like deadlock, verify with strace/tcpdump before assuming

Bug Hunt: Find the Race in This Counter

Section: Debugging Concurrency | Difficulty: Basic | Frequency: Asked Often

The classic compound-update race. The counter looks atomic but isn't, read, increment, write are three separate steps that can interleave. Fix with an atomic primitive or a lock around the read-modify-write.

Key Points

Compound operations (++ , += , !x) are NEVER atomic without explicit synchronization
Two threads can both read 5, both compute 6, both write 6, losing one increment
Tests pass on dev laptops; the race fires under production load
Three correct fixes, atomic primitive (best), lock, or per-thread accumulator with merge

Interview Follow-up Questions

Why isn't `volatile` (Java) / `atomic.Bool` enough?

Those make individual reads and writes atomic and visible. But ++ requires THREE operations, a read, a compute, and a write. Even with each operation atomic, two threads can interleave between them. The fix is an atomic compound operation (incrementAndGet, atomic.Add) or a lock.

Why does this 'work' on a dev machine?

Light load + few threads + x86's strong memory model often hide races. Production load fires them. Move to ARM (Apple Silicon, Graviton) and the race fires immediately. Always test under load and on multiple architectures.

Tests pass 1000 times, is the code safe?

No. A race that fires 1 in 10 million increments doesn't show in unit tests but does at 1M req/sec in production. Use the race detector (Go) or stress tests with 10K+ iterations under concurrency.

When is LongAdder better than AtomicLong?

When many threads contend on the same counter. AtomicLong's CAS retries fail under heavy contention, burning CPU. LongAdder shards into per-thread cells; sum() aggregates on read. Trades memory for throughput. Used heavily for metrics counters.

Gotchas

Tests pass on x86, fail on ARM, same code, different memory model
Java: AtomicInteger.get() then .set() is NOT atomic (two ops); use accumulateAndGet/updateAndGet
Python: even integer assignment may need synchronization on shared globals
Go: counter++ on int doesn't flag without -race; always run with -race in CI
Locking around the wrong thing, protecting the read but not the write

Bug Hunt: Why Does This Deadlock Under Load?

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Two threads each acquire two locks, in opposite order. Most of the time it works because the timing doesn't line up, under load, eventually it does, and both threads block forever waiting for each other. Fix with global lock ordering, or tryLock with timeout.

Key Points

Under low load, the unlucky interleaving is rare, the bug looks like a flaky hang
Under high load, it becomes nearly inevitable
Four Coffman conditions must all hold; break any ONE → no deadlock
The cheapest fix: enforce a global total ordering on lock acquisition

Interview Follow-up Questions

Why does the deadlock only happen 'sometimes'?

The bug requires a specific interleaving: thread 1 must acquire its first lock AND get scheduled out before acquiring its second, AND thread 2 must run during that window. Low load = rare interleaving. High load = inevitable interleaving.

What are the four Coffman conditions?

(1) Mutual exclusion (locks are non-shareable), (2) hold-and-wait (thread holds one lock, requests another), (3) no preemption (locks can't be forcibly taken), (4) circular wait (cycle in the wait-for graph). Break ANY one → deadlock impossible. Lock ordering breaks #4.

How is this detected in production?

Java: jstack <pid> shows 'Found N deadlocked threads' with the cycle. Go: SIGQUIT or /debug/pprof/goroutine?debug=2. Python: py-spy dump --pid. All-blocked threads in a cycle is the smoking gun.

When is tryLock preferable to lock ordering?

When ordering can't be enforced, e.g., locks are dynamic, or third-party code holds locks. tryLock with timeout breaks the no-preemption condition: if all locks can't be acquired, release what's held and retry.

Gotchas

Reentrant locks (Java ReentrantLock, Python RLock) prevent SELF-deadlock but NOT mutual deadlock between threads
Go's runtime only catches 'all goroutines asleep', partial deadlocks go undetected
Holding a lock while calling user-supplied callback code is a deadlock waiting to happen
Sorting locks by hash is fine for objects, but pointers across processes (shared memory) need stable IDs

Bug Hunt: Why Does My RWMutex Deadlock When I Re-enter?

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes

RWMutex is not reentrant. A goroutine holding the read lock that calls a method which also tries to acquire the read lock can deadlock if a writer is waiting in between. Java's ReentrantReadWriteLock is reentrant by design; sync.RWMutex is intentionally not. The fix is to restructure so locks aren't reentered.

Key Points

sync.RWMutex (Go): NOT reentrant, RLock-then-Lock or RLock-then-RLock can deadlock
ReentrantReadWriteLock (Java): IS reentrant, same thread can re-acquire its own lock
Common trap: reader calls a method that also reads; a writer arrives between them; deadlock
Fix: restructure to acquire the lock once at the outer scope, or pass an already-locked context

Interview Follow-up Questions

Why is sync.RWMutex (Go) intentionally non-reentrant?

Reentrant locks hide design issues. Go's authors argue: code that needs reentrant locks is calling itself unsafely. The right answer is to refactor, extract the locked work into internal helpers or release-and-reacquire. The non-reentrant default forces good design.

When does the RWMutex deadlock actually fire?

When a writer is queued between two readlocks of the same goroutine. Without writers, two consecutive RLocks 'work' (RWMutex allows multiple readers). With a writer waiting, the second RLock blocks because writer-pref forbids new readers. The first RLock can't release because the function is still inside it. → deadlock.

Should ReentrantLock be the default?

Reentrant locks are easier to compose but pay a small cost (tracking the holder thread + count). For tight, performance-critical code with no nested calls, the non-reentrant variant is fine. Default to ReentrantLock unless the overhead has been measured and matters.

How does this differ from a normal deadlock?

Normal deadlock: two threads, two locks, opposite order. Reentrant deadlock: ONE thread re-acquiring the same lock with a non-reentrant primitive, when a writer (other thread) is waiting. Less obvious because there are not two distinct lock instances.

Gotchas

Go: sync.RWMutex is writer-preferring, once Lock() is called, new RLocks block
Java: StampedLock and Semaphore are NOT reentrant; ReentrantLock and ReentrantReadWriteLock are
Python: threading.Lock is NOT reentrant; threading.RLock IS
Calling external code (callbacks, listeners) while holding a lock can introduce reentrancy
ReentrantLock is fine for nested calls; using one to call into completely unknown code is still risky (might block on something else)

Bug Hunt: Lost Wakeup with Condition Variable

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes

Producer signals condition before consumer waits → consumer waits forever. Classic bug from forgetting to check the predicate inside the lock before waiting. Fix: always check predicate inside a while loop while holding the lock; the predicate guards both the wait AND the wakeup.

Key Points

Lost wakeup: signal happens before wait, signal is dropped (condition variables don't queue signals).
Fix: hold the lock, check predicate in a while loop, wait if predicate is false.
Signal must be inside the same lock to make the predicate visible to the waker.
Spurious wakeups also require while loop (not if): wait can return without anyone signalling.
In Go, channels have built-in 'queued' semantics that avoid this; sync.Cond can lose signals.

Interview Follow-up Questions

Why is a while loop required around wait, not just an if?

Two reasons. (1) The predicate may have been satisfied between the wakeup and the time the thread re-acquired the lock; another waiter might have consumed the condition. (2) Spurious wakeups: the spec allows wait to return without anyone calling notify. The while loop handles both correctly; an if statement is a bug.

Why must the signal happen under the lock?

Because the predicate must be visible to the waker before the wakeup. If signal happens outside the lock, this interleaving is possible: producer sets predicate → consumer reads predicate (false) → consumer waits → producer signals (lost). Holding the lock during the predicate update AND the signal ensures the waker sees the updated predicate.

How does notify vs notifyAll affect this?

notify wakes one waiter (any one). notifyAll wakes all. If multiple consumers can validly act on the condition, notify can cause 'thundering herd' (all wake, all check, only one proceeds, others wait again). For correctness, both work with the while-loop pattern. notifyAll is safer when in doubt.

Are channels strictly better than condition variables?

For most cases in Go, yes. Channels queue values; a send can't be lost. For complex predicates that don't fit a queue model (multiple waiters, multiple conditions), sync.Cond is still useful. But the cases where Cond is the right tool are rare.

Gotchas

Using if instead of while around wait: spurious wakeups break correctness
Signalling outside the lock: predicate update may not be visible to waiter in time
Using notify when multiple unrelated waiters wait on the same condition
Forgetting to update the predicate before signal: wakes waiter that finds predicate false
Believing the bug is rare: lost-wakeup happens reliably under load testing

Bug Hunt: The Broken Double-Checked Locking

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Asked Often

Pre-Java-5, double-checked locking was famously broken because the JMM allowed the constructor's writes to be reordered with the publication of the reference. A reader could see a non-null reference whose object's fields were still uninitialized. The fix is volatile (Java) or a static-holder idiom, and similar gotchas apply in Go and Python.

Key Points

DCL trick: 'check, lock, check again, construct', saves a lock on the fast path
Bug: without volatile, the constructor's writes can be reordered with the assignment to the field
A reader sees a non-null reference; reads object fields; sees garbage / nulls
Static-holder idiom (Java) is simpler and uses class-init guarantees instead
Go's sync.Once is the right answer; in Python, module-level construction works

Interview Follow-up Questions

Why was DCL famously broken in Java before 1.5?

Pre-Java-5 JMM didn't constrain construction order. The line `instance = new Singleton()` could be split: allocate, assign field reference, then run the constructor. A reader could see the non-null reference before the constructor finished. The Java 5 JMM (JSR-133) fixed this, but ONLY when the field is volatile.

Why is the static-holder idiom safer than DCL?

Class initialization in Java is thread-safe by spec. Holder is initialized exactly once when first referenced. No lock is needed in user code. It's lazier than eager initialization (the holder class only loads when get() is called) and simpler than DCL.

Does volatile have a performance cost?

Yes, every volatile read/write emits memory barriers that prevent compiler optimizations and force CPU cache coherence traffic. For a singleton accessed millions of times, the cost is real but small. Static-holder avoids it entirely.

What's wrong with Python's DCL even with the GIL?

GIL serializes bytecode but `_instance = Singleton()` is multiple bytecodes. Two threads can both check None, both run the constructor (under the GIL serially, but both run!), losing one. With free-threaded CPython, the lack of memory barrier becomes the issue. The fix is module-level init or full-locking.

Should I use sync.Once for everything in Go?

Yes, for any 'initialize exactly once' pattern, sync.Once is the right tool. It's lightweight after the first call (a single atomic load on the fast path), correct on every architecture, and handles panics during init.

Gotchas

Java: forgetting volatile on the singleton field is the classic broken-DCL bug
Java: the static-holder idiom doesn't run init until get() is called, true laziness
Python: DCL without holding the lock for both check and set has a race on free-threaded CPython
Go: sync.Once.Do swallows panics, wrap critical init with a panic-aware version if needed
Reading instance.someField on the fast path skips ALL synchronization, must guarantee instance was safely published
Enum singletons in Java prevent reflection-based instantiation; regular ones don't

Bug Hunt: Goroutine Count Growing 10/sec

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes

A goroutine that blocks on a channel send/receive without a cancellation path lives forever. The fix is always a select on ctx.Done() or a buffered channel large enough to absorb all sends. The same bug appears in Java (futures never cancelled) and Python (asyncio tasks orphaned).

Key Points

A goroutine that blocks forever holds memory + captured variables + channels
Symptom: runtime.NumGoroutine() growing unbounded under load
Diagnosis: /debug/pprof/goroutine?debug=2 shows what each goroutine is waiting on
Fix pattern: every goroutine that sends/receives must have a cancellation path

Interview Follow-up Questions

How is a goroutine leak detected in production?

Expose `/debug/pprof/goroutine` and watch the count. Alert on `runtime.NumGoroutine()` growing unbounded. To diagnose: `curl /debug/pprof/goroutine?debug=2` shows every goroutine's stack, find the ones stuck on `chan send` or `chan receive`.

Why is the unbuffered-channel send pattern so prone to leaks?

Unbuffered = synchronous handoff. The sender blocks until a receiver is ready. If the receiver is gone (panic'd, request cancelled, slow), the sender blocks forever. Buffered channels add slack; ctx.Done() adds an exit path.

Is asyncio.create_task without await ever correct?

Rarely. Only for truly fire-and-forget work where completion and errors don't matter. Even then, keep a reference (Python may garbage-collect mid-run) and add a done_callback for exception logging. Prefer asyncio.TaskGroup for structured cleanup.

How is 'goroutine count growing' measured?

Expose runtime.NumGoroutine() as a Prometheus metric. Alert when it grows for >5 minutes without bound. Healthy services have a roughly bounded number of goroutines proportional to in-flight requests; unhealthy ones grow linearly with uptime.

Gotchas

fire-and-forget goroutines without ctx.Done() are the #1 source of Go production leaks
Java's CompletableFuture without orTimeout silently inherits 'wait forever' semantics
Python: asyncio.create_task without keeping a reference can be GC'd mid-execution
context.WithCancel without calling cancel() leaks the underlying timer goroutine
Logging goroutines that block on a slow downstream → biggest leak risk in HTTP handlers
Bounded queues + drop-on-full is often better than infinite buffering

Bug Hunt: Why Is This LongAdder Slower Than AtomicLong?

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes

LongAdder shards counters across threads to win under contention, but its sum() walks every shard, so reads are O(N). When read rate is high relative to write rate, AtomicLong (single CAS) is faster. The bug is using the right tool for the wrong workload.

Key Points

AtomicLong: cheap reads (one load), CAS retries on contention (slow when many writers contend)
LongAdder: cheap writes under contention (per-thread cells), expensive reads (sum walks all cells)
Read-heavy + few writers → AtomicLong wins
Write-heavy + many writers → LongAdder wins
Always benchmark with the actual workload, intuition lies

Interview Follow-up Questions

When does LongAdder definitively beat AtomicLong?

Many writer threads, infrequent reads. Classic example: metrics counters scraped every 10-60 seconds while every request increments. Goetz's original benchmarks showed crossover at 4+ contending threads. Below that, AtomicLong wins.

Why is AtomicLong's read O(1) but LongAdder's O(N)?

AtomicLong stores a single value, read is one cache-line load. LongAdder stores N cells (typically one per CPU); sum() walks all of them and adds. False sharing protection means each cell is on its own cache line, so reading sum is N cache misses.

Are sharded counters ever wrong?

sum() can return slightly stale values because cells are read sequentially without locking the whole structure. For metrics, this is fine. For exact accounting (billing, money), use AtomicLong or a Lock.

Why does the JDK ship both AtomicLong and LongAdder if one is just better under contention?

They're complementary, not substitutes. AtomicLong fits the 'consistent read at any time, low contention' niche. LongAdder fits the 'high write throughput, periodic read' niche. The JDK author (Doug Lea) explicitly says: profile to choose.

Gotchas

Many counters are read on every request (rate limiters, header injection) → AtomicLong wins
Metrics counters scraped via Prometheus/StatsD → LongAdder may win
Sharded counters are eventually consistent, sum() can return slightly old values
False sharing across cache lines kills sharded perf if cells aren't padded, JDK pads automatically; rolling-your-own often forgets
Profile-driven optimization beats intuition every time

Bug Hunt: False Sharing in a Striped Counter

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes

Striped counter built from an array of AtomicLong shows worse throughput at high thread count than a single AtomicLong. Bug: the array elements share cache lines; threads writing different stripes cause cache-line ping-pong. Fix: pad each element to its own cache line (LongAdder, @Contended, or manual padding).

Key Points

Symptom: striped counter scales worse than expected; CPU at 100%, throughput plateaus or degrades.
Root cause: array elements occupy adjacent memory; multiple elements share each cache line.
Two cores writing to different elements on the same line: the line bounces between L1 caches.
Fix: pad each element to 64 bytes (cache line size). Java's LongAdder does this internally.
Diagnose: profile shows hot atomic-add op, cache miss rate is high, IPC (instructions per cycle) is low.

Interview Follow-up Questions

How do I detect false sharing in profiling?

Three signals together: (1) high CPU utilisation but low throughput, (2) hot atomic operations in the profile flame graph, (3) high cache-miss rate via perf stat -e mem_load_uops_retired.l3_hit. The combination strongly suggests false sharing. Pad and re-measure; if throughput jumps, false sharing is confirmed.

Should I always pad atomics?

No. Padding wastes memory (64 bytes per atomic instead of 8). Only pad after measuring false sharing. The general rule: pad atomics that multiple cores write to in hot paths. Don't pad atomics that are read-mostly or written by one thread.

Why does cache line bouncing matter so much?

Cache coherency protocol (MESI) requires that only one core have a cache line in 'modified' state. When core A writes to the line, core B's copy is invalidated; B must re-fetch on its next access. If A and B alternate writes, the line bounces between L1 caches; each access pays the cache-miss latency (~40-100 cycles) instead of L1 hit (~1-5 cycles).

How does LongAdder choose the number of cells?

Dynamically. Starts with one cell. On contention (CAS failures), grows the cell array up to a power-of-two ceiling (typically core count or larger). Each thread hashes to a cell. With enough cells and padding, contention disappears. Reads (sum) walk all cells, so reads are O(num cells).

Gotchas

Striped counters without padding scale worse than non-striped
Padding only some atomics: misses the bouncing on the unpadded ones
@Contended without -XX:-RestrictContended is silently ignored on application classes
False sharing in field layout (two atomic fields adjacent in a struct): same problem as arrays
Padding waste: 64 bytes per atomic adds up; use sparingly

Bug Hunt: Memory Ordering on ARM

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes

Code passes tests on x86, fails on ARM. The bug: missing memory barrier (volatile in Java, atomic in C++, sync in Go) on a flag that the writer sets and the reader checks. x86's strong ordering hides the bug; ARM's weaker ordering exposes it. Fix: use proper synchronization primitives, never plain reads of shared variables.

Key Points

x86 has strong ordering: most reorderings forbidden by hardware. Bugs hidden.
ARM/POWER have weak ordering: writes can be reordered, loads can speculate ahead.
The bug usually looks like: writer thread sets flag, reader thread sees the flag set but old data.
Fix: use volatile / atomic / proper synchronisation. Never plain reads of shared mutable state.
Detect via stress testing on ARM hardware (Apple Silicon, AWS Graviton, Raspberry Pi).

Interview Follow-up Questions

Why doesn't this bug appear on an x86 laptop?

x86 has strong memory ordering: stores are observed in program order, loads aren't reordered with respect to each other. The hardware enforces what the language semantically promises only when atomic/volatile is used. So bug code 'just works' on x86 because the hardware is generous. On ARM/POWER, the hardware is more permissive, exposing the missing synchronisation.

Are these bugs detectable in tests?

Stress testing on ARM hardware exposes most. Race detectors (Go's -race, ThreadSanitizer for C++/Rust) catch concurrent unprotected access regardless of platform. Run TSan in CI. For Java, jcstress is the gold standard for stress-testing memory model assumptions.

What if the code is just for x86?

Three reasons to still fix it: (1) future-proofing, the code may move to ARM (Apple Silicon, AWS Graviton), (2) compiler reordering, even on x86, the compiler can reorder reads/writes within a thread, breaking lock-free protocols, (3) correctness, using proper synchronisation makes the code's intent clear to readers.

How does this relate to volatile in Java?

Java's volatile gives release-on-write and acquire-on-read semantics on a single field. It's the minimum required to publish a value safely across threads. Plain (non-volatile) Java fields can be cached in registers, reordered, or have stale values across threads. Use volatile (or atomic, or synchronized) for any shared mutable state.

Gotchas

Testing only on x86 = bugs hide until ARM deployment
'Works in dev, fails in prod' when prod is on ARM
Race detector catches issues regardless of platform; run it in CI
C++ default memory_order is SeqCst (safe but slow); explicit order_relaxed needs justification
Java plain int reads/writes are NOT atomic for long; use volatile or AtomicLong

Bug Hunt: Works on x86, Fails on ARM, Why?

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes

x86 has a strong memory model that hides reordering bugs at runtime. ARM and POWER have weaker memory models that expose them. Code missing proper synchronization can 'work' on Intel and break on Apple Silicon, AWS Graviton, or mobile. The fix is to add a memory barrier, volatile, atomic, or a lock.

Key Points

x86 uses TSO (Total Store Order), most reads and writes can't be reordered with each other
ARM, POWER, RISC-V allow more reordering; the same code 'works' less reliably
Compilers reorder too, even on x86, missing synchronization can fire on a JIT optimization
Memory barriers (volatile, atomic store/load, lock acquire/release) prevent reordering

Interview Follow-up Questions

Why does x86 hide so many reordering bugs?

x86 uses TSO (Total Store Order). Most reads and writes can't be reordered with each other, only store-load reordering is allowed (and even that is rare in practice). The hardware is doing a lot of work to maintain the appearance of sequential consistency. ARM and POWER have weaker models that expose more reordering.

Does this affect compiled languages only?

No. Compilers and runtimes also reorder. The JIT can hoist a non-volatile read out of a loop on x86, and the bug fires immediately. Go's race detector catches some of these even on x86. The memory model rules apply at every layer, compiler, runtime, hardware.

Are 64-bit reads/writes atomic in Java?

Not always, long/double on 32-bit JVMs can be torn. Even on 64-bit, the JLS only guarantees atomicity for fields declared `volatile`. In practice HotSpot makes 64-bit reads atomic, but the spec doesn't require it.

Should I just make everything volatile / atomic?

It's tempting but expensive, every volatile/atomic access emits memory barriers that flush CPU state, costing tens of cycles. Use synchronization where there's actual cross-thread sharing; leave thread-local state alone. The cost is real on hot paths.

Gotchas

Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
JIT can hoist a non-volatile read out of a loop entirely → 'why is my thread spinning forever?'
Java: long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
Python: free-threaded CPython (3.13+) loosens visibility guarantees that GIL-based code accidentally relied on
Go: copying a struct that contains atomic fields silently breaks them
x86 hides write-write reordering; ARM does not

Code Review: This Lock-Free Counter

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes

A candidate's 'lock-free counter' looks slick, atomic operations, no synchronized blocks. But the get-then-set pattern isn't atomic, the read isn't ordered with the write, and overflow handling is missing. Walk through what an interviewer wants reviewers to find.

Key Points

An atomic.get() followed by atomic.set() is NOT atomic, same race as ++ on volatile
Compound operations need compareAndSet (CAS) or accumulateAndGet, not get+set
Reading 'unconditionally' before checking is a read-then-act race
Overflow on counters that monotonically increase is a real production bug

Interview Follow-up Questions

Why is 'check-then-act' such a common bug?

Because it reads naturally as English: 'if X then do Y.' Engineers write this naively without realizing the gap between 'check' and 'act' is exactly where another thread can race in. Every concurrent algorithm has to ask: is this single operation, or two? Atomicity must be explicit.

When is atomic.get() followed by atomic.set() acceptable?

When losing an update is acceptable. Examples: setting a 'last-seen timestamp' (overwriting is fine), publishing a configuration version (only one writer at a time anyway). For anything where two writers can race and both updates matter, use CAS or a lock.

Why does the candidate's code call itself 'lock-free'?

They confused 'uses atomic operations' with 'lock-free.' True lock-free means: no thread can be permanently blocked by another thread's actions. CAS retry loops are lock-free in this sense. But check-then-set with two atomic ops is neither lock-free nor correct, it's just broken.

Should this code be lock-free at all?

Probably not. For a counter accessed by ~10 threads, a Lock or synchronized method is simpler and fast enough. Lock-free programming pays off in extreme contention scenarios (e.g., kernel data structures, HFT systems). For most application code, just use a lock.

Gotchas

atomic.get() then atomic.set() is two operations, not atomic
atomic.set(atomic.get() + 1) is the textbook racy increment
Reading outside a lock then acting inside it, if the read informed the action, it's a race
Naming a class 'LockFree' doesn't make it correct
Java's accumulateAndGet/updateAndGet build CAS retry loops automatically, prefer them when applicable

Code Review: Go Context Misuse

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Asked Often

Common Go context bugs: not passing context downstream, not checking ctx.Done() in loops, storing context in struct fields, ignoring cancel cleanup, calling cancel from inside the cancelled goroutine. Each one creates a goroutine leak or unresponsive cancellation.

Key Points

Always pass ctx as the FIRST parameter; don't store in struct fields.
Always defer cancel() after creating one with WithTimeout/WithCancel; otherwise leak.
Pass ctx down through every call that does I/O or might block.
In loops, check ctx.Done() either via select or ctx.Err().
Returning early from cancelled context is a goroutine's responsibility; runtime won't kill it.

Interview Follow-up Questions

Why is storing ctx in a struct an anti-pattern?

Because ctx is per-request: it carries deadlines, cancellation, request-scoped values that apply to one call chain. Storing it in a struct field means subsequent calls inherit a stale ctx (the original request long-since done). The Go team explicitly recommends 'Don't store Contexts inside a struct type' in the standard library docs.

Should every function take ctx?

Every function that does I/O, blocks, or might run long. Pure compute that returns in microseconds doesn't need it. The convention: ctx as first parameter, named ctx, type context.Context. CPU-only helpers (sort, map transforms) usually don't need ctx.

What's the difference between ctx.Done() and ctx.Err()?

ctx.Done() returns a channel that closes when ctx is cancelled; useful in select. ctx.Err() returns nil while not cancelled, returns context.Canceled or context.DeadlineExceeded after cancellation; useful for non-blocking checks. They are equivalent expressions of the same signal.

Can a context be cancelled from inside the goroutine that received it?

The cancel function works (when in scope). But typically the cancel function lives with the parent, not the child. The child observes ctx.Done() or ctx.Err() to know it should exit. When a child needs to signal 'I'm done' to siblings, prefer errgroup or a shared structured-concurrency primitive.

Gotchas

Forgetting defer cancel() leaks the cancellation goroutine
Ignoring ctx in HTTP calls or DB queries: cancellation has no effect
Storing ctx in struct: subsequent calls inherit stale ctx
Using context.Background() in handlers: no inherited cancellation; should be the request's ctx
Putting business data in ctx values: anti-pattern; use explicit parameters

Code Review: This asyncio Service for Blocking Calls

Section: Debugging Concurrency | Difficulty: Intermediate | Frequency: Sometimes

An async FastAPI handler calling a sync DB driver, sync HTTP client, and time.sleep stalls the entire event loop. Tail latency goes from 50ms to 500ms across all concurrent requests. Walk through every blocking call and either replace with async-aware libraries or wrap with asyncio.to_thread.

Key Points

async def + a synchronous call inside = silent disaster, the event loop stalls
All blocking I/O must be either async-native or wrapped in asyncio.to_thread
asyncio.create_task without a reference can be garbage-collected mid-run
asyncio.Lock and threading.Lock are NOT interchangeable

Interview Follow-up Questions

Why does ONE sync call inside async stall ALL concurrent requests?

asyncio is single-threaded. The event loop runs one coroutine at a time, switching between them at await points. A synchronous call blocks the only thread, no other coroutine can run until it returns. If 100 requests are concurrent and one of them sleeps for 500ms, all 100 see 500ms of stall.

How to detect blocking calls in production async code?

Set asyncio's debug mode (`PYTHONASYNCIODEBUG=1` or `loop.set_debug(True)`), it logs slow callbacks. Or use `aiomonitor` to dump active tasks. Production: track per-task duration; alert if any task occupies the loop for >50ms.

Is asyncio.to_thread fast?

Faster than threading.Thread (uses a shared default executor). Slower than native async (the cost is a context switch + Python's GIL release/reacquire). Use it as a bridge for unavoidable sync calls, not as a default.

Should threads be used instead of asyncio?

Without a need for 1000+ concurrent connections, threads are simpler. asyncio's value is at high concurrency where thread-per-task memory becomes prohibitive. For a web app handling 100 concurrent users, threading is fine and avoids the async-everywhere-or-nothing pitfall.

What's the right way to run a synchronous CPU task from asyncio?

ProcessPoolExecutor via `loop.run_in_executor`. The CPU work runs in another process (bypasses GIL), the event loop stays responsive. Don't run CPU work on the default thread executor, it still holds the GIL and stalls the loop.

Gotchas

requests.get inside async def → silent stall, no warning
time.sleep inside async def → freezes everything
threading.Lock in asyncio code → blocks the event loop instead of yielding
asyncio.create_task without keeping reference → task can be GC'd mid-run
Mixing sync DB drivers with FastAPI → tail latency surge under load
asyncio.gather with N=1M tasks → memory blow-up; bound with Semaphore

Production Postmortem: Concurrency Outages

Section: Debugging Concurrency | Difficulty: Advanced | Frequency: Sometimes

Real concurrency outages share patterns: thread pool exhaustion from a slow downstream, missing timeouts that turn local slowness into global outage, retry storms after a brief failure, deadlock under load that's invisible at low traffic. The fixes are usually defensive (timeouts, bulkheads, circuit breakers) rather than algorithmic.

Key Points

Most outages have multiple contributing causes; rarely one bug.
Common pattern: 'X got slow → no timeout → Y got tied up → cascade'.
Timeline analysis: when did metrics change? Tie to deploys, traffic, downstream events.
Action items: not just 'fix the bug' but 'prevent the class of bug': timeouts, bulkheads, alerts.
The blameless postmortem: focus on systemic gaps, not individual mistakes.

Interview Follow-up Questions

Why do most concurrency outages have multiple causes?

Single failures are usually contained by the system's defenses (timeouts, breakers, retries). Outages happen when multiple defenses fail simultaneously: 'downstream slow' AND 'no timeout' AND 'no bulkhead' AND 'cascading retries'. The postmortem usually finds 3-5 contributing factors. Each one alone wouldn't have caused the outage.

What does a good postmortem cover?

Timeline (what happened, when, in what order). Root causes (the bugs and the gaps that allowed the bugs to escalate). Detection (how we noticed; could we have noticed sooner?). Mitigation (what we did to stop the bleeding). Action items (concrete fixes, with owners and due dates). The blameless framing: 'how did the system allow this' rather than 'who screwed up'.

How do I prevent the next outage instead of just fixing this one?

Find the class of bug, not just the instance. 'No timeout on this call' → 'audit all calls for missing timeouts'. 'Race in this code path' → 'add race detection to CI'. 'Pool exhaustion' → 'add per-downstream pools and bulkheads'. The postmortem's action items should leave the system safer against similar future incidents, not just patched against this one.

What metrics should I have for early detection?

Per-downstream latency (p50, p99). Pool utilisation (in-use / max). Queue depth (job queues, retry queues, message queues). Error rate per downstream. Circuit breaker state. The aggregate 'service latency' is too lagging; per-component metrics catch problems before they cascade.

Gotchas

Postmortem that blames an individual instead of fixing the system
Action items without owners or due dates: never get done
Fixing only the immediate bug, not the class of bug
Not exercising the postmortem's findings: 'we'll add timeouts later'
Skipping postmortems for 'small' incidents that turn out to be patterns

What's inside

Core Mental Models

Core Patterns

Go Concurrency

Java Concurrency

Python Concurrency

Real-World Scenarios

Performance & Tradeoffs

Lock-Free Programming

Debugging Concurrency

What's inside

Core Mental Models

Core Patterns

Go Concurrency

Java Concurrency

Python Concurrency

Real-World Scenarios

Performance & Tradeoffs

Lock-Free Programming

Debugging Concurrency