Deadlock, Livelock, Starvation
Deadlock, threads block each other forever. Livelock, threads are running but making no progress. Starvation, one thread never gets the resource it needs because others keep grabbing it first. All three look similar from the outside; the fix is different for each.
Three failures that look alike from outside
All three make a service hang or grind to a halt. The causes and fixes are completely different. Mixing them up costs days of debugging.
CPU usage Are threads doing Who's stuck?
while stuck anything?
----------- ---------------- ------------
DEADLOCK ~0% No, all parked Everyone in the cycle
LIVELOCK ~100% Yes, but cancels Everyone in the dance
STARVATION normal Yes, others fine One specific thread
Symptom-to-cause map: 100% CPU + zero throughput = livelock. 0% CPU + nothing moves = deadlock. Throughput looks normal but one user/job never gets served = starvation.
Deadlock: the circular wait
The cycle below shows the classic two-thread deadlock. Each thread holds one lock and waits for the other.
Both threads wait. Neither can release. Forever.
Coffman's 1971 analysis: a deadlock requires four conditions, all at the same time.
1. Mutual exclusion at least one resource only one thread at a time
2. Hold-and-wait a thread holds one resource while waiting for another
3. No preemption resources only released voluntarily
4. Circular wait there's a cycle in the wait-for graph
Break any single one and deadlock becomes impossible. The cheapest break in production code is #4: enforce a global lock acquisition order. If every thread always acquires Lock 1 before Lock 2, no cycle can form.
Livelock: the hallway dance
Two threads detect contention and both back off "to be nice." Their backoffs happen at the same time, so when they retry they collide again. They keep "yielding" to each other forever.
The hallway analogy: two people meet in a corridor. Both step left to let the other pass. They bump. Both step right. They bump. Repeat forever, neither moving.
TIME >
Thread A: "you go first" sleep 10ms retry "you go first" sleep 10ms ...
Thread B: "no, after you" sleep 10ms retry "no, after you" sleep 10ms ...
^ ^
both yield both yield again
at the same time at the same time
Result: no thread blocks (CPU stays at 100%), but no thread makes progress either.
The fix is random jitter on the retry. Each thread sleeps for base + random(0, base) instead of a fixed base. The randomness breaks the symmetry, so on the next collision one thread retries before the other and wins the lock. Exponential backoff with jitter is the canonical pattern and the only one that reliably works.
Starvation: the never-picked thread
A thread is ready to run but the scheduler or lock keeps picking other threads. Common causes:
- Priority inversion: a high-priority thread waits on a lock held by a low-priority thread, which gets preempted by medium-priority threads. (See the priority-inversion lesson and the Mars Pathfinder story.)
- Unfair locks: a newcomer barges in front of queued waiters.
- Reader-preference RW locks: writers starve while readers keep arriving.
The Mars Pathfinder bug (1997) The Pathfinder rover hit a priority inversion bug, a low-priority weather-data thread held a lock that a high-priority bus-management thread needed. A medium-priority comms thread kept preempting the low-priority one. The bus thread missed deadlines, the watchdog rebooted the rover. NASA debugged it remotely from Earth.
How to detect them in production
| Failure | Detection |
|---|---|
| Deadlock (Java) | jstack <pid> shows "Found X deadlocked threads" |
| Deadlock (Go) | All-asleep panic, runtime/pprof goroutine profile |
| Deadlock (Python) | faulthandler.dump_traceback_later, py-spy dump |
| Livelock | High CPU, low/no throughput on metrics, pprof shows hot retry loops |
| Starvation | Tail latency spikes for specific request types; thread/queue length metrics |
The interview rule of thumb If asked "how should a hang be debugged?", the order is: check thread/goroutine dumps for blocked threads (deadlock), check CPU and retry counts (livelock), check tail latency by request type (starvation). The dump reveals which one.
Implementations
Thread A locks accountFrom then tries to lock accountTo. Thread B simultaneously locks accountTo then tries to lock accountFrom. Each holds what the other needs. Both block forever.
1 class Account {
2 private final ReentrantLock lock = new ReentrantLock();
3 int balance;
4
5 static void transfer(Account from, Account to, int amount) {
6 from.lock.lock();
7 try {
8 to.lock.lock(); // ← if another thread holds 'to' and waits for 'from', deadlock
9 try {
10 from.balance -= amount;
11 to.balance += amount;
12 } finally { to.lock.unlock(); }
13 } finally { from.lock.unlock(); }
14 }
15 }Always acquire locks in the same order across the program. Use any total ordering, here, the account's identity hash. Now both threads acquire lower first, so circular wait is impossible.
1 static void transfer(Account from, Account to, int amount) {
2 Account first = System.identityHashCode(from) < System.identityHashCode(to) ? from : to;
3 Account second = first == from ? to : from;
4 first.lock.lock();
5 try {
6 second.lock.lock();
7 try {
8 from.balance -= amount;
9 to.balance += amount;
10 } finally { second.lock.unlock(); }
11 } finally { first.lock.unlock(); }
12 }Instead of preventing the cycle, detect it: if the second lock isn't available within a timeout, back off and retry. Breaks the "no preemption" Coffman condition.
1 static boolean transfer(Account from, Account to, int amount) throws InterruptedException {
2 long deadline = System.nanoTime() + TimeUnit.MILLISECONDS.toNanos(100);
3 while (System.nanoTime() < deadline) {
4 if (from.lock.tryLock(10, TimeUnit.MILLISECONDS)) {
5 try {
6 if (to.lock.tryLock(10, TimeUnit.MILLISECONDS)) {
7 try {
8 from.balance -= amount;
9 to.balance += amount;
10 return true;
11 } finally { to.lock.unlock(); }
12 }
13 } finally { from.lock.unlock(); }
14 }
15 Thread.sleep((long) (Math.random() * 5)); // jitter
16 }
17 return false;
18 }Key points
- •Deadlock = circular wait, Thread A holds X waits for Y, Thread B holds Y waits for X
- •Coffman conditions: mutual exclusion + hold-and-wait + no preemption + circular wait
- •Break ANY ONE of the four → no deadlock
- •Lock ordering (always acquire locks in a fixed global order) is the most common deadlock prevention
- •Livelock = threads keep retrying and stepping on each other, no blocking, no progress
- •Starvation = unfair scheduling, a thread is permanently outrun by others
Follow-up questions
▸What are the four Coffman conditions for deadlock?
▸What's the difference between deadlock and livelock?
▸How is deadlock prevented in practice?
▸Why does Java's ReentrantLock have tryLock and Go's sync.Mutex doesn't (until recently)?
▸What's starvation?
Gotchas
- !Reentrant locks (Java ReentrantLock, Python RLock) prevent self-deadlock but NOT mutual deadlock between threads
- !Holding a lock while calling user-supplied code (a callback, a remote service) is a deadlock waiting to happen, that code might try to take a held lock
- !Go's runtime detects only 'all goroutines asleep' deadlocks, partial deadlocks (some goroutines stuck, others running) are silent
- !Java's jstack and -XX:+PrintConcurrentLocks show deadlocked threads, use them in production hangs
- !Database transactions can deadlock too, ORDER BY-then-FOR UPDATE on different tables is the SQL version
Common pitfalls
- Adding more locks to 'fix' a race, often introduces deadlock instead
- Acquiring locks in different order in different code paths, the most common deadlock cause
- Using sleep-based backoff without jitter → livelock under contention
- Spinning on a try-lock without backoff → CPU burn livelock
Practice problems
Cycle detection in a directed graph (DFS or topological sort)
5 philosophers, 5 forks, classic deadlock setup. Solutions: enforce odd philosophers grab left first, even grab right first; or limit to 4 philosophers eating at once.
APIs worth memorising
- Java: ReentrantLock.tryLock(timeout), jstack, ThreadMXBean.findDeadlockedThreads()
- Python: threading.Lock.acquire(timeout), threading.Lock.acquire(blocking=False)
- Go: select with timeout case, sync.Mutex.TryLock (1.18+)
Database systems use deadlock detectors (cycle detection in the wait-for graph) and abort one of the transactions. JVM monitors detect deadlocks and dump them on demand. Production hangs are almost always one of: deadlock, network timeout, or infinite loop.