Bug Hunt: Lost Wakeup with Condition Variable
Producer signals condition before consumer waits → consumer waits forever. Classic bug from forgetting to check the predicate inside the lock before waiting. Fix: always check predicate inside a while loop while holding the lock; the predicate guards both the wait AND the wakeup.
The Bug
Lost wakeup is the canonical condition-variable bug. The pattern:
- Producer thread: acquires lock, sets predicate, calls
notify. - Consumer thread: acquires lock, calls
waitwithout checking the predicate.
If the producer runs first, it notifies before any consumer is waiting. The notification is dropped (condition variables don't queue signals). The consumer later calls wait; nothing wakes it. Hangs forever.
The bug is intermittent: depends on which thread runs first. Easy to miss in unit tests, hits in production under load.
What's Going Wrong
Two contracts that condition variables enforce:
- wait must happen with the lock held, and wait atomically releases the lock and parks the thread.
- The predicate is the truth; wait is just an optimisation to avoid spinning.
The bug ignores contract 2. The waiter assumes "if I call wait, I'll be woken when ready". Wrong. The waiter must check the predicate first; wait is only for the case where the predicate is currently false.
The Fix
Always:
acquire(lock)
while (!predicate):
wait(lock) # atomically releases lock, parks, reacquires on wake
# predicate is true; do work
release(lock)
Three properties of this pattern:
- Predicate check before wait: if produce already ran, we see the predicate true and skip wait entirely. No lost wakeup.
- While loop: handles spurious wakeups. wait can return without notification; the while ensures we re-check.
- Lock held during check and wait: the atomicity of "check predicate, conditionally wait" prevents the producer from sneaking in between the check and the wait.
The producer side:
acquire(lock)
update predicate
notify (or notifyAll)
release(lock)
The notification happens under the lock, so when the consumer eventually wakes and re-acquires, it sees the updated predicate.
Why this isn't intuitive
The name "wait" suggests "wait for the next signal". The actual semantics are "release lock and sleep until signaled OR spuriously woken; then reacquire lock". The signal is not queued; if no one is waiting, it's lost.
This is a tradeoff for performance: queueing signals would require unbounded memory. The price is that the user has to combine wait with a predicate check.
Higher-level primitives (CountDownLatch, channels, futures) hide this complexity. Prefer them when possible; raw condition variables are sharp.
In Go
Go's sync.Cond has the same hazard. But Go's idiomatic solution is channels, which don't have it. A buffered channel queues values; even if the send happens before the receive, the value waits. No lost wakeup.
For most "wait until X happens" cases in Go, use a channel (signal it by closing or by sending). Reach for sync.Cond only with multiple waiters and complex predicates that don't fit channels.
The discipline that prevents this bug class is small. Never call wait without first checking the predicate the wait depends on; the predicate is the truth, the wait is the optimisation. Always loop, never if, because spurious wakeups are part of the spec. And signal while holding the lock that protects the predicate, otherwise the update and the signal can race past each other.
Internalise these and condition variables become predictable. Skip them and "occasional hang" bugs come from exactly this pattern.
Implementations
The producer sets ready=true and notifies. Consumer hasn't reached wait yet. Notification is dropped. Consumer reaches wait; ready is true, but the if-check is missing; consumer waits forever.
1 public class Lost {
2 private boolean ready = false;
3 private final Object lock = new Object();
4
5 public void produce() {
6 synchronized (lock) {
7 ready = true;
8 lock.notify(); // notify with no waiter = dropped
9 }
10 }
11
12 public void consume() throws InterruptedException {
13 synchronized (lock) {
14 // BUG: no check before wait
15 lock.wait(); // hangs forever if produce already ran
16 // ...
17 }
18 }
19 }The while loop checks the predicate before waiting. If produce already ran, ready is true; wait is skipped. Spurious wakeups are also handled by the loop.
1 public class Fixed {
2 private boolean ready = false;
3 private final Object lock = new Object();
4
5 public void produce() {
6 synchronized (lock) {
7 ready = true;
8 lock.notify();
9 }
10 }
11
12 public void consume() throws InterruptedException {
13 synchronized (lock) {
14 while (!ready) { // check predicate
15 lock.wait(); // release lock, wait, reacquire
16 }
17 // ready is now true; do work
18 }
19 }
20 }Key points
- •Lost wakeup: signal happens before wait, signal is dropped (condition variables don't queue signals).
- •Fix: hold the lock, check predicate in a while loop, wait if predicate is false.
- •Signal must be inside the same lock to make the predicate visible to the waker.
- •Spurious wakeups also require while loop (not if): wait can return without anyone signalling.
- •In Go, channels have built-in 'queued' semantics that avoid this; sync.Cond can lose signals.
Follow-up questions
▸Why is a while loop required around wait, not just an if?
▸Why must the signal happen under the lock?
▸How does notify vs notifyAll affect this?
▸Are channels strictly better than condition variables?
Gotchas
- !Using if instead of while around wait: spurious wakeups break correctness
- !Signalling outside the lock: predicate update may not be visible to waiter in time
- !Using notify when multiple unrelated waiters wait on the same condition
- !Forgetting to update the predicate before signal: wakes waiter that finds predicate false
- !Believing the bug is rare: lost-wakeup happens reliably under load testing