Bug Hunt: Lost Wakeup with Condition Variable

The Bug

Lost wakeup is the canonical condition-variable bug. The pattern:

Producer thread: acquires lock, sets predicate, calls notify.
Consumer thread: acquires lock, calls wait without checking the predicate.

If the producer runs first, it notifies before any consumer is waiting. The notification is dropped (condition variables don't queue signals). The consumer later calls wait; nothing wakes it. Hangs forever.

The bug is intermittent: depends on which thread runs first. Easy to miss in unit tests, hits in production under load.

What's Going Wrong

Two contracts that condition variables enforce:

wait must happen with the lock held, and wait atomically releases the lock and parks the thread.
The predicate is the truth; wait is just an optimisation to avoid spinning.

The bug ignores contract 2. The waiter assumes "if I call wait, I'll be woken when ready". Wrong. The waiter must check the predicate first; wait is only for the case where the predicate is currently false.

The Fix

Always:

acquire(lock)
while (!predicate):
    wait(lock)        # atomically releases lock, parks, reacquires on wake
# predicate is true; do work
release(lock)

Three properties of this pattern:

Predicate check before wait: if produce already ran, we see the predicate true and skip wait entirely. No lost wakeup.
While loop: handles spurious wakeups. wait can return without notification; the while ensures we re-check.
Lock held during check and wait: the atomicity of "check predicate, conditionally wait" prevents the producer from sneaking in between the check and the wait.

The producer side:

acquire(lock)
update predicate
notify (or notifyAll)
release(lock)

The notification happens under the lock, so when the consumer eventually wakes and re-acquires, it sees the updated predicate.

Why this isn't intuitive

The name "wait" suggests "wait for the next signal". The actual semantics are "release lock and sleep until signaled OR spuriously woken; then reacquire lock". The signal is not queued; if no one is waiting, it's lost.

This is a tradeoff for performance: queueing signals would require unbounded memory. The price is that the user has to combine wait with a predicate check.

Higher-level primitives (CountDownLatch, channels, futures) hide this complexity. Prefer them when possible; raw condition variables are sharp.

In Go

Go's sync.Cond has the same hazard. But Go's idiomatic solution is channels, which don't have it. A buffered channel queues values; even if the send happens before the receive, the value waits. No lost wakeup.

For most "wait until X happens" cases in Go, use a channel (signal it by closing or by sending). Reach for sync.Cond only with multiple waiters and complex predicates that don't fit channels.

The discipline that prevents this bug class is small. Never call wait without first checking the predicate the wait depends on; the predicate is the truth, the wait is the optimisation. Always loop, never if, because spurious wakeups are part of the spec. And signal while holding the lock that protects the predicate, otherwise the update and the signal can race past each other.

Internalise these and condition variables become predictable. Skip them and "occasional hang" bugs come from exactly this pattern.

Follow-up questions

▸Why is a while loop required around wait, not just an if?

Two reasons. (1) The predicate may have been satisfied between the wakeup and the time the thread re-acquired the lock; another waiter might have consumed the condition. (2) Spurious wakeups: the spec allows wait to return without anyone calling notify. The while loop handles both correctly; an if statement is a bug.

▸Why must the signal happen under the lock?

Because the predicate must be visible to the waker before the wakeup. If signal happens outside the lock, this interleaving is possible: producer sets predicate → consumer reads predicate (false) → consumer waits → producer signals (lost). Holding the lock during the predicate update AND the signal ensures the waker sees the updated predicate.

▸How does notify vs notifyAll affect this?

notify wakes one waiter (any one). notifyAll wakes all. If multiple consumers can validly act on the condition, notify can cause 'thundering herd' (all wake, all check, only one proceeds, others wait again). For correctness, both work with the while-loop pattern. notifyAll is safer when in doubt.

▸Are channels strictly better than condition variables?

For most cases in Go, yes. Channels queue values; a send can't be lost. For complex predicates that don't fit a queue model (multiple waiters, multiple conditions), sync.Cond is still useful. But the cases where Cond is the right tool are rare.

The Bug

Lost wakeup is the canonical condition-variable bug. The pattern:

Producer thread: acquires lock, sets predicate, calls notify.
Consumer thread: acquires lock, calls wait without checking the predicate.

The bug is intermittent: depends on which thread runs first. Easy to miss in unit tests, hits in production under load.

What's Going Wrong

Two contracts that condition variables enforce:

wait must happen with the lock held, and wait atomically releases the lock and parks the thread.
The predicate is the truth; wait is just an optimisation to avoid spinning.

The Fix

Always:

acquire(lock)
while (!predicate):
    wait(lock)        # atomically releases lock, parks, reacquires on wake
# predicate is true; do work
release(lock)

Three properties of this pattern:

Predicate check before wait: if produce already ran, we see the predicate true and skip wait entirely. No lost wakeup.
While loop: handles spurious wakeups. wait can return without notification; the while ensures we re-check.
Lock held during check and wait: the atomicity of "check predicate, conditionally wait" prevents the producer from sneaking in between the check and the wait.

The producer side:

acquire(lock)
update predicate
notify (or notifyAll)
release(lock)

The notification happens under the lock, so when the consumer eventually wakes and re-acquires, it sees the updated predicate.

Why this isn't intuitive

This is a tradeoff for performance: queueing signals would require unbounded memory. The price is that the user has to combine wait with a predicate check.

Higher-level primitives (CountDownLatch, channels, futures) hide this complexity. Prefer them when possible; raw condition variables are sharp.

In Go

For most "wait until X happens" cases in Go, use a channel (signal it by closing or by sending). Reach for sync.Cond only with multiple waiters and complex predicates that don't fit channels.

Internalise these and condition variables become predictable. Skip them and "occasional hang" bugs come from exactly this pattern.

Follow-up questions

▸Why is a while loop required around wait, not just an if?

▸Why must the signal happen under the lock?

▸How does notify vs notifyAll affect this?

▸Are channels strictly better than condition variables?

The Bug

What's Going Wrong

The Fix

Why this isn't intuitive

In Go

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Bug Hunt: Lost Wakeup with Condition Variable

The Bug

What's Going Wrong

The Fix

Why this isn't intuitive

In Go

Implementations

Key points

Follow-up questions

Gotchas

Related reading