Logical Clocks & Causality Tracking
Architecture
Why Physical Clocks Fail
On a single machine, ordering events is trivial. The CPU has a monotonic clock, and events happen in sequence. But the moment you have two machines, things fall apart.
Machine A's clock says 14:00:00.005. Machine B's clock says 14:00:00.002. Did B's event happen before A's? Maybe. Or maybe B's clock is 10 milliseconds slow, and both events actually happened at the same physical instant. You cannot tell from timestamps alone.
NTP (Network Time Protocol) keeps clocks approximately synchronized, typically within a few milliseconds. But "approximately" is doing a lot of heavy lifting in that sentence. NTP corrections are not smooth; they jump. If NTP discovers a clock is 50ms ahead, some implementations step the clock backward, which means timestamps can go backward in your logs. Others slew the clock (gradually adjust the rate), which avoids backward jumps but means the clock runs at the wrong speed for a while.
Leap seconds make it worse. When a leap second is inserted, 23:59:59 is followed by 23:59:60 before 00:00:00. Many systems have never been tested with this, and the results are unpredictable. Google's approach of "smearing" the leap second over 24 hours (running clocks slightly slow for a day) avoids the discontinuity but means Google's clocks disagree with everyone else's for 24 hours.
The fundamental issue: physical clocks are continuous, approximate, and independently maintained. You need something that captures the actual causal relationships between events.
Lamport Timestamps
Leslie Lamport defined the happens-before relation in 1978. Event A happens before event B (written A -> B) if: A and B are on the same process and A occurred first, or A is the send of a message and B is the receipt of that message, or there exists an event C such that A -> C and C -> B (transitivity).
Lamport timestamps enforce a simple rule. Each process maintains a counter. On any local event, increment the counter. When sending a message, attach the current counter value. When receiving a message, set the counter to max(local_counter, message_counter) + 1.
This gives you a scalar value for every event in the system with one guarantee: if A -> B, then L(A) < L(B). Causal ordering is preserved.
But the converse does not hold. If L(A) < L(B), you cannot conclude that A happened before B. The events might be concurrent (neither caused the other). Lamport timestamps collapse all concurrent events into an arbitrary total order, which is fine for some use cases (like a total ordering for a replicated log) but loses information about concurrency.
Think of it this way. Lamport timestamps are a projection from a partially ordered set (the true causal order) into a totally ordered set (the integers). Projections lose information. The information lost here is precisely the distinction between "A caused B" and "A and B are unrelated."
When Lamport Timestamps Are Enough
Despite their limitations, Lamport timestamps are used widely because they are cheap (one integer per event) and sufficient for many use cases.
Cassandra uses them for conflict resolution with last-writer-wins semantics. When two writes to the same key have different Lamport timestamps, the higher one wins. This is not causally correct (a concurrent write with a higher timestamp is not necessarily "later"), but it is deterministic and consistent across replicas. For many workloads, a deterministic tiebreaker is all you need, even if it occasionally picks the "wrong" winner.
Lamport timestamps also provide the foundation for total order broadcast, which is what replicated state machines need. If you combine Lamport timestamps with node IDs as a tiebreaker (when two events have the same Lamport timestamp, the event from the higher-numbered node wins), you get a total order that all nodes agree on without any communication.
Vector Clocks: Detecting Concurrency
Vector clocks solve the information loss problem by replacing the single counter with a vector of N counters, one per node. Node A tracks [count_A, count_B, count_C]. Node B tracks [count_A, count_B, count_C]. And so on.
The rules mirror Lamport timestamps but operate on vectors. On a local event at node A, increment count_A. When sending a message, attach the full vector. When receiving a message, take the element-wise max of the local vector and the received vector, then increment your own entry.
Now you can compare two events. Event X has vector VX and event Y has vector VY.
If VX[i] <= VY[i] for all i and VX != VY, then X happened before Y.
If VY[i] <= VX[i] for all i and VX != VY, then Y happened before X.
If neither vector dominates the other (VX[i] > VY[i] for some i and VX[j] < VY[j] for some j), the events are concurrent.
This is a strictly more powerful ordering than Lamport timestamps. You can detect all three relationships: happens-before, happens-after, and concurrent. The cost is carrying N integers instead of one, where N is the number of nodes.
Why Vector Clocks Do Not Scale
The problem is obvious. With 1000 nodes, every message carries 1000 integers. That is about 4-8KB of metadata per message. With 100,000 nodes, it is 400-800KB per message. This is prohibitively expensive for large-scale systems.
Worse, if nodes join and leave the cluster, the vector must grow (adding entries for new nodes) but should never shrink (removing an entry loses causal information). Long-running systems accumulate entries for nodes that no longer exist, bloating every message with dead weight.
This is why almost no production system uses full vector clocks. Amazon's Dynamo paper described using vector clocks, but the actual DynamoDB implementation uses version vectors, which are substantially cheaper.
Version Vectors: The Practical Alternative
Version vectors look like vector clocks but track a subtly different thing. Instead of tracking causality between arbitrary events, version vectors track versions of specific data items.
The vector has one entry per replica that has written to a data item, not one entry per node in the cluster. In a 1000-node cluster where a particular key has only been written by 3 replicas, the version vector has 3 entries, not 1000.
When a client reads a key, it gets the value plus the version vector. When the client writes, it sends back the version vector it read, and the server increments its entry. If two clients read version [A:1, B:2] concurrently and both write, you get two divergent versions: [A:2, B:2] and [A:1, B:3]. Neither dominates the other, so the system knows the writes are concurrent and can present both to the client for resolution (Riak's approach) or pick a winner (DynamoDB's approach).
The key advantage over full vector clocks: version vectors are per-key, and their size scales with the number of replicas that have written to that key, which is typically 2-5, not the total cluster size.
Dotted Version Vectors
Version vectors have a subtle bug that dotted version vectors fix. Consider this scenario. Client reads key K with version [A:1]. Client writes. Server A increments to [A:2] and stores the value. Now another client reads K from A, getting version [A:2], and writes concurrently with a third client that also read [A:2]. Both writes go to server A.
With plain version vectors, A would increment to [A:3] for the first write, then to [A:4] for the second write. But now the second write has version [A:4] which dominates [A:3], making it look like the second write happened after the first. The first write is silently lost, even though they were concurrent.
Dotted version vectors solve this by separating the "context" (the version vector the client read) from the "dot" (the specific event that created this version). The dot is a single (node, counter) pair that uniquely identifies this write event. When comparing, the system checks whether a dot is covered by another version's context, which correctly identifies concurrent writes even when they happen on the same server.
Riak adopted dotted version vectors specifically to fix sibling explosion bugs caused by this subtle issue.
Hybrid Logical Clocks
Hybrid Logical Clocks (HLC), designed by Kulkarni, Demirbas, Madeppa, Avva, and Leone, combine the best parts of physical and logical clocks.
Each HLC timestamp has two components: a physical part (the best-known physical time) and a logical part (a counter that breaks ties and preserves causality). The physical part is always close to the actual wall clock time, never more than the maximum expected clock skew away.
The rules: on a local event, set the physical part to max(local_physical, wall_clock_time) and increment the logical part if the physical part did not change. On sending a message, attach the HLC. On receiving, set the physical part to max(local_physical, message_physical, wall_clock_time), and set the logical part based on which physical value won.
The result is a timestamp that looks like a regular timestamp (you can read it as a date/time), preserves causal ordering (if A -> B, then HLC(A) < HLC(B)), and stays close to real time. CockroachDB uses HLC extensively to provide serializable transactions without requiring specialized clock hardware.
The trade-off is that HLC cannot detect concurrency the way vector clocks can. Two HLC timestamps are always comparable (one is always less than, equal to, or greater than the other). This is fine for CockroachDB because it uses HLC within a consensus protocol that resolves conflicts through other mechanisms, not through clock comparison alone.
Google Spanner's TrueTime
Google took the radical approach of fixing the clock problem at the hardware level. Every Spanner datacenter has GPS receivers and atomic clocks. TrueTime does not give you a timestamp. It gives you an interval: [earliest, latest]. The true time is somewhere in that interval, but you do not know exactly where.
The interval width is typically under 7 milliseconds. When Spanner needs to commit a transaction, it picks a timestamp at the end of its TrueTime interval, then waits until that timestamp is guaranteed to be in the past on all servers. This is the "commit wait" that people find surprising: Spanner deliberately pauses for a few milliseconds on every commit to ensure that no other server could assign a later-but-physically-earlier timestamp.
The math is elegant. If server A commits at timestamp T_A with uncertainty interval [T_A - epsilon, T_A + epsilon], and server B commits at timestamp T_B where T_B > T_A + 2*epsilon, then B's commit definitely happened after A's commit in real time. By waiting out epsilon before reporting the commit, Spanner ensures this property holds.
The practical cost: every write incurs a commit-wait latency of roughly the TrueTime uncertainty (typically 1-7ms). Reads that need strong consistency also incur this latency. For Google's workloads, a few milliseconds per transaction is acceptable, and the benefit of globally consistent timestamps is enormous.
For everyone who is not Google: you do not have GPS receivers and atomic clocks in every datacenter. CockroachDB has experimented with using cloud providers' NTP services to provide tighter clock bounds, but the uncertainty is typically 50-150ms, not 7ms. That 50ms commit-wait would be prohibitively expensive for most workloads, which is why CockroachDB uses HLC with a different consistency mechanism instead.
Choosing the Right Clock
The decision tree is shorter than you might expect.
If you need a total order for a replicated log and do not care about detecting concurrency, Lamport timestamps are cheap and correct. One integer per event.
If you need to detect concurrent writes to resolve conflicts in a multi-master database, use version vectors (not full vector clocks). The metadata scales with the number of writing replicas per key, which is usually small.
If you need timestamps that look like real times and preserve causality, and you are running a consensus-based system that handles conflicts through other means, Hybrid Logical Clocks give you the best of both worlds.
If you are Google and have the budget for GPS receivers and atomic clocks in every datacenter, TrueTime gives you globally consistent timestamps with bounded uncertainty. For everyone else, this is a fascinating design to study but not one you can replicate.
Whatever you choose, do not use physical timestamps alone. The bugs will be subtle, intermittent, and incredibly painful to diagnose. A "lost update" that only happens when NTP corrects a clock skew during peak traffic, on a node that was already under GC pressure, is the kind of production incident that consumes engineering weeks.
Key Points
- •Physical clocks cannot reliably order events across machines. NTP drift, leap seconds, and clock skew mean that wall-clock timestamps from two different servers are not comparable for ordering purposes
- •Lamport timestamps give you a total order that is consistent with causality in one direction: if A happened before B, then L(A) < L(B). But the converse is not true. Two events with different timestamps might be concurrent
- •Vector clocks can detect concurrency: if neither vector dominates the other element-wise, the events are concurrent. This is what you need for conflict detection in multi-master replication
- •DynamoDB uses version vectors, CockroachDB uses Hybrid Logical Clocks, and Google Spanner uses GPS and atomic clocks to bound uncertainty. Each makes a different trade-off between metadata size, accuracy, and hardware requirements
Used By
Common Mistakes
- ✗Relying on System.currentTimeMillis() or time.Now() to order events across machines. Clock skew between machines is typically 1-10ms with NTP, but can spike to seconds during NTP corrections or leap second events
- ✗Using Lamport timestamps when you need to detect concurrency. Lamport timestamps cannot distinguish between 'A happened before B' and 'A and B are concurrent.' You need vector clocks or version vectors for that
- ✗Implementing full vector clocks at scale. With 1000 nodes, every message carries 1000 integers. Version vectors, which track versions per data item rather than per node, are almost always sufficient and much smaller
- ✗Ignoring clock skew when using LWW registers. If node A's clock is 5 seconds ahead of node B, node A's writes always win concurrent conflicts regardless of actual ordering