Hybrid Logical Clocks (HLC) — Distributed Systems | CrackingWalnuts

The Gap Between Physical and Logical Clocks

Physical clocks (wall clocks synchronized by NTP) give you real time but no causal guarantees. Event A at 10:00:00.001 might have happened before or after event B at 10:00:00.002 depending on clock skew between machines. NTP keeps clocks within a few milliseconds, but "within a few milliseconds" is not the same as "exactly synchronized."

Logical clocks (Lamport clocks, vector clocks) give you perfect causal ordering but no connection to real time. Lamport timestamp 42 tells you nothing about when the event actually happened. You cannot use it for TTLs, lease expiration, or anything that needs wall-clock semantics.

HLC, introduced by Kulkarni, Demirbas, Madeppa, Avva, and Leone in 2014, bridges this gap. An HLC timestamp has two components:

pt (physical): tracks the maximum physical time seen
l (logical): breaks ties when pt is the same

The resulting timestamp behaves like a logical clock (preserves causal ordering) while staying close to the physical clock (useful for time-based reasoning).

The HLC Algorithm

Each node maintains three values: pt (the physical component of its HLC), l (the logical component), and access to its local physical clock now().

Local Event or Send

When a node creates a local event or sends a message:

if now() > pt:
    pt = now()
    l = 0
else:
    l = l + 1
timestamp = (pt, l)

If the physical clock has advanced, reset the logical counter and use the new physical time. If the physical clock has not advanced (same millisecond, or clock went backward), increment the logical counter.

Receive Message

When a node receives a message with timestamp (msg_pt, msg_l):

old_pt = pt
pt = max(now(), pt, msg_pt)
if pt == old_pt and pt == msg_pt:
    l = max(l, msg_l) + 1
elif pt == old_pt:
    l = l + 1
elif pt == msg_pt:
    l = msg_l + 1
else:
    l = 0
timestamp = (pt, l)

The logic ensures: (1) pt never goes backward, (2) the timestamp is always greater than or equal to any timestamp that causally precedes it, and (3) pt stays as close as possible to the actual physical time.

Why This Works

HLC preserves the happens-before relation: if event A causally precedes event B, then HLC(A) < HLC(B). The proof follows from the update rules: every event takes the max of all known timestamps and increments.

HLC timestamps also stay within epsilon of real time, where epsilon is the maximum clock skew across the system. Specifically, pt - now() <= epsilon at any point. This means HLC timestamps are usable for time-based operations (TTLs, lease expiration) with an error margin of the clock skew.

The combination is powerful: causal ordering for correctness, approximate physical time for usability.

CockroachDB: HLC in Production

CockroachDB is the most prominent production user of HLC and provides a concrete case study of how the theory translates to practice.

MVCC Timestamps

Every key-value pair in CockroachDB has an HLC timestamp. When a transaction writes a key, the write gets the transaction's commit timestamp. When reading, the system retrieves the version with the highest timestamp at or before the read timestamp.

HLC timestamps enable snapshot isolation and serializable isolation. A transaction reads a consistent snapshot defined by its timestamp. Writes are ordered by their HLC timestamps.

The Uncertainty Interval

Here is where bounded clock skew matters. When a transaction at timestamp T reads a key, it might find a value with a timestamp between T and T + max_offset (the configured maximum clock skew, default 500ms in CockroachDB). This value might have been written by a transaction that committed before T in real time but got a higher HLC timestamp due to clock skew.

CockroachDB handles this with an uncertainty interval. If a read encounters a value in the uncertainty window [T, T + max_offset], it cannot be sure whether the write truly happened before or after the read. In this case, CockroachDB restarts the transaction at a higher timestamp (above the uncertain value).

This is the practical cost of clock skew. Lower clock skew means smaller uncertainty intervals, fewer transaction restarts, and better performance. CockroachDB recommends NTP with tight synchronization and offers Google Cloud Spanner-like TrueTime integration for even tighter bounds.

Why Not Just TrueTime?

Google Spanner uses TrueTime, which provides explicit confidence intervals on the current time using atomic clocks and GPS receivers. TrueTime can say "the current time is between 10:00:00.001 and 10:00:00.003" with high confidence. This lets Spanner wait out the uncertainty (a "commit wait" of a few milliseconds) and guarantee external consistency.

CockroachDB uses HLC instead of TrueTime because TrueTime requires specialized hardware (atomic clocks) that is only available in Google's infrastructure. HLC works with commodity NTP, making CockroachDB deployable on any cloud or bare-metal setup. The trade-off: larger uncertainty intervals (500ms NTP vs. ~7ms TrueTime) and occasional transaction restarts.

YugabyteDB: Safe Time

YugabyteDB uses a variant of HLC for its "safe time" mechanism. Safe time is the timestamp up to which a node can serve consistent reads without needing to contact other nodes.

A node's safe time advances when it receives heartbeats from the leader with the leader's current HLC timestamp. Any read at a timestamp below safe time is guaranteed to see all committed writes up to that point.

The interplay between HLC and safe time gives YugabyteDB low-latency reads from followers without sacrificing consistency. The follower knows "I have all data up to safe time T" because the leader's HLC timestamp establishes a causal boundary.

HLC vs. Other Clock Approaches

vs. Lamport Clocks

Lamport clocks are simpler (a single integer counter) but have no physical time information. HLC adds physical time awareness with minimal extra complexity (one more counter). For systems that never need wall-clock semantics, Lamport clocks suffice. For databases with TTLs, lease management, or user-facing timestamps, HLC is worth the small additional cost.

vs. Vector Clocks

Vector clocks can detect concurrent events (neither causally ordered). HLC cannot, because it gives a total order (compare pt first, then l). Vector clocks grow with the number of nodes (O(n) space). HLC is constant size. For systems with hundreds or thousands of nodes, the space difference is significant.

For a replicated key-value store where you need to detect concurrent writes for conflict resolution, vector clocks (or dotted version vectors) are the right choice. For a distributed SQL database where you need timestamp-based MVCC, HLC is the right choice.

vs. TrueTime

TrueTime provides hard bounds on clock uncertainty using specialized hardware. HLC provides soft bounds using NTP. TrueTime enables commit-wait for external consistency (Spanner). HLC uses uncertainty intervals and transaction restarts (CockroachDB).

TrueTime is objectively better but requires Google-grade infrastructure. HLC is the pragmatic alternative for everyone else.

NTP Considerations

HLC's guarantees depend on NTP keeping clocks reasonably synchronized. In practice:

Public NTP (pool.ntp.org): clock skew typically 1-10ms, sometimes up to 100ms during network issues. This is fine for most applications but makes CockroachDB's uncertainty intervals large.

Cloud provider NTP (Amazon Time Sync, Google Cloud NTP): clock skew typically under 1ms. Much better for HLC-based systems. CockroachDB on AWS with Amazon Time Sync sees very few uncertainty restarts.

Chrony vs. ntpd: Chrony generally achieves tighter synchronization than traditional ntpd, especially after network disruptions. Most modern Linux distributions default to Chrony.

Clock jumps: NTP can step the clock forward or backward if the drift is too large. A backward jump is dangerous for HLC because pt might suddenly be far ahead of now(). HLC handles this gracefully (pt never decreases, logical counter absorbs the gap), but the logical counter grows until real time catches up with pt.

Monitoring clock skew in production (using metrics from Chrony or NTP) is essential for HLC-based systems. Alert on skew above your configured max_offset and investigate before it causes problems.

The Gap Between Physical and Logical Clocks

HLC, introduced by Kulkarni, Demirbas, Madeppa, Avva, and Leone in 2014, bridges this gap. An HLC timestamp has two components:

pt (physical): tracks the maximum physical time seen
l (logical): breaks ties when pt is the same

The resulting timestamp behaves like a logical clock (preserves causal ordering) while staying close to the physical clock (useful for time-based reasoning).

The HLC Algorithm

Each node maintains three values: pt (the physical component of its HLC), l (the logical component), and access to its local physical clock now().

Local Event or Send

When a node creates a local event or sends a message:

if now() > pt:
    pt = now()
    l = 0
else:
    l = l + 1
timestamp = (pt, l)

Receive Message

When a node receives a message with timestamp (msg_pt, msg_l):

old_pt = pt
pt = max(now(), pt, msg_pt)
if pt == old_pt and pt == msg_pt:
    l = max(l, msg_l) + 1
elif pt == old_pt:
    l = l + 1
elif pt == msg_pt:
    l = msg_l + 1
else:
    l = 0
timestamp = (pt, l)

Why This Works

The combination is powerful: causal ordering for correctness, approximate physical time for usability.

CockroachDB: HLC in Production

CockroachDB is the most prominent production user of HLC and provides a concrete case study of how the theory translates to practice.

MVCC Timestamps

HLC timestamps enable snapshot isolation and serializable isolation. A transaction reads a consistent snapshot defined by its timestamp. Writes are ordered by their HLC timestamps.

The Uncertainty Interval

Why Not Just TrueTime?

YugabyteDB: Safe Time

YugabyteDB uses a variant of HLC for its "safe time" mechanism. Safe time is the timestamp up to which a node can serve consistent reads without needing to contact other nodes.

HLC vs. Other Clock Approaches

vs. Lamport Clocks

vs. Vector Clocks

vs. TrueTime

TrueTime is objectively better but requires Google-grade infrastructure. HLC is the pragmatic alternative for everyone else.

NTP Considerations

HLC's guarantees depend on NTP keeping clocks reasonably synchronized. In practice:

Public NTP (pool.ntp.org): clock skew typically 1-10ms, sometimes up to 100ms during network issues. This is fine for most applications but makes CockroachDB's uncertainty intervals large.

Chrony vs. ntpd: Chrony generally achieves tighter synchronization than traditional ntpd, especially after network disruptions. Most modern Linux distributions default to Chrony.

Monitoring clock skew in production (using metrics from Chrony or NTP) is essential for HLC-based systems. Alert on skew above your configured max_offset and investigate before it causes problems.

Architecture

The Gap Between Physical and Logical Clocks

The HLC Algorithm

Local Event or Send

Receive Message

Why This Works

CockroachDB: HLC in Production

MVCC Timestamps

The Uncertainty Interval

Why Not Just TrueTime?

YugabyteDB: Safe Time

HLC vs. Other Clock Approaches

vs. Lamport Clocks

vs. Vector Clocks

vs. TrueTime

NTP Considerations

Key Points

Used By

Common Mistakes

Related

Architecture

The Gap Between Physical and Logical Clocks

The HLC Algorithm

Local Event or Send

Receive Message

Why This Works

CockroachDB: HLC in Production

MVCC Timestamps

The Uncertainty Interval

Why Not Just TrueTime?

YugabyteDB: Safe Time

HLC vs. Other Clock Approaches

vs. Lamport Clocks

vs. Vector Clocks

vs. TrueTime

NTP Considerations

Key Points

Used By

Common Mistakes

Related