Timers, Clocks & High-Resolution Timers
Mental Model
Three clocks on the same wall. The first syncs with the radio station every hour and occasionally jumps backward when it corrects itself -- great for knowing what time it is right now, unreliable for timing a recipe. The second started when the house was built and never adjusts. Perfect for measuring durations, but it pauses during power outages. The third also started at construction but keeps counting through outages -- the only way to know true total elapsed time. Pick the wrong clock and dinner burns or the alarm never fires.
The Problem
NTP corrects the clock backward by 5 seconds at 2 AM, and 47 services using CLOCK_REALTIME for their watchdog deadlines simultaneously think they are overdue. Cascade restart. Meanwhile, a 1ms setTimeout fires at 4-6ms on a 250Hz kernel, and the trading logic behind it loses $12,000 per missed millisecond. On a VM after live migration, System.nanoTime() goes backward, producing negative latency measurements that corrupt percentile histograms built from 5 million calls per second. And a process that creates 10,000 POSIX timers exhausts kernel memory for signal slots -- timer_create starts returning EAGAIN.
Architecture
A 10-second timeout is set. NTP adjusts the clock backward by 5 seconds. The timeout takes 15 seconds.
This is not a hypothetical. It happens in production. And the fix is embarrassingly simple: the wrong clock was in use.
Linux has multiple clocks for a reason. Pick the wrong one and timeouts break. Pick the right one and they work even through NTP adjustments, CPU frequency changes, and laptop suspends.
What Actually Happens
At boot, the kernel's clocksource framework evaluates available hardware timers -- TSC (per-CPU, ~1ns resolution), HPET (memory-mapped, ~100ns), ACPI PM Timer (legacy fallback) -- and selects the best one. The timekeeping subsystem converts raw counter readings into nanosecond timestamps, maintaining separate epochs for each clock type.
CLOCK_REALTIME tracks wall clock time. It can be adjusted by NTP, settimeofday, or adjtimex. It can jump backward. Use it for log timestamps and calendar scheduling.
CLOCK_MONOTONIC starts at boot and only moves forward. NTP can slew it (gradually speed up or slow down), but it never jumps. Use it for timeouts, elapsed time, and performance measurement.
CLOCK_BOOTTIME is CLOCK_MONOTONIC plus time spent in system suspend. A 10-minute CLOCK_MONOTONIC timer will not fire if the laptop sleeps for 2 hours -- CLOCK_BOOTTIME will fire immediately on wake.
The kernel offers two timer mechanisms. The timer wheel (struct timer_list) is a hierarchical timing wheel with O(1) insertion, optimized for the common case where most timeouts are cancelled before they fire (TCP retransmits, device polling). It runs at jiffies granularity (typically 4ms). hrtimers use a per-CPU red-black tree sorted by expiry time, backed by the local APIC timer. They provide nanosecond resolution and power nanosleep(), POSIX timers, and the scheduler tick.
For event-driven servers, timerfd turns timers into file descriptors. timerfd_create() makes an fd. timerfd_settime() arms it. The fd becomes readable when the timer expires -- integrating naturally with epoll, select, or poll. No signal handlers. No async-signal-safe constraints. Nginx, libuv, and systemd all use timerfd.
Under the Hood
The vDSO makes clock_gettime essentially free. The kernel maps a read-only vvar page into every process, containing current time values and TSC calibration parameters. When clock_gettime(CLOCK_MONOTONIC) is called, the vDSO reads the vvar page, reads the TSC (~25 cycles), and computes the time. All in user space. ~20ns instead of ~200ns for a real syscall.
Timer slack saves power. The kernel rounds timer expiries to align with other pending timers, reducing CPU wakeups. Default slack: 50 microseconds (settable via prctl(PR_SET_TIMERSLACK)). This is why setTimeout(1ms) in an application rarely fires at exactly 1ms. Real-time tasks set slack to 0.
Tickless operation. By default, the kernel fires a tick at CONFIG_HZ frequency (250 Hz = every 4ms). CONFIG_NO_HZ_IDLE stops the tick when the CPU is idle (saves power). CONFIG_NO_HZ_FULL stops it even when one task is running -- eliminating jitter for latency-sensitive workloads (HFT, real-time audio).
Common Questions
Why does Linux have both CLOCK_MONOTONIC and CLOCK_BOOTTIME?
CLOCK_MONOTONIC stops counting during suspend. Setting a 10-minute timer before the laptop sleeps for 2 hours means the timer fires 10 minutes after wake -- 2 hours and 10 minutes after it was set. CLOCK_BOOTTIME fires immediately on wake because the 10 minutes elapsed during sleep. Android's AlarmManager uses CLOCK_BOOTTIME for wake-up alarms.
How does timerfd improve on signal-based timers?
Signal-based timers (timer_create with SIGEV_SIGNAL) deliver via signals, which have pitfalls: handlers must be async-signal-safe (no malloc, no printf), signals can be lost, and signal handling interacts poorly with threads. timerfd makes timers into file descriptors that integrate with the same epoll loop as sockets. No signal handlers. No races.
What causes clock_gettime to return non-monotonic values?
On older CPUs without constant_tsc, the TSC varies with CPU frequency. If a thread reads TSC on one core, migrates, and reads on another, the second reading can be lower. Modern CPUs with constant_tsc and nonstop_tsc guarantee synchronization. In VMs, the hypervisor must offset vCPU TSCs -- a known source of timekeeping issues during live migration.
What is resolution vs precision?
Resolution is the smallest increment the clock can represent (clock_getres returns 1ns). Precision is the actual accuracy -- it depends on TSC frequency, interrupt latency, and timer slack. A clock with 1ns resolution and 1us precision means readings are accurate to ~1us despite representing 1ns granularity.
How Technologies Use This
At 2 AM, NTP corrects the system clock backward by 10 seconds. Dozens of services with WatchdogSec suddenly think they have missed their deadlines, and systemd restarts them all simultaneously. The cascade of false watchdog kills takes down half the services on the machine.
The cause is using wall-clock time (CLOCK_REALTIME) for timeout calculations. When NTP adjusts the clock backward, every active deadline computed from wall-clock time shifts into the future or past. A 30-second watchdog timeout becomes a 40-second wait or appears to have already expired, depending on the direction of the adjustment.
systemd uses CLOCK_MONOTONIC for all internal deadlines including WatchdogSec, RestartSec, and OnBootSec, which is immune to NTP adjustments. Only OnCalendar= scheduling uses CLOCK_REALTIME, because calendar dates genuinely require wall-clock correlation. This design choice prevents an estimated 5-10 spurious service restarts per NTP sync event on a typical server.
A 10-second HTTP client timeout in a Go service occasionally takes 15 seconds to fire. The issue is intermittent, happening only at specific times of day, and there is no network delay or server-side slowdown to explain it.
The root cause is that if the runtime used CLOCK_REALTIME, an NTP backward adjustment of 5 seconds mid-wait silently extends every active timeout by 5 seconds. The timeout was counting wall-clock seconds, not elapsed seconds, so a clock correction during the wait period stretches the deadline without any notification.
Go avoids this entirely by using CLOCK_MONOTONIC for time.After, time.Ticker, and scheduler preemption, all read via the vDSO at roughly 20ns per call without any syscall. Go deliberately never uses CLOCK_REALTIME for timeouts, deadlines, or internal scheduling. In a microservice making 10K requests per second, this saves about 200K unnecessary kernel transitions per second compared to a real syscall approach.
Lock.tryLock returns immediately without waiting, Thread.sleep wakes early, and latency histograms report impossible sub-zero durations. The application behavior is non-deterministic and only occurs during specific time windows.
The cause is that System.nanoTime() went backward between two calls in the same thread. A backward clock jump makes tryLock compute a negative remaining wait time, causes sleep to think the target time has already passed, and produces negative latency measurements when the end timestamp is smaller than the start timestamp.
The JVM prevents this by mapping System.nanoTime() to CLOCK_MONOTONIC via the vDSO, which the JIT compiler inlines to roughly 25ns per call. System.currentTimeMillis() uses CLOCK_REALTIME only for log timestamps where wall-clock correlation matters. On a server calling nanoTime() 5 million times per second for metrics collection, the vDSO path avoids 1 billion unnecessary kernel transitions per second.
setTimeout(1, callback) fires after 4-6 milliseconds instead of 1, and timing-sensitive operations in Node.js are consistently late. Developers assume the event loop is overloaded, but even an idle Node.js process shows the same delay.
The kernel's timer slack intentionally rounds expiry times to align with other pending timers, reducing CPU wakeups at the cost of precision. On a default 250Hz kernel, the minimum effective granularity is 4ms regardless of the value passed to setTimeout. This is not a Node.js bug -- it is a deliberate kernel power-saving optimization.
libuv works around this by creating timerfd file descriptors armed with CLOCK_MONOTONIC and integrating them into the epoll event loop alongside socket I/O. process.hrtime.bigint() provides true nanosecond precision via the vDSO at roughly 20ns per call with zero syscall cost, making it 10x more precise than Date.now() for benchmarking hot code paths.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Monotonic clock | Container shares host clocksource | System.nanoTime() maps to CLOCK_MONOTONIC via vDSO | process.hrtime.bigint() uses CLOCK_MONOTONIC | time.Now() uses CLOCK_MONOTONIC for monotonic component | Lease durations use monotonic clock |
| Wall clock | Container inherits host CLOCK_REALTIME | System.currentTimeMillis() uses CLOCK_REALTIME | Date.now() uses CLOCK_REALTIME | time.Now().Unix() uses CLOCK_REALTIME | Certificate expiry checks use wall clock |
| Timer precision | Limited by host CONFIG_HZ | ScheduledExecutorService wraps hrtimers | libuv timerfd + epoll; setTimeout min ~4ms | time.After uses runtime timers (~1ms min) | Health check probes have 1s minimum granularity |
| NTP vulnerability | All containers affected by host NTP | Only System.currentTimeMillis() affected | Only Date.now() affected | Only time.Now().UTC() affected | Pod clocks drift if NTP not configured on nodes |
Stack Layer Mapping
| Layer | Component |
|---|---|
| Hardware | TSC (x86, ~1ns), HPET (~100ns), ACPI PM Timer (legacy) |
| Kernel clocksource | clocksource framework selects best hardware source |
| Kernel timers | timer_list wheel (jiffies, coarse) and hrtimer tree (ns, precise) |
| vDSO | vvar page + TSC read = ~20ns clock_gettime without syscall |
| Userspace | timerfd for event loops, POSIX timers for signal-based, clock_nanosleep for precise wakeup |
Design Rationale: Wall time and elapsed time are fundamentally different measurements -- one answers "what time is it?" and the other answers "how long has it been?" -- and conflating them causes real failures when NTP adjusts the clock. Separate clock IDs keep those concerns apart. The vDSO exists because reading the time is the single most frequent kernel interaction in most applications, and paying 200ns per call for a ring transition is absurd when a shared memory page and a TSC read can do it in 20ns. Timer slack rounds expiries to batch wakeups because waking the CPU once for five timers uses far less power than waking it five times -- the right tradeoff for everything except real-time workloads.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Timeouts fire 5-10 seconds late after NTP sync | Using CLOCK_REALTIME for deadline calculation | Audit code for CLOCK_REALTIME in timeout paths |
| setTimeout(1ms) fires at 4-6ms | Kernel CONFIG_HZ=250 and timer slack rounding | grep CONFIG_HZ /boot/config-$(uname -r) |
| Negative elapsed time measurements | TSC not synchronized across cores (VM or old CPU) | grep constant_tsc /proc/cpuinfo |
| Timer on laptop fires 2 hours late after resume | Using CLOCK_MONOTONIC instead of CLOCK_BOOTTIME | Switch to CLOCK_BOOTTIME for suspend-aware timers |
| Thousands of timer_create calls fail with EAGAIN | Too many POSIX timers per process | Replace with single timerfd + userspace wheel |
| clock_gettime takes 200ns instead of 20ns | vDSO not available or clocksource fell back to HPET | cat /sys/devices/system/clocksource/clocksource0/current_clocksource |
When to Use / Avoid
- Use CLOCK_MONOTONIC for all timeouts, deadlines, elapsed time measurements, and performance benchmarks
- Use CLOCK_REALTIME only for log timestamps, calendar scheduling, and human-readable wall clock display
- Use CLOCK_BOOTTIME for mobile wake-up alarms and any timer that must survive system suspend
- Use timerfd when timers need to integrate with epoll-based event loops (replaces signal-based POSIX timers)
- Avoid CLOCK_REALTIME for any duration calculation -- NTP adjustments will corrupt results
- Avoid creating thousands of POSIX timers per process -- use a single timerfd or userspace timing wheel
Try It Yourself
1 # Show the current clock source in use
2
3 cat /sys/devices/system/clocksource/clocksource0/current_clocksource 2>/dev/null && cat /sys/devices/system/clocksource/clocksource0/available_clocksource 2>/dev/null
4
5 # Show kernel timer frequency (CONFIG_HZ)
6
7 grep CONFIG_HZ /boot/config-$(uname -r) 2>/dev/null || echo 'Config not available'
8
9 # View all active timers in the kernel
10
11 sudo cat /proc/timer_list 2>/dev/null | head -40
12
13 # Measure clock_gettime resolution
14
15 python3 -c 'import time; times = [time.clock_gettime(time.CLOCK_MONOTONIC) for _ in range(10)]; diffs = [times[i]-times[i-1] for i in range(1,10)]; print(f"Min delta: {min(diffs)*1e9: 0f}ns, Mean: {sum(diffs)/len(diffs)*1e9: 0f}ns")' 2>/dev/null || echo 'python3 not available'
16
17 # Check TSC reliability flags
18
19 grep -oE '(constant_tsc|nonstop_tsc|tsc_reliable|tsc_known_freq)' /proc/cpuinfo 2>/dev/null | sort -u || echo 'Not x86 or no TSC flags'
20
21 # Show timer slack for current process
22
23 cat /proc/$$/timerslack_ns 2>/dev/null || echo 'timerslack_ns not available'Debug Checklist
- 1
cat /sys/devices/system/clocksource/clocksource0/current_clocksource - 2
grep -oE '(constant_tsc|nonstop_tsc)' /proc/cpuinfo | sort -u - 3
cat /proc/timer_list | head -60 - 4
grep CONFIG_HZ /boot/config-$(uname -r) 2>/dev/null - 5
cat /proc/$$/timerslack_ns - 6
perf stat -e 'timer:*' -a -- sleep 1
Key Takeaways
- ✓CLOCK_MONOTONIC never goes backward -- not during NTP adjustments, not during manual time changes. Use it for elapsed time and timeouts. CLOCK_REALTIME tracks wall clock time and CAN jump backward. Using it for timeouts causes hangs or early wakes.
- ✓CLOCK_BOOTTIME includes time spent in suspend. A 10-minute CLOCK_MONOTONIC timer will not fire if the laptop sleeps for an hour. CLOCK_BOOTTIME will fire immediately on wake because the 10 minutes elapsed during sleep.
- ✓The kernel tick runs at CONFIG_HZ (typically 250 Hz = 4ms). With CONFIG_NO_HZ_FULL, the tick stops entirely when one task is running -- reducing jitter for latency-sensitive workloads at the cost of slightly higher overhead when the tick fires.
- ✓clock_gettime via the vDSO reads a shared page and the TSC -- no syscall, ~20ns. Most benchmarks measuring 'syscall overhead' with clock_gettime are actually measuring vDSO speed, not syscall cost.
- ✓Timer slack rounds timer expiries to align with other timers, reducing CPU wakeups. Default is 50us for non-RT tasks. That is why setTimeout(1ms) rarely fires at 1ms. prctl(PR_SET_TIMERSLACK, 0) disables it for real-time tasks.
Common Pitfalls
- ✗Mistake: Using CLOCK_REALTIME for timeout calculations. Reality: NTP can adjust the clock backward, turning a 10-second timeout into a 15-second wait. Always use CLOCK_MONOTONIC for deadlines and elapsed time.
- ✗Mistake: Expecting nanosleep() to wake at exactly the requested time. Reality: The kernel rounds to timer resolution and adds slack. On a 250 HZ kernel, nanosleep(1ms) typically sleeps for 4ms. Use clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) for precise wakeups.
- ✗Mistake: Creating thousands of POSIX timers per process. Reality: Each consumes kernel memory and a signal slot. Use a single timerfd with the nearest expiry, or a userspace timing wheel.
- ✗Mistake: Assuming TSC is synchronized across CPU cores. Reality: Older or misconfigured systems (NUMA, VMs without TSC offsetting) have per-core TSC drift. Check for constant_tsc and nonstop_tsc CPU flags. Modern CPUs are safe.
Reference
In One Line
MONOTONIC for timeouts, REALTIME for display, BOOTTIME for mobile alarms -- picking the wrong clock corrupts deadlines silently and the bug only shows up when NTP corrects.