On-Call Health Metrics

Why On-Call Health Matters

On-call is where reliability work meets human limits. A team can have perfect SLOs on paper, but if the on-call rotation is burning people out, it's not sustainable. Engineers who are exhausted from overnight pages write worse code, review less carefully, and eventually leave. On-call health is a retention metric disguised as an operations metric.

Core Metrics to Track

Pages per shift: Count the total pages (not incidents, but individual alerts) that fire during each on-call shift. More than 2 actionable pages per 12-hour shift is unsustainable long-term. More than 5 and you have an alert quality problem. Track this per shift, and break it down into actionable pages vs noise (false positives, auto-resolved, informational).

Sleep disruption: Pages between 10pm and 7am deserve special tracking. One overnight page ruins the next day's productivity. Two in one night can mean calling in sick. Track the percentage of on-call shifts that include at least one overnight page. If it's above 30%, you need to either improve alert thresholds, defer non-critical alerts to business hours, or add follow-the-sun rotation.

Interrupt rate: Beyond pages, how often does the on-call engineer get pulled into Slack threads, customer escalations, or "quick questions"? These interruptions don't show up in PagerDuty data but they fragment the on-call engineer's time. Survey your on-call engineers monthly on total interrupt burden.

Toil budget: Google SRE defines a ceiling of 50% operational work per engineer. If your on-call engineers spend more than half their time on repetitive, manual, automatable tasks, the balance is wrong. Measure time spent on toil vs engineering projects for each engineer, rolling monthly.

Load Distribution Fairness

A 6-person rotation where everyone takes equal shifts sounds fair. But if one shift consistently gets 5x the page volume (say, the team that covers the weekly batch job window), the rotation is not actually equitable. Track cumulative pages per person over a quarter. If one engineer has absorbed 3x the pages of their teammates, you need to rebalance.

Some teams weight on-call shifts by expected load and give compensatory time off or reduce other responsibilities for high-load shifts. The specifics depend on your organization, but the principle is the same: measure the actual burden, not just the schedule.

Escalation Frequency

When the on-call engineer escalates to a senior engineer or another team, that's an escalation. High escalation rates (above 20% of incidents) point to one of three problems: runbooks are incomplete, the on-call engineer lacks context for the service, or the incidents are genuinely beyond the on-call scope.

Break down escalations by cause. If most escalations happen because the runbook doesn't cover the scenario, invest in better documentation. If they happen because the service is too complex for a single person to debug, consider splitting the service or creating better diagnostic tooling.

Connecting to Retention

Track voluntary attrition alongside on-call health metrics. Teams with high page volumes and frequent sleep disruption consistently show higher attrition rates. The correlation is strong enough that on-call health should be a standing item in engineering leadership reviews. By the time someone hands in their notice citing on-call burnout, you've already lost the battle six months ago.

Why On-Call Health Matters

Core Metrics to Track

Load Distribution Fairness

Escalation Frequency

Connecting to Retention

Why On-Call Health Matters

Core Metrics to Track

Load Distribution Fairness

Escalation Frequency

Connecting to Retention

Key Points

Common Mistakes

Related Topics

On-Call Health Metrics

Why On-Call Health Matters

Core Metrics to Track

Load Distribution Fairness

Escalation Frequency

Connecting to Retention

Key Points

Common Mistakes

Related Topics