On-Call Health Metrics
Why On-Call Health Matters
On-call is where reliability work meets human limits. A team can have perfect SLOs on paper, but if the on-call rotation is burning people out, it's not sustainable. Engineers who are exhausted from overnight pages write worse code, review less carefully, and eventually leave. On-call health is a retention metric disguised as an operations metric.
Core Metrics to Track
Pages per shift: Count the total pages (not incidents, but individual alerts) that fire during each on-call shift. More than 2 actionable pages per 12-hour shift is unsustainable long-term. More than 5 and you have an alert quality problem. Track this per shift, and break it down into actionable pages vs noise (false positives, auto-resolved, informational).
Sleep disruption: Pages between 10pm and 7am deserve special tracking. One overnight page ruins the next day's productivity. Two in one night can mean calling in sick. Track the percentage of on-call shifts that include at least one overnight page. If it's above 30%, you need to either improve alert thresholds, defer non-critical alerts to business hours, or add follow-the-sun rotation.
Interrupt rate: Beyond pages, how often does the on-call engineer get pulled into Slack threads, customer escalations, or "quick questions"? These interruptions don't show up in PagerDuty data but they fragment the on-call engineer's time. Survey your on-call engineers monthly on total interrupt burden.
Toil budget: Google SRE defines a ceiling of 50% operational work per engineer. If your on-call engineers spend more than half their time on repetitive, manual, automatable tasks, the balance is wrong. Measure time spent on toil vs engineering projects for each engineer, rolling monthly.
Load Distribution Fairness
A 6-person rotation where everyone takes equal shifts sounds fair. But if one shift consistently gets 5x the page volume (say, the team that covers the weekly batch job window), the rotation is not actually equitable. Track cumulative pages per person over a quarter. If one engineer has absorbed 3x the pages of their teammates, you need to rebalance.
Some teams weight on-call shifts by expected load and give compensatory time off or reduce other responsibilities for high-load shifts. The specifics depend on your organization, but the principle is the same: measure the actual burden, not just the schedule.
Escalation Frequency
When the on-call engineer escalates to a senior engineer or another team, that's an escalation. High escalation rates (above 20% of incidents) point to one of three problems: runbooks are incomplete, the on-call engineer lacks context for the service, or the incidents are genuinely beyond the on-call scope.
Break down escalations by cause. If most escalations happen because the runbook doesn't cover the scenario, invest in better documentation. If they happen because the service is too complex for a single person to debug, consider splitting the service or creating better diagnostic tooling.
Connecting to Retention
Track voluntary attrition alongside on-call health metrics. Teams with high page volumes and frequent sleep disruption consistently show higher attrition rates. The correlation is strong enough that on-call health should be a standing item in engineering leadership reviews. By the time someone hands in their notice citing on-call burnout, you've already lost the battle six months ago.
Key Points
- •Pages per shift is the primary load metric: more than 2 pages per on-call shift signals unsustainable alert volume
- •Google SRE recommends a maximum of 50% operational work; the rest should be engineering projects
- •Sleep disruption (pages between 10pm and 7am) is the strongest predictor of on-call burnout
- •On-call load distribution should be tracked per-person to ensure fairness across the team
- •Escalation frequency indicates gaps in runbooks, tooling, or on-call engineer confidence
Common Mistakes
- ✗Measuring only incident count while ignoring alert noise, false positives, and duplicate pages
- ✗Distributing on-call by calendar rotation without accounting for page volume differences across shifts
- ✗Treating toil reduction as optional cleanup instead of a reliability investment
- ✗Not tracking on-call health until someone burns out and quits