On-Call Rotation Design
Rotation Models
Single-timezone rotation is the simplest model. One engineer holds the pager for a week (or 3-4 days), then hands it off. This works when your team is in one region, but it means overnight pages are a reality. To make this sustainable, keep the rotation pool at 6-8 people minimum so each person is on call once every 6-8 weeks.
Follow-the-sun distributes on-call across time zones so nobody gets paged overnight. You need teams in at least two regions with a 4-6 hour overlap for handoffs. This is the gold standard for quality of life but requires enough staffing in each region to maintain a local rotation pool. Companies like PagerDuty and Datadog use this model for their own operations.
Split rotation divides on-call between business hours and after hours. Some teams have a "primary" on-call during the day and a separate "night and weekend" rotation with different (usually lighter) expectations. This works for services with low after-hours traffic.
Compensation That Works
Ignoring on-call compensation is a guaranteed way to lose engineers. The most common models:
- Flat weekly stipend: $500-1,500 per week of on-call duty, paid regardless of whether pages happen. Simple and predictable.
- Per-incident payout: $50-200 per page, with multipliers for overnight or weekend pages. Aligns compensation with actual disruption.
- Extra PTO: A day of PTO for each week of on-call. Works well at companies where cash compensation is complex (early-stage startups with tight budgets).
Many companies combine approaches: a base stipend plus per-incident bonuses for after-hours pages.
Burnout Prevention
Track on-call health metrics: pages per week, mean time to acknowledge, mean time to resolve, and the ratio of actionable to non-actionable alerts. If a team is averaging more than 2 pages per on-call shift, the alert noise is too high and needs tuning.
After a particularly rough on-call week (multiple SEV1s, extended outages), give the engineer a recovery day. Don't make them jump straight back into sprint work the morning after a 3 AM incident.
Tooling
PagerDuty is the market leader with deep integrations and analytics. Opsgenie (Atlassian) is a solid alternative, especially if you're already in the Atlassian ecosystem. Rootly focuses on incident management and pairs well with either paging tool. Grafana OnCall is the open-source option. Whichever tool you pick, make sure it supports automatic escalation, schedule overrides, and on-call analytics out of the box.
Key Points
- •Sustainable on-call rotations need a minimum of 6-8 people. Fewer than that and individuals end up on call too frequently, which leads to burnout and attrition
- •Follow-the-sun rotations (handing off the pager across time zones) eliminate overnight pages but require at least two geographically distributed teams with sufficient overlap for clean handoffs
- •On-call compensation is not optional. Whether it's a flat weekly stipend ($500-1,500/week is common in US tech), extra PTO, or per-incident payouts, uncompensated on-call tells engineers their time outside work hours has no value
- •Shadow on-call pairs a new team member with an experienced on-caller for 1-2 rotations before they carry the pager solo. This builds confidence and catches knowledge gaps before they result in a botched incident response
- •Escalation policies should have clear timeouts. If a primary on-call doesn't acknowledge an alert within 5 minutes, it auto-escalates to secondary. If secondary doesn't respond in 10 minutes, it hits the engineering manager
Common Mistakes
- ✗Putting on-call solely on the SRE or ops team. The engineers who build the service should share on-call responsibility. Shared pain creates shared ownership of reliability
- ✗Alerting on metrics that aren't actionable. Every page should have a corresponding runbook with clear steps. If the on-call engineer can't do anything about an alert, it shouldn't be a page
- ✗Ignoring on-call load distribution. Some weeks are quiet, others are brutal. Track pages per rotation and rebalance if certain shifts consistently get hit harder
- ✗Skipping the on-call handoff. A 15-minute sync between outgoing and incoming on-call (open incidents, known risks, upcoming deployments) prevents context loss and repeated triage