Alerting & On-Call
Architecture Diagram
Why It Exists
Nobody watches dashboards at 3 AM. That's the whole point. Monitoring without alerting is just data collection that feels productive.
Alerting connects observability data to actual human response. When a service degrades, the right engineer should know about it within minutes. Simple enough in theory. In practice, most teams get this badly wrong.
The most common failure mode isn't missing alerts. It's alert fatigue. Too many alerts firing, too few of them worth waking up for, and eventually the on-call starts ignoring pages. Google's SRE team nailed the principle years ago: every alert must be actionable, novel, and require human intelligence. If it can be auto-remediated, automate it. If it fires constantly without anyone doing anything about it, delete it. Seriously. Delete it.
How It Works
SLO-Based Alerting
Stop setting static thresholds like "CPU > 80%". That approach generates noise and misses real problems. SLO-based alerting measures error budget burn rate instead, and it's a significant improvement.
The multi-window, multi-burn-rate approach from Google's SRE Workbook uses two windows:
- Fast burn: 2% of the 30-day error budget consumed in 1 hour (that's 14.4x the normal burn rate). This pages someone immediately. It catches acute incidents where things are breaking right now.
- Slow burn: 5% of the 30-day error budget consumed in 6 hours (2x normal burn rate). This creates a ticket. It catches the gradual degradation that nobody notices until it's too late.
Why does this work so much better? Because short traffic dips and one-off request errors don't trip the alert. Only sustained impact against the error budget triggers anything. False positives drop dramatically.
Escalation Policy Design
The escalation chain needs to cover the case where someone is unavailable. No single points of failure.
| Level | Responder | Timeout | Channel |
|---|---|---|---|
| L1 | Primary on-call | 5 min | PagerDuty (phone + push) |
| L2 | Secondary on-call | 10 min | PagerDuty (phone + push) |
| L3 | Engineering manager | 10 min | PagerDuty + Slack |
| L4 | Incident commander | 15 min | All channels |
Keep the timeouts tight. If someone doesn't acknowledge within 5 minutes, escalate. Better to over-escalate and apologize than let an incident simmer for 30 minutes because someone was in the shower.
Runbook Design
Every alert needs a linked runbook. No exceptions. A runbook should answer five questions: what does this alert mean, what are the likely root causes (ordered by probability, not alphabetically), what diagnostic commands should I run, how do I fix it, and when should I pull in another team.
The part most teams skip: runbooks are living documents. After every incident where the runbook was wrong or incomplete, update it. Otherwise, the result is just a museum of stale documentation.
Production Considerations
- Alert grouping. Configure Alertmanager to group alerts by
serviceandalertnamewith a 5-minute group wait. When a database goes down, the goal is one notification, not 200 individual pod alerts flooding the on-call's phone. - Inhibition rules. If the Kubernetes node is down, suppress all pod-level alerts on that node. If the database is unreachable, suppress the dependent service error alerts. Show the root cause, not the symptoms.
- Noise audits. Run these monthly. Any alert that fired more than 10 times without producing an incident needs to be tuned, automated, or deleted. Track the signal-to-noise ratio as a team metric and hold the team accountable.
- On-call health. Monitor pages-per-shift and MTTR. If an engineer consistently gets more than 2 actionable pages per 12-hour shift, the system is not reliable enough. The answer is engineering investment, not more on-call heroics.
- Silence management. Use time-bound silences during planned maintenance. Never create indefinite silences. They become permanent blind spots that surface months later when everyone has forgotten they exist.
Failure Scenarios
Scenario 1: Alertmanager Cluster Split-Brain During Network Partition. A three-node Alertmanager cluster is running. A network partition isolates one node. Now both partitions think they're the active cluster and independently fire alerts. The on-call gets 47 alerts instead of 12 during a real database outage. Grouping and deduplication fail because each partition tracks state independently. The engineer spends 25 minutes triaging duplicates instead of fixing the actual problem, stretching MTTR from 15 to 40 minutes. Detection: Monitor alertmanager_cluster_members and alert when member count drops below expected. Track alertmanager_alerts_received_total per instance and look for asymmetry. Recovery: Set --cluster.settle-timeout=60s and spread Alertmanager nodes across availability zones, not failure domains. Use a gossip mesh with --cluster.peer for reliable membership. Test partition behavior quarterly. Most teams never test this, and they regret it.
Scenario 2: Alert Routing Misconfiguration Silences Critical Alerts. Someone updates the Alertmanager routing config to add a new team's Slack channel. A YAML indentation error causes all severity: critical alerts to route to the new (unmonitored) Slack channel instead of PagerDuty. For 6 hours overnight, no critical alerts reach the on-call phone. A payment processing outage goes unnoticed until customers report failures on Twitter. $340K in lost revenue. Detection: Run amtool check-config in CI to validate routing syntax before deploy. Implement a synthetic "watchdog" alert that fires continuously. If PagerDuty stops receiving the watchdog, the routing is broken. Monitor alertmanager_notifications_failed_total by integration. Recovery: Roll back the config change. Require two-person review for Alertmanager config changes going forward. Add end-to-end alert delivery tests: fire a test alert, verify it arrives in PagerDuty within 2 minutes. This scenario is preventable, and the fact that it still happens at companies of every size tells a lot about how often people skip config validation.
Scenario 3: Error Budget Exhaustion Goes Unnoticed Due to Stale SLO Definition. The checkout service SLO targets 99.9% success rate. The SLI query was defined 18 months ago and references a deprecated metric name. The query silently returns 0 results, so the burn rate reads as 0. Meanwhile, the actual error rate climbs to 2.5% over three weeks. Nobody investigates because the SLO dashboard shows green. Detection: Alert on absent(sli_metric_name) to catch missing SLI data. Dashboard panels should show "NO DATA" prominently, not default to 0. Run quarterly SLO review meetings to verify the underlying queries actually return non-zero values. Recovery: Update the SLI query to reference the current metric. Backfill error budget calculations for the affected period. Add CI checks that SLO recording rules produce non-null results in staging. The lesson here is that stale configs are silent killers. Schedule reviews, or accept the consequences.
Capacity Planning
Alert volume guidelines: A healthy alerting system should hit a signal-to-noise ratio of 3:1 at minimum. Three out of every four alerts should lead to meaningful human action. Google SRE recommends no more than 2 actionable pages per 12-hour on-call shift to keep people from burning out.
| Metric | Healthy | Warning | Critical | Reference |
|---|---|---|---|---|
| Pages per on-call shift (12h) | 0-2 | 3-5 | 6+ | Google SRE book |
| Alert-to-incident ratio | >50% | 25-50% | <25% | PagerDuty State of On-Call |
| MTTA (Mean Time to Acknowledge) | <5 min | 5-15 min | >15 min | Industry benchmark |
| MTTR (Mean Time to Resolve) | <30 min | 30-60 min | >60 min | Incident complexity dependent |
| Runbook coverage | >90% of alerts | 50-90% | <50% | Netflix practice |
| False positive rate | <10% | 10-30% | >30% | Sustainable on-call threshold |
Scaling on-call rotations: With fewer than 5 engineers, a weekly primary rotation with a secondary backup works fine. At 8-12 engineers, split into two teams by service area and consider follow-the-sun coverage given the right geographic distribution. At 20+ engineers, move to tiered on-call: L1 generalist triage handles all alerts, L2 domain-specific escalation (database, networking, application) handles the harder stuff. PagerDuty reports that organizations with 100+ services should budget 1 on-call engineer per 15-20 services. That ratio feels about right from experience, though it depends heavily on system maturity.
Architecture Decision Record
Decision: Choosing an Alerting & Incident Management Platform
| Criteria (Weight) | PagerDuty | Grafana OnCall | OpsGenie | Prometheus Alertmanager |
|---|---|---|---|---|
| Escalation policies (25%) | 5, most mature with multi-level | 3, basic chains but improving | 4, solid escalation engine | 2, manual config with no scheduling |
| Integration ecosystem (20%) | 5, 700+ integrations | 3, Grafana-native and growing | 4, Atlassian ecosystem | 3, webhook + receiver-based |
| On-call scheduling (20%) | 5, advanced rotations and overrides | 3, basic schedules | 4, good scheduling with Jira integration | 1, no built-in scheduling |
| Cost (15%) | 2, $21-41/user/month | 5, free OSS with paid cloud option | 3, $9-35/user/month | 5, free |
| Analytics & reporting (10%) | 5, MTTA/MTTR dashboards plus postmortem | 2, basic stats | 3, reports and analytics | 1, no built-in analytics |
| ChatOps (10%) | 4, Slack/Teams integration | 4, native Slack with ChatOps-first design | 3, Slack integration | 2, webhook notifications only |
When to choose what:
- Team < 10, cost-sensitive: Prometheus Alertmanager + Grafana OnCall. Free, good enough for small teams, and grows with the Grafana ecosystem. Teams outgrow it eventually, but it buys time.
- Team 10-50, SaaS-preferred: PagerDuty. It's the industry standard for a reason. The analytics alone pay for themselves when actually used to drive down MTTR.
- Atlassian-heavy org: OpsGenie. The native Jira integration for incident-to-ticket workflows saves a lot of glue work. Confluence runbook linking is a nice bonus.
- Enterprise with compliance needs: PagerDuty. SOC 2 Type II certified, audit logs, HIPAA-eligible plan. For regulated industries, this is the path of least resistance.
- Team 50+, follow-the-sun: PagerDuty with Event Intelligence. The AIOps-based alert grouping can reduce noise by 40-60% at scale. At that size, manual routing just doesn't hold up anymore.
Key Points
- •Every alert that fires should require a human to do something. If it doesn't, delete it.
- •Alert on SLO burn rate, not raw metrics. 'Burned 10% of error budget in 1 hour' beats 'error rate > 1%' every time.
- •Escalation policies route alerts to the right person: primary, then secondary, then manager, then incident commander.
- •Runbooks cut MTTR by giving the on-call engineer an actual playbook instead of guesswork at 3 AM.
- •If the on-call gets more than 2 pages per shift, the system needs fixing. More heroics won't help.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| PagerDuty | Commercial | Incident management, escalation policies, integrations | Medium-Enterprise |
| OpsGenie | Commercial | Atlassian integration, alert routing, on-call scheduling | Medium-Enterprise |
| Prometheus Alertmanager | Open Source | Prometheus-native, grouping, silencing, inhibition | Medium-Enterprise |
| Grafana OnCall | Open Source | Grafana-native, ChatOps, escalation chains | Small-Large |
Common Mistakes
- Alert fatigue. Too many non-actionable alerts train responders to ignore everything, including the real incidents.
- No deduplication or grouping. One cascading database failure should not produce 100 separate notifications.
- Missing runbooks. The alert fires at 3 AM and the on-call engineer has zero context on what to do next.
- Not tracking the alert-to-incident ratio. A high ratio means the team is drowning in false positives.
- Alerting on error rate without considering traffic volume. One error in two requests is a 50% error rate, but it's meaningless.