Alerting & On-Call

Why It Exists

Nobody watches dashboards at 3 AM. That's the whole point. Monitoring without alerting is just data collection that feels productive.

Alerting connects observability data to actual human response. When a service degrades, the right engineer should know about it within minutes. Simple enough in theory. In practice, most teams get this badly wrong.

The most common failure mode isn't missing alerts. It's alert fatigue. Too many alerts firing, too few of them worth waking up for, and eventually the on-call starts ignoring pages. Google's SRE team nailed the principle years ago: every alert must be actionable, novel, and require human intelligence. If it can be auto-remediated, automate it. If it fires constantly without anyone doing anything about it, delete it. Seriously. Delete it.

How It Works

SLO-Based Alerting

Stop setting static thresholds like "CPU > 80%". That approach generates noise and misses real problems. SLO-based alerting measures error budget burn rate instead, and it's a significant improvement.

The multi-window, multi-burn-rate approach from Google's SRE Workbook uses two windows:

Fast burn: 2% of the 30-day error budget consumed in 1 hour (that's 14.4x the normal burn rate). This pages someone immediately. It catches acute incidents where things are breaking right now.
Slow burn: 5% of the 30-day error budget consumed in 6 hours (2x normal burn rate). This creates a ticket. It catches the gradual degradation that nobody notices until it's too late.

Why does this work so much better? Because short traffic dips and one-off request errors don't trip the alert. Only sustained impact against the error budget triggers anything. False positives drop dramatically.

Escalation Policy Design

The escalation chain needs to cover the case where someone is unavailable. No single points of failure.

Level	Responder	Timeout	Channel
L1	Primary on-call	5 min	PagerDuty (phone + push)
L2	Secondary on-call	10 min	PagerDuty (phone + push)
L3	Engineering manager	10 min	PagerDuty + Slack
L4	Incident commander	15 min	All channels

Keep the timeouts tight. If someone doesn't acknowledge within 5 minutes, escalate. Better to over-escalate and apologize than let an incident simmer for 30 minutes because someone was in the shower.

Runbook Design

Every alert needs a linked runbook. No exceptions. A runbook should answer five questions: what does this alert mean, what are the likely root causes (ordered by probability, not alphabetically), what diagnostic commands should I run, how do I fix it, and when should I pull in another team.

The part most teams skip: runbooks are living documents. After every incident where the runbook was wrong or incomplete, update it. Otherwise, the result is just a museum of stale documentation.

Production Considerations

Alert grouping. Configure Alertmanager to group alerts by service and alertname with a 5-minute group wait. When a database goes down, the goal is one notification, not 200 individual pod alerts flooding the on-call's phone.
Inhibition rules. If the Kubernetes node is down, suppress all pod-level alerts on that node. If the database is unreachable, suppress the dependent service error alerts. Show the root cause, not the symptoms.
Noise audits. Run these monthly. Any alert that fired more than 10 times without producing an incident needs to be tuned, automated, or deleted. Track the signal-to-noise ratio as a team metric and hold the team accountable.
On-call health. Monitor pages-per-shift and MTTR. If an engineer consistently gets more than 2 actionable pages per 12-hour shift, the system is not reliable enough. The answer is engineering investment, not more on-call heroics.
Silence management. Use time-bound silences during planned maintenance. Never create indefinite silences. They become permanent blind spots that surface months later when everyone has forgotten they exist.

Failure Scenarios

Scenario 1: Alertmanager Cluster Split-Brain During Network Partition. A three-node Alertmanager cluster is running. A network partition isolates one node. Now both partitions think they're the active cluster and independently fire alerts. The on-call gets 47 alerts instead of 12 during a real database outage. Grouping and deduplication fail because each partition tracks state independently. The engineer spends 25 minutes triaging duplicates instead of fixing the actual problem, stretching MTTR from 15 to 40 minutes. Detection: Monitor alertmanager_cluster_members and alert when member count drops below expected. Track alertmanager_alerts_received_total per instance and look for asymmetry. Recovery: Set --cluster.settle-timeout=60s and spread Alertmanager nodes across availability zones, not failure domains. Use a gossip mesh with --cluster.peer for reliable membership. Test partition behavior quarterly. Most teams never test this, and they regret it.

Scenario 2: Alert Routing Misconfiguration Silences Critical Alerts. Someone updates the Alertmanager routing config to add a new team's Slack channel. A YAML indentation error causes all severity: critical alerts to route to the new (unmonitored) Slack channel instead of PagerDuty. For 6 hours overnight, no critical alerts reach the on-call phone. A payment processing outage goes unnoticed until customers report failures on Twitter. $340K in lost revenue. Detection: Run amtool check-config in CI to validate routing syntax before deploy. Implement a synthetic "watchdog" alert that fires continuously. If PagerDuty stops receiving the watchdog, the routing is broken. Monitor alertmanager_notifications_failed_total by integration. Recovery: Roll back the config change. Require two-person review for Alertmanager config changes going forward. Add end-to-end alert delivery tests: fire a test alert, verify it arrives in PagerDuty within 2 minutes. This scenario is preventable, and the fact that it still happens at companies of every size tells a lot about how often people skip config validation.

Scenario 3: Error Budget Exhaustion Goes Unnoticed Due to Stale SLO Definition. The checkout service SLO targets 99.9% success rate. The SLI query was defined 18 months ago and references a deprecated metric name. The query silently returns 0 results, so the burn rate reads as 0. Meanwhile, the actual error rate climbs to 2.5% over three weeks. Nobody investigates because the SLO dashboard shows green. Detection: Alert on absent(sli_metric_name) to catch missing SLI data. Dashboard panels should show "NO DATA" prominently, not default to 0. Run quarterly SLO review meetings to verify the underlying queries actually return non-zero values. Recovery: Update the SLI query to reference the current metric. Backfill error budget calculations for the affected period. Add CI checks that SLO recording rules produce non-null results in staging. The lesson here is that stale configs are silent killers. Schedule reviews, or accept the consequences.

Capacity Planning

Alert volume guidelines: A healthy alerting system should hit a signal-to-noise ratio of 3:1 at minimum. Three out of every four alerts should lead to meaningful human action. Google SRE recommends no more than 2 actionable pages per 12-hour on-call shift to keep people from burning out.

Metric	Healthy	Warning	Critical	Reference
Pages per on-call shift (12h)	0-2	3-5	6+	Google SRE book
Alert-to-incident ratio	>50%	25-50%	<25%	PagerDuty State of On-Call
MTTA (Mean Time to Acknowledge)	<5 min	5-15 min	>15 min	Industry benchmark
MTTR (Mean Time to Resolve)	<30 min	30-60 min	>60 min	Incident complexity dependent
Runbook coverage	>90% of alerts	50-90%	<50%	Netflix practice
False positive rate	<10%	10-30%	>30%	Sustainable on-call threshold

Scaling on-call rotations: With fewer than 5 engineers, a weekly primary rotation with a secondary backup works fine. At 8-12 engineers, split into two teams by service area and consider follow-the-sun coverage given the right geographic distribution. At 20+ engineers, move to tiered on-call: L1 generalist triage handles all alerts, L2 domain-specific escalation (database, networking, application) handles the harder stuff. PagerDuty reports that organizations with 100+ services should budget 1 on-call engineer per 15-20 services. That ratio feels about right from experience, though it depends heavily on system maturity.

Architecture Decision Record

Decision: Choosing an Alerting & Incident Management Platform

Criteria (Weight)	PagerDuty	Grafana OnCall	OpsGenie	Prometheus Alertmanager
Escalation policies (25%)	5, most mature with multi-level	3, basic chains but improving	4, solid escalation engine	2, manual config with no scheduling
Integration ecosystem (20%)	5, 700+ integrations	3, Grafana-native and growing	4, Atlassian ecosystem	3, webhook + receiver-based
On-call scheduling (20%)	5, advanced rotations and overrides	3, basic schedules	4, good scheduling with Jira integration	1, no built-in scheduling
Cost (15%)	2, $21-41/user/month	5, free OSS with paid cloud option	3, $9-35/user/month	5, free
Analytics & reporting (10%)	5, MTTA/MTTR dashboards plus postmortem	2, basic stats	3, reports and analytics	1, no built-in analytics
ChatOps (10%)	4, Slack/Teams integration	4, native Slack with ChatOps-first design	3, Slack integration	2, webhook notifications only

When to choose what:

Team < 10, cost-sensitive: Prometheus Alertmanager + Grafana OnCall. Free, good enough for small teams, and grows with the Grafana ecosystem. Teams outgrow it eventually, but it buys time.
Team 10-50, SaaS-preferred: PagerDuty. It's the industry standard for a reason. The analytics alone pay for themselves when actually used to drive down MTTR.
Atlassian-heavy org: OpsGenie. The native Jira integration for incident-to-ticket workflows saves a lot of glue work. Confluence runbook linking is a nice bonus.
Enterprise with compliance needs: PagerDuty. SOC 2 Type II certified, audit logs, HIPAA-eligible plan. For regulated industries, this is the path of least resistance.
Team 50+, follow-the-sun: PagerDuty with Event Intelligence. The AIOps-based alert grouping can reduce noise by 40-60% at scale. At that size, manual routing just doesn't hold up anymore.

Tool	Type	Best For	Scale
PagerDuty	Commercial	Incident management, escalation policies, integrations	Medium-Enterprise
OpsGenie	Commercial	Atlassian integration, alert routing, on-call scheduling	Medium-Enterprise
Prometheus Alertmanager	Open Source	Prometheus-native, grouping, silencing, inhibition	Medium-Enterprise
Grafana OnCall	Open Source	Grafana-native, ChatOps, escalation chains	Small-Large

Why It Exists

Nobody watches dashboards at 3 AM. That's the whole point. Monitoring without alerting is just data collection that feels productive.

How It Works

SLO-Based Alerting

The multi-window, multi-burn-rate approach from Google's SRE Workbook uses two windows:

Fast burn: 2% of the 30-day error budget consumed in 1 hour (that's 14.4x the normal burn rate). This pages someone immediately. It catches acute incidents where things are breaking right now.
Slow burn: 5% of the 30-day error budget consumed in 6 hours (2x normal burn rate). This creates a ticket. It catches the gradual degradation that nobody notices until it's too late.

Escalation Policy Design

The escalation chain needs to cover the case where someone is unavailable. No single points of failure.

Level	Responder	Timeout	Channel
L1	Primary on-call	5 min	PagerDuty (phone + push)
L2	Secondary on-call	10 min	PagerDuty (phone + push)
L3	Engineering manager	10 min	PagerDuty + Slack
L4	Incident commander	15 min	All channels

Keep the timeouts tight. If someone doesn't acknowledge within 5 minutes, escalate. Better to over-escalate and apologize than let an incident simmer for 30 minutes because someone was in the shower.

Runbook Design

The part most teams skip: runbooks are living documents. After every incident where the runbook was wrong or incomplete, update it. Otherwise, the result is just a museum of stale documentation.

Production Considerations

Alert grouping. Configure Alertmanager to group alerts by service and alertname with a 5-minute group wait. When a database goes down, the goal is one notification, not 200 individual pod alerts flooding the on-call's phone.
Inhibition rules. If the Kubernetes node is down, suppress all pod-level alerts on that node. If the database is unreachable, suppress the dependent service error alerts. Show the root cause, not the symptoms.
Noise audits. Run these monthly. Any alert that fired more than 10 times without producing an incident needs to be tuned, automated, or deleted. Track the signal-to-noise ratio as a team metric and hold the team accountable.
On-call health. Monitor pages-per-shift and MTTR. If an engineer consistently gets more than 2 actionable pages per 12-hour shift, the system is not reliable enough. The answer is engineering investment, not more on-call heroics.
Silence management. Use time-bound silences during planned maintenance. Never create indefinite silences. They become permanent blind spots that surface months later when everyone has forgotten they exist.

Failure Scenarios

Capacity Planning

Metric	Healthy	Warning	Critical	Reference
Pages per on-call shift (12h)	0-2	3-5	6+	Google SRE book
Alert-to-incident ratio	>50%	25-50%	<25%	PagerDuty State of On-Call
MTTA (Mean Time to Acknowledge)	<5 min	5-15 min	>15 min	Industry benchmark
MTTR (Mean Time to Resolve)	<30 min	30-60 min	>60 min	Incident complexity dependent
Runbook coverage	>90% of alerts	50-90%	<50%	Netflix practice
False positive rate	<10%	10-30%	>30%	Sustainable on-call threshold

Architecture Decision Record

Decision: Choosing an Alerting & Incident Management Platform

Criteria (Weight)	PagerDuty	Grafana OnCall	OpsGenie	Prometheus Alertmanager
Escalation policies (25%)	5, most mature with multi-level	3, basic chains but improving	4, solid escalation engine	2, manual config with no scheduling
Integration ecosystem (20%)	5, 700+ integrations	3, Grafana-native and growing	4, Atlassian ecosystem	3, webhook + receiver-based
On-call scheduling (20%)	5, advanced rotations and overrides	3, basic schedules	4, good scheduling with Jira integration	1, no built-in scheduling
Cost (15%)	2, $21-41/user/month	5, free OSS with paid cloud option	3, $9-35/user/month	5, free
Analytics & reporting (10%)	5, MTTA/MTTR dashboards plus postmortem	2, basic stats	3, reports and analytics	1, no built-in analytics
ChatOps (10%)	4, Slack/Teams integration	4, native Slack with ChatOps-first design	3, Slack integration	2, webhook notifications only

When to choose what:

Team < 10, cost-sensitive: Prometheus Alertmanager + Grafana OnCall. Free, good enough for small teams, and grows with the Grafana ecosystem. Teams outgrow it eventually, but it buys time.
Team 10-50, SaaS-preferred: PagerDuty. It's the industry standard for a reason. The analytics alone pay for themselves when actually used to drive down MTTR.
Atlassian-heavy org: OpsGenie. The native Jira integration for incident-to-ticket workflows saves a lot of glue work. Confluence runbook linking is a nice bonus.
Enterprise with compliance needs: PagerDuty. SOC 2 Type II certified, audit logs, HIPAA-eligible plan. For regulated industries, this is the path of least resistance.
Team 50+, follow-the-sun: PagerDuty with Event Intelligence. The AIOps-based alert grouping can reduce noise by 40-60% at scale. At that size, manual routing just doesn't hold up anymore.

Architecture Diagram

Why It Exists

How It Works

SLO-Based Alerting

Escalation Policy Design

Runbook Design

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics

Alerting & On-Call

Architecture Diagram

Why It Exists

How It Works

SLO-Based Alerting

Escalation Policy Design

Runbook Design

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics