SLO, SLA & SLI Budgeting

The Three Layers

Google's Site Reliability Engineering book brought this framework into the mainstream, and it's still the best way to think about managing reliability in production. The three concepts build on each other.

Service Level Indicators (SLIs) are raw measurements of how your service behaves from the user's perspective. Good SLIs measure what users actually care about: request latency (P50, P99), error rate (percentage of 5xx responses), availability (successful requests divided by total requests), and throughput. Bad SLIs measure internal infrastructure stuff that users never see, like CPU utilization, disk IOPS, or pod restart counts. Those things matter for capacity planning, but they don't belong in your SLO tracking.

Service Level Objectives (SLOs) are the targets you set against your SLIs. Something like "99.9% of requests will succeed within 200ms over a 30-day rolling window" is specific enough to be actionable. Setting good SLOs means understanding your users. An internal batch processing pipeline can probably live with 99% availability. A payment API probably needs 99.99%. The mistake people keep making is setting SLOs based on what the system currently achieves rather than what users actually need.

Service Level Agreements (SLAs) are external contracts with money on the line. Your SLA should always be looser than your SLO. If your SLO is 99.9%, set your SLA at maybe 99.5%. That gives you breathing room before contractual penalties kick in. Plenty of internal services don't need SLAs at all. SLOs are enough.

Error Budgets in Practice

An error budget is just the flip side of your SLO. A 99.9% SLO gives you a 0.1% error budget, which works out to 43.2 minutes of total downtime per month, or roughly 4,320 failed requests per million. What makes error budgets powerful is that they turn reliability from a vague aspiration into a concrete, spendable resource.

When the budget is healthy, ship features aggressively. When the budget runs out, stop feature deployments and put the effort into reliability. That's the error budget policy, and it needs to be agreed on by both product and engineering leadership before anyone is in crisis mode.

Burn Rate Alerts

Traditional threshold alerts (like "error rate > 1%") tend to fire either too late or too often. Burn rate alerts fix this by measuring how fast you're eating through your error budget. A burn rate of 1x means you'll exactly use up your budget by month end. A burn rate of 14.4x means your 30-day budget will be gone in 2 hours, and that's worth paging someone.

Google recommends a multi-window approach: a fast burn (14.4x over 1 hour, confirmed by a 5-minute window) catches critical incidents, while a slow burn (3x over 6 hours, confirmed by a 30-minute window) catches gradual degradations. This cuts down on alert noise while still flagging both sudden failures and slow leaks.

Budget-Based Release Decisions

The most useful thing you can do with error budgets is gate your deployments. If your team has burned through 80% of the monthly budget with 20 days left, that's a clear signal to slow down and fix reliability before shipping more features. This isn't a punishment. It's a data-driven way to align incentives between product velocity and system stability. Teams that do this consistently end up with both better reliability and faster long-term delivery, because fewer incidents means less unplanned firefighting.

The Three Layers

Error Budgets in Practice

Burn Rate Alerts

Budget-Based Release Decisions

The Three Layers

Error Budgets in Practice

Burn Rate Alerts

Budget-Based Release Decisions

Key Points

Common Mistakes

Related Topics

SLO, SLA & SLI Budgeting

The Three Layers

Error Budgets in Practice

Burn Rate Alerts

Budget-Based Release Decisions

Key Points

Common Mistakes

Related Topics