SLO, SLA & SLI Budgeting
The Three Layers
Google's Site Reliability Engineering book brought this framework into the mainstream, and it's still the best way to think about managing reliability in production. The three concepts build on each other.
Service Level Indicators (SLIs) are raw measurements of how your service behaves from the user's perspective. Good SLIs measure what users actually care about: request latency (P50, P99), error rate (percentage of 5xx responses), availability (successful requests divided by total requests), and throughput. Bad SLIs measure internal infrastructure stuff that users never see, like CPU utilization, disk IOPS, or pod restart counts. Those things matter for capacity planning, but they don't belong in your SLO tracking.
Service Level Objectives (SLOs) are the targets you set against your SLIs. Something like "99.9% of requests will succeed within 200ms over a 30-day rolling window" is specific enough to be actionable. Setting good SLOs means understanding your users. An internal batch processing pipeline can probably live with 99% availability. A payment API probably needs 99.99%. The mistake people keep making is setting SLOs based on what the system currently achieves rather than what users actually need.
Service Level Agreements (SLAs) are external contracts with money on the line. Your SLA should always be looser than your SLO. If your SLO is 99.9%, set your SLA at maybe 99.5%. That gives you breathing room before contractual penalties kick in. Plenty of internal services don't need SLAs at all. SLOs are enough.
Error Budgets in Practice
An error budget is just the flip side of your SLO. A 99.9% SLO gives you a 0.1% error budget, which works out to 43.2 minutes of total downtime per month, or roughly 4,320 failed requests per million. What makes error budgets powerful is that they turn reliability from a vague aspiration into a concrete, spendable resource.
When the budget is healthy, ship features aggressively. When the budget runs out, stop feature deployments and put the effort into reliability. That's the error budget policy, and it needs to be agreed on by both product and engineering leadership before anyone is in crisis mode.
Burn Rate Alerts
Traditional threshold alerts (like "error rate > 1%") tend to fire either too late or too often. Burn rate alerts fix this by measuring how fast you're eating through your error budget. A burn rate of 1x means you'll exactly use up your budget by month end. A burn rate of 14.4x means your 30-day budget will be gone in 2 hours, and that's worth paging someone.
Google recommends a multi-window approach: a fast burn (14.4x over 1 hour, confirmed by a 5-minute window) catches critical incidents, while a slow burn (3x over 6 hours, confirmed by a 30-minute window) catches gradual degradations. This cuts down on alert noise while still flagging both sudden failures and slow leaks.
Budget-Based Release Decisions
The most useful thing you can do with error budgets is gate your deployments. If your team has burned through 80% of the monthly budget with 20 days left, that's a clear signal to slow down and fix reliability before shipping more features. This isn't a punishment. It's a data-driven way to align incentives between product velocity and system stability. Teams that do this consistently end up with both better reliability and faster long-term delivery, because fewer incidents means less unplanned firefighting.
Key Points
- •SLIs are the measurements, SLOs are the targets, SLAs are the contracts. Do not confuse them
- •Error budgets quantify how much unreliability you can tolerate before pausing feature work
- •A 99.9% SLO allows 43.2 minutes of downtime per month. Know your budget in real time
- •Burn rate alerts detect when you are consuming error budget faster than expected
- •SLOs should be set based on user expectations, not on what your system currently achieves
Common Mistakes
- ✗Setting SLOs at 99.99% when your users would be perfectly happy with 99.9%, wasting engineering effort
- ✗Defining SLIs that do not reflect actual user experience (measuring server uptime instead of request success rate)
- ✗Having SLOs without error budget policies. The budget is meaningless if nobody acts when it is exhausted
- ✗Treating SLAs and SLOs as the same thing. SLAs have financial penalties, SLOs are internal targets