Reliability Engineering & SLO Design

What SLOs Actually Are (and What They're Not)

SLOs are targets for how well your service should perform, defined in terms that matter to users. They are not aspirational goals, not marketing numbers, and not a competition to see who can add the most nines. A good SLO answers one question: "How reliable does this service need to be for our users to be happy?"

That question is harder than it sounds. Different users have different expectations. A payment processing API needs to be far more reliable than a recommendation engine. An internal analytics dashboard can tolerate more downtime than a customer-facing checkout flow. Your SLOs should reflect these differences.

Here's a concrete example. Say you're setting SLOs for a food delivery app's order placement service. You'd talk to product managers about what users expect ("I should be able to place an order within 10 seconds, and it should succeed on the first try almost always"). You'd look at historical data to see what you're actually delivering. You'd check the SLOs of your dependencies (payment processor, restaurant API, geolocation service) because you can't be more reliable than your weakest link. From all of that, you'd arrive at something like: 99.9% of order placements succeed within 5 seconds, measured from the client side.

Designing SLIs That Reflect User Experience

The SLI is the measurement. Get it wrong and your SLO is meaningless. The most common mistake is measuring at the wrong layer.

Bad SLIs:

Server CPU utilization below 70%
HTTP 200 response rate at the load balancer
Database query latency under 100ms

Good SLIs:

Percentage of user-initiated requests that complete successfully within the target latency, measured from the client or edge
Percentage of page loads that become interactive within 3 seconds, measured by real user monitoring
Percentage of background jobs that complete within their SLA window, measured end-to-end

The difference is perspective. Bad SLIs measure component health. Good SLIs measure user outcomes. A server can return 200 OK while the user stares at a spinner because the response was technically successful but contained an error message. Your SLI should catch that.

Error Budgets: The Bridge Between Product and Reliability

The error budget is the most powerful concept in SLO-based reliability. It's simple math: if your SLO is 99.9% availability over a rolling 30-day window, your error budget is 0.1%, which translates to about 43 minutes of downtime. That budget is yours to spend however you want.

When the budget is healthy, you have room to take risks. Ship that big migration. Try the new deployment strategy. Run the chaos engineering experiment. If something goes wrong, you have budget to absorb it.

When the budget is tight, slow down. Add more testing. Require extra review on deploys. Postpone risky changes until the budget recovers. This isn't about punishment. It's about making smart bets with a limited resource.

The key insight is that the error budget belongs to both product and engineering. If product wants to ship 10 features this sprint and engineering says the error budget is already at 60% consumed, that's a conversation, not a veto. Maybe you ship 5 features and spend the other capacity on reliability improvements. Maybe you accept the risk because a critical launch date is approaching. The budget makes the tradeoff visible and forces the right people to make the decision together.

Running Reliability Reviews

Reliability reviews are the practice that makes SLOs stick. Without them, SLOs become stale numbers that nobody looks at. Here's what a good monthly reliability review covers:

SLO performance over the past 30 days. Are you meeting your targets? Are you trending up or down? If you've been consistently over-performing (say, hitting 99.99% against a 99.9% target), you might be over-investing in reliability at the expense of velocity.
Error budget consumption. How much budget did you burn, and on what? Was it one big incident or a steady drip of small issues? The pattern matters as much as the number.
Incident retrospectives. What broke, what did you learn, and what are you doing about it? Connect incidents to SLO impact so leadership can see the cost of reliability gaps in concrete terms.
Upcoming risks. Are there big migrations, new feature launches, or infrastructure changes coming that might stress the system? Adjust your risk tolerance accordingly.

Keep these reviews short (30-45 minutes), data-driven, and attended by both engineering and product leadership. The goal is shared understanding of where you stand and shared ownership of what to do about it.

What SLOs Actually Are (and What They're Not)

Designing SLIs That Reflect User Experience

The SLI is the measurement. Get it wrong and your SLO is meaningless. The most common mistake is measuring at the wrong layer.

Bad SLIs:

Server CPU utilization below 70%
HTTP 200 response rate at the load balancer
Database query latency under 100ms

Good SLIs:

Percentage of user-initiated requests that complete successfully within the target latency, measured from the client or edge
Percentage of page loads that become interactive within 3 seconds, measured by real user monitoring
Percentage of background jobs that complete within their SLA window, measured end-to-end

Error Budgets: The Bridge Between Product and Reliability

Running Reliability Reviews

Reliability reviews are the practice that makes SLOs stick. Without them, SLOs become stale numbers that nobody looks at. Here's what a good monthly reliability review covers:

SLO performance over the past 30 days. Are you meeting your targets? Are you trending up or down? If you've been consistently over-performing (say, hitting 99.99% against a 99.9% target), you might be over-investing in reliability at the expense of velocity.
Error budget consumption. How much budget did you burn, and on what? Was it one big incident or a steady drip of small issues? The pattern matters as much as the number.
Incident retrospectives. What broke, what did you learn, and what are you doing about it? Connect incidents to SLO impact so leadership can see the cost of reliability gaps in concrete terms.
Upcoming risks. Are there big migrations, new feature launches, or infrastructure changes coming that might stress the system? Adjust your risk tolerance accordingly.

What SLOs Actually Are (and What They're Not)

Designing SLIs That Reflect User Experience

Error Budgets: The Bridge Between Product and Reliability

Running Reliability Reviews

Sample Questions

Evaluation Criteria

Key Points

Common Mistakes

Related Topics

Reliability Engineering & SLO Design

What SLOs Actually Are (and What They're Not)

Designing SLIs That Reflect User Experience

Error Budgets: The Bridge Between Product and Reliability

Running Reliability Reviews

Sample Questions

Evaluation Criteria

Key Points

Common Mistakes

Related Topics