Reliability Engineering & SLO Design
What SLOs Actually Are (and What They're Not)
SLOs are targets for how well your service should perform, defined in terms that matter to users. They are not aspirational goals, not marketing numbers, and not a competition to see who can add the most nines. A good SLO answers one question: "How reliable does this service need to be for our users to be happy?"
That question is harder than it sounds. Different users have different expectations. A payment processing API needs to be far more reliable than a recommendation engine. An internal analytics dashboard can tolerate more downtime than a customer-facing checkout flow. Your SLOs should reflect these differences.
Here's a concrete example. Say you're setting SLOs for a food delivery app's order placement service. You'd talk to product managers about what users expect ("I should be able to place an order within 10 seconds, and it should succeed on the first try almost always"). You'd look at historical data to see what you're actually delivering. You'd check the SLOs of your dependencies (payment processor, restaurant API, geolocation service) because you can't be more reliable than your weakest link. From all of that, you'd arrive at something like: 99.9% of order placements succeed within 5 seconds, measured from the client side.
Designing SLIs That Reflect User Experience
The SLI is the measurement. Get it wrong and your SLO is meaningless. The most common mistake is measuring at the wrong layer.
Bad SLIs:
- Server CPU utilization below 70%
- HTTP 200 response rate at the load balancer
- Database query latency under 100ms
Good SLIs:
- Percentage of user-initiated requests that complete successfully within the target latency, measured from the client or edge
- Percentage of page loads that become interactive within 3 seconds, measured by real user monitoring
- Percentage of background jobs that complete within their SLA window, measured end-to-end
The difference is perspective. Bad SLIs measure component health. Good SLIs measure user outcomes. A server can return 200 OK while the user stares at a spinner because the response was technically successful but contained an error message. Your SLI should catch that.
Error Budgets: The Bridge Between Product and Reliability
The error budget is the most powerful concept in SLO-based reliability. It's simple math: if your SLO is 99.9% availability over a rolling 30-day window, your error budget is 0.1%, which translates to about 43 minutes of downtime. That budget is yours to spend however you want.
When the budget is healthy, you have room to take risks. Ship that big migration. Try the new deployment strategy. Run the chaos engineering experiment. If something goes wrong, you have budget to absorb it.
When the budget is tight, slow down. Add more testing. Require extra review on deploys. Postpone risky changes until the budget recovers. This isn't about punishment. It's about making smart bets with a limited resource.
The key insight is that the error budget belongs to both product and engineering. If product wants to ship 10 features this sprint and engineering says the error budget is already at 60% consumed, that's a conversation, not a veto. Maybe you ship 5 features and spend the other capacity on reliability improvements. Maybe you accept the risk because a critical launch date is approaching. The budget makes the tradeoff visible and forces the right people to make the decision together.
Running Reliability Reviews
Reliability reviews are the practice that makes SLOs stick. Without them, SLOs become stale numbers that nobody looks at. Here's what a good monthly reliability review covers:
-
SLO performance over the past 30 days. Are you meeting your targets? Are you trending up or down? If you've been consistently over-performing (say, hitting 99.99% against a 99.9% target), you might be over-investing in reliability at the expense of velocity.
-
Error budget consumption. How much budget did you burn, and on what? Was it one big incident or a steady drip of small issues? The pattern matters as much as the number.
-
Incident retrospectives. What broke, what did you learn, and what are you doing about it? Connect incidents to SLO impact so leadership can see the cost of reliability gaps in concrete terms.
-
Upcoming risks. Are there big migrations, new feature launches, or infrastructure changes coming that might stress the system? Adjust your risk tolerance accordingly.
Keep these reviews short (30-45 minutes), data-driven, and attended by both engineering and product leadership. The goal is shared understanding of where you stand and shared ownership of what to do about it.
Sample Questions
How do you set SLOs for a new service? What inputs do you need?
The interviewer wants to see you think about user expectations, business requirements, and dependency SLOs. Don't pick a number out of thin air. Show how you'd gather data from user behavior, product requirements, and downstream service capabilities to arrive at a target that's both meaningful and achievable.
Your team has burned through 80% of its error budget in the first week. What do you do?
This tests judgment. Do you freeze all deploys immediately? Do you investigate first? The interviewer wants to see you balance urgency with good decision-making. Show that you'd assess the cause, determine if it's ongoing or a one-time event, and then decide on the appropriate response.
How do you balance feature velocity against reliability when the error budget is tight?
The error budget is a shared resource between product and engineering. This question tests whether you can use it as a negotiation tool rather than treating reliability as purely an engineering concern. Show that you understand the tradeoff is explicit and should involve product leadership.
Evaluation Criteria
- Sets SLOs based on user expectations and business needs, not arbitrary targets like 99.99%
- Uses error budgets as concrete decision-making tools for balancing velocity and reliability
- Designs alerting strategies derived from SLOs rather than arbitrary thresholds
- Understands the relationship between SLIs, SLOs, and SLAs and can explain when each matters
- Knows when to tighten or relax SLO targets based on changing requirements and historical data
Key Points
- •SLOs should reflect what users actually experience, not what makes engineering feel good. A 99.99% target sounds impressive, but if your users are happy at 99.5% and the cost of the extra nines is enormous, you're wasting resources.
- •Error budgets are the best negotiation tool between product and engineering. When the budget is healthy, ship fast and take risks. When it's tight, slow down and focus on reliability. This makes the velocity-reliability tradeoff explicit and data-driven.
- •SLIs should measure what users see, not what servers report. A server returning 200 OK in 50ms doesn't matter if the user's page takes 4 seconds to render. Measure latency at the edge, success rates from the client perspective, and availability as users experience it.
- •SLOs and SLAs are different things. An SLO is an internal target that drives engineering decisions. An SLA is an external commitment with contractual consequences. Your SLO should always be tighter than your SLA, giving you a buffer before you breach customer commitments.
- •Reliability reviews should be a regular practice, not a response to incidents. Review your SLO performance monthly, discuss trends with product and engineering leadership, and adjust targets based on what you've learned. Treat reliability as an ongoing conversation, not a checkbox.
Common Mistakes
- ✗Setting SLOs at 99.99% because it sounds good. Do the math: 99.99% uptime means about 52 minutes of downtime per year. For most services, that's far stricter than users need, and achieving it requires massive investment in redundancy. 99.9% (8.7 hours/year) is the right target for most internal services.
- ✗Treating error budget violations as purely engineering problems. If you burned through your error budget because product pushed 15 risky features in a sprint, that's a product decision, not an engineering failure. The conversation about what to do next needs to include product leadership.
- ✗Measuring server uptime instead of user-facing success rate. Your servers can be running perfectly while users experience failures due to CDN issues, DNS problems, or client-side errors. The SLI needs to capture what the user actually sees.
- ✗Not having a clear policy for what happens when the error budget runs out. If the budget hits zero and nobody knows what that means, it's not a useful tool. Define the policy in advance: do you freeze deploys? Require extra review? Redirect engineering capacity to reliability work? Decide before the crisis.