SRE Team Structure
Embedded vs Centralized
The embedded model places SREs directly inside product teams. They attend standups, join design reviews, and understand the service intimately. This works well when a service is complex enough that reliability work requires deep domain context. The tradeoff is that embedded SREs can drift toward feature work if management doesn't protect their time.
Centralized SRE teams sit apart from product teams and provide reliability services across the org. They own shared infrastructure like monitoring platforms, deployment pipelines, and incident tooling. The risk here is that centralized SREs become a ticketing queue, disconnected from the services they support.
Google invented SRE with a centralized model, but most companies that have scaled this well (LinkedIn, Dropbox, Uber) ended up with a hybrid. A small central SRE platform team builds shared tooling while embedded SREs work directly with the highest-impact product teams.
When to Hire Your First SRE
You probably don't need an SRE team until you hit around 30-40 engineers and have production services that real customers depend on. Before that, reliability ownership should sit with the engineers who build the services. Your first SRE hire should be someone who can set up observability foundations (Datadog, Grafana, PagerDuty), define SLOs, and build the on-call culture.
Don't hire 1 SRE and expect them to fix everything. Start with 2-3 so they can share on-call and actually have bandwidth for engineering work.
On-Call Expectations
SREs should not be the only people on call. The healthiest model is shared on-call where developers carry the primary pager for their own services and SREs provide escalation support plus incident coordination. This keeps developers accountable for the reliability of what they ship.
Compensation for on-call matters. Whether it's extra PTO, a flat stipend, or per-page payouts, ignoring the burden of on-call is the fastest way to burn out your SRE team and drive attrition.
Production Readiness Reviews
A PRR is a structured checklist that a service must pass before SRE will accept it into their operational scope. Google's version covers monitoring coverage, alerting thresholds, capacity projections, disaster recovery plans, and documented runbooks.
The key principle: SRE support is earned, not assumed. If a development team can't demonstrate that their service meets baseline reliability standards, SREs are empowered to decline support until the gaps are closed.
Key Points
- •Google's original SRE model caps operational work at 50% of an SRE's time. The other 50% goes to engineering projects that improve reliability. If toil exceeds 50%, tickets get redirected back to the development team until the balance is restored
- •The standard ratio is 1 SRE for every 8-10 developers. Staffing below that means your SREs become permanent firefighters with no time for systemic improvements
- •Embedded SREs sit inside product teams and build deep domain knowledge. Centralized SREs maintain consistency across the org but risk becoming a bottleneck. Most mature orgs use a hybrid: a central SRE platform team plus embedded SREs for critical services
- •SRE, DevOps, and Platform Engineering are not the same thing. SREs own service reliability and SLOs. DevOps is a cultural philosophy about shared ownership. Platform Engineering builds internal developer tools and infrastructure
- •Production readiness reviews (PRRs) are gating checks before a service goes live. They cover monitoring, alerting, runbooks, capacity planning, and failure modes. Without PRRs, teams ship services that nobody knows how to operate at 3 AM
Common Mistakes
- ✗Hiring SREs before you have enough production services to justify the role. If you have fewer than 5-6 production services, a senior backend engineer with ops experience can fill the gap
- ✗Treating SRE as a rebranded ops team. If your SREs aren't writing code to automate away toil, you've just renamed your sysadmins
- ✗Letting SREs own all on-call without developer participation. This creates a moral hazard where developers ship unreliable code because someone else deals with the consequences