SRE Team Structure

Embedded vs Centralized

The embedded model places SREs directly inside product teams. They attend standups, join design reviews, and understand the service intimately. This works well when a service is complex enough that reliability work requires deep domain context. The tradeoff is that embedded SREs can drift toward feature work if management doesn't protect their time.

Centralized SRE teams sit apart from product teams and provide reliability services across the org. They own shared infrastructure like monitoring platforms, deployment pipelines, and incident tooling. The risk here is that centralized SREs become a ticketing queue, disconnected from the services they support.

Google invented SRE with a centralized model, but most companies that have scaled this well (LinkedIn, Dropbox, Uber) ended up with a hybrid. A small central SRE platform team builds shared tooling while embedded SREs work directly with the highest-impact product teams.

When to Hire Your First SRE

You probably don't need an SRE team until you hit around 30-40 engineers and have production services that real customers depend on. Before that, reliability ownership should sit with the engineers who build the services. Your first SRE hire should be someone who can set up observability foundations (Datadog, Grafana, PagerDuty), define SLOs, and build the on-call culture.

Don't hire 1 SRE and expect them to fix everything. Start with 2-3 so they can share on-call and actually have bandwidth for engineering work.

On-Call Expectations

SREs should not be the only people on call. The healthiest model is shared on-call where developers carry the primary pager for their own services and SREs provide escalation support plus incident coordination. This keeps developers accountable for the reliability of what they ship.

Compensation for on-call matters. Whether it's extra PTO, a flat stipend, or per-page payouts, ignoring the burden of on-call is the fastest way to burn out your SRE team and drive attrition.

Production Readiness Reviews

A PRR is a structured checklist that a service must pass before SRE will accept it into their operational scope. Google's version covers monitoring coverage, alerting thresholds, capacity projections, disaster recovery plans, and documented runbooks.

The key principle: SRE support is earned, not assumed. If a development team can't demonstrate that their service meets baseline reliability standards, SREs are empowered to decline support until the gaps are closed.

Embedded vs Centralized

When to Hire Your First SRE

Don't hire 1 SRE and expect them to fix everything. Start with 2-3 so they can share on-call and actually have bandwidth for engineering work.

On-Call Expectations

Compensation for on-call matters. Whether it's extra PTO, a flat stipend, or per-page payouts, ignoring the burden of on-call is the fastest way to burn out your SRE team and drive attrition.

Production Readiness Reviews

Embedded vs Centralized

When to Hire Your First SRE

On-Call Expectations

Production Readiness Reviews

Key Points

Common Mistakes

Related Topics

SRE Team Structure

Embedded vs Centralized

When to Hire Your First SRE

On-Call Expectations

Production Readiness Reviews

Key Points

Common Mistakes

Related Topics