AI Adoption & Change Management — Operational Excellence
Difficulty: Advanced. Org Size: Growth
Key Points for AI Adoption & Change Management
- AI adoption is a change management problem, not a technology problem. The hardest part is never the model. It's getting people to trust it and actually change how they work
- Start with internal productivity use cases before customer-facing ones. Code review assistance, test generation, and documentation are low-risk wins that build confidence across the org
- Set up an AI Champions program with 2-3 champions per team who experiment first, document what works, and coach their peers. Grassroots adoption sticks better than top-down mandates
- Talk about job displacement concerns directly and honestly. Pretending the worry doesn't exist just makes people dig in harder
- Get a governance framework in place early, before the first production use case ships. Trying to bolt governance onto live AI systems after the fact is messy and disruptive
Common Mistakes with AI Adoption & Change Management
- Mandating AI tools from the top without understanding how teams actually work. A CEO email saying 'everyone must use Copilot' without any workflow context just breeds resentment and checkbox compliance
- Ignoring the skills gap between engineers who are comfortable with AI tools and those who aren't. That gap widens fast if you don't actively close it
- Treating AI adoption as a one-time rollout instead of an ongoing learning process. The tools change every few months, and the org needs ways to absorb new capabilities
- Measuring adoption poorly. 'Number of Copilot licenses' tells you nothing about whether the tool is actually helping or just gathering dust
Related to AI Adoption & Change Management
Incident Management Process, Scaling 10 to 50 Engineers
Chapters & Guilds — Team Structure
Difficulty: Intermediate. Org Size: Enterprise
Key Points for Chapters & Guilds
- Spotify itself has said the 2012 whitepaper was aspirational, not descriptive. Jeremiah Lee, a former Spotify engineer, wrote a widely-cited critique in 2020 confirming the model never fully worked as described. Treat the vocabulary as useful shorthand, not a blueprint
- Chapters solve a real problem: who owns career growth in a cross-functional team model? The answer is the chapter lead, who is both functional manager and technical standard-setter. Without this role, engineers on product squads have no clear growth path
- Guilds die when they become mandatory meetings. The ones that survive at scale (Atlassian's guilds, Shopify's craft groups) share three traits: a rotating facilitator, visible output like shared libraries or RFC templates, and permission to sunset when they've run their course
- The model fits 100-500 engineers where you need product-aligned delivery and functional consistency. Below 100, the matrix overhead costs more than it saves. Above 500, you need formal coordination mechanisms that guilds can't provide
- The biggest lesson from companies that adopted the Spotify Model is that structure without culture is just bureaucracy. Autonomous squads only work when you have high-trust engineering culture, strong hiring, and leaders who tolerate local decisions they disagree with
Common Mistakes with Chapters & Guilds
- Copying the org chart without copying the culture. ING famously adopted the Spotify Model and found that the hardest part wasn't the structure. It was getting managers to let go of control. The structure is the easy part
- Creating too many guilds. Start with 3-4 around your highest-value cross-team knowledge gaps. If you have 15 guilds and half of them are ghost Slack channels, you've diluted participation past the point of usefulness
- Letting chapter leads manage 12+ engineers across 4-5 squads. At that span, they can't stay close enough to each squad's context to give meaningful feedback. Cap chapters at 8-9 people or split them
- Treating squad autonomy as absolute. Autonomy without alignment creates a fragmented platform. Chapters and guilds exist precisely to provide the alignment rails that make autonomy sustainable
Related to Chapters & Guilds
Team Topologies Overview, Conway's Law & Inverse Maneuver
Conway's Law & Inverse Maneuver — Team Structure
Difficulty: Advanced. Org Size: Growth
Key Points for Conway's Law & Inverse Maneuver
- Conway's Law says your system architecture will mirror the communication structure of the org that builds it. This isn't a suggestion. It's just what happens.
- The Inverse Conway Maneuver turns this around: you deliberately structure teams to produce the architecture you actually want
- Cross-team dependencies on the org chart become cross-service dependencies in the codebase. Reduce one and you reduce the other
- Splitting a monolith without splitting the team first almost always fails. The existing communication patterns just recreate the coupling
- API boundaries between teams should work like contracts. Team autonomy depends on stable interfaces
Common Mistakes with Conway's Law & Inverse Maneuver
- Reorganizing teams but leaving communication patterns unchanged. New boxes on an org chart don't mean anything if the same people still coordinate on the same code
- Ignoring Conway's Law during architecture planning, then being surprised when the system looks like the org chart instead of the whiteboard diagram
- Going too hard on the Inverse Conway Maneuver. Splitting into lots of small teams before you understand the product domain well enough just creates premature boundaries
- Forgetting that Conway's Law applies beyond engineering. Product, design, and data teams also shape system architecture through how they communicate
Related to Conway's Law & Inverse Maneuver
Team Topologies Overview, Scaling 50 to 200 Engineers
Cross-Functional Product Teams — Team Structure
Difficulty: Intermediate. Org Size: Growth
Key Points for Cross-Functional Product Teams
- A well-composed product team includes a product manager, a designer, 4-6 engineers, and QA capacity (dedicated or shared). This is the smallest unit that can independently discover, build, and ship a product increment
- Embedded relationships (designer sits on the team full-time) create tighter collaboration than matrix relationships (designer is 'loaned' from a design org). The tradeoff is that embedded designers lose connection with their functional peers
- Team APIs define what a team owns, what it provides to other teams, and how other teams should interact with it. Without explicit boundaries, teams step on each other's toes or leave gaps that nobody owns
- Shared services (auth, payments, notifications) create the hardest ownership challenges. The team that owns the shared service has to balance their own roadmap against requests from every team that depends on them
- The right time to split a team is when it owns too many domains to hold in a single sprint, when coordination overhead inside the team exceeds coordination overhead between teams, or when it grows past 8-9 people
Common Mistakes with Cross-Functional Product Teams
- Splitting teams by technical layer (frontend team, backend team, API team) instead of by product domain. This creates handoff dependencies that slow delivery and diffuse ownership
- Sharing a PM across 3-4 teams. A PM split that thin becomes a prioritization bottleneck and can't do meaningful discovery work for any team
- Ignoring dual reporting tension. If a designer reports to a design manager but is embedded in a product team, both the team lead and the design manager need to align on priorities. Without regular sync, the designer gets pulled in two directions
- Forming a team around a project instead of a product area. Project teams ship their deliverable and disband, losing all the domain knowledge they built. Product teams persist and compound their understanding over time
Related to Cross-Functional Product Teams
Team Topologies Overview, Conway's Law & Inverse Maneuver
Data Team Organization — Team Structure
Difficulty: Advanced. Org Size: Growth
Key Points for Data Team Organization
- Data engineering, data science, and analytics engineering are three distinct disciplines with different skills, tools, and career paths. Lumping them together under 'the data team' guarantees that at least one group gets neglected
- The analytics engineering role (popularized by dbt Labs) bridges the gap between raw data engineering and business analytics. They own the transformation layer and build the data models that analysts and scientists depend on
- Centralized data teams maintain consistency in data modeling and tooling but become bottlenecks as request volume grows. Embedded data people move faster for their specific domain but create silos and duplicate work
- Reporting structure matters more than people realize. Data teams under engineering tend to prioritize infrastructure and reliability. Under product, they skew toward experimentation and metrics. As a standalone org under a CDO, they get strategic focus but risk disconnection from the teams they serve
- A modern data stack (Fivetran/Airbyte for ingestion, Snowflake/BigQuery for warehousing, dbt for transformation, Looker/Metabase for BI) shapes team structure. Each layer needs clear ownership
Common Mistakes with Data Team Organization
- Building a data team before you have enough data volume and business questions to justify it. If a single analyst with SQL access can answer 90% of your questions, you don't need a data platform team yet
- Hiring data scientists before data engineers. Your ML models are only as good as the data pipelines feeding them. Get the plumbing right first
- Letting every team build their own metrics definitions. Without a single source of truth for how 'active user' or 'revenue' is calculated, you end up with five dashboards showing five different numbers
- Treating data quality as someone else's problem. The teams producing data need ownership over its quality. Data teams can build validation frameworks, but they can't fix upstream data issues they don't control
Related to Data Team Organization
ML Team Structure, Team Topologies Overview
Engineering Career Ladders — People & Growth
Difficulty: Advanced. Org Size: Growth
Key Points for Engineering Career Ladders
- IC and management tracks should run parallel in seniority and comp. A Staff Engineer and an Engineering Manager should be peers, not one reporting to the other
- Each level needs clear, observable expectations across multiple dimensions: technical skill, scope of impact, leadership, and communication
- Cross-team calibration prevents title inflation and makes sure 'Senior Engineer' actually means the same thing everywhere in the org
- Promotion should be based on consistently doing next-level work, not just time served. Tenure is one data point, not the deciding factor
- Career ladders should be living documents that get reviewed annually. As the organization changes, the expectations at each level should change with it
Common Mistakes with Engineering Career Ladders
- Creating too many levels too early. A startup with 20 engineers doesn't need 8 IC levels, and the artificial granularity creates more politics than clarity
- Making management the only path to higher pay and seniority. This pushes great ICs into management roles they don't want and aren't wired for
- Defining levels with fuzzy language like 'demonstrates technical excellence' instead of observable behaviors like 'leads design of systems spanning 3+ services'
- Promoting based on tenure or likability rather than demonstrated impact. This kills the ladder's credibility and drives high performers to look elsewhere
- Skipping calibration entirely. Without cross-team calibration, every manager applies different standards and titles lose their meaning
Related to Engineering Career Ladders
Team Topologies Overview, Scaling 10 to 50 Engineers
Engineering Hiring Pipeline — People & Growth
Difficulty: Intermediate. Org Size: Growth
Key Points for Engineering Hiring Pipeline
- Employee referrals consistently produce the highest-quality hires with the shortest time-to-close. Build a referral program with meaningful incentives ($5,000-10,000 bonuses are standard at tech companies) and make the process frictionless for referring employees
- The target time-to-hire for engineering roles is 30-45 days from first contact to signed offer. Every week beyond that increases candidate drop-off by roughly 10-15%. Speed is a competitive advantage
- Structured interviews with scorecards reduce bias and produce more consistent outcomes than unstructured conversations. Define evaluation criteria before the interview starts and score independently before any debrief discussion
- The interview loop should assess four dimensions: technical skill (coding, system design), problem-solving approach, collaboration style, and alignment with team needs. Each stage should test something different
- Calibration sessions after every hiring cycle align interviewers on what 'strong hire' actually looks like. Without calibration, different interviewers apply different bars and your quality becomes inconsistent
Common Mistakes with Engineering Hiring Pipeline
- Optimizing for false negative avoidance (not missing good candidates) at the expense of process speed. Requiring 6 interview rounds and a take-home project guarantees you'll lose top candidates to companies that move faster
- Letting the hiring manager make the final decision alone. Hiring committees or at least structured debriefs with all interviewers reduce individual bias and prevent one person's gut feeling from overriding signal
- Giving candidates a poor experience and expecting them to accept anyway. Unresponsive recruiters, rescheduled interviews, and weeks of silence between stages damage your employer brand. Rejected candidates talk to other candidates
- Filtering too aggressively on pedigree (top-tier universities, FAANG experience) and missing strong engineers from non-traditional backgrounds, bootcamps, or smaller companies
Related to Engineering Hiring Pipeline
Engineering Career Ladders, Scaling 10 to 50 Engineers
Engineering Manager to IC Ratio — Scaling Organizations
Difficulty: Intermediate. Org Size: Growth
Key Points for Engineering Manager to IC Ratio
- The sweet spot for most engineering managers is 5-8 direct reports. Below 5, the manager lacks enough context to justify the role. Above 8, 1:1s and career development conversations get squeezed and people start feeling invisible
- Span of control should decrease as team complexity increases. A manager overseeing a mature CRUD service team can handle 8 reports. A manager overseeing a team building distributed systems from scratch should have closer to 5
- Skip-level 1:1s (a manager's manager meeting directly with ICs) are essential for catching blind spots. Run them monthly or quarterly. They surface problems that people won't raise with their direct manager
- The player-coach model (manager who also writes production code) works briefly during transitions but fails as a permanent structure. Either the management work suffers or the engineering work suffers. Usually both
- Adding a management layer is a one-way door that's hard to undo. Only add a manager-of-managers role when you have 3+ managers, each with 5+ reports, who need coordination and career development that the director can't provide alone
Common Mistakes with Engineering Manager to IC Ratio
- Promoting the best engineer to manager without training or a trial period. Technical excellence and people management are orthogonal skills. Offer management rotations or tech lead roles as a bridge
- Keeping span of control too narrow because managers want to stay close to the code. If a manager has 3 reports and spends 60% of their time coding, you've created an expensive tech lead, not a manager
- Ignoring the transition pain when an IC's manager changes. Relationship continuity matters for career development. When you restructure, give people advance notice and let them have input on where they land
- Assuming flat structures scale. Companies like Valve and GitHub tried minimal management and eventually added layers because coordination costs grew faster than headcount
Related to Engineering Manager to IC Ratio
Scaling 10 to 50 Engineers, Engineering Career Ladders
Incident Management Process — Operational Excellence
Difficulty: Intermediate. Org Size: Growth
Key Points for Incident Management Process
- Severity levels (SEV1-SEV4) give everyone a shared vocabulary and set expectations for response urgency. Without clear definitions, every incident either feels like an emergency or gets shrugged off
- The Incident Commander role keeps coordination separate from debugging. One person drives the process while engineers focus on the technical investigation
- Communication templates (status pages, stakeholder updates, customer notifications) stop the ad-hoc messaging that makes high-stress situations even more confusing
- Post-incident reviews should focus on systemic fixes, not blame. The point is to make the system more resilient, not to find someone to pin it on
- Running regular incident drills and game days builds the kind of muscle memory that turns real incidents into practiced responses instead of panicked scrambles
Common Mistakes with Incident Management Process
- Skipping post-incident reviews for 'minor' incidents. SEV3 and SEV4 issues often reveal the kind of systemic weaknesses that eventually cause a SEV1
- Having the on-call engineer also serve as incident commander. Debugging and coordinating compete for the same mental bandwidth, and both suffer
- Writing post-incident reviews that chalk things up to 'human error.' Humans will always make mistakes. The system should be built to handle that
- Never practicing incident response. Teams that only run the process during real emergencies end up slow, confused, and error-prone when it counts
Related to Incident Management Process
Engineering Career Ladders, Team Topologies Overview
ML & AI Team Structure Patterns — Team Structure
Difficulty: Advanced. Org Size: Growth
Key Points for ML & AI Team Structure Patterns
- Three organizational models exist for ML teams: centralized, embedded, and hybrid. Once you get past 5 ML engineers, hybrid tends to scale best because it balances specialization with product proximity.
- Embedded ML engineers without a shared platform end up spending roughly 70% of their time on infrastructure plumbing instead of actual modeling work
- ML teams need different hiring profiles than product engineering. A strong ML engineer isn't just a software engineer who took a course, and treating them as interchangeable causes attrition
- The handoff between data science and engineering is where most ML projects die. If nobody owns that gap, models stay stuck in notebooks forever
- Conway's Law applies to ML systems too. If your ML team is cut off from product teams, your models will be cut off from the product experience
Common Mistakes with ML & AI Team Structure Patterns
- Hiring ML PhDs before your data infrastructure is in place. You can't do machine learning without reliable, accessible data pipelines. Senior researchers will leave if they spend months just waiting for clean data
- Running the ML team as a service desk that takes orders from product teams without owning any outcomes. This turns into a sweatshop where data scientists have zero product context and build models that never ship
- Expecting unicorn full-stack ML engineers who can do research, write production code, build pipelines, and operate models. Those people exist, but there are maybe 200 of them and they all work at DeepMind
- Keeping the ML team off the on-call rotation for their own models. If the team that built the model doesn't get paged when it degrades, model quality quietly rots
Related to ML & AI Team Structure Patterns
Team Topologies Overview, Conway's Law & Inverse Maneuver
On-Call Rotation Design — Operational Excellence
Difficulty: Intermediate. Org Size: Growth
Key Points for On-Call Rotation Design
- Sustainable on-call rotations need a minimum of 6-8 people. Fewer than that and individuals end up on call too frequently, which leads to burnout and attrition
- Follow-the-sun rotations (handing off the pager across time zones) eliminate overnight pages but require at least two geographically distributed teams with sufficient overlap for clean handoffs
- On-call compensation is not optional. Whether it's a flat weekly stipend ($500-1,500/week is common in US tech), extra PTO, or per-incident payouts, uncompensated on-call tells engineers their time outside work hours has no value
- Shadow on-call pairs a new team member with an experienced on-caller for 1-2 rotations before they carry the pager solo. This builds confidence and catches knowledge gaps before they result in a botched incident response
- Escalation policies should have clear timeouts. If a primary on-call doesn't acknowledge an alert within 5 minutes, it auto-escalates to secondary. If secondary doesn't respond in 10 minutes, it hits the engineering manager
Common Mistakes with On-Call Rotation Design
- Putting on-call solely on the SRE or ops team. The engineers who build the service should share on-call responsibility. Shared pain creates shared ownership of reliability
- Alerting on metrics that aren't actionable. Every page should have a corresponding runbook with clear steps. If the on-call engineer can't do anything about an alert, it shouldn't be a page
- Ignoring on-call load distribution. Some weeks are quiet, others are brutal. Track pages per rotation and rebalance if certain shifts consistently get hit harder
- Skipping the on-call handoff. A 15-minute sync between outgoing and incoming on-call (open incidents, known risks, upcoming deployments) prevents context loss and repeated triage
Related to On-Call Rotation Design
Incident Management Process, SRE Team Structure
Remote-First Org Design — Scaling Organizations
Difficulty: Intermediate. Org Size: Growth
Key Points for Remote-First Org Design
- Remote-first is not the same as remote-friendly. Remote-first means every process, meeting, and decision is designed for distributed participants by default. If your office employees have an information advantage over remote ones, you're remote-friendly at best
- Async-first communication requires investing heavily in written documentation. RFCs, decision logs, and project updates should be written, not spoken. GitLab's handbook (2,000+ pages, publicly available) is the gold standard for this
- Time zone band strategies group engineers into overlapping windows (Americas, EMEA, APAC) with 3-4 hours of daily overlap for synchronous collaboration. Trying to have everyone overlap with everyone creates meetings at terrible hours for somebody
- Onboarding remote engineers takes deliberate structure. A 30-60-90 day plan, an assigned onboarding buddy, recorded walkthroughs of key systems, and weekly check-ins with their manager are all non-negotiable
- Meeting-free days (Automattic uses them, Shopify has 'No Meeting Wednesdays') protect deep work time. Two consecutive meeting-free days per week is even better for flow state
Common Mistakes with Remote-First Org Design
- Defaulting to video calls for everything. If it can be a Loom recording, a Notion doc, or a Slack thread, it should be. Meetings should be reserved for decisions that require real-time discussion
- Measuring productivity by online status or hours logged instead of output. Surveillance tools like keystroke loggers destroy trust and push your best engineers to leave
- Assuming remote engineers will figure out the culture on their own. Without deliberate social infrastructure (virtual coffee chats, team offsites, interest-based Slack channels), remote teams become isolated individuals who happen to share a Jira board
- Keeping all-hands meetings at a time that only works for headquarters. Rotate meeting times or run multiple sessions to give every time zone a fair shot
Related to Remote-First Org Design
Scaling 10 to 50 Engineers, Scaling 50 to 200 Engineers
Scaling 10 to 50 Engineers — Scaling Organizations
Difficulty: Advanced. Org Size: Startup
Key Points for Scaling 10 to 50 Engineers
- This is the shift from 'everyone knows everything' to 'we need some structure.' It's the first real organizational growing pain
- Your first engineering managers show up here, usually strong tech leads who now split their time between people management and hands-on coding
- Specialization kicks in. Generalists start gravitating toward frontend, backend, infrastructure, or data as the codebase grows past what one person can hold in their head
- Communication overhead scales fast. At 10 people you have 45 possible communication channels. At 50 you have 1,225
- Process needs to be introduced on purpose. Standups, sprint planning, design reviews, and on-call rotations step in where hallway conversations used to work
Common Mistakes with Scaling 10 to 50 Engineers
- Promoting your best engineer to manager without any training or support. Management is a different skill set entirely, and losing a great IC while gaining a struggling manager is a double loss
- Avoiding process because 'we're still a startup.' The chaos that worked at 10 people turns into dysfunction at 30
- Hiring too quickly without onboarding infrastructure. New engineers joining a startup with no documentation or mentoring take 3-6 months to get productive instead of 3-6 weeks
- Not writing down technical standards early. Code style, review requirements, deployment process, and incident response should be documented before you forget the oral traditions
Related to Scaling 10 to 50 Engineers
Team Topologies Overview, Scaling 50 to 200 Engineers
Scaling 50 to 200 Engineers — Scaling Organizations
Difficulty: Expert. Org Size: Growth
Key Points for Scaling 50 to 200 Engineers
- This is the 'messy middle,' too big for startup informality but too small for enterprise bureaucracy. You have to find a careful balance between structure and speed
- New organizational layers show up. Directors manage managers, VPs manage directors, and the CEO can no longer stay close to every technical decision
- Platform teams become a necessity. Without shared infrastructure, every team ends up reinventing deployment, monitoring, and data pipelines on their own
- Technical governance moves from relying on individual experts to collective decision-making through architecture review boards, RFCs, and tech radar processes
- Preserving culture takes deliberate work. As new hires outnumber the original team, the founding culture fades unless you put explicit values and rituals in place
Common Mistakes with Scaling 50 to 200 Engineers
- Adding management layers without actually reducing the coordination burden. More managers should mean less cross-team coordination, not more meetings
- Building a platform team that's too large or too early. Start with 3-4 engineers tackling the most painful shared problems, then grow based on real demand
- Letting technical standards live as tribal knowledge. At 200 engineers, if it's not written in an RFC or ADR, it basically doesn't exist
- Reorging too often. Organizational changes take 3-6 months to settle in, and constant restructuring wrecks team cohesion and trust
Related to Scaling 50 to 200 Engineers
Scaling 10 to 50 Engineers, Conway's Law & Inverse Maneuver
Security Team Integration — Team Structure
Difficulty: Advanced. Org Size: Enterprise
Key Points for Security Team Integration
- Application Security (AppSec), Security Operations (SecOps), and Governance/Risk/Compliance (GRC) are three distinct functions that require different skill sets, tools, and team structures. Treating security as one monolithic team breaks down past 5-6 people
- The security champions model scales security knowledge without scaling the security team. Volunteer engineers from each product team get training, attend security guild meetings, and serve as the first line of review for their team's code
- Shifting left means integrating security checks into CI/CD pipelines (SAST with Semgrep, SCA with Snyk, secret scanning with GitLeaks) so issues get caught before code review, not after deployment
- Security review processes should be tiered. Low-risk changes get automated scanning only. Medium-risk changes get a security champion review. High-risk changes (auth flows, payment logic, new third-party integrations) get a full AppSec team review
- The CISO reporting structure signals organizational priority. Reporting to the CTO keeps security close to engineering. Reporting to the CEO or board gives security independence but can create distance from the teams doing the work
Common Mistakes with Security Team Integration
- Making the security team a gate that every PR must pass through. This creates a bottleneck that slows shipping to a crawl and breeds resentment between security and engineering
- Hiring security specialists before establishing baseline security hygiene. SSO, dependency scanning, and secret rotation should be in place before you hire your first dedicated security engineer
- Running annual penetration tests as your only security assessment. Point-in-time assessments miss the vulnerabilities introduced between tests. Continuous scanning catches what annual pentests cannot
Related to Security Team Integration
Team Topologies Overview, Scaling 50 to 200 Engineers
SRE Team Structure — Team Structure
Difficulty: Advanced. Org Size: Growth
Key Points for SRE Team Structure
- Google's original SRE model caps operational work at 50% of an SRE's time. The other 50% goes to engineering projects that improve reliability. If toil exceeds 50%, tickets get redirected back to the development team until the balance is restored
- The standard ratio is 1 SRE for every 8-10 developers. Staffing below that means your SREs become permanent firefighters with no time for systemic improvements
- Embedded SREs sit inside product teams and build deep domain knowledge. Centralized SREs maintain consistency across the org but risk becoming a bottleneck. Most mature orgs use a hybrid: a central SRE platform team plus embedded SREs for critical services
- SRE, DevOps, and Platform Engineering are not the same thing. SREs own service reliability and SLOs. DevOps is a cultural philosophy about shared ownership. Platform Engineering builds internal developer tools and infrastructure
- Production readiness reviews (PRRs) are gating checks before a service goes live. They cover monitoring, alerting, runbooks, capacity planning, and failure modes. Without PRRs, teams ship services that nobody knows how to operate at 3 AM
Common Mistakes with SRE Team Structure
- Hiring SREs before you have enough production services to justify the role. If you have fewer than 5-6 production services, a senior backend engineer with ops experience can fill the gap
- Treating SRE as a rebranded ops team. If your SREs aren't writing code to automate away toil, you've just renamed your sysadmins
- Letting SREs own all on-call without developer participation. This creates a moral hazard where developers ship unreliable code because someone else deals with the consequences
Related to SRE Team Structure
Incident Management Process, Team Topologies Overview
Team Topologies Overview — Team Structure
Difficulty: Intermediate. Org Size: Growth
Key Points for Team Topologies Overview
- Four team types form the backbone of this framework: stream-aligned, platform, enabling, and complicated-subsystem. Each one serves a specific purpose and has its own ownership model.
- Stream-aligned teams handle end-to-end delivery for a business capability, cutting down handoffs and keeping work flowing smoothly
- Platform teams take shared problems off other teams' plates by offering self-service tools for CI/CD, observability, and infrastructure
- Enabling teams act as temporary coaches. They help other teams pick up new practices, then move on before they become a dependency
- Complicated-subsystem teams handle domains that demand deep specialist knowledge (think ML models, video codecs, or financial engines) so stream-aligned teams aren't overwhelmed
Common Mistakes with Team Topologies Overview
- Treating team topologies as a one-time reorg rather than something that should keep evolving as the business and technology change
- Standing up platform teams before stream-aligned teams are actually struggling. If nobody feels the pain yet, you're solving problems that don't exist
- Letting enabling teams stick around too long. They should coach, hand off knowledge, and dissolve, not become permanent gatekeepers
- Skipping the conversation about interaction modes. Teams need to agree upfront on whether they're collaborating, providing X-as-a-Service, or facilitating
Related to Team Topologies Overview
Conway's Law & Inverse Maneuver, Scaling 10 to 50 Engineers
Technical Program Management — Operational Excellence
Difficulty: Advanced. Org Size: Enterprise
Key Points for Technical Program Management
- You need a TPM when your Slack channels are full of 'who owns this?' questions. The role exists to create clarity across organizational seams, not to manage tasks within a single team
- The real TPM skill is knowing when to escalate and when to quietly resolve. A TPM who escalates everything burns political capital. A TPM who escalates nothing lets programs die slowly in dependency hell
- Google runs roughly 1 TPM per 25 engineers on high-priority programs. Most orgs should start around 1:50 and tighten the ratio for programs with heavy cross-org dependencies or regulatory pressure
- Risk tracking only works as a weekly practice. Stripe's TPMs review their risk register in every program sync, with each risk assigned a single owner and a concrete mitigation by the next review. The risks you track are the ones you manage
- The best TPMs build program plans that fit in two pages. A milestone tracker in Linear or Jira, a dependency map on a whiteboard, and a decision log. Skip the 200-row Gantt chart
Common Mistakes with Technical Program Management
- Hiring TPMs who can't read a system design doc. If your TPM can't challenge an engineering estimate or spot an architectural bottleneck, you've hired a project coordinator at a Staff+ salary
- Using TPMs as status reporters. If the primary output is a weekly email summarizing what engineers already know, the role is being wasted. TPMs should be unblocking, not transcribing
- Assigning TPMs to single-team projects. An EM can run a project that lives inside one team. TPMs exist for the messy space between teams where nobody has clear ownership
- Letting TPMs own scope decisions. The moment a TPM starts deciding what to build instead of how to deliver it, you've blurred the PM/TPM boundary and created confusion about who speaks for the customer
Related to Technical Program Management
Scaling 50 to 200 Engineers, Incident Management Process