Data Team Organization
The Three Pillars
Data Engineering owns the infrastructure. They build and maintain ingestion pipelines, data warehouses, streaming systems, and orchestration (Airflow, Dagster, Prefect). Their output is reliable, well-structured data that other teams can use. Think of them as the platform team for data.
Analytics Engineering owns the transformation and modeling layer. They take raw data and turn it into clean, tested, documented datasets. dbt has become the standard tool here. A good analytics engineer knows SQL deeply, understands the business domain, and writes data models with the same rigor a software engineer brings to application code.
Data Science owns experimentation, ML models, and advanced analytics. They need clean, reliable data as input (which is why they depend on the other two groups). Data scientists who spend 80% of their time cleaning data are a sign that your data engineering and analytics engineering functions are understaffed.
Centralized vs Embedded
At 3-5 data people, centralized is the only thing that makes sense. One team, one backlog, one set of standards.
Between 8-15, the centralized model starts to crack. Request queues grow long. Product teams complain that data work takes weeks. This is when companies start embedding data people into product teams while keeping a small central platform group that owns shared infrastructure and governance.
Beyond 20 data professionals, you almost certainly need a hub-and-spoke model. The hub maintains the warehouse, shared tooling, and data quality standards. The spokes are embedded analysts and analytics engineers who serve specific product areas.
When to Split Into Specialized Teams
The signal is usually workload imbalance. If your data engineers are constantly fighting fires on pipelines while data scientists wait around for clean data, you need to formalize the separation. Similarly, if analysts are writing the same complex SQL transformations over and over, that's a sign you need dedicated analytics engineers building reusable data models.
Reporting Structure
There's no single right answer, but the trend among companies that do data well (Airbnb, Spotify, Netflix) is toward a standalone data organization led by a Head of Data or CDO, with dotted-line relationships to the product and engineering teams they serve. This gives data professionals a clear career path and prevents them from being treated as a service desk.
Key Points
- •Data engineering, data science, and analytics engineering are three distinct disciplines with different skills, tools, and career paths. Lumping them together under 'the data team' guarantees that at least one group gets neglected
- •The analytics engineering role (popularized by dbt Labs) bridges the gap between raw data engineering and business analytics. They own the transformation layer and build the data models that analysts and scientists depend on
- •Centralized data teams maintain consistency in data modeling and tooling but become bottlenecks as request volume grows. Embedded data people move faster for their specific domain but create silos and duplicate work
- •Reporting structure matters more than people realize. Data teams under engineering tend to prioritize infrastructure and reliability. Under product, they skew toward experimentation and metrics. As a standalone org under a CDO, they get strategic focus but risk disconnection from the teams they serve
- •A modern data stack (Fivetran/Airbyte for ingestion, Snowflake/BigQuery for warehousing, dbt for transformation, Looker/Metabase for BI) shapes team structure. Each layer needs clear ownership
Common Mistakes
- ✗Building a data team before you have enough data volume and business questions to justify it. If a single analyst with SQL access can answer 90% of your questions, you don't need a data platform team yet
- ✗Hiring data scientists before data engineers. Your ML models are only as good as the data pipelines feeding them. Get the plumbing right first
- ✗Letting every team build their own metrics definitions. Without a single source of truth for how 'active user' or 'revenue' is calculated, you end up with five dashboards showing five different numbers
- ✗Treating data quality as someone else's problem. The teams producing data need ownership over its quality. Data teams can build validation frameworks, but they can't fix upstream data issues they don't control