ML Pipeline & Feature Store Architecture
Architecture Diagram
The ML Infrastructure Stack
Most conversations about ML focus on models. Which architecture to pick, how to tune hyperparameters, which framework is hot right now. But the teams that actually succeed with ML in production spend about 80% of their effort on infrastructure, not models.
The ML infrastructure stack has five layers: data ingestion, feature computation, training pipelines, model serving, and monitoring. Each one has its own set of hard problems. The tricky part is that they all need to work together seamlessly. A failure in any layer cascades into model quality degradation that can be incredibly hard to diagnose.
This is platform engineering work. Treat it that way.
Feature Store Architecture
A feature store is probably the most impactful piece of ML infrastructure you can build. It solves one specific, critical problem: making sure the features used during model training are computed identically to the features used during inference.
Without a feature store, training-serving skew creeps in everywhere. A data scientist computes features in a Spark job using one set of transformations. A backend engineer reimplements those transformations in Go for the serving path. The two implementations diverge in subtle ways. Maybe a different rounding behavior, a timezone handling difference, or a window boundary edge case. The model performs worse in production than in evaluation, and nobody can figure out why.
A feature store has three components:
- Feature registry that stores metadata, ownership, schemas, and documentation for every feature. Engineers browse this to discover existing features before building new ones.
- Offline store (typically a data warehouse like BigQuery or Snowflake) that serves historical feature values for training dataset construction. It supports point-in-time joins to prevent label leakage.
- Online store (typically Redis, DynamoDB, or a purpose-built store like Feast's online serving) that serves the latest feature values at low latency for real-time inference.
The feature computation layer writes to both stores. One set of transformation logic, two storage backends, zero skew.
Batch vs Real-Time Pipeline Design
Batch pipelines run on a schedule (hourly, daily) and process large volumes of data at once. They are simpler to build, cheaper to run, and easier to debug. Use batch for features that do not need to reflect events from the last few minutes. User profile aggregations, historical spending patterns, document embeddings. Spark, Flink in batch mode, or dbt are the common tools.
Real-time pipelines process events as they arrive and update feature values within seconds. They are significantly harder to build and operate. Use them only when freshness directly impacts model quality. Fraud detection needs to know about the transaction that happened 30 seconds ago. A recommendation model for a news feed needs to reflect what the user just clicked.
Many teams default to real-time when batch would work perfectly well. A recommendation model that retrains daily does not gain anything from sub-second feature freshness. Match your pipeline latency to your model's actual sensitivity to data staleness.
The hardest architecture is the lambda pattern, where batch and real-time pipelines compute the same features with different technologies and their outputs get merged. This is expensive to operate. Prefer the kappa pattern (streaming-only with replay capability) when you genuinely need real-time, and batch-only when you do not.
Data Versioning and Lineage
You version your code. You version your model weights. You need to version your data too.
Data versioning means you can answer the question: "What exact data was used to train the model that is currently serving production traffic?" Tools like DVC, LakeFS, or Delta Lake's time travel feature give you this capability.
Data lineage means you can trace any model prediction back through the serving features, to the feature computation logic, to the raw source data. When a model starts producing bad predictions, lineage tells you whether the problem is the model, the features, or something upstream in the data.
Build lineage tracking into your pipelines from the start. Retrofitting it later is painful and usually incomplete. Tag every dataset, every feature computation job, and every training run with unique identifiers that can be joined together. Store this metadata in a centralized catalog.
For regulated industries (finance, healthcare), lineage is not just good engineering practice. It is an audit requirement. Regulators will ask you to explain why a specific prediction was made, and "the model just outputs a number" is not an acceptable answer.
Model Serving Infrastructure
Model serving looks simple until you try to do it reliably at scale. You have a trained model artifact. You need to wrap it in an API and serve predictions. What could go wrong?
Quite a lot, actually.
Packaging: Models need to be serialized in a format the serving infrastructure can load. ONNX provides cross-framework compatibility. Framework-specific formats (SavedModel, TorchScript) offer better optimization. Pick one standard for your organization and stick with it.
Deployment: Treat model deployments exactly like code deployments. Canary rollouts that send 5% of traffic to the new model. Automated rollback if prediction latency or error rates spike. Shadow mode where the new model runs alongside the old one so you can compare outputs without affecting users. Blue-green deployments for zero-downtime switches.
Scaling: Models have different compute profiles than typical web services. Some are CPU-bound, some need GPUs. Autoscaling policies need to account for model loading time, which can be 30+ seconds for large models. Pre-warm instances before traffic shifts.
Latency budgets: Define a P99 latency SLO for each model endpoint. Include preprocessing, inference, and postprocessing in the budget. If the model itself takes 50ms but feature retrieval takes 200ms, your optimization effort should focus on feature retrieval, not model inference.
Data Quality and Monitoring
ML systems fail silently. A broken microservice returns 500 errors and triggers alerts. A model receiving corrupted features returns confident-looking predictions that are subtly wrong. Nobody notices until a business metric drops weeks later.
Input monitoring checks that incoming feature values match the distributions seen during training. If a feature that was always between 0 and 1 suddenly has values in the thousands, something upstream changed. Statistical tests like Kolmogorov-Smirnov or Population Stability Index can detect these shifts automatically.
Output monitoring tracks prediction distributions over time. If a fraud model that normally flags 2% of transactions suddenly flags 15%, investigate immediately. It might be a real fraud spike, or it might be a broken feature feeding garbage to the model.
Feedback loops connect model predictions to real-world outcomes. Did the user actually click the recommended item? Was the flagged transaction actually fraudulent? These labels flow back into training data and evaluation metrics. Without them, you are flying blind.
Set up alerts on all three layers. The gap between a data quality issue and its impact on business metrics can be days or weeks. Catching problems at the input monitoring layer gives you a big head start.
Key Points
- •Feature stores solve the training-serving skew problem by providing a single source of truth for feature computation logic, used in both offline training and online inference
- •ML pipelines operate in two fundamentally different modes (batch and real-time), each with different latency, throughput, and consistency requirements that call for different infrastructure
- •Data versioning and lineage tracking are not optional extras. Without them, you cannot reproduce results, debug model regressions, or meet audit requirements
- •The feature computation layer is typically the most expensive component in the ML stack. Optimizing it through incremental computation and caching delivers the highest cost impact
- •Model serving needs the same deployment rigor as code: canary rollouts, rollback capability, traffic splitting, and health checks are non-negotiable
Common Mistakes
- ✗Letting data scientists build ad-hoc notebook pipelines for prototyping and then trying to push those same pipelines to production without re-engineering them
- ✗Building separate feature computation logic for training and serving, which guarantees subtle numerical differences that degrade model performance in production
- ✗Treating ML infrastructure as a data science problem instead of a platform engineering problem. This leads to fragile, undocumented systems that only their creator can operate
- ✗Ignoring data quality monitoring, which means your model silently degrades for weeks before anyone notices that upstream data changes broke input distributions