ML Platform & AI Golden Paths
Why ML Needs Its Own Golden Paths
Standard software golden paths assume a pretty straightforward lifecycle: write code, test, deploy, monitor. ML workflows break that assumption. They involve large datasets, expensive compute, non-deterministic outputs, and a feedback loop where production performance quietly degrades over time without retraining.
Getting a model from "I have an idea" to "it is serving production traffic" typically means a dozen handoffs, three different compute environments, and at least two teams who do not share a common vocabulary. In organizations without ML golden paths, the median time from a successful experiment to a production deployment is 3-6 months. With a well-built ML platform, that drops to days.
The golden path approach works the same way for ML as it does for application development: make the right thing the easy thing. A data scientist should be able to run ml-platform deploy --model=fraud-v3 --canary=10% and get a production deployment with monitoring, alerting, and rollback capability. Everything underneath (the container builds, the Kubernetes scheduling, the load balancer configuration) should be invisible.
The ML Platform Layer Cake
Think of the ML platform as four layers, each building on the one below.
Data layer provides access to features, training datasets, and evaluation data. This includes a feature store for real-time and batch features, a data catalog so scientists can discover what data exists, and data validation that catches schema drift before it corrupts a model.
Experiment layer provides compute for training, hyperparameter tuning, and evaluation. This includes managed notebook environments, distributed training on GPU clusters, experiment tracking (MLflow, Weights & Biases, or similar), and artifact storage for trained models.
Deployment layer packages models and serves them in production. This includes a model registry for versioning, model serving infrastructure (Seldon, KServe, or custom), deployment strategies (canary, shadow, A/B), and automated rollback on metric degradation.
Monitoring layer tracks model health in production. This includes prediction quality metrics, feature drift detection, data quality monitoring, and alerting when model performance drops below thresholds.
You do not need to build all four layers at once. Start with the deployment layer. The gap between experiment and production is where most ML projects go to die.
Feature Store as Platform Primitive
The feature store solves a problem that is invisible from the outside but absolutely cripples ML teams from the inside: feature consistency between training and serving.
During training, features are computed in batch from historical data. During serving, those same features need to be computed in real time from live data. If the computation logic differs even slightly (rounding, null handling, time zone conversion), the model's production behavior drifts from what was tested. This is called training-serving skew, and it is one of the most common reasons ML models underperform in production. It is also maddeningly hard to diagnose because everything looks correct in isolation.
A feature store provides a single definition for each feature that is used for both training and serving. Engineers define the feature transformation once. The feature store handles materializing it for batch (training) and online (serving) consumption.
Start simple. A feature store can be as basic as a shared Python library for feature transforms, a Postgres table for online features, and a Parquet dataset in S3 for offline features. You do not need Feast or Tecton on day one. You need consistency.
Model Deployment as Self-Service
The deployment golden path for ML models needs to handle several things that standard application deployments do not.
Model packaging standardizes how models are containerized. Data scientists should not need to write Dockerfiles. The platform takes a model artifact (a pickle file, an ONNX model, a saved TensorFlow graph) and a serving configuration, then produces a container image automatically. This is a small thing that saves enormous amounts of friction.
Canary deployment routes a small percentage of traffic to the new model version while the old version handles the rest. The platform monitors prediction quality, latency, and error rates. If the new model degrades on any metric, it rolls back automatically. This is not optional for ML. Models are non-deterministic, and offline evaluation simply does not guarantee production quality.
A/B testing runs two model versions simultaneously for a statistical comparison. This differs from canary because both versions run for a sustained period to measure business metrics, not just technical health. The platform should handle traffic splitting, metric collection, and statistical significance calculation.
Rollback must be instant. One command to revert to the previous model version, with traffic shifting in seconds rather than minutes. When a model starts producing bad predictions, every minute of exposure matters.
The self-service contract looks like this: the data scientist provides a trained model artifact and a deployment config. The platform handles everything else. If a scientist needs to file a ticket or attend a meeting to deploy a model, you have not reached self-service yet.
When to Build Your ML Platform
Timing matters more than architecture here. Build too early and you create infrastructure nobody uses. Build too late and teams have already created incompatible DIY solutions that are painful to migrate away from.
The right time to start is when you have 2-3 ML models running in production that were deployed manually. At that point, you understand the real pain points. You know which steps take the longest, which handoffs cause the most friction, and which operational tasks keep recurring.
Start with the deployment golden path. Get model packaging and serving to self-service. Then add experiment tracking (most teams are already using something ad hoc anyway). Then build the feature store once you have at least two models sharing features.
Budget for a team of 2-3 engineers focused on the ML platform. They should sit close to the data science team, attend their standups periodically, and treat them as customers. The best ML platform teams include at least one person with data science experience who can bridge the vocabulary gap between infrastructure engineers and ML practitioners.
Do not try to build a general-purpose ML platform that covers every possible use case. Build for the 2-3 model types your organization actually runs. A fraud detection model, a recommendation model, and a search ranking model have enough in common to share infrastructure, but enough differences to keep you honest about not over-generalizing.
Key Points
- •ML golden paths cut experiment-to-production time from months to days by standardizing the boring parts: packaging, deployment, monitoring, and rollback
- •The key design principle is abstracting infrastructure complexity while preserving experiment flexibility. Data scientists should never write Dockerfiles, but they should always control hyperparameters.
- •Three capabilities are essential before anything else: a feature store for consistent feature computation, experiment tracking for reproducibility, and a model registry for versioning and lineage
- •Self-service model deployment with built-in canary rollouts, A/B testing, and one-click rollback removes the biggest bottleneck in ML organizations
- •Measure your ML platform by adoption rate and time-to-production. If data scientists are not using it, the platform is wrong, not the scientists.
Common Mistakes
- ✗Building an ML platform before you have proven ML use cases in production. You need at least 2-3 models running manually before you know what to automate.
- ✗Over-abstracting the platform so data scientists cannot debug failures. When a training job fails, they need to see logs, not a generic 'pipeline error' message.
- ✗Treating the ML platform as entirely separate from the application platform. Shared capabilities like CI/CD, secrets management, and observability should be reused, not rebuilt.
- ✗Not involving data scientists in platform design. Engineers build what is elegant. Scientists need what is practical. These are often different things.