What "Production" Actually Means for ML
A model in a Jupyter notebook is a prototype. A model serving predictions to users is a product. The engineering gap between the two is where most enterprise AI initiatives stall. Production ML means the model runs reliably 24/7, degrades gracefully under load, fails safely when inputs are unexpected, and can be updated without downtime.
That description should sound familiar: it is just software engineering. The difference is that ML systems have an additional failure mode beyond bugs. They can produce incorrect outputs silently. A crashing service is obvious. A model that started returning biased predictions three weeks ago because the training distribution shifted is not. This silent failure mode shapes every architectural decision in a production ML platform.
In our work across 200+ enterprise AI programs, the most common reason production ML platforms fail is not model quality. It is infrastructure discipline: no rollback mechanism, no canary deployment, no data drift alerts. The model was fine. The platform was not.
The Production ML Platform Stack
A complete production ML platform has six functional layers. Each layer can be built in-house, assembled from open source, or purchased from a cloud vendor. The choice matters less than ensuring the layer exists and is operated well.
Platform Maturity: Where Most Enterprises Actually Are
Before investing in platform tooling, it helps to be clear-eyed about where your organisation sits today. Most enterprises overestimate their maturity by one level. The four-level model below is deliberately blunt.
Most enterprises sit at Level 1 or early Level 2. The jump from Level 2 to Level 3 is where the real engineering investment happens. Level 4 is mature and typically only cost-justified for teams running 20 or more models in production.
The ML CI/CD Pipeline: What It Must Include
A production ML CI/CD pipeline is not a software CI/CD pipeline with a model artifact attached. It has unique validation gates that standard pipelines do not need.
| Pipeline Stage | What Happens | Priority | Common Failure |
|---|---|---|---|
| Data Validation | Schema checks, null rates, distribution tests against training baseline | Critical | Schema drift crashes serving code silently |
| Feature Engineering Tests | Unit tests for every transformation function, with known-good fixtures | Critical | Training-serving skew from untested transform |
| Model Training | Reproducible training run with pinned dependencies and seed | Critical | Non-reproducible results impede debugging |
| Evaluation Gate | Comparison against champion model on holdout set and sliced cohorts | Critical | Aggregate accuracy hides regression on minority segments |
| Bias and Fairness Checks | Disparity testing across protected attributes before promotion | High | Regulatory exposure if deployed without checks |
| Performance Benchmarks | Latency, throughput, and memory under simulated production load | High | Model that passes accuracy gates fails SLA at volume |
| Canary Deployment | Route 5% of traffic to new model version, compare business metrics | High | Full rollout before business impact visible |
| Rollback Trigger | Automated rollback if error rate or latency exceeds threshold | Important | Manual rollback at 3am takes too long |
The evaluation gate deserves particular attention. Comparing a challenger model against the champion on aggregate metrics is table stakes. What most pipelines miss is slice evaluation: performance on underrepresented subpopulations, edge cases from production error logs, and recently-changed business-critical segments. A model can improve by 2% overall while regressing 15% on the customer cohort that matters most. Only slice evaluation catches this.
Deployment Patterns for Production Models
There is no single correct deployment pattern. The right choice depends on model complexity, latency requirements, update frequency, and the blast radius if something goes wrong. Here are the four patterns used most by mature engineering teams.
In practice, most mature teams combine canary release with shadow mode for the highest-stakes models. Shadow mode builds confidence before any user is exposed. Canary handles the final validation at scale. Blue/green is the fallback for anything that cannot tolerate a gradual rollout.
Training-Serving Skew: The Silent Killer
Training-serving skew is the condition where features computed during training are computed differently during serving. It is the most common source of unexplained model performance degradation in production and the hardest to debug retroactively.
The root cause is almost always a feature engineering function that exists in two places: once in a training pipeline (Python, Spark, SQL) and once in a serving path (Java, C++, a different Python version). Any divergence between them produces predictions at serving time that are different from what the model was trained on. The model has not changed. The world has not changed. The predictions are just wrong.
The only reliable fix is a feature store with a unified serving API. Features computed once, served consistently. The same Python function runs in batch training and online serving. Any team that cannot fund a feature store should at minimum enforce strict unit testing with production snapshots as test fixtures.
For teams using enterprise feature stores, this risk is substantially reduced by design. The feature store acts as the single point of truth for feature computation. Training reads from the same store as serving. Skew cannot exist unless the store itself has a bug, which is far easier to test than distributed feature logic spread across codebases.
Five Anti-Patterns That Cause Production Failures
These are drawn from incident retrospectives across dozens of enterprise ML programs. Each anti-pattern is common, avoidable, and typically discovered only after it causes a production incident.
What a Minimal Viable Production Platform Looks Like
Not every enterprise needs a six-layer platform on day one. For organisations deploying their first five models in production, a minimal viable production platform (MVPP) can be assembled in eight to twelve weeks. The non-negotiables are:
- Model registry: Even a folder in S3 with a naming convention and a JSON manifest is better than nothing. MLflow is free and adds versioning, metrics, and lineage.
- A deployment script: One script that takes a model registry version and deploys it to the target environment. No manual steps. No tribal knowledge.
- A rollback command: One command that reverts to the previous version. This must be practiced in a staging environment before it is needed in production at 2am.
- Three monitoring metrics: Prediction count (is it serving?), error rate (is it crashing?), and at least one business metric (is it working?). These three cover the majority of production failures.
- A data validation step before training: At minimum, Great Expectations or a custom script that asserts schema, null rates, and value ranges. Garbage in stops at the pipeline, not at the model.
This is enough to run five to ten models without recurring firefighting. Everything beyond this is optimisation. See our guidance on enterprise MLOps model lifecycle management for how to evolve from this baseline as the portfolio grows.
Build vs Buy: How to Frame the Decision
The build-versus-buy question for ML platform components gets framed wrong most often. The question is not "is this commercially available?" It is "is this a source of competitive differentiation for us?"
Feature transformation, model serving infrastructure, monitoring pipelines, and CI/CD tooling are not competitive differentiators. They are plumbing. A bank does not win on better Kubernetes YAML. An insurer does not win on a custom-built model registry. These components should be bought or assembled from open source. The differentiation lives in the models themselves, the proprietary data they are trained on, and the domain-specific feature engineering that no vendor can replicate.
The trap enterprises fall into is deciding to build a platform for six months before deploying a single model. By the time the platform is ready, the business requirements have changed and half the engineers who built it have left. Start with a vendor-managed or open source stack. Build what you cannot buy. Do not build what you can.
When evaluating ML platform vendors, the key question is not feature completeness. It is migration cost. Platforms with proprietary storage formats and API lock-in become expensive to exit. Favour vendors who store artifacts in open formats (ONNX, Parquet, JSON) and expose standard APIs. See our MLOps platform selection guide for a full evaluation framework.
The Organisational Side of Platform Engineering
The best-designed platform will fail if the organisational model does not support it. Two structural decisions determine whether a production ML platform gets adopted or worked around.
The first is the team model. A centralised platform team that builds tooling without using it produces tooling that does not match how models get built in practice. The most effective model is embedded ML engineers who work on both models and platform components, with a small core infrastructure team setting standards and running shared services. Platform components should be validated on real use cases before being declared stable.
The second is the ownership model. Every model in production must have a named owner responsible for its performance. Not a team. A person. When a model degrades, someone needs to receive the alert, triage it, and either fix it or escalate. Diffuse ownership means no alert gets acted on. Named ownership means models get maintained.
The AI implementation advisory work we do with enterprise clients almost always includes a platform ownership design phase. Technical architecture without organisational design produces shelfware.
Connecting Platform to Governance
Regulators are increasingly asking enterprise AI teams to demonstrate exactly how a production model makes decisions, what data it was trained on, and how its performance is monitored over time. A well-designed production ML platform makes this answerable. A poorly designed one makes it impossible.
The minimum documentation a model audit requires is: training dataset version and provenance, feature definitions and transformation logic, evaluation methodology including slice analysis, deployment history with timestamps and approvers, and monitoring configuration with alert history. If your platform cannot produce this documentation on demand, you have a governance gap independent of whether you have a regulatory requirement today. Regulatory requirements have a tendency to arrive faster than platforms can be retrofitted.
See our guidance on AI data governance for enterprise programs for how governance requirements should shape platform design from the beginning.
Summary: What Good Looks Like
A production ML platform at Level 3 maturity has: automated CI/CD that runs training, validation, and deployment without human intervention for standard cases; canary deployment on all model updates with business metric gates; a feature store or equivalent that eliminates training-serving skew; monitoring with alerts on at least three metrics per model; full lineage from data to prediction; and a rollback capability that can be exercised in under five minutes.
Most enterprises are six to eighteen months of focused engineering effort from that state. The organisations that get there are not the ones that buy the most expensive platform. They are the ones that treat ML engineering as a first-class software engineering discipline and staff it accordingly.
If your team is assessing where you are and what it would take to get to Level 3, the AI Readiness Assessment includes a production platform maturity evaluation as a standard component.