Why MLOps CI/CD Is Different from Software CI/CD
Software CI/CD pipelines verify that code behaves correctly — unit tests pass, the binary compiles, the API returns the right status code. ML pipelines must verify something fundamentally different: that a statistical model generalises well enough to be useful in production, and that its generalisation hasn't degraded since the last release.
This distinction drives a cascade of differences. ML CI/CD must manage data versioning alongside code versioning: a model trained on last week's data is a different artifact from one trained on today's data, even if the code is identical. It must enforce statistical evaluation gates: a model that improves accuracy by 0.3 percentage points on the holdout set isn't necessarily better if that improvement is within the confidence interval of the test-set variance. And it must handle non-determinism: two identical training runs with the same seed on the same data can produce slightly different artifacts due to hardware floating-point differences, making binary diff-based comparisons meaningless.
The practical solution is to treat ML CI/CD as three nested pipelines: a training pipeline that reproducibly builds a model artifact from versioned data and versioned code, an evaluation pipeline that statistically validates that artifact against a champion model, and a deployment pipeline that promotes the validated artifact through environments using progressive traffic-shifting strategies.
Training Pipeline
Data versioning, feature engineering, hyperparameter search, artifact registration
Evaluation Pipeline
Champion vs challenger, statistical significance, data drift, performance thresholds
Deployment Pipeline
Shadow mode, canary release, A/B test, full rollout, rollback triggers
Reproducible Training with DVC and MLflow
DVC (Data Version Control) solves the data versioning problem by tracking dataset versions in Git-compatible metadata files while storing the actual bytes in object storage (S3, GCS, Azure Blob). A dvc.yamlpipeline definition describes each stage — data download, preprocessing, feature engineering, training — as a DAG with explicit inputs, outputs, and commands. DVC's cache layer ensures that stages whose inputs haven't changed are skipped, giving you incremental pipeline execution without manual caching logic.
# dvc.yaml — reproducible ML pipeline definition
stages:
preprocess:
cmd: python src/preprocess.py --input data/raw --output data/processed
deps:
- src/preprocess.py
- data/raw
outs:
- data/processed
train:
cmd: python src/train.py --data data/processed --output models/
deps:
- src/train.py
- data/processed
- params.yaml
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py --model models/model.pkl --data data/test
deps:
- src/evaluate.py
- models/model.pkl
- data/test
metrics:
- metrics/eval_metrics.json:
cache: falseWithin each stage, MLflow Tracking records every run with its hyperparameters, per-epoch metrics, and the final model artifact. The training script wraps the fit loop in an mlflow.start_run() context manager and logs everything to a central MLflow tracking server backed by PostgreSQL and S3.
# src/train.py — training with MLflow tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
import yaml, json, joblib
def train(data_path: str, output_path: str):
with open("params.yaml") as f:
params = yaml.safe_load(f)["train"]
X_train, y_train = load_features(data_path)
with mlflow.start_run() as run:
mlflow.log_params(params)
mlflow.set_tag("pipeline_stage", "train")
mlflow.set_tag("dvc_commit", get_dvc_commit())
model = GradientBoostingClassifier(**params)
model.fit(X_train, y_train)
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
mlflow.log_metric("train_roc_auc", train_auc)
# Log model to MLflow Model Registry
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name="churn-predictor",
input_example=X_train.iloc[:5],
)
joblib.dump(model, output_path)
json.dump({"train_roc_auc": train_auc, "run_id": run.info.run_id},
open("metrics/train_metrics.json", "w"))
return run.info.run_idNote
conda.yaml or requirements.txt and log it as an MLflow artifact. A model trained on scikit-learn 1.3 may silently behave differently when loaded with scikit-learn 1.5 — version pinning is the only defense.Statistical Evaluation Gates: Champion vs Challenger
An evaluation gate is a CI check that compares the challenger model (the new artifact) against the champion model (the currently deployed artifact) on a held-out evaluation dataset. If the challenger fails to meet the gate conditions, the pipeline fails and the model is blocked from promotion — just as a failing unit test blocks a software deployment.
Three types of gates belong in every MLOps pipeline. An absolute threshold gate checks that the challenger meets a minimum acceptable performance floor — for example, ROC-AUC above 0.75. A relative improvement gate checks that the challenger outperforms the champion by a statistically significant margin; a challenger that is marginally better but within noise should not trigger a deployment. A bias and fairness gatechecks that the model's performance doesn't degrade disproportionately for protected subgroups.
For production systems backed by a Feast feature store, the evaluation dataset should be generated using the same feature retrieval path as inference — fetching point-in-time correct historical features from the offline store — to prevent evaluation-to-serving skew from inflating the evaluation metrics.
# src/evaluate.py — champion vs challenger evaluation gate
import mlflow
from mlflow import MlflowClient
from scipy import stats
import numpy as np
def evaluate_challenger(challenger_run_id: str, eval_data_path: str):
client = MlflowClient()
# Load challenger
challenger_uri = f"runs:/{challenger_run_id}/model"
challenger = mlflow.sklearn.load_model(challenger_uri)
# Load current champion from Model Registry (Production stage)
champion_versions = client.get_latest_versions(
"churn-predictor", stages=["Production"]
)
champion = None
if champion_versions:
champion = mlflow.sklearn.load_model(
f"models:/churn-predictor/Production"
)
X_eval, y_eval = load_features(eval_data_path)
challenger_preds = challenger.predict_proba(X_eval)[:, 1]
challenger_auc = roc_auc_score(y_eval, challenger_preds)
# Gate 1: absolute threshold
assert challenger_auc >= 0.75, (
f"Challenger AUC {challenger_auc:.4f} below minimum threshold 0.75"
)
# Gate 2: relative improvement (with statistical significance test)
if champion is not None:
champion_preds = champion.predict_proba(X_eval)[:, 1]
champion_auc = roc_auc_score(y_eval, champion_preds)
# Bootstrap confidence interval on AUC difference
n_bootstrap = 1000
diffs = []
rng = np.random.default_rng(42)
for _ in range(n_bootstrap):
idx = rng.integers(0, len(y_eval), len(y_eval))
c_auc = roc_auc_score(y_eval[idx], challenger_preds[idx])
p_auc = roc_auc_score(y_eval[idx], champion_preds[idx])
diffs.append(c_auc - p_auc)
ci_lower = np.percentile(diffs, 5)
print(f"Champion AUC: {champion_auc:.4f} | Challenger AUC: {challenger_auc:.4f}")
print(f"90% CI on improvement: [{ci_lower:.4f}, {np.percentile(diffs, 95):.4f}]")
# Fail if challenger is significantly WORSE
assert ci_lower > -0.005, (
f"Challenger is significantly worse than champion (CI lower: {ci_lower:.4f})"
)
# Gate 3: per-segment fairness check
check_fairness_by_segment(X_eval, y_eval, challenger_preds)
return {"challenger_auc": challenger_auc, "gate": "passed"}
def check_fairness_by_segment(X, y, preds, max_gap=0.08):
for segment_col in ["age_group", "region"]:
if segment_col not in X.columns:
continue
for segment_val in X[segment_col].unique():
mask = X[segment_col] == segment_val
seg_auc = roc_auc_score(y[mask], preds[mask])
overall_auc = roc_auc_score(y, preds)
gap = abs(overall_auc - seg_auc)
assert gap <= max_gap, (
f"Fairness gate failed: {segment_col}={segment_val} "
f"AUC gap {gap:.4f} > {max_gap}"
)Model Registry: Staging to Production Promotion
The MLflow Model Registry provides a centralised store for model versions with lifecycle stages: None → Staging → Production → Archived. A model version can only reach Production after the evaluation pipeline has passed and a human reviewer has approved the promotion — enforced via a GitHub Actions environment with required reviewers.
The promotion workflow looks like this: the training pipeline registers a new model version in None stage, the evaluation pipeline transitions it to Staging on pass, and a manual approval step (or an automated canary analysis) transitions it to Production, automatically archiving the previous champion.
# scripts/promote_model.py — registry promotion logic
from mlflow import MlflowClient
def promote_to_staging(model_name: str, run_id: str) -> str:
client = MlflowClient()
# Find the version registered from this run
for mv in client.search_model_versions(f"name='{model_name}'"):
if mv.run_id == run_id:
client.transition_model_version_stage(
name=model_name,
version=mv.version,
stage="Staging",
archive_existing_versions=False,
)
client.set_model_version_tag(
model_name, mv.version, "evaluation_gate", "passed"
)
return mv.version
raise ValueError(f"No model version found for run_id={run_id}")
def promote_to_production(model_name: str, version: str):
client = MlflowClient()
# Archive existing Production version
client.transition_model_version_stage(
name=model_name,
version=version,
stage="Production",
archive_existing_versions=True, # archives current Production
)
client.set_model_version_tag(
model_name, version, "promoted_by", "ci-pipeline"
)GitHub Actions: Full MLOps CI/CD Workflow
The GitHub Actions workflow below ties together DVC pipeline execution, MLflow experiment tracking, evaluation gates, and model registry promotion. It runs on every push to main and on a nightly schedule for automatic retraining. The deploy job targets a production GitHub Actions environment that requires at least one reviewer approval before running.
# .github/workflows/mlops-pipeline.yml
name: MLOps CI/CD Pipeline
on:
push:
branches: [main]
paths: ["src/**", "params.yaml", "dvc.yaml", "data.dvc"]
schedule:
- cron: "0 2 * * *" # nightly retraining at 02:00 UTC
workflow_dispatch:
inputs:
force_retrain:
type: boolean
default: false
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
AWS_DEFAULT_REGION: eu-west-1
jobs:
train:
runs-on: ubuntu-latest
outputs:
run_id: ${{ steps.train.outputs.run_id }}
steps:
- uses: actions/checkout@v4
- name: Configure DVC remote (S3)
run: |
dvc remote modify myremote access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
dvc remote modify myremote secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Pull versioned data
run: dvc pull data/processed data/test
- name: Run DVC training pipeline
id: train
run: |
dvc repro train evaluate
RUN_ID=$(jq -r '.run_id' metrics/train_metrics.json)
echo "run_id=$RUN_ID" >> $GITHUB_OUTPUT
- name: Push updated metrics to DVC remote
run: dvc push metrics/
evaluate:
needs: train
runs-on: ubuntu-latest
outputs:
model_version: ${{ steps.gate.outputs.model_version }}
steps:
- uses: actions/checkout@v4
- name: Run evaluation gate
id: gate
env:
CHALLENGER_RUN_ID: ${{ needs.train.outputs.run_id }}
run: |
python scripts/evaluate_challenger.py --run-id $CHALLENGER_RUN_ID --eval-data data/test
VERSION=$(python scripts/promote_model.py staging --run-id $CHALLENGER_RUN_ID)
echo "model_version=$VERSION" >> $GITHUB_OUTPUT
- name: Comment evaluation results on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const results = JSON.parse(
require('fs').readFileSync('metrics/eval_metrics.json', 'utf8')
);
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Model Evaluation Results\n` +
`- Champion AUC: ${results.champion_auc}\n` +
`- Challenger AUC: ${results.challenger_auc}\n` +
`- Gate: ${results.gate}`,
});
deploy:
needs: [train, evaluate]
runs-on: ubuntu-latest
environment: production # requires human approval
steps:
- uses: actions/checkout@v4
- name: Promote model to Production
run: |
python scripts/promote_model.py production --version ${{ needs.evaluate.outputs.model_version }}
- name: Trigger canary deployment
run: |
kubectl patch inferenceservice churn-predictor --type=merge --patch='{"spec":{"predictor":{"model":{"storageUri":
"s3://models/churn-predictor/v${{ needs.evaluate.outputs.model_version }}"}}}}'Note
actions/cache and the pip wheel cache directory. A cold install of scikit-learn, pandas, and their transitive dependencies can add 3-4 minutes to a workflow; caching reduces it to under 20 seconds on a cache hit.Deployment Strategies: Shadow, Canary, and A/B
Unlike software deployments where a new version either works or doesn't, ML deployments must evaluate whether a model's statistical performance in production matches its offline evaluation metrics. Three deployment strategies form a progression of confidence: shadow mode, canary release, and full A/B test.
Shadow Mode
The challenger model receives a copy of every production request but its predictions are logged, not returned to users. After 24-48 hours of shadow traffic, you compare the challenger's predictions against the champion's on identical inputs. Shadow mode catches serving skew — differences between offline evaluation and production behaviour caused by feature preprocessing differences or schema mismatches — with zero user impact.
Canary Release
The challenger receives a small percentage of live traffic (typically 5-10%) while the champion handles the rest. Business metrics (conversion rate, click-through, prediction accuracy on labelled outcomes) are compared between the two groups. The canary percentage is progressively increased if the metrics hold, or the rollout is automatically reverted if an alerting rule fires.
A/B Test
A structured experiment with a statistical power calculation determining the required sample size before the test begins. Users are split by a stable hash (not random per-request, to prevent the same user seeing both models). An A/B test runs until statistical significance is reached, then the winner is fully promoted. A/B tests are appropriate when the business metric of interest has a long feedback loop, such as 7-day or 30-day retention.
Model Serving with KServe on Kubernetes
KServe (formerly KFServing) provides a Kubernetes-native serving layer for ML models with built-in support for canary rollouts, shadow deployments, and multi-model serving. An InferenceService CRD describes the predictor, transformer, and explainer components. The canary traffic split is controlled by the canaryTrafficPercent field, making progressive rollouts a kubectl patch command.
For teams already familiar with MLflow for experiment tracking, KServe can load models directly from the MLflow Model Registry by pointing storageUri at the S3 artifact path exported by MLflow, eliminating a separate model packaging step.
# infra/inference-service.yaml — KServe canary deployment
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: churn-predictor
namespace: ml-serving
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
spec:
predictor:
canaryTrafficPercent: 10 # 10% traffic to challenger
model:
modelFormat:
name: sklearn
storageUri: s3://ml-models/churn-predictor/champion
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8080
initialDelaySeconds: 20
periodSeconds: 10
# Canary predictor (challenger model)
# Promoted from Model Registry by CI pipeline
predictor-canary:
model:
modelFormat:
name: sklearn
storageUri: s3://ml-models/churn-predictor/challenger
resources:
requests:
cpu: "500m"
memory: "512Mi"
# Optional: transformer for feature preprocessing
transformer:
containers:
- image: myregistry/churn-transformer:v2.1.0
name: transformer
env:
- name: FEAST_FEATURE_SERVER_URL
value: http://feast-server.feast.svc.cluster.localProduction Monitoring: Data Drift and Retraining Triggers
Deploying a model is the beginning of its operational life, not the end. Input feature distributions shift over time (data drift), the relationship between features and the target variable changes (concept drift), and the downstream business process the model supports evolves. Without continuous monitoring, model accuracy silently degrades until a business metric breaks and the root cause is unclear.
The monitoring stack for a production ML system has three layers. The infrastructure layer uses standard Prometheus/Grafana metrics: request latency, throughput, error rate, and pod resource utilisation. The prediction layer tracks the distribution of model outputs — if a binary classifier that normally returns 15% positive predictions starts returning 40% positives, something has changed. The data drift layer compares the statistical distribution of incoming features against the training distribution using the Evidently AI library or custom KS-test implementations.
For model evaluation frameworks, see our guide on LLM evaluation in production — many of the same principles around golden datasets, regression testing, and CI integration apply to classical ML models as well.
# monitoring/drift_detector.py — Evidently-based drift monitoring
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.metrics import DatasetDriftMetric
import json, boto3
from datetime import datetime
def run_drift_report(
reference_data: pd.DataFrame,
production_data: pd.DataFrame,
model_name: str,
window_hours: int = 24,
) -> dict:
report = Report(metrics=[
DataDriftPreset(drift_share=0.3), # alert if >30% features drift
DataQualityPreset(),
DatasetDriftMetric(),
])
report.run(
reference_data=reference_data,
current_data=production_data,
)
results = report.as_dict()
dataset_drift = results["metrics"][2]["result"]["dataset_drift"]
drift_share = results["metrics"][2]["result"]["share_of_drifted_columns"]
# Publish to CloudWatch for alerting
cw = boto3.client("cloudwatch", region_name="eu-west-1")
cw.put_metric_data(
Namespace=f"MLOps/{model_name}",
MetricData=[
{
"MetricName": "DriftShare",
"Value": drift_share,
"Unit": "None",
"Timestamp": datetime.utcnow(),
}
],
)
# Save HTML report to S3 for the data team
report.save_html(f"/tmp/drift_report_{window_hours}h.html")
s3 = boto3.client("s3")
s3.upload_file(
f"/tmp/drift_report_{window_hours}h.html",
"ml-monitoring",
f"reports/{model_name}/{datetime.utcnow().date()}/drift.html",
)
return {
"dataset_drift": dataset_drift,
"drift_share": drift_share,
"trigger_retraining": dataset_drift and drift_share > 0.4,
}Automated Retraining Triggers
Rather than retraining on a fixed schedule, a drift-triggered retraining pipeline is more efficient. The drift monitoring job runs hourly, and when the trigger_retraining flag is set, it dispatches a workflow_dispatch event to the GitHub Actions MLOps pipeline. The pipeline trains a new challenger model on the most recent window of data, runs the evaluation gates, and if the challenger passes, promotes it to Staging for human review.
# monitoring/trigger_retraining.py — GitHub Actions dispatch on drift
import requests, os
def trigger_retraining(reason: str, drift_share: float):
token = os.environ["GITHUB_TOKEN"]
repo = os.environ["GITHUB_REPOSITORY"]
resp = requests.post(
f"https://api.github.com/repos/{repo}/actions/workflows/mlops-pipeline.yml/dispatches",
headers={
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github.v3+json",
},
json={
"ref": "main",
"inputs": {
"force_retrain": "true",
"trigger_reason": reason,
"drift_share": str(drift_share),
},
},
)
resp.raise_for_status()
print(f"Retraining triggered: {reason} (drift_share={drift_share:.2%})")Note
Tooling Choices: MLflow, BentoML, Seldon, and KServe
The MLOps tooling ecosystem is wide, and choosing the wrong combination creates a maintenance burden that outweighs the productivity gains. The decision usually turns on two axes: how much Kubernetes expertise the team has, and whether the model serving requirements are simple (single model, batch or online) or complex (multi-model pipelines, ensembles, transformers).
| Tool | Best For | Trade-offs |
|---|---|---|
| MLflow + Flask | Teams starting out, low traffic | No built-in canary; manual infra |
| BentoML | Python-first teams, containerised serving | No native K8s canary; good DX |
| KServe | K8s-native canary, shadow deployments | Requires Kubernetes expertise |
| Seldon Core | Multi-model pipelines, explainability | Complex to operate, higher overhead |
| SageMaker | AWS-native, managed retraining | Vendor lock-in; expensive at scale |
MLOps CI/CD Production Checklist
Before declaring your MLOps CI/CD pipeline production-ready, verify each of the following:
Data and code are versioned together — every model artifact traces back to an exact dataset version and Git commit
Every training run is logged to MLflow with hyperparameters, metrics, and the model artifact — no manual experiment records
Evaluation gates include absolute threshold, champion-challenger comparison, and per-segment fairness checks
Model versions move through a registry lifecycle (Staging → Production) with automated gate transitions and human approval for Production
Shadow mode is used for every new model before any live traffic is served
Canary rollout percentage is automatically increased on metric health and reverted on alerting rule fire
Prediction distribution is monitored continuously and alerts fire if the output distribution shifts significantly
Data drift detection runs on production traffic against the training reference set, with retraining triggered on significant drift
Retraining cooldown is enforced to prevent overlapping training jobs
Model serving environment pins the same Python and library versions used at training time
Deploying ML models manually, running training pipelines that produce non-reproducible artifacts, or lacking evaluation gates to catch regressions before production?
We design and implement end-to-end MLOps CI/CD pipelines — from DVC data versioning and MLflow experiment tracking setup to statistical evaluation gate design with champion-challenger comparison and fairness checks, MLflow Model Registry lifecycle automation, GitHub Actions workflows for training and canary deployment, KServe InferenceService configuration with progressive traffic splitting and rollback triggers, Evidently AI drift monitoring with automated retraining dispatch, and production serving environment parity enforcement with pinned dependency management. Let’s talk.
Let's Talk