The gap between a working Jupyter notebook and a production ML system is enormous. After building multiple production ML pipelines at Zalando, I’ve learned that the “last mile” of ML — getting models into production reliably — is where most projects fail.
Here are the key lessons I’ve learned about building ML systems that actually work.
Lesson 1: Your Notebook is Not Your Pipeline
The biggest mindset shift for data scientists moving to production: notebooks are for exploration, not production.
A production pipeline needs:
- Reproducibility: Same inputs → same outputs, every time
- Testability: Unit tests, integration tests, data validation
- Observability: Logging, metrics, alerting
- Recoverability: Graceful failure handling, retries, checkpoints
We use Kedro to bridge this gap:
# Instead of notebook cells, we write nodes
def preprocess_data(raw_data: pd.DataFrame, params: dict) -> pd.DataFrame:
"""Preprocessing step with explicit inputs and outputs."""
df = raw_data.copy()
# Handle missing values
df = df.fillna(params['fill_values'])
# Feature engineering
df['log_price'] = np.log1p(df['price'])
return df
# Nodes compose into pipelines
pipeline = Pipeline([
node(func=load_data, inputs="raw_data", outputs="loaded_data"),
node(func=preprocess_data, inputs=["loaded_data", "params:preprocessing"], outputs="clean_data"),
node(func=train_model, inputs=["clean_data", "params:model"], outputs="model"),
])Every function is:
- Pure (no side effects)
- Typed (clear contracts)
- Testable (easy to unit test)
- Documented (docstrings required)
Lesson 2: Version Everything
In production ML, you need to trace any prediction back to:
- The exact model version
- The training data version
- The feature engineering code version
- The configuration used
We use MLFlow for experiment tracking:
import mlflow
with mlflow.start_run():
# Log parameters
mlflow.log_params({
"model_type": "xgboost",
"n_estimators": 100,
"learning_rate": 0.1
})
# Train model
model = train_model(X_train, y_train)
# Log metrics
mlflow.log_metrics({
"rmse": rmse,
"mape": mape,
"r2": r2
})
# Log model artifact
mlflow.sklearn.log_model(model, "model")
# Log data version
mlflow.log_param("data_version", data_hash)Combined with Kedro’s data versioning, we can reproduce any historical run.
Lesson 3: Data Validation is Non-Negotiable
Bad data is the #1 cause of ML failures in production. We validate data at every stage:
import pandera as pa
# Define schema for input data
input_schema = pa.DataFrameSchema({
"product_id": pa.Column(str, nullable=False),
"price": pa.Column(float, checks=[
pa.Check.greater_than(0),
pa.Check.less_than(10000)
]),
"quantity": pa.Column(int, checks=pa.Check.greater_than_or_equal_to(0)),
"date": pa.Column(pa.DateTime)
})
@pa.check_input(input_schema)
def process_sales(df: pd.DataFrame) -> pd.DataFrame:
"""This function will fail if input doesn't match schema."""
return df.groupby("product_id").agg({"quantity": "sum"})We also monitor for distribution drift:
from evidently import ColumnDriftMetric
from evidently.report import Report
def check_drift(reference: pd.DataFrame, current: pd.DataFrame) -> dict:
"""Detect distribution drift between training and production data."""
report = Report(metrics=[
ColumnDriftMetric(column_name="price"),
ColumnDriftMetric(column_name="quantity"),
])
report.run(reference_data=reference, current_data=current)
return report.as_dict()Lesson 4: Design for Failure
Production systems fail. Design accordingly:
Graceful Degradation
When the ML model fails, fall back to simpler logic:
def get_prediction(features: dict) -> float:
try:
# Try ML prediction
return ml_model.predict(features)
except Exception as e:
logger.error(f"ML prediction failed: {e}")
# Fall back to rule-based default
return calculate_fallback(features)Retry Logic
Transient failures are common:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
def fetch_features(product_id: str) -> dict:
"""Fetch features from feature store with retries."""
return feature_store.get(product_id)Circuit Breakers
Prevent cascade failures:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
def call_recommendation_service(user_id: str) -> list:
"""Call external service with circuit breaker."""
return recommendation_api.get_recommendations(user_id)Lesson 5: Monitoring is a First-Class Citizen
You can’t improve what you don’t measure. We monitor at multiple levels:
Model Performance
# Track prediction accuracy over time
metrics_client.gauge(
"model.mape",
value=calculate_mape(actuals, predictions),
tags={"model": "price_elasticity", "version": "v2.1"}
)Business Metrics
# Track business impact
metrics_client.counter(
"recommendations.accepted",
tags={"model": "stock_recommendation"}
)Infrastructure
- CPU/memory usage
- Prediction latency
- Queue depths
- Error rates
Our monitoring stack: Prometheus for metrics, Grafana for dashboards, PagerDuty for alerting.
Lesson 6: CI/CD for ML is Different
ML pipelines need specialized CI/CD:
# .github/workflows/ml-pipeline.yml
name: ML Pipeline CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run unit tests
run: pytest tests/unit
- name: Run data validation tests
run: pytest tests/data_validation
- name: Run model tests
run: pytest tests/model
integration:
needs: test
runs-on: ubuntu-latest
steps:
- name: Run integration tests
run: pytest tests/integration
- name: Run Kedro pipeline
run: kedro run --pipeline=test
deploy:
needs: integration
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
run: ./deploy.sh staging
- name: Run smoke tests
run: pytest tests/smoke
- name: Deploy to production
run: ./deploy.sh productionKey additions for ML:
- Model performance tests (does accuracy meet threshold?)
- Data contract tests (do data sources match expected schema?)
- Shadow mode deployment (run new model alongside old, compare)
Lesson 7: Documentation is Part of the System
ML systems are complex. Documentation isn’t optional:
- Model cards: What does this model do? What are its limitations?
- Data dictionaries: What does each feature mean?
- Runbooks: How do I debug common failures?
- Architecture diagrams: How do components interact?
We auto-generate documentation from code:
def calculate_price_elasticity(
sales: pd.DataFrame,
prices: pd.DataFrame,
params: ElasticityParams
) -> ElasticityResults:
"""
Calculate price elasticity using causal inference.
Args:
sales: Historical sales data with columns [product_id, date, quantity]
prices: Historical prices with columns [product_id, date, price]
params: Model configuration parameters
Returns:
ElasticityResults containing:
- elasticity_estimates: Point estimates per product
- confidence_intervals: 95% CIs
- model_diagnostics: Fit statistics
Raises:
InsufficientDataError: If product has < 30 observations
DataQualityError: If data validation fails
"""The Tech Stack That Works
After years of iteration, here’s what works for us:
| Component | Tool | Why |
|---|---|---|
| Orchestration | Kedro + Airflow | Kedro for pipelines, Airflow for scheduling |
| Experiment Tracking | MLFlow | Industry standard, good UI |
| Feature Store | Feast | Open source, flexible |
| Model Serving | FastAPI + K8s | Fast, scalable, familiar |
| Data Validation | Pandera + Great Expectations | Catches issues early |
| Monitoring | Prometheus + Grafana | Proven reliability |
| CI/CD | GitHub Actions | Native integration |
Final Thoughts
The gap between notebook and production isn’t about tools — it’s about mindset. Think about:
- Reliability: What happens when things go wrong?
- Observability: How will I know something’s wrong?
- Maintainability: Can someone else understand this in 6 months?
- Reproducibility: Can I recreate any historical result?
If you can answer these questions confidently, you’re building production ML systems, not just experiments.
Want to discuss MLOps practices? I’m always happy to chat about production ML. Find me on LinkedIn.