// Back to articles

From Notebook to Production: MLOps Lessons from Zalando

Lessons learned building production ML systems at Europe's largest online fashion platform.

The gap between a working Jupyter notebook and a production ML system is enormous. After building multiple production ML pipelines at Zalando, I’ve learned that the “last mile” of ML — getting models into production reliably — is where most projects fail.

Here are the key lessons I’ve learned about building ML systems that actually work.

Lesson 1: Your Notebook is Not Your Pipeline

The biggest mindset shift for data scientists moving to production: notebooks are for exploration, not production.

A production pipeline needs:

  • Reproducibility: Same inputs → same outputs, every time
  • Testability: Unit tests, integration tests, data validation
  • Observability: Logging, metrics, alerting
  • Recoverability: Graceful failure handling, retries, checkpoints

We use Kedro to bridge this gap:

# Instead of notebook cells, we write nodes
def preprocess_data(raw_data: pd.DataFrame, params: dict) -> pd.DataFrame:
    """Preprocessing step with explicit inputs and outputs."""
    df = raw_data.copy()

    # Handle missing values
    df = df.fillna(params['fill_values'])

    # Feature engineering
    df['log_price'] = np.log1p(df['price'])

    return df

# Nodes compose into pipelines
pipeline = Pipeline([
    node(func=load_data, inputs="raw_data", outputs="loaded_data"),
    node(func=preprocess_data, inputs=["loaded_data", "params:preprocessing"], outputs="clean_data"),
    node(func=train_model, inputs=["clean_data", "params:model"], outputs="model"),
])

Every function is:

  • Pure (no side effects)
  • Typed (clear contracts)
  • Testable (easy to unit test)
  • Documented (docstrings required)

Lesson 2: Version Everything

In production ML, you need to trace any prediction back to:

  • The exact model version
  • The training data version
  • The feature engineering code version
  • The configuration used

We use MLFlow for experiment tracking:

import mlflow

with mlflow.start_run():
    # Log parameters
    mlflow.log_params({
        "model_type": "xgboost",
        "n_estimators": 100,
        "learning_rate": 0.1
    })

    # Train model
    model = train_model(X_train, y_train)

    # Log metrics
    mlflow.log_metrics({
        "rmse": rmse,
        "mape": mape,
        "r2": r2
    })

    # Log model artifact
    mlflow.sklearn.log_model(model, "model")

    # Log data version
    mlflow.log_param("data_version", data_hash)

Combined with Kedro’s data versioning, we can reproduce any historical run.

Lesson 3: Data Validation is Non-Negotiable

Bad data is the #1 cause of ML failures in production. We validate data at every stage:

import pandera as pa

# Define schema for input data
input_schema = pa.DataFrameSchema({
    "product_id": pa.Column(str, nullable=False),
    "price": pa.Column(float, checks=[
        pa.Check.greater_than(0),
        pa.Check.less_than(10000)
    ]),
    "quantity": pa.Column(int, checks=pa.Check.greater_than_or_equal_to(0)),
    "date": pa.Column(pa.DateTime)
})

@pa.check_input(input_schema)
def process_sales(df: pd.DataFrame) -> pd.DataFrame:
    """This function will fail if input doesn't match schema."""
    return df.groupby("product_id").agg({"quantity": "sum"})

We also monitor for distribution drift:

from evidently import ColumnDriftMetric
from evidently.report import Report

def check_drift(reference: pd.DataFrame, current: pd.DataFrame) -> dict:
    """Detect distribution drift between training and production data."""
    report = Report(metrics=[
        ColumnDriftMetric(column_name="price"),
        ColumnDriftMetric(column_name="quantity"),
    ])

    report.run(reference_data=reference, current_data=current)

    return report.as_dict()

Lesson 4: Design for Failure

Production systems fail. Design accordingly:

Graceful Degradation

When the ML model fails, fall back to simpler logic:

def get_prediction(features: dict) -> float:
    try:
        # Try ML prediction
        return ml_model.predict(features)
    except Exception as e:
        logger.error(f"ML prediction failed: {e}")
        # Fall back to rule-based default
        return calculate_fallback(features)

Retry Logic

Transient failures are common:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
def fetch_features(product_id: str) -> dict:
    """Fetch features from feature store with retries."""
    return feature_store.get(product_id)

Circuit Breakers

Prevent cascade failures:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def call_recommendation_service(user_id: str) -> list:
    """Call external service with circuit breaker."""
    return recommendation_api.get_recommendations(user_id)

Lesson 5: Monitoring is a First-Class Citizen

You can’t improve what you don’t measure. We monitor at multiple levels:

Model Performance

# Track prediction accuracy over time
metrics_client.gauge(
    "model.mape",
    value=calculate_mape(actuals, predictions),
    tags={"model": "price_elasticity", "version": "v2.1"}
)

Business Metrics

# Track business impact
metrics_client.counter(
    "recommendations.accepted",
    tags={"model": "stock_recommendation"}
)

Infrastructure

  • CPU/memory usage
  • Prediction latency
  • Queue depths
  • Error rates

Our monitoring stack: Prometheus for metrics, Grafana for dashboards, PagerDuty for alerting.

Lesson 6: CI/CD for ML is Different

ML pipelines need specialized CI/CD:

# .github/workflows/ml-pipeline.yml
name: ML Pipeline CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Run unit tests
        run: pytest tests/unit

      - name: Run data validation tests
        run: pytest tests/data_validation

      - name: Run model tests
        run: pytest tests/model

  integration:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Run integration tests
        run: pytest tests/integration

      - name: Run Kedro pipeline
        run: kedro run --pipeline=test

  deploy:
    needs: integration
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: ./deploy.sh staging

      - name: Run smoke tests
        run: pytest tests/smoke

      - name: Deploy to production
        run: ./deploy.sh production

Key additions for ML:

  • Model performance tests (does accuracy meet threshold?)
  • Data contract tests (do data sources match expected schema?)
  • Shadow mode deployment (run new model alongside old, compare)

Lesson 7: Documentation is Part of the System

ML systems are complex. Documentation isn’t optional:

  • Model cards: What does this model do? What are its limitations?
  • Data dictionaries: What does each feature mean?
  • Runbooks: How do I debug common failures?
  • Architecture diagrams: How do components interact?

We auto-generate documentation from code:

def calculate_price_elasticity(
    sales: pd.DataFrame,
    prices: pd.DataFrame,
    params: ElasticityParams
) -> ElasticityResults:
    """
    Calculate price elasticity using causal inference.

    Args:
        sales: Historical sales data with columns [product_id, date, quantity]
        prices: Historical prices with columns [product_id, date, price]
        params: Model configuration parameters

    Returns:
        ElasticityResults containing:
            - elasticity_estimates: Point estimates per product
            - confidence_intervals: 95% CIs
            - model_diagnostics: Fit statistics

    Raises:
        InsufficientDataError: If product has < 30 observations
        DataQualityError: If data validation fails
    """

The Tech Stack That Works

After years of iteration, here’s what works for us:

ComponentToolWhy
OrchestrationKedro + AirflowKedro for pipelines, Airflow for scheduling
Experiment TrackingMLFlowIndustry standard, good UI
Feature StoreFeastOpen source, flexible
Model ServingFastAPI + K8sFast, scalable, familiar
Data ValidationPandera + Great ExpectationsCatches issues early
MonitoringPrometheus + GrafanaProven reliability
CI/CDGitHub ActionsNative integration

Final Thoughts

The gap between notebook and production isn’t about tools — it’s about mindset. Think about:

  • Reliability: What happens when things go wrong?
  • Observability: How will I know something’s wrong?
  • Maintainability: Can someone else understand this in 6 months?
  • Reproducibility: Can I recreate any historical result?

If you can answer these questions confidently, you’re building production ML systems, not just experiments.


Want to discuss MLOps practices? I’m always happy to chat about production ML. Find me on LinkedIn.