// Back to articles

Production-Ready Time-Series Forecasting at Scale

Lessons learned building forecasting systems that serve millions of predictions daily.

Forecasting is one of those problems that seems simple until you try to do it at scale. Predicting next week’s sales for a single product? Easy. Predicting daily sales for millions of product-location combinations with real-time updates? That’s where things get interesting.

In this post, I’ll share lessons from building forecasting systems at Zalando that serve millions of predictions daily.

The Scale Challenge

Our forecasting requirements:

  • Millions of time series: Each product × location combination needs forecasts
  • Multiple horizons: From next-day to next-quarter predictions
  • Daily updates: Fresh predictions every morning for operations teams
  • Probabilistic outputs: Point forecasts aren’t enough — we need uncertainty
  • Low latency: Downstream systems expect forecasts within SLA

A single model doesn’t cut it. We needed a system.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Forecasting Pipeline                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────┐ │
│  │ Feature  │───▶│  Model   │───▶│ Forecast │───▶│ Store │ │
│  │ Pipeline │    │ Training │    │ Serving  │    │       │ │
│  └──────────┘    └──────────┘    └──────────┘    └───────┘ │
│       │                                              │      │
│       └──────────────── Kedro ───────────────────────┘      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Model Selection Strategy

Not all time series are created equal. We segment our products:

High-volume products

  • Enough historical data for complex models
  • Use neural forecasters (N-BEATS, TFT)
  • Individual models or fine-tuned global models

Medium-volume products

  • Global models with product embeddings
  • Prophet with hierarchical priors
  • Transfer learning from similar products

Low-volume / new products

  • Pure hierarchical forecasting
  • Aggregate to higher levels, then disaggregate
  • Heavy reliance on seasonality patterns from category
def select_model(product_id: str, history: pd.DataFrame) -> BaseForecaster:
    n_observations = len(history)

    if n_observations > 365:
        return NeuralForecaster(product_id)
    elif n_observations > 90:
        return GlobalModelWithEmbedding(product_id)
    else:
        return HierarchicalForecaster(product_id)

Building with Kedro

Kedro transformed how we build ML pipelines. The key benefits:

1. Reproducibility

Every run is versioned. We can trace any prediction back to its exact inputs and model version.

# kedro catalog.yml
sales_forecast:
  type: pandas.ParquetDataSet
  filepath: data/07_model_output/forecasts.parquet
  versioned: true

2. Modular Pipelines

Separate pipelines for feature engineering, training, and inference:

# nodes.py
def create_features(sales: pd.DataFrame, calendar: pd.DataFrame) -> pd.DataFrame:
    """Generate forecasting features."""
    features = sales.copy()

    # Lag features
    for lag in [7, 14, 28]:
        features[f'lag_{lag}'] = features.groupby('product_id')['sales'].shift(lag)

    # Rolling statistics
    for window in [7, 28]:
        features[f'rolling_mean_{window}'] = (
            features.groupby('product_id')['sales']
            .transform(lambda x: x.rolling(window).mean())
        )

    # Calendar features
    features = features.merge(calendar, on='date')

    return features

3. Easy Experimentation

A/B testing different model configurations is straightforward:

# pipeline.py
def create_pipeline(**kwargs):
    return Pipeline([
        node(
            func=create_features,
            inputs=["sales", "calendar"],
            outputs="features",
        ),
        node(
            func=train_model,
            inputs=["features", "params:model_config"],
            outputs="model",
        ),
        node(
            func=generate_forecasts,
            inputs=["model", "features"],
            outputs="forecasts",
        ),
    ])

Prediction Intervals

Point forecasts are dangerous. Downstream systems need to know uncertainty:

from scipy import stats

def forecast_with_intervals(
    model,
    X: pd.DataFrame,
    quantiles: list = [0.1, 0.5, 0.9]
) -> pd.DataFrame:
    """Generate probabilistic forecasts."""

    # Point prediction
    y_pred = model.predict(X)

    # Estimate prediction variance (simplified)
    residuals = model.residuals_
    sigma = np.std(residuals)

    # Generate quantile forecasts
    forecasts = pd.DataFrame({'point_forecast': y_pred})

    for q in quantiles:
        z = stats.norm.ppf(q)
        forecasts[f'q{int(q*100)}'] = y_pred + z * sigma

    return forecasts

For neural models, we use Monte Carlo dropout or quantile regression directly.

Forecast Reconciliation

When you forecast at multiple aggregation levels, they often don’t add up. Hierarchical reconciliation fixes this:

from hierarchicalforecast.methods import BottomUp, MinTrace

# Forecasts at different levels
forecasts = {
    'total': total_forecast,
    'category': category_forecasts,
    'product': product_forecasts,
}

# Reconcile to ensure consistency
reconciler = MinTrace(method='ols')
reconciled = reconciler.reconcile(forecasts, S=summation_matrix)

This ensures that product-level forecasts sum to category totals, which sum to the overall total.

Monitoring & Alerting

Forecasting models degrade silently. We monitor:

Accuracy Metrics

  • MAPE, RMSE, and weighted variants
  • Coverage of prediction intervals
  • Bias detection (systematic over/under-forecasting)

Data Quality

  • Missing data patterns
  • Distribution shifts
  • Outlier frequency

Operational Metrics

  • Pipeline latency
  • Prediction serving latency
  • Model freshness
# Alerting on forecast degradation
def check_forecast_quality(actuals: pd.Series, forecasts: pd.Series) -> None:
    mape = np.mean(np.abs((actuals - forecasts) / actuals))

    if mape > MAPE_THRESHOLD:
        send_alert(
            severity="warning",
            message=f"Forecast MAPE ({mape:.2%}) exceeds threshold"
        )

Lessons Learned

1. Simple Models Go Far

Prophet with good feature engineering beats complex neural models for many use cases. Start simple.

2. Data Quality > Model Complexity

Spending time on data cleaning, outlier handling, and feature engineering pays off more than model tuning.

3. Business Context Matters

A 5% MAPE improvement means nothing if it doesn’t help business decisions. Align metrics with business outcomes.

4. Automate Everything

Manual interventions don’t scale. Build systems that handle edge cases automatically.

5. Communicate Uncertainty

Stakeholders need to understand forecast confidence. Invest in visualization and explanation.

Tech Stack Summary

  • Orchestration: Kedro + Airflow
  • Training: PyTorch (neural), Prophet (classical)
  • Serving: FastAPI + Redis cache
  • Storage: Parquet on S3, Redshift for queries
  • Monitoring: Prometheus + Grafana
  • Experiment tracking: MLFlow

Building forecasting systems? I’d love to hear about your challenges. Connect with me on LinkedIn.