Forecasting is one of those problems that seems simple until you try to do it at scale. Predicting next week’s sales for a single product? Easy. Predicting daily sales for millions of product-location combinations with real-time updates? That’s where things get interesting.
In this post, I’ll share lessons from building forecasting systems at Zalando that serve millions of predictions daily.
The Scale Challenge
Our forecasting requirements:
- Millions of time series: Each product × location combination needs forecasts
- Multiple horizons: From next-day to next-quarter predictions
- Daily updates: Fresh predictions every morning for operations teams
- Probabilistic outputs: Point forecasts aren’t enough — we need uncertainty
- Low latency: Downstream systems expect forecasts within SLA
A single model doesn’t cut it. We needed a system.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Forecasting Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │
│ │ Feature │───▶│ Model │───▶│ Forecast │───▶│ Store │ │
│ │ Pipeline │ │ Training │ │ Serving │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────┘ │
│ │ │ │
│ └──────────────── Kedro ───────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘Model Selection Strategy
Not all time series are created equal. We segment our products:
High-volume products
- Enough historical data for complex models
- Use neural forecasters (N-BEATS, TFT)
- Individual models or fine-tuned global models
Medium-volume products
- Global models with product embeddings
- Prophet with hierarchical priors
- Transfer learning from similar products
Low-volume / new products
- Pure hierarchical forecasting
- Aggregate to higher levels, then disaggregate
- Heavy reliance on seasonality patterns from category
def select_model(product_id: str, history: pd.DataFrame) -> BaseForecaster:
n_observations = len(history)
if n_observations > 365:
return NeuralForecaster(product_id)
elif n_observations > 90:
return GlobalModelWithEmbedding(product_id)
else:
return HierarchicalForecaster(product_id)Building with Kedro
Kedro transformed how we build ML pipelines. The key benefits:
1. Reproducibility
Every run is versioned. We can trace any prediction back to its exact inputs and model version.
# kedro catalog.yml
sales_forecast:
type: pandas.ParquetDataSet
filepath: data/07_model_output/forecasts.parquet
versioned: true2. Modular Pipelines
Separate pipelines for feature engineering, training, and inference:
# nodes.py
def create_features(sales: pd.DataFrame, calendar: pd.DataFrame) -> pd.DataFrame:
"""Generate forecasting features."""
features = sales.copy()
# Lag features
for lag in [7, 14, 28]:
features[f'lag_{lag}'] = features.groupby('product_id')['sales'].shift(lag)
# Rolling statistics
for window in [7, 28]:
features[f'rolling_mean_{window}'] = (
features.groupby('product_id')['sales']
.transform(lambda x: x.rolling(window).mean())
)
# Calendar features
features = features.merge(calendar, on='date')
return features3. Easy Experimentation
A/B testing different model configurations is straightforward:
# pipeline.py
def create_pipeline(**kwargs):
return Pipeline([
node(
func=create_features,
inputs=["sales", "calendar"],
outputs="features",
),
node(
func=train_model,
inputs=["features", "params:model_config"],
outputs="model",
),
node(
func=generate_forecasts,
inputs=["model", "features"],
outputs="forecasts",
),
])Prediction Intervals
Point forecasts are dangerous. Downstream systems need to know uncertainty:
from scipy import stats
def forecast_with_intervals(
model,
X: pd.DataFrame,
quantiles: list = [0.1, 0.5, 0.9]
) -> pd.DataFrame:
"""Generate probabilistic forecasts."""
# Point prediction
y_pred = model.predict(X)
# Estimate prediction variance (simplified)
residuals = model.residuals_
sigma = np.std(residuals)
# Generate quantile forecasts
forecasts = pd.DataFrame({'point_forecast': y_pred})
for q in quantiles:
z = stats.norm.ppf(q)
forecasts[f'q{int(q*100)}'] = y_pred + z * sigma
return forecastsFor neural models, we use Monte Carlo dropout or quantile regression directly.
Forecast Reconciliation
When you forecast at multiple aggregation levels, they often don’t add up. Hierarchical reconciliation fixes this:
from hierarchicalforecast.methods import BottomUp, MinTrace
# Forecasts at different levels
forecasts = {
'total': total_forecast,
'category': category_forecasts,
'product': product_forecasts,
}
# Reconcile to ensure consistency
reconciler = MinTrace(method='ols')
reconciled = reconciler.reconcile(forecasts, S=summation_matrix)This ensures that product-level forecasts sum to category totals, which sum to the overall total.
Monitoring & Alerting
Forecasting models degrade silently. We monitor:
Accuracy Metrics
- MAPE, RMSE, and weighted variants
- Coverage of prediction intervals
- Bias detection (systematic over/under-forecasting)
Data Quality
- Missing data patterns
- Distribution shifts
- Outlier frequency
Operational Metrics
- Pipeline latency
- Prediction serving latency
- Model freshness
# Alerting on forecast degradation
def check_forecast_quality(actuals: pd.Series, forecasts: pd.Series) -> None:
mape = np.mean(np.abs((actuals - forecasts) / actuals))
if mape > MAPE_THRESHOLD:
send_alert(
severity="warning",
message=f"Forecast MAPE ({mape:.2%}) exceeds threshold"
)Lessons Learned
1. Simple Models Go Far
Prophet with good feature engineering beats complex neural models for many use cases. Start simple.
2. Data Quality > Model Complexity
Spending time on data cleaning, outlier handling, and feature engineering pays off more than model tuning.
3. Business Context Matters
A 5% MAPE improvement means nothing if it doesn’t help business decisions. Align metrics with business outcomes.
4. Automate Everything
Manual interventions don’t scale. Build systems that handle edge cases automatically.
5. Communicate Uncertainty
Stakeholders need to understand forecast confidence. Invest in visualization and explanation.
Tech Stack Summary
- Orchestration: Kedro + Airflow
- Training: PyTorch (neural), Prophet (classical)
- Serving: FastAPI + Redis cache
- Storage: Parquet on S3, Redshift for queries
- Monitoring: Prometheus + Grafana
- Experiment tracking: MLFlow
Building forecasting systems? I’d love to hear about your challenges. Connect with me on LinkedIn.