Pipelines and Cross-Validation

Published

Jun 2026

ID: DS-L12
Type: Premium
Audience: Intermediate to Advanced
Theme: From single results to stable, reproducible modeling

In previous lessons, we evaluated models using a single train/test split.

That gave us a useful starting point.

But a single split gives only one estimate of performance.

A model may look stronger or weaker depending on which observations happen to fall into the training set and which observations happen to fall into the test set.

To move toward reliable model evaluation, we need two things:

consistent preprocessing
repeated evaluation across multiple splits

This is where pipelines and cross-validation become essential.

A pipeline helps us keep the modeling workflow reproducible.

Cross-validation helps us understand whether performance is stable.

Together, they move us from a single result toward a more defensible evaluation system.

How to Run This Lesson

This lesson can be run in two ways.

First, run the supporting script from the project root to generate the reusable outputs:

python scripts/python/12a_cross_validate_pipeline.py

This creates the expected cross-validation outputs in the reports/ directory:

reports/diabetes-pipeline-cross-validation-folds.csv
reports/diabetes-pipeline-cross-validation-summary.csv
reports/figures/diabetes-cross-validation-mae.png
reports/figures/diabetes-cross-validation-r2.png

Then render the Quarto site:

quarto render

You can also run the code blocks inside this chapter interactively.

The script-based workflow is preferred for reproducibility because it leaves behind files that can be inspected, compared, committed, or reused in later chapters.

Load the Dataset

We continue with the same diabetes dataset used in the previous lessons.

#| label: 12-load-data
import pandas as pd


df = pd.read_csv("data/diabetes.csv")

X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]

df.head()

The feature table contains the predictors.

#| label: 12-feature-table
X.head()

The target contains the outcome we want to predict.

#| label: 12-target-column
y.head()

Using the same saved dataset keeps the workflow stable across chapters.

This is important because cross-validation should evaluate model behavior, not hidden changes in the data.

Why Pipelines Matter

So far, we have often shown preprocessing and modeling as separate steps.

That is useful for learning.

But in real workflows, separating preprocessing from modeling can create risk.

Common problems include:

applying transformations to training data but forgetting them on test data
applying transformations inconsistently across scripts
leaking information from the full dataset into preprocessing
making the workflow harder to reproduce

A pipeline solves this by combining preprocessing and modeling into one object.

The pipeline defines the full sequence of steps needed to move from input features to predictions.

That makes the workflow more reliable.

Build a Modeling Pipeline

Here we create a pipeline that first scales the features and then fits a linear regression model.

#| label: 12-build-pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

pipeline

The pipeline has two steps:

Step	Purpose
`scaler`	Standardizes the feature values
`model`	Fits the linear regression model

This means scaling is not handled as a separate manual step.

It becomes part of the model workflow.

That matters because the same transformation is applied consistently during evaluation.

Why Cross-Validation Matters

A single train/test split gives one estimate of performance.

But performance depends on how the data is split.

Cross-validation addresses this problem by evaluating the model across multiple train/test partitions.

In k-fold cross-validation, the data is divided into k folds.

The model is trained and evaluated multiple times.

Each fold gets a turn as the validation set.

The result is not one score.

The result is a set of scores.

This lets us ask better questions:

What is the average performance?
How much does performance vary across folds?
Are there folds where the model performs unusually poorly?
Is the model stable enough to support a real-world claim?

This is much more informative than relying on a single split.

Apply Cross-Validation

We now evaluate the pipeline using 5-fold cross-validation.

#| label: 12-cross-validation
from sklearn.model_selection import cross_val_score
import numpy as np


mae_scores = -cross_val_score(
    pipeline,
    X,
    y,
    cv=5,
    scoring="neg_mean_absolute_error"
)

r2_scores = cross_val_score(
    pipeline,
    X,
    y,
    cv=5,
    scoring="r2"
)

mae_scores, r2_scores

Scikit-learn returns negative values for MAE when using neg_mean_absolute_error because its scoring system assumes that higher scores are better.

We multiply by -1 so that the values are easier to interpret as ordinary MAE.

For MAE, lower values are better.

For R², higher values are better.

Create a Cross-Validation Results Table

We now place the fold-level results into a table.

#| label: 12-results-table
results_df = pd.DataFrame({
    "fold": range(1, len(mae_scores) + 1),
    "mae": mae_scores,
    "r2": r2_scores
})

results_df.round(3)

Each row represents one fold.

This table makes the evaluation more transparent.

Instead of reporting only one number, we can see how performance changes across folds.

Summarize Cross-Validation Performance

Next, we summarize the average and variability of the scores.

#| label: 12-summary
summary_df = pd.DataFrame({
    "metric": ["mae", "r2"],
    "mean": [np.mean(mae_scores), np.mean(r2_scores)],
    "std": [np.std(mae_scores), np.std(r2_scores)]
})

summary_df.round(3)

The mean tells us the average performance.

The standard deviation tells us how much performance varies across folds.

A model with good average performance but high variability may be less reliable than it first appears.

A model with moderate performance and low variability may be more dependable.

Save Cross-Validation Outputs

In an applied data science workflow, evaluation results should be saved as project artifacts.

This allows the results to be reused in reports, reviewed later, or compared with future model versions.

#| label: 12-save-cross-validation-results
from pathlib import Path


Path("reports").mkdir(exist_ok=True)

results_df.to_csv(
    "reports/diabetes-pipeline-cross-validation-folds.csv",
    index=False
)

summary_df.to_csv(
    "reports/diabetes-pipeline-cross-validation-summary.csv",
    index=False
)

results_df.round(3)

This creates two project outputs:

reports/diabetes-pipeline-cross-validation-folds.csv
reports/diabetes-pipeline-cross-validation-summary.csv

These files make the evaluation reproducible beyond the rendered chapter.

Visualize Performance Across Folds

Tables are useful, but plots make variability easier to see.

First, we visualize MAE across folds.

#| label: 12-plot-mae
import matplotlib.pyplot as plt


folds = results_df["fold"]

plt.figure(figsize=(7, 5))
plt.plot(folds, results_df["mae"], marker="o")
plt.xlabel("Fold")
plt.ylabel("MAE")
plt.title("MAE Across Cross-Validation Folds")
plt.grid(alpha=0.2)
plt.show()

Next, we visualize R² across folds.

#| label: 12-plot-r2
plt.figure(figsize=(7, 5))
plt.plot(folds, results_df["r2"], marker="o")
plt.xlabel("Fold")
plt.ylabel("R²")
plt.title("R² Across Cross-Validation Folds")
plt.grid(alpha=0.2)
plt.show()

These plots show whether performance is relatively stable or highly dependent on the fold.

The goal is not identical performance across all folds.

Some variation is expected.

The goal is limited variability around a consistent average.

Save Cross-Validation Figures

We can also save the plots as report figures.

#| label: 12-save-cross-validation-figures
Path("reports/figures").mkdir(parents=True, exist_ok=True)

plt.figure(figsize=(7, 5))
plt.plot(folds, results_df["mae"], marker="o")
plt.xlabel("Fold")
plt.ylabel("MAE")
plt.title("MAE Across Cross-Validation Folds")
plt.grid(alpha=0.2)
plt.savefig("reports/figures/diabetes-cross-validation-mae.png", dpi=300, bbox_inches="tight")
plt.close()

plt.figure(figsize=(7, 5))
plt.plot(folds, results_df["r2"], marker="o")
plt.xlabel("Fold")
plt.ylabel("R²")
plt.title("R² Across Cross-Validation Folds")
plt.grid(alpha=0.2)
plt.savefig("reports/figures/diabetes-cross-validation-r2.png", dpi=300, bbox_inches="tight")
plt.close()

This creates two reusable visual outputs:

reports/figures/diabetes-cross-validation-mae.png
reports/figures/diabetes-cross-validation-r2.png

These figures can be included in reports, presentations, or later case studies.

Interpret the Results

Cross-validation replaces a single result with a distribution of outcomes.

That distribution is more informative than one train/test score.

For MAE, we ask:

Are the errors consistently low?
Are there folds with much higher error?
Is the average error acceptable for the problem?

For R², we ask:

Does the model explain a meaningful amount of variation?
Is explained variance stable across folds?
Are some folds much weaker than others?

This gives us a more complete view of model behavior.

What Cross-Validation Reveals

Each fold exposes the model to a different training subset and validation subset.

This affects:

which patterns are learned
which observations are used for evaluation
how sensitive the model is to data partitioning
whether the model generalizes consistently

If performance changes dramatically across folds, that is a warning sign.

It may suggest that the model is sensitive to the split, the dataset is small, important subgroups are unevenly distributed, or the model is not stable.

If performance varies moderately around a consistent average, the model is more defensible as a baseline.

Pipelines Reduce Leakage Risk

Cross-validation is most useful when preprocessing is handled correctly.

This is why we used a pipeline.

When preprocessing is included inside the pipeline, each training fold learns its own preprocessing parameters.

For example, the scaler is fitted only on the training portion of each fold.

Then that fitted transformation is applied to the validation portion.

This avoids using information from the validation fold during preprocessing.

That is important because using validation information too early can make performance look better than it really is.

This problem is called data leakage.

Pipelines help prevent it.

CDI Interpretation

Pipelines and cross-validation are not just technical conveniences.

They are part of defensible analytical practice.

A pipeline answers:

Was the modeling workflow applied consistently?

Cross-validation answers:

Is performance stable across different data splits?

Together, they help us move from a single model result to a reproducible evaluation system.

This matters because real-world decisions should not depend on one lucky split.

Summary

In this lesson, we moved from single-split evaluation to pipeline-based cross-validation.

We:

loaded the saved diabetes dataset
built a preprocessing and modeling pipeline
evaluated the model using 5-fold cross-validation
summarized MAE and R² across folds
saved fold-level and summary results
visualized performance variability
saved cross-validation figures
explained how pipelines reduce leakage risk

The key lesson is:

Stable model evaluation requires both consistent workflows and repeated testing across data splits.

What Comes Next

Cross-validation helps us understand whether model performance is stable.

But performance alone does not explain how the model works.

In the next lesson, we will examine:

feature importance
model interpretation
coefficient inspection
what we can and cannot infer from model structure

This moves us from evaluating predictions to understanding model behavior.

→ Feature importance and model interpretation