Evaluating Models Beyond a Single Result
In the previous lesson, we built a model and evaluated it using a single train/test split.
We obtained metrics.
We visualized predictions.
We inspected coefficients.
That was necessary.
But it is not sufficient.
A single evaluation result does not tell us whether a model is reliable.
Load the Dataset
We continue using the diabetes dataset introduced in the previous lesson.
Because the dataset was saved into the project data/ directory, we now load it directly from disk.
This keeps the workflow consistent across lessons and reflects how real projects are usually organized.
import pandas as pd
df = pd.read_csv("data/diabetes.csv")
X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]
df.head()Why Evaluation Is Not a Single Number
When we compute metrics such as:
- Mean Absolute Error (MAE)
- R-squared (R²)
we are summarizing model performance under one specific condition:
- one train/test split
- one random state
- one sample of the data
This raises an important question:
What happens if the data is split differently?
Repeating the Split
Let us repeat the modeling process using different random splits.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
def evaluate_once(X, y, random_state):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=random_state
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
return mae, r2
results = [evaluate_once(X, y, rs) for rs in range(5)]
resultsEven though:
- the data is the same
- the model is the same
the performance changes.
This tells us something important:
model performance is not fixed
it depends on how the data is partitioned
Aggregating Results
results_df = pd.DataFrame(results, columns=["MAE", "R2"])
results_dfresults_df.describe().round(3)Instead of a single value, we now see a range of performance.
This is a more realistic view.
Script 10A — Evaluate Repeated Train/Test Splits
Create a file called:
scripts/python/10a_evaluate_repeated_splits.py
Run it with:
python scripts/python/10a_evaluate_repeated_splits.pyExpected outputs:
reports/diabetes-repeated-split-metrics.csv
reports/diabetes-repeated-split-summary.csv
This script repeats model fitting across multiple random train/test splits and summarizes how performance changes.
Introducing Cross-Validation
To make evaluation more stable, we can systematically repeat this process.
This is the idea behind cross-validation.
from sklearn.model_selection import cross_val_score
model = LinearRegression()
cv_scores = cross_val_score(
model,
X,
y,
cv=5,
scoring="r2"
)
cv_scorespd.DataFrame({
"cv_fold": range(1, len(cv_scores) + 1),
"r2": cv_scores
}).round(3)cv_scores.mean(), cv_scores.std()Cross-validation gives us:
- an average performance
- a measure of variability
This is more informative than a single split.
Script 10B — Run Cross-Validation
Create a file called:
scripts/python/10b_cross_validate_linear_model.py
Run it with:
python scripts/python/10b_cross_validate_linear_model.pyExpected outputs:
reports/diabetes-cross-validation-r2.csv
reports/diabetes-cross-validation-summary.csv
This script evaluates the model across cross-validation folds and records both fold-level and summary results.
Interpreting Metrics Carefully
Metrics are useful, but incomplete.
Mean Absolute Error (MAE)
- easy to interpret
- sensitive to scale
But it does not tell us:
- where errors occur
- whether errors are systematic
R-squared (R²)
- measures explained variance
But it can be misleading:
- a moderate R² may still be useful
- a high R² does not guarantee generalization
Residual Analysis
To examine residuals clearly, we recompute predictions using a defined train/test split.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
residual_model = LinearRegression()
residual_model.fit(X_train, y_train)
y_pred = residual_model.predict(X_test)import matplotlib.pyplot as plt
import numpy as np
residuals = y_test - y_pred
plt.figure(figsize=(7, 5))
scatter = plt.scatter(
y_pred,
residuals,
c=np.abs(residuals),
cmap="viridis",
edgecolors="black",
linewidth=0.4,
alpha=0.9
)
plt.axhline(0, linestyle="--", linewidth=1)
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.colorbar(scatter, label="Absolute Error")
plt.grid(alpha=0.2)
plt.show()This plot helps us examine how prediction errors behave across the range of model outputs.
It allows us to detect:
systematic bias
If residuals are mostly above or below zero, the model consistently over- or under-predictsuneven error distribution
If the spread of residuals changes across predicted values, the model performs inconsistently across rangesregions of poor performance
Clusters of large residuals indicate where the model struggles to capture the underlying pattern
Rather than summarizing error with a single number, this plot reveals how and where the model fails, which is essential for reliable interpretation.
Script 10C — Create Residual Plot
Create a file called:
scripts/python/10c_plot_residuals.py
Run it with:
python scripts/python/10c_plot_residuals.pyExpected outputs:
reports/diabetes-residuals.csv
reports/figures/diabetes-residual-plot.png
What Reliable Evaluation Requires
A model is not reliable because:
- it produces a number
- it performs well on one split
A model becomes more reliable when:
- performance is stable across splits
- errors are understood, not just measured
- evaluation reflects how the model will be used
What Can Go Wrong
Over-trusting a Single Metric
One number cannot capture full model behavior.
Ignoring Variability
Performance that changes across splits indicates instability.
Misinterpreting Metrics
Metrics must be interpreted in context.
Skipping Error Analysis
Without examining residuals, important patterns may be missed.
CDI Insight
A model is not evaluated once.
It is evaluated across conditions.
Reliability comes from consistency, not a single result.
Summary
- A single train/test split provides limited insight.
- Model performance varies depending on data partitioning.
- Cross-validation provides a more stable estimate.
- Metrics must be interpreted carefully.
- Residual analysis reveals patterns hidden by summary statistics.
- Reliable evaluation requires multiple perspectives.
Exercise
Run the three scripts for this lesson.
Then answer:
- How much does MAE vary across repeated splits?
- How much does R-squared vary across repeated splits?
- What does cross-validation reveal that a single split does not?
- Are residuals evenly distributed around zero?
- Where does the model appear to make larger errors?
- Why is stable performance more trustworthy than a single strong score?
Looking Ahead
In the next lesson, we begin improving models through controlled changes.
The focus shifts from asking:
How did this model perform?
to asking:
What should we change, and how do we know whether the change helped?