Evaluating Models Beyond a Single Result

Published

Jun 2026

  • ID: DS-L10
  • Type: Premium
  • Audience: Intermediate to Advanced
  • Theme: Reliable models require stable and context-aware evaluation

In the previous lesson, we built a model and evaluated it using a single train/test split.

We obtained metrics.

We visualized predictions.

We inspected coefficients.

That was necessary.

But it is not sufficient.

A single evaluation result does not tell us whether a model is reliable.


Load the Dataset

We continue using the diabetes dataset introduced in the previous lesson.

Because the dataset was saved into the project data/ directory, we now load it directly from disk.

This keeps the workflow consistent across lessons and reflects how real projects are usually organized.

import pandas as pd

df = pd.read_csv("data/diabetes.csv")

X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]

df.head()

Why Evaluation Is Not a Single Number

When we compute metrics such as:

  • Mean Absolute Error (MAE)
  • R-squared (R²)

we are summarizing model performance under one specific condition:

  • one train/test split
  • one random state
  • one sample of the data

This raises an important question:

What happens if the data is split differently?


Repeating the Split

Let us repeat the modeling process using different random splits.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

def evaluate_once(X, y, random_state):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state
    )

    model = LinearRegression()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    return mae, r2

results = [evaluate_once(X, y, rs) for rs in range(5)]

results

Even though:

  • the data is the same
  • the model is the same

the performance changes.

This tells us something important:

model performance is not fixed
it depends on how the data is partitioned


Aggregating Results

results_df = pd.DataFrame(results, columns=["MAE", "R2"])
results_df
results_df.describe().round(3)

Instead of a single value, we now see a range of performance.

This is a more realistic view.


Script 10A — Evaluate Repeated Train/Test Splits

Create a file called:

scripts/python/10a_evaluate_repeated_splits.py

Run it with:

python scripts/python/10a_evaluate_repeated_splits.py

Expected outputs:

reports/diabetes-repeated-split-metrics.csv
reports/diabetes-repeated-split-summary.csv

This script repeats model fitting across multiple random train/test splits and summarizes how performance changes.


Introducing Cross-Validation

To make evaluation more stable, we can systematically repeat this process.

This is the idea behind cross-validation.

from sklearn.model_selection import cross_val_score

model = LinearRegression()

cv_scores = cross_val_score(
    model,
    X,
    y,
    cv=5,
    scoring="r2"
)

cv_scores
pd.DataFrame({
    "cv_fold": range(1, len(cv_scores) + 1),
    "r2": cv_scores
}).round(3)
cv_scores.mean(), cv_scores.std()

Cross-validation gives us:

  • an average performance
  • a measure of variability

This is more informative than a single split.


Script 10B — Run Cross-Validation

Create a file called:

scripts/python/10b_cross_validate_linear_model.py

Run it with:

python scripts/python/10b_cross_validate_linear_model.py

Expected outputs:

reports/diabetes-cross-validation-r2.csv
reports/diabetes-cross-validation-summary.csv

This script evaluates the model across cross-validation folds and records both fold-level and summary results.


Interpreting Metrics Carefully

Metrics are useful, but incomplete.

Mean Absolute Error (MAE)

  • easy to interpret
  • sensitive to scale

But it does not tell us:

  • where errors occur
  • whether errors are systematic

R-squared (R²)

  • measures explained variance

But it can be misleading:

  • a moderate R² may still be useful
  • a high R² does not guarantee generalization

Residual Analysis

To examine residuals clearly, we recompute predictions using a defined train/test split.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

residual_model = LinearRegression()
residual_model.fit(X_train, y_train)

y_pred = residual_model.predict(X_test)
import matplotlib.pyplot as plt
import numpy as np

residuals = y_test - y_pred

plt.figure(figsize=(7, 5))

scatter = plt.scatter(
    y_pred,
    residuals,
    c=np.abs(residuals),
    cmap="viridis",
    edgecolors="black",
    linewidth=0.4,
    alpha=0.9
)

plt.axhline(0, linestyle="--", linewidth=1)
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.colorbar(scatter, label="Absolute Error")
plt.grid(alpha=0.2)
plt.show()

This plot helps us examine how prediction errors behave across the range of model outputs.

It allows us to detect:

  • systematic bias
    If residuals are mostly above or below zero, the model consistently over- or under-predicts

  • uneven error distribution
    If the spread of residuals changes across predicted values, the model performs inconsistently across ranges

  • regions of poor performance
    Clusters of large residuals indicate where the model struggles to capture the underlying pattern

Rather than summarizing error with a single number, this plot reveals how and where the model fails, which is essential for reliable interpretation.


Script 10C — Create Residual Plot

Create a file called:

scripts/python/10c_plot_residuals.py

Run it with:

python scripts/python/10c_plot_residuals.py

Expected outputs:

reports/diabetes-residuals.csv
reports/figures/diabetes-residual-plot.png

What Reliable Evaluation Requires

A model is not reliable because:

  • it produces a number
  • it performs well on one split

A model becomes more reliable when:

  • performance is stable across splits
  • errors are understood, not just measured
  • evaluation reflects how the model will be used

What Can Go Wrong

Over-trusting a Single Metric

One number cannot capture full model behavior.

Ignoring Variability

Performance that changes across splits indicates instability.

Misinterpreting Metrics

Metrics must be interpreted in context.

Skipping Error Analysis

Without examining residuals, important patterns may be missed.


CDI Insight

A model is not evaluated once.

It is evaluated across conditions.

Reliability comes from consistency, not a single result.


Summary

  • A single train/test split provides limited insight.
  • Model performance varies depending on data partitioning.
  • Cross-validation provides a more stable estimate.
  • Metrics must be interpreted carefully.
  • Residual analysis reveals patterns hidden by summary statistics.
  • Reliable evaluation requires multiple perspectives.

Exercise

Run the three scripts for this lesson.

Then answer:

  1. How much does MAE vary across repeated splits?
  2. How much does R-squared vary across repeated splits?
  3. What does cross-validation reveal that a single split does not?
  4. Are residuals evenly distributed around zero?
  5. Where does the model appear to make larger errors?
  6. Why is stable performance more trustworthy than a single strong score?

Looking Ahead

In the next lesson, we begin improving models through controlled changes.

The focus shifts from asking:

How did this model perform?

to asking:

What should we change, and how do we know whether the change helped?