Improving Models Through Controlled Changes

Published

Jun 2026

ID: DS-L11
Type: Premium
Audience: Intermediate to Advanced
Theme: Model improvement is a controlled and interpretable process, not random model shopping

In the previous lesson, we evaluated model performance beyond a single result.

We saw that performance can change depending on:

how the data is split
how the model is evaluated
whether the result is stable across repeated evaluations
whether the model behaves consistently enough to support a real claim

This leads to the next practical question:

How do we improve a model without fooling ourselves?

Model improvement is not simply trying many algorithms until one number looks better.

That approach can easily produce a model that appears better by chance.

A defensible improvement process is different.

It starts with a baseline.

It changes one thing at a time.

It compares models fairly.

It asks whether the change improves usefulness, stability, interpretability, or deployment readiness.

Load the Dataset

We continue using the diabetes dataset saved earlier in the project data/ directory.

#| label: 11-load-diabetes-file
import pandas as pd

df = pd.read_csv("data/diabetes.csv")

X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]

df.head()

The target remains:

#| label: 11-target-column
y.head()

The feature table remains:

#| label: 11-feature-table
X.head()

Using the same saved dataset keeps the workflow consistent across lessons.

This matters because model improvement should not depend on hidden changes in the data.

Why Model Improvement Must Be Controlled

A common beginner mistake is to treat model improvement as a competition between algorithms.

For example:

Try linear regression, random forest, gradient boosting, support vector machines, and neural networks. Then choose the one with the best score.

That may sound practical, but it can be misleading.

If we try many models and only report the best result, we may be selecting a model that performed well because of one favorable split rather than because it generalizes better.

A controlled improvement process is more disciplined.

It asks:

What exactly are we changing?
Why might this change help?
How will we compare it against the baseline?
Does the improvement persist under fair evaluation?
Is the new model still interpretable enough for the use case?

In applied data science, the goal is not to find the most impressive number.

The goal is to produce a model whose behavior can be explained, reproduced, and defended.

Start With the Baseline

We begin by recreating the same baseline model used earlier.

#| label: 11-baseline-model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)

baseline_predictions = baseline_model.predict(X_test)

baseline_mae = mean_absolute_error(y_test, baseline_predictions)
baseline_r2 = r2_score(y_test, baseline_predictions)

baseline_mae, baseline_r2

This model is our reference point.

Every proposed improvement will be compared against it.

Define Candidate Model Changes

We will compare a small set of controlled alternatives.

Each alternative changes one modeling idea:

Model	Change introduced
Linear regression	Baseline model
Scaled linear regression	Adds feature scaling
Ridge regression	Adds regularization
Decision tree	Changes model family

These are not the only possible models.

They are enough to demonstrate the improvement workflow.

The purpose of this lesson is not to declare a universal winner.

The purpose is to show how to compare model changes carefully.

Candidate 1: Scaled Linear Regression

Feature scaling transforms features so they are on a comparable numerical scale.

For ordinary linear regression, scaling does not usually change predictive performance in a meaningful way.

However, scaling becomes important for many other models and for regularized models.

Here we include it because it introduces a reproducible pipeline pattern.

#| label: 11-scaled-linear-model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

scaled_linear_model = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

scaled_linear_model.fit(X_train, y_train)

scaled_linear_predictions = scaled_linear_model.predict(X_test)

scaled_linear_mae = mean_absolute_error(y_test, scaled_linear_predictions)
scaled_linear_r2 = r2_score(y_test, scaled_linear_predictions)

scaled_linear_mae, scaled_linear_r2

This model changes the preprocessing step while keeping the model family the same.

That makes it a controlled comparison.

Candidate 2: Ridge Regression

Ridge regression is a regularized version of linear regression.

Regularization discourages very large coefficient values.

This can help when:

predictors are correlated
coefficients are unstable
the model is sensitive to small data changes
interpretability and stability matter

#| label: 11-ridge-model
from sklearn.linear_model import Ridge

ridge_model = Pipeline([
    ("scaler", StandardScaler()),
    ("model", Ridge(alpha=1.0))
])

ridge_model.fit(X_train, y_train)

ridge_predictions = ridge_model.predict(X_test)

ridge_mae = mean_absolute_error(y_test, ridge_predictions)
ridge_r2 = r2_score(y_test, ridge_predictions)

ridge_mae, ridge_r2

Ridge regression may not always reduce prediction error.

Its value is often in producing a more stable model.

That distinction is important.

A model can be useful even when it does not produce the lowest error on one split.

Candidate 3: Decision Tree

A decision tree changes the model family.

Unlike linear models, decision trees can capture nonlinear relationships and feature interactions.

However, trees can also overfit easily, especially when not constrained.

#| label: 11-decision-tree-model
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(
    random_state=42
)

tree_model.fit(X_train, y_train)

tree_predictions = tree_model.predict(X_test)

tree_mae = mean_absolute_error(y_test, tree_predictions)
tree_r2 = r2_score(y_test, tree_predictions)

tree_mae, tree_r2

This comparison asks whether a more flexible model family improves prediction on this dataset.

Compare Model Results

Now we combine the results into one comparison table.

#| label: 11-model-comparison-table
comparison_df = pd.DataFrame({
    "model": [
        "linear_regression_baseline",
        "scaled_linear_regression",
        "ridge_regression",
        "decision_tree"
    ],
    "mae": [
        baseline_mae,
        scaled_linear_mae,
        ridge_mae,
        tree_mae
    ],
    "r2": [
        baseline_r2,
        scaled_linear_r2,
        ridge_r2,
        tree_r2
    ]
})

comparison_df = comparison_df.sort_values("mae").reset_index(drop=True)

comparison_df.round(3)

Lower MAE is better.

Higher R² is better.

However, the table should not be interpreted mechanically.

The best-looking model on one split is not automatically the best model overall.

Save the Model Comparison Output

In an applied workflow, model comparisons should be saved as project outputs.

This allows the comparison to be inspected later, used in reports, and reproduced by others.

#| label: 11-save-model-comparison
from pathlib import Path

Path("reports").mkdir(exist_ok=True)

comparison_df.to_csv(
    "reports/diabetes-model-comparison.csv",
    index=False
)

comparison_df

This creates a reusable artifact:

reports/diabetes-model-comparison.csv

The result is no longer only visible inside the notebook or rendered chapter.

It becomes part of the project record.

Interpret the Comparison

The comparison table gives us a first view of how the candidate models behave.

But model improvement requires interpretation.

A model should not be selected only because it has the best number in one table.

We need to ask what the result means.

If the baseline remains best

If the baseline linear regression model performs as well as or better than the alternatives, that is not a failure.

It suggests that the dataset may be reasonably well captured by a linear structure.

In that case, the simplest model may be preferred because it is:

easier to explain
easier to inspect
easier to reproduce
easier to communicate
less likely to introduce unnecessary complexity

A simple model that performs well is often stronger than a complex model that performs only slightly better.

If ridge performs similarly

If ridge regression performs similarly to linear regression, it may still be valuable.

Ridge may improve coefficient stability even when predictive metrics are similar.

This matters when the model is used for interpretation, reporting, or repeated deployment.

In applied settings, stability can be as important as raw accuracy.

If the decision tree performs poorly

If the decision tree performs worse, that may indicate overfitting.

A fully grown tree can learn patterns that are too specific to the training data.

When that happens, it may perform well on training data but poorly on unseen data.

Poor test performance is a warning that flexibility alone does not guarantee generalization.

Improvement Is More Than Better Metrics

A model is not improved just because one number moved in the desired direction.

Useful improvement may involve:

lower prediction error
more stable performance
simpler deployment
better interpretability
more reliable coefficients
fewer fragile assumptions
clearer communication to decision-makers

A model that improves MAE slightly but becomes much harder to explain may not be better for the real-world use case.

A model that is slightly less accurate but more stable and interpretable may be preferred.

The meaning of improvement depends on the analytical goal.

What Can Go Wrong During Model Improvement

Chasing a Single Metric

A model can improve one metric while becoming worse in another way.

For example, a model may reduce average error but produce worse errors for certain subgroups or ranges of the outcome.

Metrics should guide judgment, not replace it.

Overfitting the Test Set

If we repeatedly try models and choose the one that performs best on the same test set, the test set becomes part of the modeling process.

That weakens its value as an honest estimate of future performance.

This is why cross-validation and final holdout testing are important.

Changing Too Many Things at Once

If we change the features, preprocessing, algorithm, and hyperparameters all at once, we cannot tell what caused the change in performance.

Controlled changes make model improvement interpretable.

Confusing Complexity With Progress

A more complex model is not automatically better.

Complexity should be justified by better performance, better stability, or better fit to the problem structure.

Without that justification, complexity becomes a liability.

CDI Insight

Model improvement is not model shopping.

It is controlled experimentation.

A defensible improvement process compares models fairly, explains why each change was made, and avoids treating one favorable result as proof.

The best model is not always the most complex model.

The best model is the one that is accurate enough, stable enough, interpretable enough, and appropriate for the decision context.

Summary

In this lesson, we moved from model evaluation to model improvement.

We:

loaded the same saved diabetes dataset
recreated the baseline model
tested controlled model alternatives
compared scaling, regularization, and a tree-based model
saved the model comparison table as a project artifact
interpreted improvement beyond MAE and R²
discussed overfitting, metric chasing, and unnecessary complexity

The key idea is simple:

A model improves only when the change produces a more useful and defensible analytical system.

What Comes Next

So far, we have compared model variants using a single train/test split.

That is useful, but still incomplete.

A model that looks better on one split may not remain better across many splits.

In the next lesson, we will introduce:

reusable modeling pipelines
cross-validation
fair model comparison across multiple folds
more stable estimates of performance

This moves us from one comparison table to a more reliable evaluation system.

→ Building reproducible pipelines and cross-validation