Improving Models Through Controlled Changes
In the previous lesson, we evaluated model performance beyond a single result.
We saw that performance can change depending on:
- how the data is split
- how the model is evaluated
- whether the result is stable across repeated evaluations
- whether the model behaves consistently enough to support a real claim
This leads to the next practical question:
How do we improve a model without fooling ourselves?
Model improvement is not simply trying many algorithms until one number looks better.
That approach can easily produce a model that appears better by chance.
A defensible improvement process is different.
It starts with a baseline.
It changes one thing at a time.
It compares models fairly.
It asks whether the change improves usefulness, stability, interpretability, or deployment readiness.
Load the Dataset
We continue using the diabetes dataset saved earlier in the project data/ directory.
#| label: 11-load-diabetes-file
import pandas as pd
df = pd.read_csv("data/diabetes.csv")
X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]
df.head()The target remains:
#| label: 11-target-column
y.head()The feature table remains:
#| label: 11-feature-table
X.head()Using the same saved dataset keeps the workflow consistent across lessons.
This matters because model improvement should not depend on hidden changes in the data.
Why Model Improvement Must Be Controlled
A common beginner mistake is to treat model improvement as a competition between algorithms.
For example:
Try linear regression, random forest, gradient boosting, support vector machines, and neural networks. Then choose the one with the best score.
That may sound practical, but it can be misleading.
If we try many models and only report the best result, we may be selecting a model that performed well because of one favorable split rather than because it generalizes better.
A controlled improvement process is more disciplined.
It asks:
- What exactly are we changing?
- Why might this change help?
- How will we compare it against the baseline?
- Does the improvement persist under fair evaluation?
- Is the new model still interpretable enough for the use case?
In applied data science, the goal is not to find the most impressive number.
The goal is to produce a model whose behavior can be explained, reproduced, and defended.
Start With the Baseline
We begin by recreating the same baseline model used earlier.
#| label: 11-baseline-model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)
baseline_predictions = baseline_model.predict(X_test)
baseline_mae = mean_absolute_error(y_test, baseline_predictions)
baseline_r2 = r2_score(y_test, baseline_predictions)
baseline_mae, baseline_r2This model is our reference point.
Every proposed improvement will be compared against it.
Define Candidate Model Changes
We will compare a small set of controlled alternatives.
Each alternative changes one modeling idea:
| Model | Change introduced |
|---|---|
| Linear regression | Baseline model |
| Scaled linear regression | Adds feature scaling |
| Ridge regression | Adds regularization |
| Decision tree | Changes model family |
These are not the only possible models.
They are enough to demonstrate the improvement workflow.
The purpose of this lesson is not to declare a universal winner.
The purpose is to show how to compare model changes carefully.
Candidate 1: Scaled Linear Regression
Feature scaling transforms features so they are on a comparable numerical scale.
For ordinary linear regression, scaling does not usually change predictive performance in a meaningful way.
However, scaling becomes important for many other models and for regularized models.
Here we include it because it introduces a reproducible pipeline pattern.
#| label: 11-scaled-linear-model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
scaled_linear_model = Pipeline([
("scaler", StandardScaler()),
("model", LinearRegression())
])
scaled_linear_model.fit(X_train, y_train)
scaled_linear_predictions = scaled_linear_model.predict(X_test)
scaled_linear_mae = mean_absolute_error(y_test, scaled_linear_predictions)
scaled_linear_r2 = r2_score(y_test, scaled_linear_predictions)
scaled_linear_mae, scaled_linear_r2This model changes the preprocessing step while keeping the model family the same.
That makes it a controlled comparison.
Candidate 2: Ridge Regression
Ridge regression is a regularized version of linear regression.
Regularization discourages very large coefficient values.
This can help when:
- predictors are correlated
- coefficients are unstable
- the model is sensitive to small data changes
- interpretability and stability matter
#| label: 11-ridge-model
from sklearn.linear_model import Ridge
ridge_model = Pipeline([
("scaler", StandardScaler()),
("model", Ridge(alpha=1.0))
])
ridge_model.fit(X_train, y_train)
ridge_predictions = ridge_model.predict(X_test)
ridge_mae = mean_absolute_error(y_test, ridge_predictions)
ridge_r2 = r2_score(y_test, ridge_predictions)
ridge_mae, ridge_r2Ridge regression may not always reduce prediction error.
Its value is often in producing a more stable model.
That distinction is important.
A model can be useful even when it does not produce the lowest error on one split.
Candidate 3: Decision Tree
A decision tree changes the model family.
Unlike linear models, decision trees can capture nonlinear relationships and feature interactions.
However, trees can also overfit easily, especially when not constrained.
#| label: 11-decision-tree-model
from sklearn.tree import DecisionTreeRegressor
tree_model = DecisionTreeRegressor(
random_state=42
)
tree_model.fit(X_train, y_train)
tree_predictions = tree_model.predict(X_test)
tree_mae = mean_absolute_error(y_test, tree_predictions)
tree_r2 = r2_score(y_test, tree_predictions)
tree_mae, tree_r2This comparison asks whether a more flexible model family improves prediction on this dataset.
Compare Model Results
Now we combine the results into one comparison table.
#| label: 11-model-comparison-table
comparison_df = pd.DataFrame({
"model": [
"linear_regression_baseline",
"scaled_linear_regression",
"ridge_regression",
"decision_tree"
],
"mae": [
baseline_mae,
scaled_linear_mae,
ridge_mae,
tree_mae
],
"r2": [
baseline_r2,
scaled_linear_r2,
ridge_r2,
tree_r2
]
})
comparison_df = comparison_df.sort_values("mae").reset_index(drop=True)
comparison_df.round(3)Lower MAE is better.
Higher R² is better.
However, the table should not be interpreted mechanically.
The best-looking model on one split is not automatically the best model overall.
Save the Model Comparison Output
In an applied workflow, model comparisons should be saved as project outputs.
This allows the comparison to be inspected later, used in reports, and reproduced by others.
#| label: 11-save-model-comparison
from pathlib import Path
Path("reports").mkdir(exist_ok=True)
comparison_df.to_csv(
"reports/diabetes-model-comparison.csv",
index=False
)
comparison_dfThis creates a reusable artifact:
reports/diabetes-model-comparison.csv
The result is no longer only visible inside the notebook or rendered chapter.
It becomes part of the project record.
Interpret the Comparison
The comparison table gives us a first view of how the candidate models behave.
But model improvement requires interpretation.
A model should not be selected only because it has the best number in one table.
We need to ask what the result means.
If the baseline remains best
If the baseline linear regression model performs as well as or better than the alternatives, that is not a failure.
It suggests that the dataset may be reasonably well captured by a linear structure.
In that case, the simplest model may be preferred because it is:
- easier to explain
- easier to inspect
- easier to reproduce
- easier to communicate
- less likely to introduce unnecessary complexity
A simple model that performs well is often stronger than a complex model that performs only slightly better.
If ridge performs similarly
If ridge regression performs similarly to linear regression, it may still be valuable.
Ridge may improve coefficient stability even when predictive metrics are similar.
This matters when the model is used for interpretation, reporting, or repeated deployment.
In applied settings, stability can be as important as raw accuracy.
If the decision tree performs poorly
If the decision tree performs worse, that may indicate overfitting.
A fully grown tree can learn patterns that are too specific to the training data.
When that happens, it may perform well on training data but poorly on unseen data.
Poor test performance is a warning that flexibility alone does not guarantee generalization.
Improvement Is More Than Better Metrics
A model is not improved just because one number moved in the desired direction.
Useful improvement may involve:
- lower prediction error
- more stable performance
- simpler deployment
- better interpretability
- more reliable coefficients
- fewer fragile assumptions
- clearer communication to decision-makers
A model that improves MAE slightly but becomes much harder to explain may not be better for the real-world use case.
A model that is slightly less accurate but more stable and interpretable may be preferred.
The meaning of improvement depends on the analytical goal.
What Can Go Wrong During Model Improvement
Chasing a Single Metric
A model can improve one metric while becoming worse in another way.
For example, a model may reduce average error but produce worse errors for certain subgroups or ranges of the outcome.
Metrics should guide judgment, not replace it.
Overfitting the Test Set
If we repeatedly try models and choose the one that performs best on the same test set, the test set becomes part of the modeling process.
That weakens its value as an honest estimate of future performance.
This is why cross-validation and final holdout testing are important.
Changing Too Many Things at Once
If we change the features, preprocessing, algorithm, and hyperparameters all at once, we cannot tell what caused the change in performance.
Controlled changes make model improvement interpretable.
Confusing Complexity With Progress
A more complex model is not automatically better.
Complexity should be justified by better performance, better stability, or better fit to the problem structure.
Without that justification, complexity becomes a liability.
CDI Insight
Model improvement is not model shopping.
It is controlled experimentation.
A defensible improvement process compares models fairly, explains why each change was made, and avoids treating one favorable result as proof.
The best model is not always the most complex model.
The best model is the one that is accurate enough, stable enough, interpretable enough, and appropriate for the decision context.
Summary
In this lesson, we moved from model evaluation to model improvement.
We:
- loaded the same saved diabetes dataset
- recreated the baseline model
- tested controlled model alternatives
- compared scaling, regularization, and a tree-based model
- saved the model comparison table as a project artifact
- interpreted improvement beyond MAE and R²
- discussed overfitting, metric chasing, and unnecessary complexity
The key idea is simple:
A model improves only when the change produces a more useful and defensible analytical system.
What Comes Next
So far, we have compared model variants using a single train/test split.
That is useful, but still incomplete.
A model that looks better on one split may not remain better across many splits.
In the next lesson, we will introduce:
- reusable modeling pipelines
- cross-validation
- fair model comparison across multiple folds
- more stable estimates of performance
This moves us from one comparison table to a more reliable evaluation system.
→ Building reproducible pipelines and cross-validation