Building Your First Model with Real-World Health Data
In the previous lesson, we worked with controlled examples to understand feature engineering.
That was intentional.
It allowed us to focus on how data representation affects what a model can learn.
Now we move into a real-world dataset.
This is where premium work begins to feel different.
Real datasets are not just collections of variables.
They represent a prediction problem, a measurement process, and a set of assumptions about what can be learned.
In this lesson, we will use a health-related tabular dataset to build a first model in a way that is structured, cautious, and interpretable.
Why We Are Using a Health Dataset
Health-related data is useful for modeling lessons because it is both practical and consequential.
Variables such as:
- age
- body mass index
- blood pressure
- blood chemistry
can carry meaningful signals, but they also require careful interpretation.
This makes health data a good setting for learning how modeling decisions affect results.
Our goal is not to make a clinical claim.
Our goal is to understand how a model behaves when applied to structured health data.
The Prediction Task
We will use the diabetes dataset available in scikit-learn.
This dataset contains ten baseline variables measured for each patient, along with a quantitative target related to disease progression.
That gives us a regression task:
use baseline patient measurements to predict a continuous outcome
This is a good starting point because it allows us to focus on:
- feature matrix design
- training and testing logic
- baseline model behavior
- interpretation of prediction error
Load the Dataset and Save It for Reuse
We first load the dataset from scikit-learn, then save it into the project data/ directory.
This makes the dataset part of the project itself.
In later lessons, we will load the same file directly from disk so that the workflow remains consistent and reproducible.
import pandas as pd
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df["disease_progression"] = diabetes.target
df.to_csv("data/diabetes.csv", index=False)
df.head()For a reusable version, use the prepared script below.
Script 09A — Save the Diabetes Dataset
Create a file called:
scripts/python/09a_save_diabetes_dataset.py
Run it with:
python scripts/python/09a_save_diabetes_dataset.pyExpected output:
data/diabetes.csv
Create Features and Target
X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]
X.head()Let us also inspect the target.
y.head()Understand the Data Structure
Before modeling, confirm what the data represents.
X.shape, y.shapeX.columns.tolist()Each row represents one patient observation.
Each column represents one measured feature.
The target is a continuous variable, so this is not a classification problem.
This distinction matters because it determines:
- what model types are appropriate
- what evaluation metrics make sense
- how results should be interpreted
A Brief Look at the Features
X.describe().round(3)The features in this dataset are already numeric.
That makes it suitable for a first regression workflow.
But numeric does not mean automatically meaningful.
We still need to think about:
- scale
- signal strength
- correlation structure
- whether the relationship to the outcome is likely to be simple or complex
Split the Data Properly
A model should not be evaluated on the same data used to fit it.
That would produce an overly optimistic result.
We create separate training and testing sets so that performance is assessed on unseen data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
X_train.shape, X_test.shapeThis is one of the first major shifts from exploratory analysis to modeling:
we are no longer only asking what patterns exist
we are asking whether those patterns generalize beyond the data used to fit the model
Start with a Baseline Model
Premium modeling should begin with a baseline, not with complexity.
A simple model gives us a reference point.
We will start with linear regression.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)Now generate predictions on the test set.
y_pred = model.predict(X_test)
y_pred[:10]Evaluate the Model
For regression, two common metrics are:
- Mean Absolute Error (MAE): average absolute difference between observed and predicted values
- R-squared (R²): proportion of variability explained by the model
from sklearn.metrics import mean_absolute_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
pd.DataFrame({
"metric": ["MAE", "R-squared"],
"value": [mae, r2]
}).round(3)These values are not simply “good” or “bad” in isolation.
They must be interpreted relative to:
- the complexity of the problem
- the amount of noise in the data
- the expectations of the use case
- alternative models or baselines
Script 09B — Train and Evaluate the Baseline Model
Create a file called:
scripts/python/09b_train_baseline_linear_model.py
Run it with:
python scripts/python/09b_train_baseline_linear_model.pyExpected outputs:
models/diabetes-linear-regression.joblib
reports/diabetes-model-metrics.csv
reports/diabetes-observed-vs-predicted.csv
reports/diabetes-model-coefficients.csv
This script creates the reusable modeling outputs for this lesson.
Compare Observed and Predicted Values
A good diagnostic plot does more than show predictions.
It reveals where the model is accurate and where it fails.
Points close to the diagonal line represent accurate predictions.
Points farther away indicate larger errors.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
results_df = pd.DataFrame({
"observed": y_test,
"predicted": y_pred
})
results_df["error"] = np.abs(results_df["observed"] - results_df["predicted"])
plt.figure(figsize=(7, 5))
scatter = plt.scatter(
results_df["observed"],
results_df["predicted"],
c=results_df["error"],
cmap="viridis",
edgecolors="black",
linewidth=0.4,
alpha=0.9
)
min_val = min(results_df["observed"].min(), results_df["predicted"].min())
max_val = max(results_df["observed"].max(), results_df["predicted"].max())
plt.plot([min_val, max_val], [min_val, max_val], linestyle="--", linewidth=1)
plt.xlabel("Observed progression")
plt.ylabel("Predicted progression")
plt.title("Observed vs Predicted")
plt.colorbar(scatter, label="Absolute Error")
plt.grid(alpha=0.2)
plt.show()Script 09C — Plot Observed vs Predicted Results
Create a file called:
scripts/python/09c_plot_observed_vs_predicted.py
Run it with:
python scripts/python/09c_plot_observed_vs_predicted.pyExpected output:
reports/figures/diabetes-observed-vs-predicted.png
Inspect Coefficients Carefully
coef_df = pd.DataFrame({
"feature": X_train.columns,
"coefficient": model.coef_
}).sort_values("coefficient", ascending=False)
coef_dfThese coefficients are useful, but they must be interpreted carefully.
A coefficient reflects how much the model’s prediction changes when a feature changes, holding other features constant.
In this sense, larger coefficients indicate that the model relies more heavily on that feature when forming predictions.
However, a coefficient does not automatically imply:
- causation
- clinical importance
- independent biological truth
Instead, it describes how the fitted model uses that feature within this dataset, under this specific modeling setup, and given the presence of other variables.
This distinction is critical.
Coefficients represent model behavior, not real-world mechanisms.
What This First Model Actually Tells Us
At this stage, we have not proven that we can predict diabetes progression well in a real clinical sense.
What we have shown is something more fundamental:
- the dataset supports a structured prediction task
- a baseline regression model can be fit successfully
- predictions can be evaluated on held-out data
- coefficients and errors can be inspected systematically
This is the beginning of modeling discipline.
What Can Go Wrong at This Stage
Even a simple first model can fail for important reasons.
Leakage
If preprocessing or feature selection uses the full dataset before splitting, test performance becomes misleading.
Over-interpretation
A moderate or weak metric does not mean the dataset is useless.
It may mean the problem is difficult, the model is simple, or the features are limited.
False Confidence
A fitted model is not the same as a validated workflow.
We still need to examine evaluation, robustness, and reproducibility more carefully.
Confusing Association with Explanation
A predictive relationship is not the same as a biological mechanism.
CDI Insight
In analysis, we often ask:
what patterns are present in the data?
In modeling, we must also ask:
do these patterns remain useful when tested on unseen data?
That is the beginning of reliable prediction.
Summary
- We introduced a real-world health-related tabular dataset.
- The diabetes dataset defines a regression task with clinical-style variables.
- We saved the dataset into
data/so later lessons can reuse the same file.
- We created separate training and testing sets.
- We fit a baseline linear regression model.
- We evaluated performance using MAE and R-squared.
- We examined observed versus predicted values and model coefficients.
- We interpreted results cautiously rather than treating model output as truth.
Exercise
Run the three scripts for this lesson.
Then answer:
- What is the target variable?
- Is this a regression or classification task?
- What does MAE tell you?
- What does R-squared tell you?
- Why should coefficients not be treated as causal explanations?
- What could go wrong if evaluation is done on the training data?
Looking Ahead
In the next lesson, we evaluate models beyond a single train-test split.
The focus shifts from asking:
Can this model fit once?
to asking:
How stable and trustworthy is this model when evaluated more carefully?