Building Your First Model with Real-World Health Data

Published

Jun 2026

ID: DS-L09
Type: Premium
Audience: Intermediate to Advanced
Theme: Modeling begins when representation meets a real prediction task

In the previous lesson, we worked with controlled examples to understand feature engineering.

That was intentional.

It allowed us to focus on how data representation affects what a model can learn.

Now we move into a real-world dataset.

This is where premium work begins to feel different.

Real datasets are not just collections of variables.

They represent a prediction problem, a measurement process, and a set of assumptions about what can be learned.

In this lesson, we will use a health-related tabular dataset to build a first model in a way that is structured, cautious, and interpretable.

Why We Are Using a Health Dataset

Health-related data is useful for modeling lessons because it is both practical and consequential.

Variables such as:

age
body mass index
blood pressure
blood chemistry

can carry meaningful signals, but they also require careful interpretation.

This makes health data a good setting for learning how modeling decisions affect results.

Our goal is not to make a clinical claim.

Our goal is to understand how a model behaves when applied to structured health data.

The Prediction Task

We will use the diabetes dataset available in scikit-learn.

This dataset contains ten baseline variables measured for each patient, along with a quantitative target related to disease progression.

That gives us a regression task:

use baseline patient measurements to predict a continuous outcome

This is a good starting point because it allows us to focus on:

feature matrix design
training and testing logic
baseline model behavior
interpretation of prediction error

Load the Dataset and Save It for Reuse

We first load the dataset from scikit-learn, then save it into the project data/ directory.

This makes the dataset part of the project itself.

In later lessons, we will load the same file directly from disk so that the workflow remains consistent and reproducible.

import pandas as pd
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()

df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df["disease_progression"] = diabetes.target

df.to_csv("data/diabetes.csv", index=False)

df.head()

For a reusable version, use the prepared script below.

Script 09A — Save the Diabetes Dataset

Create a file called:

scripts/python/09a_save_diabetes_dataset.py

Run it with:

python scripts/python/09a_save_diabetes_dataset.py

Expected output:

data/diabetes.csv

Create Features and Target

X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]

X.head()

Let us also inspect the target.

y.head()

Understand the Data Structure

Before modeling, confirm what the data represents.

X.shape, y.shape

X.columns.tolist()

Each row represents one patient observation.

Each column represents one measured feature.

The target is a continuous variable, so this is not a classification problem.

This distinction matters because it determines:

what model types are appropriate
what evaluation metrics make sense
how results should be interpreted

A Brief Look at the Features

X.describe().round(3)

The features in this dataset are already numeric.

That makes it suitable for a first regression workflow.

But numeric does not mean automatically meaningful.

We still need to think about:

scale
signal strength
correlation structure
whether the relationship to the outcome is likely to be simple or complex

Split the Data Properly

A model should not be evaluated on the same data used to fit it.

That would produce an overly optimistic result.

We create separate training and testing sets so that performance is assessed on unseen data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape

This is one of the first major shifts from exploratory analysis to modeling:

we are no longer only asking what patterns exist
we are asking whether those patterns generalize beyond the data used to fit the model

Start with a Baseline Model

Premium modeling should begin with a baseline, not with complexity.

A simple model gives us a reference point.

We will start with linear regression.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Now generate predictions on the test set.

y_pred = model.predict(X_test)

y_pred[:10]

Evaluate the Model

For regression, two common metrics are:

Mean Absolute Error (MAE): average absolute difference between observed and predicted values
R-squared (R²): proportion of variability explained by the model

from sklearn.metrics import mean_absolute_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

pd.DataFrame({
    "metric": ["MAE", "R-squared"],
    "value": [mae, r2]
}).round(3)

These values are not simply “good” or “bad” in isolation.

They must be interpreted relative to:

the complexity of the problem
the amount of noise in the data
the expectations of the use case
alternative models or baselines

Script 09B — Train and Evaluate the Baseline Model

Create a file called:

scripts/python/09b_train_baseline_linear_model.py

Run it with:

python scripts/python/09b_train_baseline_linear_model.py

Expected outputs:

models/diabetes-linear-regression.joblib
reports/diabetes-model-metrics.csv
reports/diabetes-observed-vs-predicted.csv
reports/diabetes-model-coefficients.csv

This script creates the reusable modeling outputs for this lesson.

Compare Observed and Predicted Values

A good diagnostic plot does more than show predictions.

It reveals where the model is accurate and where it fails.

Points close to the diagonal line represent accurate predictions.

Points farther away indicate larger errors.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

results_df = pd.DataFrame({
    "observed": y_test,
    "predicted": y_pred
})

results_df["error"] = np.abs(results_df["observed"] - results_df["predicted"])

plt.figure(figsize=(7, 5))

scatter = plt.scatter(
    results_df["observed"],
    results_df["predicted"],
    c=results_df["error"],
    cmap="viridis",
    edgecolors="black",
    linewidth=0.4,
    alpha=0.9
)

min_val = min(results_df["observed"].min(), results_df["predicted"].min())
max_val = max(results_df["observed"].max(), results_df["predicted"].max())
plt.plot([min_val, max_val], [min_val, max_val], linestyle="--", linewidth=1)

plt.xlabel("Observed progression")
plt.ylabel("Predicted progression")
plt.title("Observed vs Predicted")
plt.colorbar(scatter, label="Absolute Error")
plt.grid(alpha=0.2)
plt.show()

Script 09C — Plot Observed vs Predicted Results

Create a file called:

scripts/python/09c_plot_observed_vs_predicted.py

Run it with:

python scripts/python/09c_plot_observed_vs_predicted.py

Expected output:

reports/figures/diabetes-observed-vs-predicted.png

Inspect Coefficients Carefully

coef_df = pd.DataFrame({
    "feature": X_train.columns,
    "coefficient": model.coef_
}).sort_values("coefficient", ascending=False)

coef_df

These coefficients are useful, but they must be interpreted carefully.

A coefficient reflects how much the model’s prediction changes when a feature changes, holding other features constant.

In this sense, larger coefficients indicate that the model relies more heavily on that feature when forming predictions.

However, a coefficient does not automatically imply:

causation
clinical importance
independent biological truth

Instead, it describes how the fitted model uses that feature within this dataset, under this specific modeling setup, and given the presence of other variables.

This distinction is critical.

Coefficients represent model behavior, not real-world mechanisms.

What This First Model Actually Tells Us

At this stage, we have not proven that we can predict diabetes progression well in a real clinical sense.

What we have shown is something more fundamental:

the dataset supports a structured prediction task
a baseline regression model can be fit successfully
predictions can be evaluated on held-out data
coefficients and errors can be inspected systematically

This is the beginning of modeling discipline.

What Can Go Wrong at This Stage

Even a simple first model can fail for important reasons.

Leakage

If preprocessing or feature selection uses the full dataset before splitting, test performance becomes misleading.

Over-interpretation

A moderate or weak metric does not mean the dataset is useless.

It may mean the problem is difficult, the model is simple, or the features are limited.

False Confidence

A fitted model is not the same as a validated workflow.

We still need to examine evaluation, robustness, and reproducibility more carefully.

Confusing Association with Explanation

A predictive relationship is not the same as a biological mechanism.

CDI Insight

In analysis, we often ask:

what patterns are present in the data?

In modeling, we must also ask:

do these patterns remain useful when tested on unseen data?

That is the beginning of reliable prediction.

Summary

We introduced a real-world health-related tabular dataset.
The diabetes dataset defines a regression task with clinical-style variables.
We saved the dataset into data/ so later lessons can reuse the same file.
We created separate training and testing sets.
We fit a baseline linear regression model.
We evaluated performance using MAE and R-squared.
We examined observed versus predicted values and model coefficients.
We interpreted results cautiously rather than treating model output as truth.

Exercise

Run the three scripts for this lesson.

Then answer:

What is the target variable?
Is this a regression or classification task?
What does MAE tell you?
What does R-squared tell you?
Why should coefficients not be treated as causal explanations?
What could go wrong if evaluation is done on the training data?

Looking Ahead

In the next lesson, we evaluate models beyond a single train-test split.

The focus shifts from asking:

Can this model fit once?

to asking:

How stable and trustworthy is this model when evaluated more carefully?