End-to-End Case Study: From Data to Decision
In the previous lessons, we built the applied data science workflow step by step:
- feature engineering
- model building
- model evaluation
- model improvement
- pipelines and cross-validation
- model interpretation
- responsible claims
- clear communication
- decision framing
Now we bring these pieces together in one case study.
The goal is not to introduce a new concept.
The goal is to show how the full CDI workflow operates from beginning to end on a real dataset.
How to Run This Lesson
Run the supporting script from the project root:
python scripts/python/17a_run_end_to_end_case_study.pyThis creates the expected outputs in the reports/ directory:
reports/diabetes-end-to-end-cross-validation-folds.csv
reports/diabetes-end-to-end-cross-validation-summary.csv
reports/diabetes-end-to-end-test-metrics.csv
reports/diabetes-end-to-end-coefficients.csv
reports/diabetes-end-to-end-case-summary.md
reports/figures/diabetes-end-to-end-cv-mae.png
reports/figures/diabetes-end-to-end-cv-r2.png
reports/figures/diabetes-end-to-end-observed-vs-predicted.png
reports/figures/diabetes-end-to-end-coefficients.png
Then render the Quarto site:
quarto renderYou can also run the code blocks inside this chapter interactively.
The script-based workflow is preferred because it leaves behind reproducible project artifacts that can be inspected, compared, committed, or reused.
Case Study Objective
We will use the diabetes dataset to answer a practical question:
Can baseline clinical measurements support useful prediction of disease progression, and how should those predictions be interpreted for real-world use?
This is not a causal question.
It is a structured prediction question.
That distinction matters from the beginning.
A prediction workflow can tell us whether the available measurements are useful for estimating future outcomes.
It cannot, by itself, prove that changing one measurement will cause the outcome to change.
Load the Dataset
import pandas as pd
df = pd.read_csv("data/diabetes.csv")
df.head()The dataset is already stored inside the project data/ directory.
This keeps the case study connected to the earlier lessons and avoids hidden data dependencies.
Define Features and Target
X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]
X.shape, y.shapeEach row represents one observation.
Each column represents one measured feature.
The target is continuous, so this is a regression problem.
Step 1: Understand the Data Structure
Before modeling, we inspect the feature table.
X.describe().round(3)At this stage, the key question is not whether the data is exciting.
The key question is whether the data is structured well enough to support a defensible workflow.
Here, the answer is yes:
- the dataset is tabular
- the features are numeric
- the target is clearly defined
- the project has a stable input file
This is the minimum foundation needed for reproducible applied modeling.
Step 2: Build a Reproducible Workflow
We use a pipeline so that scaling and modeling are applied consistently.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LinearRegression())
])
pipelineThe pipeline combines preprocessing and modeling into one object.
This reduces the chance of applying transformations inconsistently.
It also makes the workflow easier to reuse in scripts, reports, and later projects.
Step 3: Evaluate With Cross-Validation
A single train/test split provides only one estimate of performance.
Cross-validation provides a more stable view by evaluating the workflow across multiple folds.
from sklearn.model_selection import cross_val_score
import numpy as np
mae_scores = -cross_val_score(
pipeline,
X,
y,
cv=5,
scoring="neg_mean_absolute_error"
)
r2_scores = cross_val_score(
pipeline,
X,
y,
cv=5,
scoring="r2"
)
mae_scores, r2_scorescv_summary = pd.DataFrame({
"metric": ["MAE", "R2"],
"mean": [np.mean(mae_scores), np.mean(r2_scores)],
"std": [np.std(mae_scores), np.std(r2_scores)]
}).round(3)
cv_summaryThis gives us two important views:
- average model performance
- variability across folds
That is more informative than a single score.
Step 4: Visualize Stability Across Folds
import matplotlib.pyplot as plt
folds = range(1, len(mae_scores) + 1)
plt.figure(figsize=(7, 5))
plt.plot(folds, mae_scores, marker="o", linewidth=1)
plt.xlabel("Fold")
plt.ylabel("MAE")
plt.title("MAE Across Cross-Validation Folds")
plt.grid(alpha=0.2)
plt.show()plt.figure(figsize=(7, 5))
plt.plot(folds, r2_scores, marker="o", linewidth=1)
plt.xlabel("Fold")
plt.ylabel("R²")
plt.title("R² Across Cross-Validation Folds")
plt.grid(alpha=0.2)
plt.show()These plots help us see whether performance is:
- stable
- highly variable
- dependent on a small number of favorable splits
Here, the variation is present but not extreme.
That suggests the model is reasonably stable as a baseline system.
Step 5: Fit a Model for Interpretation
Cross-validation gives us stability evidence.
For interpretation and visualization, we also fit one model on a train/test split.
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)
test_mae, test_r2This gives us one fitted model for inspection.
The single split should not replace cross-validation.
It is used here as a practical way to inspect predictions, residual behavior, and coefficients.
Step 6: Compare Observed and Predicted Values
error = np.abs(y_test - y_pred)
plt.figure(figsize=(7, 5))
scatter = plt.scatter(
y_test,
y_pred,
c=error,
edgecolors="black",
linewidth=0.4,
alpha=0.9
)
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], linestyle="--", linewidth=1)
plt.xlabel("Observed progression")
plt.ylabel("Predicted progression")
plt.title("Observed vs Predicted")
plt.colorbar(scatter, label="Absolute Error")
plt.grid(alpha=0.2)
plt.show()Points close to the diagonal line represent more accurate predictions.
Points farther away indicate larger errors.
This visual reminds us that a model can be useful without being perfect.
Step 7: Interpret Feature Influence
Because the model inside the pipeline is linear regression, we can inspect coefficients.
The coefficients are attached to the scaled features because the model was fit after StandardScaler.
coef_series = pd.Series(
pipeline.named_steps["model"].coef_,
index=X.columns
).sort_values()
coef_seriesplt.figure(figsize=(7, 5))
coef_series.plot(kind="barh")
plt.xlabel("Coefficient")
plt.title("Feature Coefficients from the Fitted Linear Model")
plt.grid(alpha=0.2)
plt.show()These coefficients describe how the model uses features for prediction.
They do not establish causation.
For example, if BMI has a strong positive coefficient, we can say:
- BMI is an important feature for prediction in this model.
We cannot say:
- BMI directly causes disease progression based on this model alone.
This distinction is one of the most important parts of responsible model interpretation.
Step 8: Translate Results Into a Defensible Claim
At this point, a careful claim might be:
The model shows moderate predictive ability, and features such as BMI and s5 contribute meaningfully to predicted disease progression in this dataset.
This is a model-based statement.
It is appropriately limited.
It does not overreach into causal or clinical certainty.
A stronger claim such as “BMI causes disease progression” is not supported by this workflow.
Step 9: Translate Analysis Into Decision Context
Now we ask the practical question:
What could someone do with this result?
A reasonable decision-oriented interpretation is:
- use the model as a screening or prioritization aid
- identify observations that may deserve closer follow-up
- avoid using predictions as the sole basis for intervention
Why?
Because the model is:
- informative
- reasonably stable
- but still imperfect
This is where analysis becomes useful without becoming overconfident.
Step 10: Save the Case Study Outputs
The script for this lesson saves the case study outputs into reports/.
Those outputs include:
- cross-validation fold results
- cross-validation summary
- test-set metrics
- model coefficients
- interpretation summary
- decision framing summary
- figures for model performance and interpretation
Saving these files turns the case study from an interactive analysis into a reproducible project artifact.
That is the difference between a one-time notebook and an applied analytical system.
Case Study Summary
Data
We used a structured health-related tabular dataset.
Model
We built a reproducible pipeline with scaling and linear regression.
Evaluation
Cross-validation showed moderate and fairly stable performance.
Interpretation
Several features, including BMI-related signals, contributed meaningfully to prediction.
Communication
Claims were framed as predictive and associative, not causal.
Decision Context
The model could support monitoring or prioritization, but not standalone decision-making.
CDI Insight
An end-to-end workflow is not complete when a model is fit.
It is complete when the path from:
- data
- to model
- to evaluation
- to interpretation
- to communication
- to decision context
remains clear, disciplined, and defensible.
Key Takeaway
The value of a model is not only in its output.
It is in how responsibly that output is connected to real-world use.
What Comes Next
In the next lesson, we will reflect on limitations and responsible use.
This is important because even a complete workflow has boundaries.
A responsible analyst must understand what the system can support and what it cannot.
→ Limitations and responsible use