Appendix
This appendix collects technical reference material that supports the Applied Data Science System.
It documents the project structure, environment setup, rendering workflow, script-based execution pattern, and reusable Python patterns used throughout the guide.
The purpose of the appendix is simple:
make the workflow easier to rerun, inspect, adapt, and extend.
How to Use This Appendix
Use this appendix as a reference when you need to:
- recreate the project environment
- understand the folder structure
- rerun lesson scripts
- regenerate reports and figures
- render the Quarto site
- prepare changes for GitHub Pages
Most chapters in this guide can be run in two ways:
- interactively, by reading and running the code blocks in the chapter
- reproducibly, by running the supporting Python script from the project root
The script-based workflow is preferred when preparing a public or production-facing version of the guide.
Project Structure Overview
A standard CDI applied data science project uses a Quarto-first structure with supporting scripts and saved outputs.
applied-data-science/
├── index.qmd
├── 00-preface.qmd
├── 01-setting-up-environment.qmd
├── 08-feature-engineering.qmd
├── 09-model-building.qmd
├── 10-model-evaluation.qmd
├── 11-model-improvement.qmd
├── 12-pipelines-and-cross-validation.qmd
├── 13-feature-importance-and-interpretation.qmd
├── 14-from-model-outputs-to-real-world-claims.qmd
├── 15-communicating-results-clearly.qmd
├── 16-from-analysis-to-decision-making.qmd
├── 17-end-to-end-case-study.qmd
├── 18-limitations-and-responsible-use.qmd
├── 19-from-foundations-to-real-world-practice.qmd
├── 999-appendix.qmd
├── 999-references.qmd
├── data/
│ ├── README.md
│ ├── diabetes.csv
│ ├── raw/
│ │ └── people-example.csv
│ └── processed/
│ └── model-ready-example.csv
├── models/
│ └── diabetes-linear-regression.joblib
├── reports/
│ ├── *.csv
│ ├── *.md
│ └── figures/
│ └── *.png
├── scripts/
│ ├── bash/
│ └── python/
├── docs/
├── _quarto.yml
├── requirements.txt
├── requirements-lock.txt
└── .gitignore
This layout separates source files, data, scripts, generated reports, model artifacts, and rendered documentation.
Key Directories
data/
The data/ directory stores project datasets.
In this guide, it includes:
data/diabetes.csv
data/raw/people-example.csv
data/processed/model-ready-example.csv
The diabetes dataset supports the modeling lessons.
The people example supports feature engineering and model-ready table construction.
scripts/python/
The scripts/python/ directory contains executable lesson scripts.
These scripts turn chapter logic into reusable project workflows.
Examples include:
scripts/python/09b_train_baseline_linear_model.py
scripts/python/10a_evaluate_repeated_splits.py
scripts/python/12a_cross_validate_pipeline.py
scripts/python/17a_run_end_to_end_case_study.py
Each script should be runnable from the project root.
scripts/bash/
The scripts/bash/ directory contains helper scripts for project setup and build operations.
Examples include:
scripts/bash/01a-create-project-structure.sh
scripts/bash/01c-build-project.sh
scripts/bash/01d-freeze-dependencies.sh
These scripts support repeatable project management.
reports/
The reports/ directory stores reusable outputs created by the scripts.
These may include:
- model metrics
- cross-validation summaries
- coefficient tables
- communication summaries
- responsible-use checklists
- decision framing tables
Saved reports make the workflow auditable.
They also allow later chapters to refer to previous outputs without hiding the work that created them.
reports/figures/
The reports/figures/ directory stores saved plots.
Examples include:
reports/figures/diabetes-observed-vs-predicted.png
reports/figures/diabetes-residual-plot.png
reports/figures/diabetes-cross-validation-mae.png
Saving figures as files makes them easier to reuse in reports, slides, websites, and README pages.
models/
The models/ directory stores trained teaching models.
For example:
models/diabetes-linear-regression.joblib
In this guide, the model artifact is small and intentional.
In larger projects, model artifacts should be tracked carefully because they can become large or sensitive.
docs/
The docs/ directory contains the rendered Quarto website.
For GitHub Pages using a docs/ deployment workflow, this is the public site output directory.
Do not edit files inside docs/ manually unless there is a specific reason.
Instead, edit the .qmd source files and rerender the site.
Environment Setup Reference
This project uses a local Python virtual environment.
Create and activate the environment from the project root:
python3 -m venv .venv
source .venv/bin/activateInstall the required packages:
pip install -r requirements.txtThe requirements.txt file defines the main dependencies needed to run the project.
Reproducibility Lock File
The project may also include:
requirements-lock.txt
This file records an exact package snapshot.
A useful distinction is:
requirements.txt = minimal dependency list
requirements-lock.txt = frozen reproducibility snapshot
To refresh the lock file after installing packages:
pip freeze > requirements-lock.txtThe lock file helps preserve the environment used when the guide was built.
Running a Lesson Script
Most applied lessons include a supporting Python script.
Run scripts from the project root.
For example:
python scripts/python/12a_cross_validate_pipeline.pyA script should usually create outputs in:
reports/
reports/figures/
models/
data/processed/
depending on the lesson.
Rendering the Quarto Site
After running scripts, render the site:
quarto renderThe rendered output is written to:
docs/
Open the site locally:
open docs/index.htmlOn non-macOS systems, open docs/index.html using your file browser or web browser.
Standard Lesson Workflow
A typical lesson workflow is:
source data
↓
Python script
↓
reports and figures
↓
Quarto chapter
↓
quarto render
↓
docs site
The preferred pattern is:
python scripts/python/<lesson-script>.py
quarto renderThis ensures that outputs are regenerated before the site is rebuilt.
GitHub Pages Workflow
For a public Quarto site using GitHub Pages, the usual workflow is:
edit .qmd files and scripts
↓
run scripts
↓
quarto render
↓
commit source and docs
↓
push to GitHub
↓
GitHub Pages publishes docs/
If a custom domain is used, the rendered site should include a CNAME file containing the domain.
For this project, the custom domain is:
applied-datascience.complexdatainsights.com
If the workflow creates the CNAME file automatically, confirm that the deployed docs/ output preserves it.
Reusable Pandas Patterns
Read a CSV file
import pandas as pd
df = pd.read_csv("data/diabetes.csv")Select columns
df[["col1", "col2"]]Drop columns
X = df.drop(columns=["target"])Select a target column
y = df["target"]Filter rows
filtered_df = df[df["value"] > 0]Create a new column
df["new_column"] = df["col1"] + df["col2"]Group and aggregate
summary_df = df.groupby("group_col").agg(
mean_value=("value_col", "mean"),
count=("value_col", "count")
).reset_index()Handle missing values
df["col"] = df["col"].fillna(df["col"].median())Convert data type
df["category_col"] = df["category_col"].astype("category")Save a result
df.to_csv("reports/output-table.csv", index=False)Reusable Scikit-Learn Patterns
Define features and target
X = df.drop(columns=["disease_progression"])
y = df["disease_progression"]Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)Fit a baseline model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)Generate predictions
predictions = model.predict(X_test)Evaluate predictions
from sklearn.metrics import mean_absolute_error, r2_score
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)Build a pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LinearRegression())
])Run cross-validation
from sklearn.model_selection import cross_val_score
mae_scores = -cross_val_score(
pipeline,
X,
y,
cv=5,
scoring="neg_mean_absolute_error"
)Save a trained model
import joblib
joblib.dump(model, "models/diabetes-linear-regression.joblib")Load a trained model
import joblib
model = joblib.load("models/diabetes-linear-regression.joblib")Reusable Matplotlib Patterns
Basic scatter plot
import matplotlib.pyplot as plt
plt.figure(figsize=(7, 5))
plt.scatter(x, y)
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Scatter Plot")
plt.tight_layout()
plt.show()Save a figure
plt.savefig("reports/figures/example-figure.png", dpi=300, bbox_inches="tight")Observed vs predicted plot
plt.figure(figsize=(7, 5))
plt.scatter(observed, predicted)
min_value = min(observed.min(), predicted.min())
max_value = max(observed.max(), predicted.max())
plt.plot([min_value, max_value], [min_value, max_value], linestyle="--")
plt.xlabel("Observed")
plt.ylabel("Predicted")
plt.title("Observed vs Predicted")
plt.tight_layout()
plt.show()Reproducibility Checklist
Before finalizing any chapter, confirm that:
- the lesson script runs from the project root
- expected outputs are created
- paths are relative, not machine-specific
- no private files or credentials are included
- code blocks use the same workflow described in the chapter
- figures render correctly
- saved report files are updated
- the Quarto site builds successfully
docs/index.htmlopens locallygit statusshows only expected changes
A useful final check is:
quarto render
git statusPublic Repository Safety Checklist
Before making a project public, check for sensitive files or accidental local paths.
Search for common sensitive terms:
grep -R "password\|token\|secret\|api_key\|apikey\|Users/" . \
--exclude-dir=.git \
--exclude-dir=.venv \
--exclude-dir=docsCheck for large files:
find . -type f -size +10M \
-not -path "./.git/*" \
-not -path "./.venv/*"Review tracked files:
git status
git diff --cached --name-statusThese checks help keep public CDI repositories clean and safe.
Running the Full System
bash scripts/bash/end-to-end-workflow.shData Sources
This guide uses the diabetes dataset distributed through scikit-learn and saved locally as:
data/diabetes.csv
The dataset is used for educational modeling examples throughout the applied lessons.
Software and Tools
This guide uses a small set of widely adopted tools for reproducible applied data science:
- Python (Python Software Foundation 2024)
- pandas (McKinney et al. 2010)
- NumPy (Harris et al. 2020)
- matplotlib (Hunter 2007)
- scikit-learn (Pedregosa et al. 2011)
- joblib (Team, n.d.)
- Quarto (Posit Software, PBC, n.d.)
Together, these tools support tabular data handling, modeling, evaluation, visualization, reproducible reporting, and static site publishing.
Closing Note
The goal of the Applied Data Science System is not only to teach syntax.
It is to teach a disciplined workflow:
data
↓
analysis
↓
modeling
↓
evaluation
↓
interpretation
↓
communication
↓
decision context
↓
responsible use
The same structure can be reused across:
- new datasets
- client projects
- public guides
- domain-specific CDI pathways
- future analytical systems
A strong data science workflow does more than produce outputs.
It produces evidence that can be inspected, explained, reproduced, and used responsibly.