Feature Engineering and Data Representation

Published

Jun 2026

ID: DS-L08
Type: Premium
Audience: Intermediate to Advanced
Theme: Data representation determines what a model can learn

In the foundations track, you worked with data as it was given.

You explored it, cleaned it, and interpreted it.

In this track, that is no longer sufficient.

Before building a model, you must decide:

What information should the model receive, and in what form?

This is the role of feature engineering.

Representation Comes Before Modeling

Models do not understand raw data.

They operate on structured representations.

In most applied data science workflows, this means numerical or encoded variables arranged in a model-ready table.

This means:

the model does not understand context automatically
it can only learn from what is made visible
it can only use information that has been represented correctly
it may ignore important structure if that structure is hidden

Feature engineering determines:

which patterns are detectable
which relationships are emphasized
which signals are ignored
which sources of variation become available to the model

Two datasets with the same information can lead to very different models depending on how features are constructed.

This is why feature engineering is not only a technical step.

It is a modeling design decision.

A Simple Example

Start with a small table.

import pandas as pd

people = pd.DataFrame({
    "age": [22, 35, 47, 51],
    "income": [25000, 48000, 52000, 61000],
    "city": ["A", "B", "A", "C"]
})

people

At this stage:

numerical variables have different scales
categorical data is not yet represented numerically
relationships between variables are implicit
the table is understandable to a person, but not yet ready for many models

This is not yet a modeling dataset.

Step 1 — Making Variables Comparable

Some models are sensitive to the scale of input variables.

For example, income is measured in thousands, while age is measured in years.

Without scaling, a model may give more influence to a variable simply because its values are larger.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

people_scaled = people.copy()
people_scaled[["age", "income"]] = scaler.fit_transform(
    people[["age", "income"]]
)

people_scaled

Scaling does not change the meaning of the variables.

It changes how the values are represented for the model.

This helps numerical variables become more comparable.

Step 2 — Making Categories Visible

Many models cannot use text categories directly.

The city column contains meaningful information, but it must be encoded before modeling.

people_encoded = pd.get_dummies(people_scaled, columns=["city"])

people_encoded

This converts the city categories into indicator variables.

Now the model can use city information as part of the input representation.

The goal is not to make the data look more complicated.

The goal is to make useful information visible to the model.

Step 3 — Making Relationships Explicit

Some relationships are not directly visible in the original columns.

For example, income relative to age may carry different information from income alone.

people_features = people_encoded.copy()

people_features["income_per_age"] = (
    people_features["income"] / (people_features["age"] + 1)
)

people_features

This creates a derived feature.

A derived feature is a new variable created from existing variables.

Derived features can help expose relationships that may otherwise be difficult for a model to learn.

However, derived features should be created carefully.

A feature should be meaningful, reproducible, and explainable.

Important Warning About Leakage

Feature engineering must avoid data leakage.

Data leakage occurs when information is used during modeling that would not be available at the time of prediction.

Examples include:

using future outcomes to create a feature
scaling data before splitting into training and testing sets
using post-event information to predict a pre-event decision
including variables that directly reveal the answer

Leakage can make a model look strong during evaluation but fail in real use.

In applied data science, a good feature is not only predictive.

It must also be valid for the real decision context.

Transition to Real-World Data

So far, we have worked with small, controlled examples.

This is intentional.

It allows us to isolate key ideas without distraction.

In practice, however, data rarely arrives in this form.

Real datasets are:

larger
messier
less consistent
less explicit in their structure
shaped by collection processes, measurement limits, and human decisions

As we move forward, we will begin working with real-world data.

This introduces a new responsibility:

ensuring that your data is structured in a way that supports analysis and modeling

Preparing Your Own Data

Before applying models, your dataset should follow a clear tabular structure:

each row represents one observation
each column represents one variable
each column has a consistent data type
missing values are handled or explicitly represented
target variables are separated from predictor variables
identifiers are preserved but not accidentally used as model signals

In CDI terms, this is a tidy and structured dataset.

If your data does not follow this structure:

feature engineering becomes inconsistent
models become unreliable
results become difficult to interpret
the workflow becomes difficult to reuse

A model-ready table should be understandable both to the computer and to the analyst.

Script 08A — Create an Example Dataset

Create a file called:

scripts/python/08a_create_example_people_dataset.py

This script creates a small example dataset and saves it in data/raw/.

Run it with:

python scripts/python/08a_create_example_people_dataset.py

Expected output:

data/raw/people-example.csv

Script 08B — Build a Model-Ready Feature Table

Create a file called:

scripts/python/08b_build_feature_table.py

This script reads the raw example dataset, scales numerical variables, encodes the categorical variable, creates a derived feature, and saves a model-ready table.

Run it with:

python scripts/python/08b_build_feature_table.py

Expected output:

data/processed/model-ready-example.csv

This script is intentionally simple.

Later chapters will improve this pattern by using training and testing splits, pipelines, and evaluation workflows.

Feature Engineering Checklist

Before moving into model building, ask:

What is the prediction or modeling objective?
What is the observation unit?
Which columns are predictors?
Which column is the target, if supervised learning is used?
Which variables are numerical?
Which variables are categorical?
Which variables require scaling?
Which variables require encoding?
Are any derived features meaningful?
Could any feature introduce leakage?

This checklist helps keep feature engineering connected to the real analytical question.

CDI Insight

Controlled examples teach concepts.

Real datasets test your ability to apply them.

The transition between the two is where analytical skill develops.

Feature engineering is where domain understanding, data structure, and modeling objectives meet.

Summary

Models learn from representations, not raw data.
Scaling controls how numerical variables influence learning.
Encoding makes categorical information usable.
Derived features expose relationships.
Feature engineering must avoid leakage.
A model-ready dataset should be tidy, structured, and aligned with the decision context.
Feature engineering is a design process with direct impact on model behavior.

Exercise

Using a small dataset of your choice, identify:

the observation unit
the target variable, if one exists
at least two numerical predictors
at least one categorical predictor
one feature that may need scaling
one feature that may need encoding
one possible derived feature
one possible leakage risk

Then create a model-ready version of the dataset.

Save the output as:

data/processed/model-ready-example.csv

Looking Ahead

In the next lesson, we begin building models with intention and control.

The focus shifts from preparing model-ready inputs to asking:

Which model should we build, and how do we know whether it is useful?