Regression ML Pipeline Tutorial in Python (Scikit-Learn)

Regression ML Pipeline Tutorial in Python (Scikit-Learn)#

This notebook demonstrates how to build a machine learning pipeline for a regression task, using the California Housing dataset The details of Machine Leanring concepts can be found here. In this notebook, you will learn to:

Preprocess data (missing values, scaling)
Train a regression model
Evaluate the model using standard metrics

Task is to create a machine learning model that predicts the median house value (MedHouseVal), the target variable, using all the other columns as features.

Prerequisites (install the libraries needed for the machine learning if not done earlier)#

%pip install scikit-learn

Step 1: Import Libraries#

# data anaysis packages
import pandas as pd
import numpy as np

# Plotting packages
import matplotlib.pyplot as plt
import seaborn as sns

#machine learning packages from scikit-learn
from sklearn.datasets import fetch_california_housing #fetch dataset
from sklearn.model_selection import train_test_split  
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # model evaluation

Step 2: Load the data#

We use the California housing dataset, which includes features like median income, house age, average rooms, and target variable MedHouseVal (median house value).

At this step, if we have a csv, txt, excel….

# Load sklearn built-in dataset of California housing price
data = fetch_california_housing(as_frame=True)
df = data.frame

# Preview the dataset
df.head() # print first 5 rows of the data

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Step 3: Explore the dataset; find the number of rows and columns, column names, and datatype of columns#

# Summary of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

The data has 20,640 rows and 9 columns. We can see from above that there is no missing data in the columns and all the columns have “float” data type.

# Summary statistics of the data
df.describe()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
count	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	3.870671	28.639486	5.429000	1.096675	1425.476744	3.070655	35.631861	-119.569704	2.068558
std	1.899822	12.585558	2.474173	0.473911	1132.462122	10.386050	2.135952	2.003532	1.153956
min	0.499900	1.000000	0.846154	0.333333	3.000000	0.692308	32.540000	-124.350000	0.149990
25%	2.563400	18.000000	4.440716	1.006079	787.000000	2.429741	33.930000	-121.800000	1.196000
50%	3.534800	29.000000	5.229129	1.048780	1166.000000	2.818116	34.260000	-118.490000	1.797000
75%	4.743250	37.000000	6.052381	1.099526	1725.000000	3.282261	37.710000	-118.010000	2.647250
max	15.000100	52.000000	141.909091	34.066667	35682.000000	1243.333333	41.950000	-114.310000	5.000010

# plot heatmap of correlation between all the features of dataset df; we are interested in the correlation between the features (X)
plt.figure(figsize=(5, 3))
sns.heatmap(df.corr(), cmap='coolwarm', square=True)
plt.title('Feature Correlation')
plt.show()

../../../_images/caaa1973db142e41199863bf98e26a4c7cceae02d68226e1828fd5a36b4d87d8.png

The darker red color shows the high correlation between features. For example, the positive correlation between feature “AveBedrms” and “AveRooms” is greater than .75. At this point, we may discard one of the two highly correlated features and do further analysis.

Step 4: Split the data into Feature and Target form#

X = df.drop(columns=['MedHouseVal']) # X has ALL columns except "MedHouseVal"
y = df['MedHouseVal']

Step 5: Train/Test Split#

Randomly split the data into training an test set in the ratio of 80%(training) and 20% (testing); could be 70/30 or 75/25 also

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
14196	3.2596	33.0	5.017657	1.006421	2300.0	3.691814	32.71	-117.03
8267	3.8125	49.0	4.473545	1.041005	1314.0	1.738095	33.77	-118.16
17445	4.1563	4.0	5.645833	0.985119	915.0	2.723214	34.66	-120.48
14265	1.9425	36.0	4.002817	1.033803	1418.0	3.994366	32.69	-117.11
2271	3.5542	43.0	6.268421	1.134211	874.0	2.300000	36.78	-119.80

Step 6: Data Preprocessing Pipeline (optional; if data has missing values)#

We will preprocess:

Numeric features: impute missing values with the median and scale them

(a) X.select_dtypes(include=[‘object’, ‘category’]) ………This selects columns from the DataFrame X whose data types are either ‘object’ (typically strings) or ‘category’ (Pandas categorical type).

(b) .columns ……. Retrieves the column names of the selected categorical columns.

(c) .tolist() ………Converts the column names (which are a Index object) into a standard Python list.

Result: The variable categorical_features will be a list of the names of all columns in X that are considered categorical features, i.e., columns containing text or categorical data.
Categorical features (if any): impute with the most frequent value and one-hot encode

# Detect column types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Define transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine into column transformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Create Full Pipeline. We will use "Linear Regression" as the base model. You can replace this with any regressor like Random Forest or XGBoost.

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

Step 6: Standardize the feature values. We can directly come to this step after Step 5 if there are no missing values.#

Numerical value of feature may be of different magnitudes (for example, max value of “Population” is 35682 wheras max value of “HouseAge” is just 52. Hence, we scale the features to transform them to a similar scale (between 0 and 1), so the feature with high numerical values does not dominate the model creation.

Only feature values are scaled, NOT the target values.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

Step 7: Train the Model#

model_reg = LinearRegression()
model_reg.fit(X_train_scaled, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Step 8: Evaluate the Model#

We use common regression metrics:

Metric Definitions#

Mean Squared Error (MSE): Measures average squared difference between actual and predicted values. Lower is better.

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Mean Absolute Error (MAE): Average absolute difference between actual and predicted values.
R-squared (R²): Proportion of variance in the dependent variable that is predictable from the independent variables.

\[ R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} \]

R² ranges from 0 to 1 (higher is better).

y_pred = model_reg.predict(X_test)

print("R² Score: ", r2_score(y_test, y_pred))
print("Mean Absolute Error: ", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error: ", mean_squared_error(y_test, y_pred))

R² Score:  -4214.291797132394
Mean Absolute Error:  74.23802506908287
Mean Squared Error:  5523.756216867506

/opt/homebrew/lib/python3.11/site-packages/sklearn/utils/validation.py:2742: UserWarning: X has feature names, but LinearRegression was fitted without feature names
  warnings.warn(

	fit_intercept	True
	copy_X	True
	tol	1e-06
	n_jobs	None
	positive	False