Case Study – Regression

Regression analysis is a procedure for assessing the inter-variable relationships. It is used for numeric prediction and forecasting. Here, the objective is to analyze the association amongst a dependent (i.e. response) variable and one or more independent (i.e. predictor) variables. Basically, regression analysis can recognize which among the independent variables closely associated with the dependent variable, and to discover the forms of these relationships.

In this tutorial, we will perform prediction for the “Boston Housing Dataset” using Multiple Linear regression. This case study is done using the Python modules named pandas, numpy, and scikit-learn. You can download the dataset from here

 

Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

Here is the description of the features present in the dataset:

 

Prediction using Multiple Linear Regression in Python

In this case study, we take the Boston Housing dataset which contains information about different houses in Boston. Since, there is no missing value in this dataset, we don’t have to apply preprocessing on it. Now, similar to classification, the dataset is to be divided into two subsets namely the training set and testing set using the k-fold cross-validation (CV) technique. We choose k=10 for this case study. We apply here regression analysis to predict the value of prices of the houses. 

Like classification, prediction also consists of two different steps. The first step, called the training phase, develops a predictive model by learning from a given training dataset accompanied by their related class label attributes. After that, the predictive model is suitable for prediction called the testing phase. This step evaluates the performance of the derived model using the test dataset based on the different performance metrics for prediction.

The following Python code implements the solution using the concept of Multiple Linear regression.

regression_boston_housing.py

# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Read the dataset
df = pd.read_csv('boston_housing.csv')
print(df.describe())

# Step 3: Separate the input features and output from dataset
X = df.drop('MEDV', axis=1)
Y = df['MEDV']
print(X.shape)
print(Y.shape)
kf = KFold(n_splits=10, random_state=None, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

# Step 4: Train the model
model = linear_model.LinearRegression()
model.fit(X_train, Y_train)
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print('\n',coeff_df)

# Step 5: Predict the output
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': Y_test, 'Predicted': y_pred})
print('\n',df.head(10))

# Step 6: Evaluation
rmsd = np.sqrt(mean_squared_error(Y_test, y_pred))
r2_value = r2_score(Y_test, y_pred)
print("\nIntercept: \n", model.intercept_)
print("\nRoot Mean Square Error \n", rmsd)
print("\nR^2 Value: \n", r2_value)

 

Output:

             CRIM          ZN     ...           LSTAT        MEDV
count 506.000000 506.000000 ... 506.000000 506.000000
mean 3.613524 11.363636 ... 12.653063 22.532806
std 8.601545 23.322453 ... 7.141062 9.197104
min 0.006320 0.000000 ... 1.730000 5.000000
25% 0.082045 0.000000 ... 6.950000 17.025000
50% 0.256510 0.000000 ... 11.360000 21.200000
75% 3.677082 12.500000 ... 16.955000 25.000000
max 88.976200 100.000000 ... 37.970000 50.000000

[8 rows x 14 columns]
(506, 13)
(506,)

Coefficient
CRIM -0.123900
ZN 0.046496
INDUS 0.017622
CHAS 2.415943
NOX -17.710202
RM 3.659401
AGE -0.002479
DIS -1.449379
RAD 0.315472
TAX -0.012937
PTRATIO -0.914142
B 0.009345
LSTAT -0.500881

Actual Predicted
2 34.7 30.328810
18 20.2 16.517505
21 19.6 17.826574
27 14.8 14.942342
34 13.5 13.950088
60 18.7 18.221199
71 21.7 21.992554
76 20.0 22.914642
85 26.6 27.736285
103 19.3 20.370922

Intercept:
36.690559436197574

Root Mean Square Error
5.362069565203584

R^2 Value:
0.7574121046803222

N.B. Some of the attributes may not be visible in the output of Python program.

I hope that now you have attained the self-confidence of dealing with any kind of prediction problem using regression. In the next tutorial, I will discuss about Clustering analysis (i.e. Unsupervised learning).