Case Study – Regression - TechGuruSpeaks

Case Study – Regression

Regression analysis is a procedure for assessing the inter-variable relationships. It is used for numeric prediction and forecasting. Here, the objective is to analyze the association amongst a dependent (i.e. response) variable and one or more independent (i.e. predictor) variables. Basically, regression analysis can recognize which among the independent variables closely associated with the dependent variable, and to discover the forms of these relationships.

In this tutorial, we will perform prediction for the “Boston Housing Dataset” using Multiple Linear regression. This case study is done using the Python modules named pandas, numpy, and scikit-learn. You can download the dataset from here.

Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

Here is the description of the features present in the dataset:

CRIM:     Per capita crime rate by town
ZN:       Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS:    Proportion of non-retail business acres per town
CHAS:     Charles River dummy variable (=1 if tract bounds river; 0 else)
NOX:      Nitric oxide concentration (parts per 10 million)
RM:       Average number of rooms per dwelling
AGE:      Proportion of owner-occupied units built prior to 1940
DIS:      Weighted distances to five Boston employment centers
RAD:      Index of accessibility to radial highways
TAX:      Full-value property tax rate per $10,000
PTRATIO:  Pupil-teacher ratio by town
B:        1000(Bk — 0.63)², where Bk is the proportion of blacks by town
LSTAT:    Percentage of lower status of the population
MEDV:     Median value of owner-occupied homes in $1000s

Prediction using Multiple Linear Regression in Python

In this case study, we take the Boston Housing dataset which contains information about different houses in Boston. Since, there is no missing value in this dataset, we don’t have to apply preprocessing on it. Now, similar to classification, the dataset is to be divided into two subsets namely the training set and testing set using the k-fold cross-validation (CV) technique. We choose k=10 for this case study. We apply here regression analysis to predict the value of prices of the houses.

Like classification, prediction also consists of two different steps. The first step, called the training phase, develops a predictive model by learning from a given training dataset accompanied by their related class label attributes. After that, the predictive model is suitable for prediction called the testing phase. This step evaluates the performance of the derived model using the test dataset based on the different performance metrics for prediction.

The following Python code implements the solution using the concept of Multiple Linear regression.

regression_boston_housing.py

# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Read the dataset
df = pd.read_csv('boston_housing.csv')
print(df.describe())

# Step 3: Separate the input features and output from dataset
X = df.drop('MEDV', axis=1)
Y = df['MEDV']
print(X.shape)
print(Y.shape)
kf = KFold(n_splits=10, random_state=None, shuffle=True) 
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

# Step 4: Train the model
model = linear_model.LinearRegression()
model.fit(X_train, Y_train)
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print('\n',coeff_df)

# Step 5: Predict the output
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': Y_test, 'Predicted': y_pred})
print('\n',df.head(10))

# Step 6: Evaluation
rmsd = np.sqrt(mean_squared_error(Y_test, y_pred)) 
r2_value = r2_score(Y_test, y_pred) 
print("\nIntercept: \n", model.intercept_)
print("\nRoot Mean Square Error \n", rmsd)
print("\nR^2 Value: \n", r2_value)

Output:

             CRIM          ZN     ...           LSTAT        MEDV
count  506.000000  506.000000     ...      506.000000  506.000000
mean     3.613524   11.363636     ...       12.653063   22.532806
std      8.601545   23.322453     ...        7.141062    9.197104
min      0.006320    0.000000     ...        1.730000    5.000000
25%      0.082045    0.000000     ...        6.950000   17.025000
50%      0.256510    0.000000     ...       11.360000   21.200000
75%      3.677082   12.500000     ...       16.955000   25.000000
max     88.976200  100.000000     ...       37.970000   50.000000

[8 rows x 14 columns]
(506, 13)
(506,)

Coefficient
CRIM -0.123900
ZN 0.046496
INDUS 0.017622
CHAS 2.415943
NOX -17.710202
RM 3.659401
AGE -0.002479
DIS -1.449379
RAD 0.315472
TAX -0.012937
PTRATIO -0.914142
B 0.009345
LSTAT -0.500881

Actual Predicted
2 34.7 30.328810
18 20.2 16.517505
21 19.6 17.826574
27 14.8 14.942342
34 13.5 13.950088
60 18.7 18.221199
71 21.7 21.992554
76 20.0 22.914642
85 26.6 27.736285
103 19.3 20.370922

Intercept: 
36.690559436197574

Root Mean Square Error 
5.362069565203584

R^2 Value: 
0.7574121046803222

N.B. Some of the attributes may not be visible in the output of Python program.

I hope that now you have attained the self-confidence of dealing with any kind of prediction problem using regression. In the next tutorial, I will discuss about Clustering analysis (i.e. Unsupervised learning).