**Case Study – Regression**

Regression analysis is a procedure for assessing the inter-variable relationships. It is used for numeric prediction and forecasting. Here, the objective is to analyze the association amongst a *dependent *(i.e. *response*) *variable* and one or more *independent* (i.e. *predictor*) variables. Basically, regression analysis can recognize which among the independent variables closely associated with the dependent variable, and to discover the forms of these relationships.

In this tutorial, we will perform prediction for the “Boston Housing Dataset” using *Multiple Linear regression*. This case study is done using the Python modules named *pandas*, *numpy, *and* scikit-learn*. You can download the dataset from here.

**Boston Housing Dataset**

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

Here is the description of the features present in the dataset:

**CRIM**: Per capita crime rate by town

**ZN**: Proportion of residential land zoned for lots over 25,000 sq. ft

**INDUS**: Proportion of non-retail business acres per town

**CHAS**: Charles River dummy variable (=1 if tract bounds river; 0 else)

**NOX**: Nitric oxide concentration (parts per 10 million)

**RM**: Average number of rooms per dwelling

**AGE**: Proportion of owner-occupied units built prior to 1940

**DIS**: Weighted distances to five Boston employment centers

**RAD**: Index of accessibility to radial highways

**TAX**: Full-value property tax rate per $10,000

**PTRATIO**: Pupil-teacher ratio by town

**B**: 1000(Bk — 0.63)², where Bk is the proportion of blacks by town

**LSTAT**: Percentage of lower status of the population

**MEDV**: Median value of owner-occupied homes in $1000s

**Prediction using ****Multiple Linear Regression in Python**

In this case study, we take the Boston Housing dataset which contains information about different houses in Boston. Since, there is no missing value in this dataset, we don’t have to apply preprocessing on it. Now, similar to classification, the dataset is to be divided into two subsets namely the* training set* and *testing set* using the *k-fold* *cross-validation (CV)* technique. We choose k=10 for this case study. We apply here regression analysis to predict the value of prices of the houses.

Like classification, prediction also consists of two different steps. The first step, called the *training phase*, develops a predictive model by learning from a given training dataset accompanied by their related class label attributes. After that, the predictive model is suitable for prediction called the *testing phase*. This step evaluates the performance of the derived model using the test dataset based on the different performance metrics for prediction.

The following Python code implements the solution using the concept of *Multiple Linear regression*.

*regression_boston_housing.py*

# Step 1: Import libraries

import pandas as pd

import numpy as np

from sklearn.model_selection import KFold

from sklearn import linear_model

from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Read the dataset

df = pd.read_csv('boston_housing.csv')

print(df.describe())

# Step 3: Separate the input features and output from dataset

X = df.drop('MEDV', axis=1)

Y = df['MEDV']

print(X.shape)

print(Y.shape)

kf = KFold(n_splits=10, random_state=None, shuffle=True)

for train_index, test_index in kf.split(X):

X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]

Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

# Step 4: Train the model

model = linear_model.LinearRegression()

model.fit(X_train, Y_train)

coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])

print('\n',coeff_df)

# Step 5: Predict the output

y_pred = model.predict(X_test)

df = pd.DataFrame({'Actual': Y_test, 'Predicted': y_pred})

print('\n',df.head(10))

# Step 6: Evaluation

rmsd = np.sqrt(mean_squared_error(Y_test, y_pred))

r2_value = r2_score(Y_test, y_pred)

print("\nIntercept: \n", model.intercept_)

print("\nRoot Mean Square Error \n", rmsd)

print("\nR^2 Value: \n", r2_value)

**Output:**

CRIM ZN ... LSTAT MEDV

count 506.000000 506.000000 ... 506.000000 506.000000

mean 3.613524 11.363636 ... 12.653063 22.532806

std 8.601545 23.322453 ... 7.141062 9.197104

min 0.006320 0.000000 ... 1.730000 5.000000

25% 0.082045 0.000000 ... 6.950000 17.025000

50% 0.256510 0.000000 ... 11.360000 21.200000

75% 3.677082 12.500000 ... 16.955000 25.000000

max 88.976200 100.000000 ... 37.970000 50.000000

[8 rows x 14 columns]

(506, 13)

(506,)

Coefficient

CRIM -0.123900

ZN 0.046496

INDUS 0.017622

CHAS 2.415943

NOX -17.710202

RM 3.659401

AGE -0.002479

DIS -1.449379

RAD 0.315472

TAX -0.012937

PTRATIO -0.914142

B 0.009345

LSTAT -0.500881

Actual Predicted

2 34.7 30.328810

18 20.2 16.517505

21 19.6 17.826574

27 14.8 14.942342

34 13.5 13.950088

60 18.7 18.221199

71 21.7 21.992554

76 20.0 22.914642

85 26.6 27.736285

103 19.3 20.370922

Intercept:

36.690559436197574

Root Mean Square Error

5.362069565203584

R^2 Value:

0.7574121046803222

**N.B. **Some of the attributes may not be visible in the output of Python program.

I hope that now you have attained the self-confidence of dealing with any kind of prediction problem using regression. In the next tutorial, I will discuss about Clustering analysis (i.e. Unsupervised learning).

** **