#### Case Study – Regression

Regression analysis is a procedure for assessing the inter-variable relationships. It is used for numeric prediction and forecasting. Here, the objective is to analyze the association amongst a dependent (i.e. response) variable and one or more independent (i.e. predictor) variables. Basically, regression analysis can recognize which among the independent variables closely associated with the dependent variable, and to discover the forms of these relationships.

In this tutorial, we will perform prediction for the “Boston Housing Dataset” using Multiple Linear regression. This case study is done using the Python modules named pandas, numpy, and scikit-learn. You can download the dataset from here

#### Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

Here is the description of the features present in the dataset:

#### Prediction using Multiple Linear Regression in Python

In this case study, we take the Boston Housing dataset which contains information about different houses in Boston. Since, there is no missing value in this dataset, we don’t have to apply preprocessing on it. Now, similar to classification, the dataset is to be divided into two subsets namely the training set and testing set using the k-fold cross-validation (CV) technique. We choose k=10 for this case study. We apply here regression analysis to predict the value of prices of the houses.

Like classification, prediction also consists of two different steps. The first step, called the training phase, develops a predictive model by learning from a given training dataset accompanied by their related class label attributes. After that, the predictive model is suitable for prediction called the testing phase. This step evaluates the performance of the derived model using the test dataset based on the different performance metrics for prediction.

The following Python code implements the solution using the concept of Multiple Linear regression.

regression_boston_housing.py

`# Step 1: Import librariesimport pandas as pdimport numpy as npfrom sklearn.model_selection import KFoldfrom sklearn import linear_modelfrom sklearn.metrics import mean_squared_error, r2_score# Step 2: Read the datasetdf = pd.read_csv('boston_housing.csv')print(df.describe())# Step 3: Separate the input features and output from datasetX = df.drop('MEDV', axis=1)Y = df['MEDV']print(X.shape)print(Y.shape)kf = KFold(n_splits=10, random_state=None, shuffle=True) for train_index, test_index in kf.split(X):X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]# Step 4: Train the modelmodel = linear_model.LinearRegression()model.fit(X_train, Y_train)coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])print('\n',coeff_df)# Step 5: Predict the outputy_pred = model.predict(X_test)df = pd.DataFrame({'Actual': Y_test, 'Predicted': y_pred})print('\n',df.head(10))# Step 6: Evaluationrmsd = np.sqrt(mean_squared_error(Y_test, y_pred)) r2_value = r2_score(Y_test, y_pred) print("\nIntercept: \n", model.intercept_)print("\nRoot Mean Square Error \n", rmsd)print("\nR^2 Value: \n", r2_value)`

Output:

`             CRIM          ZN     ...           LSTAT        MEDVcount  506.000000  506.000000     ...      506.000000  506.000000mean     3.613524   11.363636     ...       12.653063   22.532806std      8.601545   23.322453     ...        7.141062    9.197104min      0.006320    0.000000     ...        1.730000    5.00000025%      0.082045    0.000000     ...        6.950000   17.02500050%      0.256510    0.000000     ...       11.360000   21.20000075%      3.677082   12.500000     ...       16.955000   25.000000max     88.976200  100.000000     ...       37.970000   50.000000[8 rows x 14 columns](506, 13)(506,)CoefficientCRIM -0.123900ZN 0.046496INDUS 0.017622CHAS 2.415943NOX -17.710202RM 3.659401AGE -0.002479DIS -1.449379RAD 0.315472TAX -0.012937PTRATIO -0.914142B 0.009345LSTAT -0.500881Actual Predicted2 34.7 30.32881018 20.2 16.51750521 19.6 17.82657427 14.8 14.94234234 13.5 13.95008860 18.7 18.22119971 21.7 21.99255476 20.0 22.91464285 26.6 27.736285103 19.3 20.370922Intercept: 36.690559436197574Root Mean Square Error 5.362069565203584R^2 Value: 0.7574121046803222`

N.B. Some of the attributes may not be visible in the output of Python program.

I hope that now you have attained the self-confidence of dealing with any kind of prediction problem using regression. In the next tutorial, I will discuss about Clustering analysis (i.e. Unsupervised learning).