Case Study – Preprocessing

Data preprocessing denotes several preprocessing tasks applied to the target data set to ensure consistency in naming conventions, encoding structures, and attribute measures. Preprocessing mainly includes data integration and data cleaning. Data integration is a form of preprocessing that may combine multiple data sources. The data set then cleaned where the term ‘cleaning’ denotes the processing of data for reducing noise and the treatment of missing values.

Data transformation procedure may be applied to the preprocessed dataset prior to data classification. For example, this method is used to normalize the dataset as because neural network and regression-based techniques require distance measurements for analysis. It transforms database attribute values to a small-scale range such as [-1.0, +1.0] or [0.0, 1.0]. Occasionally, researchers follow aggregation or consolidation approaches for performing data transformation.

In this tutorial, we will perform data preprocessing for the “Loan Prediction Problem Dataset“. This case study is done using Python programming. You can download the dataset from here

 

Loan Prediction Problem Dataset

Here is the description of the variables present in the dataset:

Variable

Description

Type

Loan_ID

Unique Loan ID

Object

Gender

Male / Female

Categorical

Married

Applicant married (Y/N)

Categorical

Dependents

Number of dependents

Categorical

Education

Education (Graduate / Not Graduate)

Categorical

Self_Employed

Self-employed (Y/N)  

Categorical

ApplicantIncome          

Applicant’s income

Numerical

CoapplicantIncome      

Coapplicant’s income

Numerical

LoanAmount

Loan amount in thousands

Numerical

Loan_Amount_Term

Term of loan in months

Numerical

Credit_History

Credit history meets guidelines

Numerical

Property_Area

Urban / Semiurban / Rural

Categorical

Loan_Status

Loan approved (Y/N) – class

Categorical

 

Preprocessing and analysis in Python using Pandas

Python module pandas is one of the most useful data analysis libraries in Python. We will now use this module to read data from the dataset and perform preprocessing for this problem. Additionally, we may need scikit-learn (i.e. sklearn in program) module for label encoding. See the code below.

loan_preprocessing.py

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Reading the dataset in a dataframe using Pandas
df = pd.read_csv('loandata.csv') # The training dataset is renamed

# Gathering Info
print('\nInfo of numeric columns: ')
print(df.describe()) # takes numeric values only
print('\nInfo of all columns: ')
print(df.columns)
print(df.dtypes)
print('\nSize Info: ')
print(df.shape)

# Checking missing values in the dataset
print('\nMissing values info in the dataset: ')

# In pandas axis=0 represents rows (default) and axis=1 represents columns

print(df.apply(lambda x: sum(x.isnull()),axis=0))

# Replacing missing values for numeric attributes using mean
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean(), inplace=True)
df['Credit_History'].fillna(df['Credit_History'].max(), inplace=True)

# Replacing missing values for categorical attributes using most frequent value
# Example: print(df['Gender'].value_counts()) gives frequency count for 'Gender'
df['Gender'].fillna('Male',inplace=True)
df['Married'].fillna('Yes',inplace=True)
df['Dependents'].fillna('0',inplace=True)
df['Self_Employed'].fillna('No',inplace=True)

# Removing a few unnecessary column(s)
# Example: Loan_Id is an unnecessary column

df.drop(['Loan_ID'], axis=1, inplace=True)
print('\n', df.head(5))

# Converting all categorical variables into numeric by encoding the categories
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
print(df.dtypes)

# Saving the dataframe "df" as "loan.csv" to our local machine
df.to_csv('loan.csv')

Output:

Info of numeric columns: 
ApplicantIncome ... Credit_History
count 614.000000 ... 564.000000
mean 5403.459283 ... 0.842199
std 6109.041673 ... 0.364878
min 150.000000 ... 0.000000
25% 2877.500000 ... 1.000000
50% 3812.500000 ... 1.000000
75% 5795.000000 ... 1.000000
max 81000.000000 ... 1.000000

[8 rows x 5 columns]

Info of all columns:
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
dtype='object')
Loan_ID object
Gender object
Married object
Dependents object
Education object
Self_Employed object
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Property_Area object
Loan_Status object
dtype: object

Size Info:
(614, 13)

Missing values info in the dataset:
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

Gender Married ... Property_Area Loan_Status

0 Male No ... Urban Y
1 Male Yes ... Rural N
2 Male Yes ... Urban Y
3 Male Yes ... Urban Y
4 Male No ... Urban Y

[5 rows x 12 columns]
Gender int32
Married int32
Dependents int32
Education int32
Self_Employed int32
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Property_Area int32
Loan_Status int32
dtype: object

N.B. Some of the attributes may not be visible in the output of Python program.

After, the data preprocessing is done, the dataset (denoted by loan.csv‘) is ready for classification. But, before moving to classification it is better to know how the preprocessed dataset is to be divided into training and test datasets. This is called distribution of dataset which is to be applied before the actual classification task begins.

 

Distribution of Dataset

We can divide the dataset in two major ways using —

1. predefined division method

2. k-fold Cross-validation (CV) method

We will describe about them in brief using coding example.

 

1. Predefined division method

Using this method, we can divide the preprocessed dataset into training and test sets using fraction values. Here, the resulting sets may not be disjoint (i.e. some of the data records may be common) which may lead to biasness to creep in. As a result, it may affect the quality of classification.

For example, if we have a dataset of N (=500) data records, using the fraction value t = 0.3 (indiactes the portion of test data in the original dataset), the number of data records in the training set will be

N × (1 – t) = 500 × 0.7 = 500 × 7/10 = 350

and the number of data records in the test set will be

N × t = 500 × 0.3 = 500 × 3/10 = 150

Python code snippet:

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('loan.csv') # Load the preprocessed dataset
X = df.drop('Loan_Status', axis=1) # Class label column is 'Loan_Status'
y = df['Loan_Status']

# Split data into training and test sets using predefined division method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

 

2. k-fold Cross-validation (CV) method

Using this method, we can divide the preprocessed dataset into training and test sets which are completely disjoint. This distribution of dataset is done randomly so that there is no common data between the training set and the test set.

For example, if we have a dataset of N (=500) data records, using 10-fold CV (i.e. k=10), the number of data records in the training set will be

N × (k – 1)/k = 500 × 9/10 = 450

and the number of data records in the test set will be

N × 1/k = 500 × 1/10 = 50

Python code snippet:

import pandas as pd
from sklearn.model_selection import KFold

df = pd.read_csv('loan.csv') # Load the preprocessed dataset
X = df.drop('Loan_Status', axis=1) # Class label column is 'Loan_Status'
y = df['Loan_Status']

# Split data into training and test sets using 10-fold CV method
kf = KFold(n_splits=10, random_state=None, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

 

In the next case study, we will perform classification (including distribution of dataset) on this preprocessed dataset using several important classifiers based on ML.