Python library Pandas
Pandas is an open source, BSD-licensed library (i.e. module) for providing high-performance, easy-to-use data structures and data analysis tools for Python language. It is used in a wide range of fields involving academic and commercial domains including finance, economics, statistics, analytics, etc.
Pandas library is built on top of Numpy, meaning Pandas needs Numpy to operate. Pandas provide an easy way to create, manipulate and analyze the data.
Data scientists use Pandas for its following advantages:
- It easily handles missing data
- It uses Series for one-dimensional data structure
- It employs DataFrame for multi-dimensional data structure
- It provides an efficient way to slice the data
- It provides a flexible way to merge, concatenate or reshape the data
- It includes a powerful time series tool to work with
In brief, Pandas is a useful library in data analysis. It can be used to perform data manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures.
In this tutorial, we will learn the various features of Pandas and how to use them in practice using Python.
1) Pandas for Series
We can create a data series from a given array. Then, we can apply various operations on this series. Let us consider the following program to deal with a data series created from a numeric array:
# Example of Pandas Series
import numpy as np
import pandas as pd
# A sample array
data = np.array([11, 22, 33, 44, 55, 66, 77, 88, 99])
# Create series from array
series = pd.Series(data)
print("Full series:~")
print(series, "\n")
# Retrieve the first six elements
print("First 6 elements of series:~")
print(series[:6], "\n")
# Calculate Sum
print("Sum:", series.sum(), "\n")
# Calculate Mean
print("Mean:", series.mean(), "\n")
# Calculate Standard Deviation
val = series.std()
res = format(val, '.2f')
print("Standard Deviation:", res, "\n")
# Use loc() method
print("\n<-- loc -->")
result = series.loc[2:5]
print(result, "\n")
# Use loc() method
print("\n<-- iloc -->")
result = series.iloc[2:5]
print(result, "\n")
Output:
Full series:~
0 11
1 22
2 33
3 44
4 55
5 66
6 77
7 88
8 99
dtype: int32
First 6 elements of series:~
0 11
1 22
2 33
3 44
4 55
5 66
dtype: int32
Sum: 495
Mean: 55.0
Standard Deviation: 30.12
<-- loc -->
2 33
3 44
4 55
5 66
dtype: int32
<-- iloc -->
2 33
3 44
4 55
dtype: int32
2) Pandas for DataFrame
Let us consider this CSV file named “students.csv” denoting a students’ database with the following attributes:
See the example below to create dataframe from the above database and then manipulate & analyze the data in this dataframe:
# Example of Pandas DataFrame
import pandas as pd
# Reading the dataset in a dataframe using Pandas
df = pd.read_csv('students.csv')
# Gathering Info
print('\nInfo of numeric columns: ');
print(df.describe())
# print(df.describe(include="all"))
print('\nInfo of all columns: ');
print(df.columns)
print(df.dtypes)
print('\nSize Info: ');
print(df.shape, "\n")
print('\nBasic Info: ');
print(df.head(10))
# print(df.info) # look at the info of "df"
# df.loc[] gets rows (or columns) with particular labels from the index
print("\n<-- loc 1 -->")
print(df.loc[(df['Grade'] > 8.5), ['Roll', 'Name', 'Stream']])
# df.iloc[] gets rows (or columns) at particular positions in the index (integer value)
print("\n<-- iloc 1 -->")
# DataFrame.values attribute return a Numpy representation of the given DataFrame
print(df.iloc[(df['Grade'] > 8.5).values, [0, 1, 2]])
# Another example of df.loc[] to show it is label-based
print("\n<-- loc 2 -->")
print(df.loc[(df['Grade'] > 9.0) & (df['Stream'] == 'CSE'), ['Roll', 'Name']])
# Another example of df.iloc[] to show it is index-based
print("\n<-- iloc 2 -->")
print(df.iloc[((df['Grade'] > 9.0) & (df['Stream'] == 'CSE')).values, [0, 1]])
Output:
Info of numeric columns:
Roll Age Grade
count 10.000000 10.000000 10.000000
mean 55.000000 22.500000 8.715000
std 30.276504 1.080123 0.569878
min 10.000000 21.000000 7.780000
25% 32.500000 22.000000 8.382500
50% 55.000000 22.500000 8.650000
75% 77.500000 23.000000 9.107500
max 100.000000 24.000000 9.650000
Info of all columns:
Index(['Roll', 'Name', 'Stream', 'Gender', 'Age', 'Grade'], dtype='object')
Roll int64
Name object
Stream object
Gender object
Age int64
Grade float64
dtype: object
Size Info:
(10, 6)
Basic Info:
Roll Name Stream Gender Age Grade
0 10 Sandip Das IT M 23 9.29
1 20 Aparna Roy EE F 22 8.75
2 30 Samar Khan ME M 22 8.15
3 40 Sandesh Srivastav ECE M 23 9.12
4 50 Peter Parkar CSE M 21 8.55
5 60 Annie Smith IT F 24 9.07
6 70 Amar Thakur CSE M 23 7.78
7 80 Hasina Begam ECE F 21 8.37
8 90 Samira Reddy EE F 22 9.65
9 100 Somenath Ghosh CSE M 24 9.02
<-- loc 1 -->
Roll Name Stream
0 10 Sandip Das IT
1 20 Aparna Roy EE
3 40 Sandesh Srivastav ECE
4 50 Peter Parkar CSE
5 60 Annie Smith IT
8 90 Samira Reddy EE
9 100 Somenath Ghosh CSE
<-- iloc 1 -->
Roll Name Stream
0 10 Sandip Das IT
1 20 Aparna Roy EE
3 40 Sandesh Srivastav ECE
4 50 Peter Parkar CSE
5 60 Annie Smith IT
8 90 Samira Reddy EE
9 100 Somenath Ghosh CSE
<-- loc 2 -->
Roll Name
9 100 Somenath Ghosh
<-- iloc 2 -->
Roll Name
9 100 Somenath Ghosh
I hope that now you have got the confidence of working with Pandas module. This is particularly useful when you will develop solutions for the problems related to data science and machine learning.