Case Study - Apriori algorithm

Case Study – Apriori algorithm

Apriori algorithm is an Association Rule Mining (ARM) algorithm for boolean association rules. The algorithm is based on the fact that it uses prior knowledge of the frequent itemset property which states that all nonempty subsets of a frequent itemset must also be frequent. This algorithm uses two functions namely candidate generation and pruning at every iteration.

In general, the association rule is an expression of the form X⇒Y, where X, Y ⊆ I. Here, X is called the antecedent and Y is called the consequent. Association rule shows how many times Y has occurred if X has already occurred depending on the minimum support (s) and minimum confidence (c) values.

ARM Measures

Support: The support of the rule X⇒Y in the transaction database D is the support of the itemset X ∪ Y in D:

support(X⇒Y) = count(X ∪ Y) / N –––> (1)

where ‘N’ is the total number of transactions in the database and count(X ∪ Y) is the number of transactions that contain X ∪ Y.

Confidence: The confidence of the rule X⇒Y in the transaction database D is the ratio of the number of transactions in D that contain X ∪ Y to the number of transactions that contain X in D:

confidence(X⇒Y) = count(X ∪ Y) / count(X) = support(X ∪ Y) / support(X) –––> (2)

It is basically denotes a conditional probability P(Y|X).

Lift: The lift of the rule X⇒Y is referred to as the interestingness measure, takes this into account by incorporating the prior probability of the rule consequent as follows:

lift(X⇒Y) = support(X ∪ Y) / support(X) ∗ support(Y) –––> (3)

The measure ‘lift‘ is newly added in this context. Its significance in ARM is given below:

lift(X⇒Y) = 1 means that there is no correlation between X and Y,
lift(X⇒Y) > 1 means that there is a positive correlation between X and Y, and
lift(X⇒Y) < 1 means that there is a negative correlation between X and Y.

Greater lift value indicates stronger association. We will use this measure in our experiment.

Dataset Description

The following dataset (transaction.csv) contains transactional records of a departmental store on a particular day. The dataset is having 30 records and contains six items such as Juice, Chips, Bread, Butter, Milk, and Banana. The snapshot of the dataset is given below using MS Excel software.

transaction.csv

Juice	Chips	Bread	Butter	Milk	Banana
Juice		Bread	Butter	Milk
		Bread	Butter	Milk
	Chips				Banana
Juice	Chips	Bread	Butter	Milk	Banana
Juice	Chips			Milk
Juice	Chips	Bread	Butter		Banana
Juice	Chips			Milk
Juice		Bread			Banana
Juice		Bread	Butter	Milk
	Chips	Bread	Butter		Banana
Juice			Butter	Milk	Banana
Juice	Chips	Bread	Butter	Milk
Juice		Bread	Butter	Milk	Banana
Juice		Bread	Butter	Milk	Banana
Juice	Chips	Bread	Butter	Milk	Banana
	Chips	Bread	Butter	Milk	Banana
	Chips		Butter	Milk	Banana
Juice	Chips	Bread	Butter	Milk	Banana
Juice		Bread	Butter	Milk	Banana
Juice	Chips	Bread		Milk	Banana
Juice	Chips
		Bread	Butter		Banana
		Bread	Butter	Milk	Banana
Juice	Chips
		Bread	Butter		Banana
	Chips	Bread	Butter	Milk	Banana
Juice		Bread	Butter		Banana
	Chips	Bread	Butter	Milk	Banana
	Chips	Bread	Butter		Banana

Python Environment Setup

Before we start coding, we need to install the ‘apyori’ module first.

pip install apyori

It is mandatory because ‘apriori‘ is a member of the ‘apyori’ module.

Implementation of Apriori algorithm

We provide here the implementation of Apriori algorithm using Python coding. The objective is to discover the association rules based on support, confidence and lift respectively greater than equal to min_support, min_confidence and min_lift. See the code below.

arm.py

# Step 1: Import the libraries
import pandas as pd
from apyori import apriori

# Step 2: Load the dataset
df = pd.read_csv('transaction.csv', header=None)

# Step 3: Display statistics of records
print("Display statistics: ")
print("===================")
print(df.describe())

# Step 4: Display shape of the dataset
print("\nShape:",df.shape)

# Step 5: Convert dataframe into a nested list
database = []
for i in range(0,30):
    database.append([str(df.values[i,j]) for j in range(0,6)])

# Step 6: Develop the Apriori model 
arm_rules = apriori(database, min_support=0.5, min_confidence=0.7, min_lift=1.2)
arm_results = list((arm_rules))

# Step 7: Display the number of rule(s)
print("\nNo. of rule(s):",len(arm_results))

# Step 8: Display the rule(s)
print("\nResults: ")
print("========")
print(arm_results)

Output:

Display statistics: 
===================
          0      1      2       3     4       5
count    19     18     23      23    20      22
unique    1      1      1       1     1       1
top   Juice  Chips  Bread  Butter  Milk  Banana
freq 19 18 23 23 20 22

Shape: (30, 6)

No. of rule(s): 1

Results: 
========
[RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk'}), support=0.5, 
ordered_statistics=[OrderedStatistic(items_base=frozenset({'Bread', 'Milk'}), 
items_add=frozenset({'Butter'}), confidence=0.9375, lift=1.2228260869565217)])]

Explanation

The program generates only one rule based on user-specified input measures such as: min_support = 0.5, min_confidence = 0.7, and min_lift = 1.2.

The support count value for the rule is 0.5. This number is calculated by dividing the number of transactions containing ‘Butter’, ‘Bread’, and ‘Milk’ by the total number of transactions.

The confidence level for the rule is 0.9375, which shows that out of all the transactions that contain both ‘Bread’ and ‘Milk’, 93.75 % contain ‘Butter’ too.

The lift of 1.22 tells us that ‘Butter’ is 1.22 times more likely to be bought by the customers who buy both ‘Bread’ and ‘Milk’ compared to the default likelihood sale of ‘Butter.’