Case Study – Apriori algorithm

Apriori algorithm is an Association Rule Mining (ARM) algorithm for boolean association rules. The algorithm is based on the fact that it uses prior knowledge of the frequent itemset property which states that all nonempty subsets of a frequent itemset must also be frequent. This algorithm uses two functions namely candidate generation and pruning at every iteration.

In general, the association rule is an expression of the form X⇒Y, where X, YI. Here, X is called the antecedent and Y is called the consequent. Association rule shows how many times Y has occurred if X has already occurred depending on the minimum support (s) and minimum confidence (c) values.

 

ARM Measures

Support: The support of the rule XY in the transaction database D is the support of the itemset X Y in D:

support(X⇒Y) = count(X ∪ Y) / N  –––> (1)

where ‘N’ is the total number of transactions in the database and count(X ∪ Y) is the number of transactions that contain X ∪ Y.

 

Confidence: The confidence of the rule XY in the transaction database D is the ratio of the number of transactions in D that contain X Y to the number of transactions that contain X in D

confidence(X⇒Y) = count(X ∪ Y) / count(X) = support(X ∪ Y) / support(X)   –––> (2)

It is basically denotes a conditional probability P(Y|X).

 

Lift: The lift of the rule XY is referred to as the interestingness measure, takes this into account by incorporating the prior probability of the rule consequent as follows: 

lift(X⇒Y) = support(X ∪ Y) / support(X) ∗ support(Y)   –––> (3)

The measure ‘lift‘ is newly added in this context. Its significance in ARM is given below:

  • lift(X⇒Y) = 1 means that there is no correlation between X and Y,
  • lift(X⇒Y) > 1 means that there is a positive correlation between X and Y, and
  • lift(X⇒Y) < 1 means that there is a negative correlation between X and Y.

Greater lift value indicates stronger association. We will use this measure in our experiment.

 

Dataset Description

The following dataset (transaction.csv) contains transactional records of a departmental store on a particular day. The dataset is having 30 records and contains six items such as Juice, Chips, Bread, Butter, Milk, and Banana. The snapshot of the dataset is given below using MS Excel software.

transaction.csv

JuiceChipsBreadButterMilkBanana
Juice BreadButterMilk 
  BreadButterMilk 
 Chips   Banana
JuiceChipsBreadButterMilkBanana
JuiceChips  Milk 
JuiceChipsBreadButter Banana
JuiceChips  Milk 
Juice Bread  Banana
Juice BreadButterMilk 
 ChipsBreadButter Banana
Juice  ButterMilkBanana
JuiceChipsBreadButterMilk 
Juice BreadButterMilkBanana
Juice BreadButterMilkBanana
JuiceChipsBreadButterMilkBanana
 ChipsBreadButterMilkBanana
 Chips ButterMilkBanana
JuiceChipsBreadButterMilkBanana
Juice BreadButterMilkBanana
JuiceChipsBread MilkBanana
JuiceChips    
  BreadButter Banana
  BreadButterMilkBanana
JuiceChips    
  BreadButter Banana
 ChipsBreadButterMilkBanana
Juice BreadButter Banana
 ChipsBreadButterMilkBanana
 ChipsBreadButter Banana

 

Python Environment Setup

Before we start coding, we need to install the ‘apyori’ module first.

pip install apyori

It is mandatory because ‘apriori‘ is a member of the ‘apyori’ module.

 

Implementation of Apriori algorithm

We provide here the implementation of Apriori algorithm using Python coding. The objective is to discover the association rules based on support, confidence and lift respectively greater than equal to min_supportmin_confidence and min_lift. See the code below.

arm.py

# Step 1: Import the libraries
import pandas as pd
from apyori import apriori

# Step 2: Load the dataset
df = pd.read_csv('transaction.csv', header=None)

# Step 3: Display statistics of records
print("Display statistics: ")
print("===================")
print(df.describe())

# Step 4: Display shape of the dataset
print("\nShape:",df.shape)

# Step 5: Convert dataframe into a nested list
database = []
for i in range(0,30):
database.append([str(df.values[i,j]) for j in range(0,6)])

# Step 6: Develop the Apriori model
arm_rules = apriori(database, min_support=0.5, min_confidence=0.7, min_lift=1.2)
arm_results = list((arm_rules))

# Step 7: Display the number of rule(s)
print("\nNo. of rule(s):",len(arm_results))

# Step 8: Display the rule(s)
print("\nResults: ")
print("========")
print(arm_results)

 

Output:

Display statistics: 
===================
0 1 2 3 4 5
count 19 18 23 23 20 22
unique 1 1 1 1 1 1
top Juice Chips Bread Butter Milk Banana
freq 19 18 23 23 20 22


Shape: (30, 6)

No. of rule(s): 1

Results:
========
[RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk'}), support=0.5,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'Bread', 'Milk'}),
items_add=frozenset({'Butter'}), confidence=0.9375, lift=1.2228260869565217)])]
 
Explanation

The program generates only one rule based on user-specified input measures such as: min_support = 0.5, min_confidence = 0.7, and min_lift = 1.2.

The support count value for the rule is 0.5. This number is calculated by dividing the number of transactions containing ‘Butter’, ‘Bread’, and ‘Milk’ by the total number of transactions.

The confidence level for the rule is 0.9375, which shows that out of all the transactions that contain both ‘Bread’ and ‘Milk’, 93.75 % contain ‘Butter’ too.

The lift of 1.22 tells us that ‘Butter’ is 1.22 times more likely to be bought by the customers who buy both ‘Bread’ and ‘Milk’ compared to the default likelihood sale of ‘Butter.’