Wednesday, November 4, 2015

Association Rules Analysis on Groceries Dataset

----------------------------------------------------------------------------------------------------------

by Manoj Patra and Nikita Naidu

Click here to download the project

----------------------------------------------------------------------------------------------------------

Abstract


Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction. The aim of this project is to perform Market Basket Analysis on the groceries.csv dataset by finding association rules on the purchased items. Market basket analysis may provide the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers. We used Weka 3.6.10 for performing the analysis.




----------------------------------------------------------------------------------------------------------


1.0 Introduction

Association analysis is a data mining technique used to find the correlations among two or more items based on the transactions. If an item A is always purchased/preferred with item B by the users, then an inference can be made that the two items have some association between them. This fact can be leveraged to improve the sales, user experience by the business. Market Basket Analysis is performed using this technique. The associated items if placed in the adjacent racks, in a grocery store, will make the shopping experience convenient for the customer. Similarly, increasing cross-domain sell by keeping offers on cold drinks for customers who buy pizza, making customers to spend more on shopping. Deliver target marketing by emailing customers buying specific products recommending them to purchase other products that goes with their first purchase, introducing some offers etc.
Association analysis is thus, a very essential technique which is used by business on a large scale to increase their sales.

2.0 Problem definition

To find association rules on groceries dataset.
The dataset is in csv format having purchase details of 9853 transactions. Apriori and FP growth algorithms were used to find the association between the items. The evaluation metric used to find the best association rules are support and confidence.

3.0 Dataset Description

The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories. Groceries dataset is converted from csv to arff format using java code.

4.0 Methodology and results

4.1 Algorithm and evaluation metric

We used the Apriori and FP growth for finding the association rules with following evaluation metrics:

1. Support

The support value of X with respect to set of transactions T is defined as the proportion of transactions in the database which contains the item-set X.

2. Confidence

The confidence value of a rule, X⇒Y , with respect to a set of transactions T, is the proportion the transactions that contains X which also contains Y.
                                   conf(X⇒Y) = supp(XUY)/supp(Y)

3. Conviction

Conviction compares the probability that X appears without Y if they were dependent with the actual frequency of the appearance of X without Y
                                 conviction(X⇒Y) = 1−supp(Y) /1−conf(X⇒Y)= P(X)P(~Y)/ P(X∧~Y)
                                 

4. Lift

Lift measures how many times more often X and Y occur together than expected if they where statistically independent.Lift is not down-ward closed and does not suffer from the rare item problem. Also lift is susceptible to noise in small databases. Rare itemsets with low counts (low probability) which per chance occur a few times (or only once) together can produce enormous lift values.
lift(X⇒Y) = lift(Y⇒X) = conf(X⇒Y)/supp(Y) = conf(Y⇒X)/supp(X) = P(X∧Y)/P(X)P(Y)
  

5. Leverage

Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells.
PS(X⇒Y) = leverage(X⇒Y) = supp(X⇒Y) − supp(X)supp(Y) = P(X∧Y) − P(X)P(Y)

4.2 Results 

The screenshots and the association rules obtained are as below:

1. Apriori

Lower Bound On Support = 0.07 (688 instances) :: Lower Bound On Confidence = 0.1

1. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39) 
2. whole_milk=t 2513 ==> other_vegetables=t 736 conf:(0.29)

Lower Bound On Support = 0.05 (492 instances) :: Lower Bound On Confidence = 0.1

1. yogurt=t 1372 ==> whole_milk=t 551 conf:(0.4)
2. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39)
3. rolls/buns=t 1809 ==> whole_milk=t 557 conf:(0.31)
4. whole_milk=t 2513 ==> other_vegetables=t 736 conf:(0.29)
5. whole_milk=t 2513 ==> rolls/buns=t 557 conf:(0.22)
6. whole_milk=t 2513 ==> yogurt=t 551 conf:(0.22)

Lower Bound On Support = 0.04 (393 instances) :: Lower Bound On Confidence = 0.1

1. root_vegetables=t 1072 ==> whole_milk=t 481 conf:(0.45) 
2. root_vegetables=t 1072 ==> other_vegetables=t 466 conf:(0.43) 
3. tropical_fruit=t 1032 ==> whole_milk=t 416 conf:(0.4) 
4. yogurt=t 1372 ==> whole_milk=t 551 conf:(0.4) 
5. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39) 
6. yogurt=t 1372 ==> other_vegetables=t 427 conf:(0.31) 
7. rolls/buns=t 1809 ==> whole_milk=t 557 conf:(0.31) 
8. whole_milk=t 2513 ==> other_vegetables=t 736 conf:(0.29) 
9. other_vegetables=t 1903 ==> root_vegetables=t 466 conf:(0.24) 
10. rolls/buns=t 1809 ==> other_vegetables=t 419 conf:(0.23) 
11. soda=t 1715 ==> whole_milk=t 394 conf:(0.23) 
12. other_vegetables=t 1903 ==> yogurt=t 427 conf:(0.22) 
13. whole_milk=t 2513 ==> rolls/buns=t 557 conf:(0.22) 
14. other_vegetables=t 1903 ==> rolls/buns=t 419 conf:(0.22) 
15. whole_milk=t 2513 ==> yogurt=t 551 conf:(0.22) 
16. whole_milk=t 2513 ==> root_vegetables=t 481 conf:(0.19) 
17. whole_milk=t 2513 ==> tropical_fruit=t 416 conf:(0.17) 
18. whole_milk=t 2513 ==> soda=t 394 conf:(0.16)

Lower Bound On Support = 0.02 (197 instances) :: Lower Bound On Confidence = 0.2

1. yogurt=t other_vegetables=t 427 ==> whole_milk=t 219 conf:(0.51) 
2. butter=t 545 ==> whole_milk=t 271 conf:(0.5) 
3. curd=t 524 ==> whole_milk=t 257 conf:(0.49) 
4. other_vegetables=t root_vegetables=t 466 ==> whole_milk=t 228 conf:(0.49) 
5. whole_milk=t root_vegetables=t 481 ==> other_vegetables=t 228 conf:(0.47) 
6. domestic_eggs=t 624 ==> whole_milk=t 295 conf:(0.47) 
7. whipped/sour_cream=t 705 ==> whole_milk=t 317 conf:(0.45) 
8. root_vegetables=t 1072 ==> whole_milk=t 481 conf:(0.45) 
9. root_vegetables=t 1072 ==> other_vegetables=t 466 conf:(0.43) 
10. frozen_vegetables=t 473 ==> whole_milk=t 201 conf:(0.42) 
11. margarine=t 576 ==> whole_milk=t 238 conf:(0.41) 
12. beef=t 516 ==> whole_milk=t 209 conf:(0.41) 
13. tropical_fruit=t 1032 ==> whole_milk=t 416 conf:(0.4) 
14. whipped/sour_cream=t 705 ==> other_vegetables=t 284 conf:(0.4) 
15. yogurt=t 1372 ==> whole_milk=t 551 conf:(0.4) 
16. pip_fruit=t 744 ==> whole_milk=t 296 conf:(0.4) 
17. yogurt=t whole_milk=t 551 ==> other_vegetables=t 219 conf:(0.4) 
18. brown_bread=t 638 ==> whole_milk=t 248 conf:(0.39) 
19. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39) 
20. pork=t 567 ==> whole_milk=t 218 conf:(0.38)

2. FP Growth

Lower Bound On Support = 0.07 (688 instances) :: Lower Bound On Confidence = 0.1

1. [other_vegetables=t]: 1903 ==> [whole_milk=t]: 736 <conf:(0.39)> lift:(1.51) lev:(0.03) conv:(1.21) 
2. [whole_milk=t]: 2513 ==> [other_vegetables=t]: 736 <conf:(0.29)> lift:(1.51) lev:(0.03) conv:(1.14)

Lower Bound On Support = 0.05 (492 instances) :: Lower Bound On Confidence = 0.1
1. [yogurt=t]: 1372 ==> [whole_milk=t]: 551 <conf:(0.4)> lift:(1.57) lev:(0.02) conv:(1.24) 
2. [other_vegetables=t]: 1903 ==> [whole_milk=t]: 736 <conf:(0.39)> lift:(1.51) lev:(0.03) conv:(1.21) 
3. [rolls/buns=t]: 1809 ==> [whole_milk=t]: 557 <conf:(0.31)> lift:(1.21) lev:(0.01) conv:(1.07) 
4. [whole_milk=t]: 2513 ==> [other_vegetables=t]: 736 <conf:(0.29)> lift:(1.51) lev:(0.03) conv:(1.14) 
5. [whole_milk=t]: 2513 ==> [rolls/buns=t]: 557 <conf:(0.22)> lift:(1.21) lev:(0.01) conv:(1.05) 
6. [whole_milk=t]: 2513 ==> [yogurt=t]: 551 <conf:(0.22)> lift:(1.57) lev:(0.02) conv:(1.1)

FPGrowth found 124 rules (displaying top 20)
Lower Bound On Support = 0.04 (393 instances) :: Lower Bound On Confidence = 0.1

1. [other_vegetables=t, yogurt=t]: 427 ==> [whole_milk=t]: 219 <conf:(0.51)> lift:(2.01) lev:(0.01) conv:(1.52)
2. [butter=t]: 545 ==> [whole_milk=t]: 271 <conf:(0.5)> lift:(1.95) lev:(0.01) conv:(1.48)
3. [curd=t]: 524 ==> [whole_milk=t]: 257 <conf:(0.49)> lift:(1.92) lev:(0.01) conv:(1.46)
4. [other_vegetables=t, root_vegetables=t]: 466 ==> [whole_milk=t]: 228 <conf:(0.49)> lift:(1.91) lev:(0.01) conv:(1.45)
5. [whole_milk=t, root_vegetables=t]: 481 ==> [other_vegetables=t]: 228 <conf:(0.47)> lift:(2.45) lev:(0.01) conv:(1.53)
6. [domestic_eggs=t]: 624 ==> [whole_milk=t]: 295 <conf:(0.47)> lift:(1.85) lev:(0.01) conv:(1.41)
7. [whipped/sour_cream=t]: 705 ==> [whole_milk=t]: 317 <conf:(0.45)> lift:(1.76) lev:(0.01) conv:(1.35)
8. [root_vegetables=t]: 1072 ==> [whole_milk=t]: 481 <conf:(0.45)> lift:(1.76) lev:(0.02) conv:(1.35)
9. [root_vegetables=t]: 1072 ==> [other_vegetables=t]: 466 <conf:(0.43)> lift:(2.25) lev:(0.03) conv:(1.42)
10. [frozen_vegetables=t]: 473 ==> [whole_milk=t]: 201 <conf:(0.42)> lift:(1.66) lev:(0.01) conv:(1.29)
11. [margarine=t]: 576 ==> [whole_milk=t]: 238 <conf:(0.41)> lift:(1.62) lev:(0.01) conv:(1.26)
12. [beef=t]: 516 ==> [whole_milk=t]: 209 <conf:(0.41)> lift:(1.59) lev:(0.01) conv:(1.25)
13. [tropical_fruit=t]: 1032 ==> [whole_milk=t]: 416 <conf:(0.4)> lift:(1.58) lev:(0.02) conv:(1.25)
14. [whipped/sour_cream=t]: 705 ==> [other_vegetables=t]: 284 <conf:(0.4)> lift:(2.08) lev:(0.02) conv:(1.35)
15. [yogurt=t]: 1372 ==> [whole_milk=t]: 551 <conf:(0.4)> lift:(1.57) lev:(0.02) conv:(1.24)
16. [pip_fruit=t]: 744 ==> [whole_milk=t]: 296 <conf:(0.4)> lift:(1.56) lev:(0.01) conv:(1.23)
17. [whole_milk=t, yogurt=t]: 551 ==> [other_vegetables=t]: 219 <conf:(0.4)> lift:(2.05) lev:(0.01) conv:(1.33)
18. [brown_bread=t]: 638 ==> [whole_milk=t]: 248 <conf:(0.39)> lift:(1.52) lev:(0.01) conv:(1.21)
19. [other_vegetables=t]: 1903 ==> [whole_milk=t]: 736 <conf:(0.39)> lift:(1.51) lev:(0.03) conv:(1.21)
20. [pork=t]: 567 ==> [whole_milk=t]: 218 <conf:(0.38)> lift:(1.5) lev:(0.01) conv:(1.21)

4.3 Inference

We have obtained the association between the items. Following items are more often purchased together:
1. other_vegetables, yogurt ==> whole_milk=t <conf:(0.51)>
2. butter=t 545 ==> whole_milk=t 271 conf:(0.5)
3. curd=t 524 ==> whole_milk=t 257 conf:(0.49)
4. other_vegetables=t root_vegetables=t 466 ==> whole_milk=t 228 conf:(0.49)
5. domestic_eggs=t 624 ==> whole_milk=t 295 conf:(0.47)

5.0 Conclusion

Association rule analysis when performed on the groceries dataset revealed strong co-occurence between some of the items. Thus, if the store makes offers on the related products like discount on whole milk on purchase of yoghurt and/or by placing milk and vegetables on nearby stacks, their sales may increase dramatically.

6 References

1. http://snowplowanalytics.com/documentation/recipes/catalog-analytics/market-basket-analysis- identifying-products-that-sell-well-together.html
2. https://en.wikipedia.org/wiki/Affinity_analysis

No comments:

Post a Comment