Manoj Patra's Blog: November 2015

Tuesday, November 17, 2015

How to Create Wifi Hotspot in Ubuntu 14.04

1. Disable WIFI and plug in an internet cable to your laptop so that your Ubuntu is connect to a wired internet and wireless is disabled.

2. Go to Network Icon on top panel -> Edit Connections …, then click the Add button in the pop-up window.

3. Choose Wi-Fi from the drop-down menu when you’re asked to choose a connection type:

4. In next window, do:

Type in a connection name. The name will be used later.
Type in a SSID
Select mode: Infrastructure
Device MAC address: select your wireless card from drop-down menu.

5. Go to Wi-Fi Security tab, select security type WPA & WPA2 Personal and set a password.

6. Go to IPv4 Settings tab, from Method drop-down box select Shared to other computers.

When done, click the save button.
After above steps, a configuration file created under /etc/NetworkManager/system-connections directory. File name is same to the connection name you typed in step 4.
Now press Ctrl+Alt+T on keyboard to open terminal. When it opens, paste the commands below and hit enter to edit the configuration file:

gksu gedit /etc/NetworkManager/system-connections/max-Patra

When the file opens, find out the line mode=infrastructure and change it to mode=ap. Finally save the file.

When everything’s done, enable WIFI from Network Manager icon on the panel. It should automatically connect to the hotspot you created. If not, select “Connect to Hidden Wi-Fi Network …” and select it from the drop-down box.

If you like what you show

Please like, share and comment in the comment box bellow.

Big thanks for reading..!

Sunday, November 15, 2015

Best Motivational Video

What are the things we can learn from this video is

1. Keep trying until you are succeed. (duckling are falling down again and again but they are keep trying)

2. If you are unable to succeed with the way you are following, try it with different way.( they are moving from one place to another and jumping)

3. Don't leave back your friends, colleague or any one who is with you, when you are one step ahead of them. ( they are waiting for the rest to come)

Wednesday, November 4, 2015

Association Rules Analysis on Groceries Dataset

----------------------------------------------------------------------------------------------------------

by Manoj Patra and Nikita Naidu

Click here to download the project

----------------------------------------------------------------------------------------------------------

Abstract

Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction. The aim of this project is to perform Market Basket Analysis on the groceries.csv dataset by finding association rules on the purchased items. Market basket analysis may provide the retailer with information to understand the purchase behaviour of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers. We used Weka 3.6.10 for performing the analysis.

----------------------------------------------------------------------------------------------------------

1.0 Introduction

Association analysis is a data mining technique used to find the correlations among two or more items based on the transactions. If an item A is always purchased/preferred with item B by the users, then an inference can be made that the two items have some association between them. This fact can be leveraged to improve the sales, user experience by the business. Market Basket Analysis is performed using this technique. The associated items if placed in the adjacent racks, in a grocery store, will make the shopping experience convenient for the customer. Similarly, increasing cross-domain sell by keeping offers on cold drinks for customers who buy pizza, making customers to spend more on shopping. Deliver target marketing by emailing customers buying specific products recommending them to purchase other products that goes with their first purchase, introducing some offers etc.

Association analysis is thus, a very essential technique which is used by business on a large scale to increase their sales.

2.0 Problem definition

To find association rules on groceries dataset.

The dataset is in csv format having purchase details of 9853 transactions. Apriori and FP growth algorithms were used to find the association between the items. The evaluation metric used to find the best association rules are support and confidence.

3.0 Dataset Description

The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories. Groceries dataset is converted from csv to arff format using java code.

4.0 Methodology and results

4.1 Algorithm and evaluation metric

We used the Apriori and FP growth for finding the association rules with following evaluation metrics:

1. Support

The support value of X with respect to set of transactions T is defined as the proportion of transactions in the database which contains the item-set X.

2. Confidence

The confidence value of a rule, X⇒Y , with respect to a set of transactions T, is the proportion the transactions that contains X which also contains Y.

conf(X⇒Y) = supp(XUY)/supp(Y)

3. Conviction

Conviction compares the probability that X appears without Y if they were dependent with the actual frequency of the appearance of X without Y

conviction(X⇒Y) = 1−supp(Y) /1−conf(X⇒Y)= P(X)P(~Y)/ P(X∧~Y)

4. Lift

Lift measures how many times more often X and Y occur together than expected if they where statistically independent.Lift is not down-ward closed and does not suffer from the rare item problem. Also lift is susceptible to noise in small databases. Rare itemsets with low counts (low probability) which per chance occur a few times (or only once) together can produce enormous lift values.

lift(X⇒Y) = lift(Y⇒X) = conf(X⇒Y)/supp(Y) = conf(Y⇒X)/supp(X) = P(X∧Y)/P(X)P(Y)

5. Leverage

Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells.

PS(X⇒Y) = leverage(X⇒Y) = supp(X⇒Y) − supp(X)supp(Y) = P(X∧Y) − P(X)P(Y)

4.2 Results

The screenshots and the association rules obtained are as below:

1. Apriori

Lower Bound On Support = 0.07 (688 instances) :: Lower Bound On Confidence = 0.1

1. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39)

2. whole_milk=t 2513 ==> other_vegetables=t 736 conf:(0.29)

Lower Bound On Support = 0.05 (492 instances) :: Lower Bound On Confidence = 0.1

1. yogurt=t 1372 ==> whole_milk=t 551 conf:(0.4)

2. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39)

3. rolls/buns=t 1809 ==> whole_milk=t 557 conf:(0.31)

4. whole_milk=t 2513 ==> other_vegetables=t 736 conf:(0.29)

5. whole_milk=t 2513 ==> rolls/buns=t 557 conf:(0.22)

6. whole_milk=t 2513 ==> yogurt=t 551 conf:(0.22)

Lower Bound On Support = 0.04 (393 instances) :: Lower Bound On Confidence = 0.1

1. root_vegetables=t 1072 ==> whole_milk=t 481 conf:(0.45)

2. root_vegetables=t 1072 ==> other_vegetables=t 466 conf:(0.43)

3. tropical_fruit=t 1032 ==> whole_milk=t 416 conf:(0.4)

4. yogurt=t 1372 ==> whole_milk=t 551 conf:(0.4)

5. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39)

6. yogurt=t 1372 ==> other_vegetables=t 427 conf:(0.31)

7. rolls/buns=t 1809 ==> whole_milk=t 557 conf:(0.31)

8. whole_milk=t 2513 ==> other_vegetables=t 736 conf:(0.29)

9. other_vegetables=t 1903 ==> root_vegetables=t 466 conf:(0.24)

10. rolls/buns=t 1809 ==> other_vegetables=t 419 conf:(0.23)

11. soda=t 1715 ==> whole_milk=t 394 conf:(0.23)

12. other_vegetables=t 1903 ==> yogurt=t 427 conf:(0.22)

13. whole_milk=t 2513 ==> rolls/buns=t 557 conf:(0.22)

14. other_vegetables=t 1903 ==> rolls/buns=t 419 conf:(0.22)

15. whole_milk=t 2513 ==> yogurt=t 551 conf:(0.22)

16. whole_milk=t 2513 ==> root_vegetables=t 481 conf:(0.19)

17. whole_milk=t 2513 ==> tropical_fruit=t 416 conf:(0.17)

18. whole_milk=t 2513 ==> soda=t 394 conf:(0.16)

Lower Bound On Support = 0.02 (197 instances) :: Lower Bound On Confidence = 0.2

1. yogurt=t other_vegetables=t 427 ==> whole_milk=t 219 conf:(0.51)

2. butter=t 545 ==> whole_milk=t 271 conf:(0.5)

3. curd=t 524 ==> whole_milk=t 257 conf:(0.49)

4. other_vegetables=t root_vegetables=t 466 ==> whole_milk=t 228 conf:(0.49)

5. whole_milk=t root_vegetables=t 481 ==> other_vegetables=t 228 conf:(0.47)

6. domestic_eggs=t 624 ==> whole_milk=t 295 conf:(0.47)

7. whipped/sour_cream=t 705 ==> whole_milk=t 317 conf:(0.45)

8. root_vegetables=t 1072 ==> whole_milk=t 481 conf:(0.45)

9. root_vegetables=t 1072 ==> other_vegetables=t 466 conf:(0.43)

10. frozen_vegetables=t 473 ==> whole_milk=t 201 conf:(0.42)

11. margarine=t 576 ==> whole_milk=t 238 conf:(0.41)

12. beef=t 516 ==> whole_milk=t 209 conf:(0.41)

13. tropical_fruit=t 1032 ==> whole_milk=t 416 conf:(0.4)

14. whipped/sour_cream=t 705 ==> other_vegetables=t 284 conf:(0.4)

15. yogurt=t 1372 ==> whole_milk=t 551 conf:(0.4)

16. pip_fruit=t 744 ==> whole_milk=t 296 conf:(0.4)

17. yogurt=t whole_milk=t 551 ==> other_vegetables=t 219 conf:(0.4)

18. brown_bread=t 638 ==> whole_milk=t 248 conf:(0.39)

19. other_vegetables=t 1903 ==> whole_milk=t 736 conf:(0.39)

20. pork=t 567 ==> whole_milk=t 218 conf:(0.38)

2. FP Growth

Lower Bound On Support = 0.07 (688 instances) :: Lower Bound On Confidence = 0.1

1. [other_vegetables=t]: 1903 ==> [whole_milk=t]: 736 <conf:(0.39)> lift:(1.51) lev:(0.03) conv:(1.21)

2. [whole_milk=t]: 2513 ==> [other_vegetables=t]: 736 <conf:(0.29)> lift:(1.51) lev:(0.03) conv:(1.14)

Lower Bound On Support = 0.05 (492 instances) :: Lower Bound On Confidence = 0.1

1. [yogurt=t]: 1372 ==> [whole_milk=t]: 551 <conf:(0.4)> lift:(1.57) lev:(0.02) conv:(1.24)

2. [other_vegetables=t]: 1903 ==> [whole_milk=t]: 736 <conf:(0.39)> lift:(1.51) lev:(0.03) conv:(1.21)

3. [rolls/buns=t]: 1809 ==> [whole_milk=t]: 557 <conf:(0.31)> lift:(1.21) lev:(0.01) conv:(1.07)

4. [whole_milk=t]: 2513 ==> [other_vegetables=t]: 736 <conf:(0.29)> lift:(1.51) lev:(0.03) conv:(1.14)

5. [whole_milk=t]: 2513 ==> [rolls/buns=t]: 557 <conf:(0.22)> lift:(1.21) lev:(0.01) conv:(1.05)

6. [whole_milk=t]: 2513 ==> [yogurt=t]: 551 <conf:(0.22)> lift:(1.57) lev:(0.02) conv:(1.1)

FPGrowth found 124 rules (displaying top 20)

Lower Bound On Support = 0.04 (393 instances) :: Lower Bound On Confidence = 0.1

1. [other_vegetables=t, yogurt=t]: 427 ==> [whole_milk=t]: 219 <conf:(0.51)> lift:(2.01) lev:(0.01) conv:(1.52)

2. [butter=t]: 545 ==> [whole_milk=t]: 271 <conf:(0.5)> lift:(1.95) lev:(0.01) conv:(1.48)

3. [curd=t]: 524 ==> [whole_milk=t]: 257 <conf:(0.49)> lift:(1.92) lev:(0.01) conv:(1.46)

4. [other_vegetables=t, root_vegetables=t]: 466 ==> [whole_milk=t]: 228 <conf:(0.49)> lift:(1.91) lev:(0.01) conv:(1.45)

5. [whole_milk=t, root_vegetables=t]: 481 ==> [other_vegetables=t]: 228 <conf:(0.47)> lift:(2.45) lev:(0.01) conv:(1.53)

6. [domestic_eggs=t]: 624 ==> [whole_milk=t]: 295 <conf:(0.47)> lift:(1.85) lev:(0.01) conv:(1.41)

7. [whipped/sour_cream=t]: 705 ==> [whole_milk=t]: 317 <conf:(0.45)> lift:(1.76) lev:(0.01) conv:(1.35)

8. [root_vegetables=t]: 1072 ==> [whole_milk=t]: 481 <conf:(0.45)> lift:(1.76) lev:(0.02) conv:(1.35)

9. [root_vegetables=t]: 1072 ==> [other_vegetables=t]: 466 <conf:(0.43)> lift:(2.25) lev:(0.03) conv:(1.42)

10. [frozen_vegetables=t]: 473 ==> [whole_milk=t]: 201 <conf:(0.42)> lift:(1.66) lev:(0.01) conv:(1.29)

11. [margarine=t]: 576 ==> [whole_milk=t]: 238 <conf:(0.41)> lift:(1.62) lev:(0.01) conv:(1.26)

12. [beef=t]: 516 ==> [whole_milk=t]: 209 <conf:(0.41)> lift:(1.59) lev:(0.01) conv:(1.25)

13. [tropical_fruit=t]: 1032 ==> [whole_milk=t]: 416 <conf:(0.4)> lift:(1.58) lev:(0.02) conv:(1.25)

14. [whipped/sour_cream=t]: 705 ==> [other_vegetables=t]: 284 <conf:(0.4)> lift:(2.08) lev:(0.02) conv:(1.35)

15. [yogurt=t]: 1372 ==> [whole_milk=t]: 551 <conf:(0.4)> lift:(1.57) lev:(0.02) conv:(1.24)

16. [pip_fruit=t]: 744 ==> [whole_milk=t]: 296 <conf:(0.4)> lift:(1.56) lev:(0.01) conv:(1.23)

17. [whole_milk=t, yogurt=t]: 551 ==> [other_vegetables=t]: 219 <conf:(0.4)> lift:(2.05) lev:(0.01) conv:(1.33)

18. [brown_bread=t]: 638 ==> [whole_milk=t]: 248 <conf:(0.39)> lift:(1.52) lev:(0.01) conv:(1.21)

19. [other_vegetables=t]: 1903 ==> [whole_milk=t]: 736 <conf:(0.39)> lift:(1.51) lev:(0.03) conv:(1.21)

20. [pork=t]: 567 ==> [whole_milk=t]: 218 <conf:(0.38)> lift:(1.5) lev:(0.01) conv:(1.21)

4.3 Inference

We have obtained the association between the items. Following items are more often purchased together:

1. other_vegetables, yogurt ==> whole_milk=t <conf:(0.51)>

2. butter=t 545 ==> whole_milk=t 271 conf:(0.5)

3. curd=t 524 ==> whole_milk=t 257 conf:(0.49)

4. other_vegetables=t root_vegetables=t 466 ==> whole_milk=t 228 conf:(0.49)

5. domestic_eggs=t 624 ==> whole_milk=t 295 conf:(0.47)

5.0 Conclusion

Association rule analysis when performed on the groceries dataset revealed strong co-occurence between some of the items. Thus, if the store makes offers on the related products like discount on whole milk on purchase of yoghurt and/or by placing milk and vegetables on nearby stacks, their sales may increase dramatically.

6 References

1. http://snowplowanalytics.com/documentation/recipes/catalog-analytics/market-basket-analysis- identifying-products-that-sell-well-together.html

2. https://en.wikipedia.org/wiki/Affinity_analysis

Credit Scoring on German Credit Dataset

----------------------------------------------------------------------------------------------------------

by Manoj Patra and Nikita Naidu

Click here to download the project

----------------------------------------------------------------------------------------------------------

Abstract

The main aim of this project is to obtain a model to perform credit scoring. Credit scoring or credit risk assessment is an important research issue in the banking industry. The major challenge of credit scoring is to categorize the profitable customers by predicting the bankrupts. The data set for the experiment is taken from the UCI Machine learning repository. German Credit data set has data on 1000 past credit applicants, described by 20 attributes. Each applicant is rated as “Good” or “Bad” credit. This model will help in deciding whether the loan should be granted to the new customer or not. We want to obtain a model that may be used to determine if new applicants present a good or bad credit risk.

----------------------------------------------------------------------------------------------------------

1.0 Introduction

Customer credit scoring model is a statistical method used to predict the probability that a loan applicant or existing borrowers will default or become delinquent, which was founded based on the characteristics in numeral of samples data in history to isolate the effects of various applicant characteristics on delinquencies and defaults. With the credit cards as well as a variety of personal consumption credit scale enlarged rapidly, the prevention of credit risk becomes highly concerned issues by financial institutions. Thus, it is essential that how to establish the credit scoring model matching the customer characteristics which can provide intellectual support for the decision makers. Credit scoring models have been widely studied in the areas of statistics, machine learning, and artificial intelligence. The advantages of credit scoring include reducing the cost of credit analysis, enabling faster credit decisions, closer monitoring of existing accounts, and prioritizing collections. With the growth of the credit industry and the large loan portfolios under management today, credit industry is actively developing more accurate credit scoring models. The main aim of this project is comparing the performance of the typical methods. In this project, we consider four different methods to classify; they are Logistic Regression, Multilayer Perception, Decision Tree and Support Vector Machine.

2.0 Problem Definition

Credit scoring or credit risk assessment is an important research issue in the banking industry. The major challenge of credit scoring is to categorize the profitable customers by predicting the bankrupts. So how to decide whether the loan should be granted to the new customer or not. We want to obtain a model that may be used to determine if new applicants present a good or bad credit risk. There are 1000 instances and 20 attributes in the data set. We have to build a model which can predict the characteristics of a new customer i.e. good or bad. We also measure the performance of each model using Accuracy, Sensitivity, Specificity and ROC Curve to obtain the best model.

3.0 Methodology Followed

Classification is a data mining (machine learning) technique used to predict class level for data instances. In this project, we are using four different classification techniques to predict the class level. Several major kinds of classification method including Logistic Regression, Multilayer Perception, Support Vector Machine and decision tree induction techniques. We are using WEKA suite for the illustration of different models. The outputs of these models are shown bellow in tabular form.

(Comparison of different model in tabular form)

1. Multilayer Perceptron:

A multilayer perceptron (MLP) is a feed forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. The output we got from this model is attached bellow.

2. Logistic Regression:

3. Decision Tree:

4. Support Vector Machine:

5. Tree Generated:

6. Logistic Regression with SMOTE 100%

7. Support Vector Machine with SMOTE 100%

8. ROC Curve

9. Work flow Diagram

4.0 Research and Discussion

First of all we had gone through all the models available in WEKA suite. The best result we got is described above in tabular form. The first best solution we got was the accuracy of 73.5 using Multilayer Perceptron. Then using Logistic Regression we got an accuracy of 76.5, followed by 77.0 and 77.5 using Decision Tree and Support Vector Machine respectively. All the above results were found by splitting the data set into 80% training data and 205 test data. But this was not the best solution we wanted, so we used SMOTE with Logistic Regression and SMOTE with Support Vector Machine. The best result we got by these two model was the accuracy of 79.6 and 80.6 using Logistic Regression and Support Vector Machine respectively.

5.0 Conclusion

Classification is a form of data analysis that extracts models describing important data classes. We have developed an effective and scalable model using SMOTE in collaboration with Support Vector Machine and 10 FCV. We have evaluated the model using several metrics including accuracy, sensitivity, Specificity, Mean Absolute Error, Root Mean Square Error and Real Absolute Error. 10-fold cross-validation is recommended for accuracy estimation and Significance tests and ROC curves are useful for model selection.