Friday, 7 August 2020

HANA Machine Learning (ML) -Analysis Association Frequent Pattern(FP) Growth Algorithm using Python

Association analysis is process of finding interested relationship in large datasets. This is  been used in grocery stores like coupons we found , packaged deals, the way items are displayed on shelfs or together. Some common examples of Data Associations are:-

“People who buy bread tend to buy butter or jam as well. Because normally breakfast goes with bread and butter.”

“People who buys diapers tends to buy beer as well. Because raising kids is a stressful job”

There is lot grocery stores are doing and can do by this data Association Analysis. There are number of algorithms available and there some very good explanations available on Git and SCN blogs(link is provided in reference for very good explanation of mostly all Association Analysis algorithm and codes) but I like FP (frequent pattern) Growth algorithm and in this Article I’ll try to put some light on this using Powerful HANA PAL (Predictive Analytical Libraries).And some details of Python codes and steps.

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

In PAL, the FP-Growth algorithm is extended to find association rules in three steps:

1. Converts the transactions into a compressed frequent pattern tree (FP-Tree);
2. Recursively finds frequent patterns from the FP-Tree;
3. Generates association rules based on the frequent patterns found in Step 2.

Here intention is to keep complexity low so that it’s easily explainable. There are other methods in Association Analysis Apriori etc but I used one only just for more focus and understanding better.

Indicators

1. Support
2. Confidence
3. Lift

Consider if we need to find Support, Confidence and Lift for two products (A and B)

Support:-

Support of Product A to B = Transactions Involving Product A and B/ Total Transactions.

Decrease the support count tells that the frequency of item in total transaction is very low.

Confidence:

Confidence of Product A to B = Transactions Involving Product A and B /Total Transactions Involving Product A.

Lift

Lift is the increase in the ratio of the sale of Product B when you sell Product A.

Lift = (Confidence of A to B) / (Transactions fractions containing Product B)

Value of lift greater than 1 symbolizes high association between A and B.

Prerequisite


For trying hands on below software/Environment needs be available. I used HANA 2.0 with XSA but without XSA this can be done as well. Like there is an option to only Host server without XSA.

Environments/Software

HANA 2.0 with XSA Hosted on Google Cloud.

Python 3.7, Anaconda 1.9.12

Juypter Notebooks 6.0.3

HANA ML 1.0.8

Details

To demonstrate this I used HANA Machine learning Libraries installed on Python 3.x with Anaconda. This can be any HANA database either on your laptop or you can host on Amazon etc.

For working on HANA ML PAL needs to be enabled on SAP HANA. Details of code snippet is available in below link.

https://github.com/saphanaacademy/PAL/blob/master/Code%20Snippets/PAL%20146%20Getting%20Started%20with%20HANA%2020%20SPS02.sql

This code also created “devuser” under my tenant database (HXE).

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Installing Hana Client on Python

Since I already have Hana client installed so I didn’t install again. But this can be install easily by below command.

Pip install hdbcli

In my case I just used Pip show

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Following Machine learning Libraries are also installed on my machine

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Creating HANA Table to store data

Created table under Devuser as “FP_GROWTH_ASSOCIATION”, with only two fields Transaction and Items.

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Table Structure

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Data for this exercise

Kaggel is the opensource for various datasets. I used “Random shopping cart” data which can be found below. Used only two columns (Transaction and Items) though.

https://www.kaggle.com/acostasg/random-shopping-cart

Loaded this data into HANA Table with 16753 recs manually with flat file approach.

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Since prerequisite of algorithm to not have null values and duplicates. I removed duplicates, null value will be removed in subsequent part shortly.

Data Glimpse

Data is such that it has transactions of carts with different grocery items. Glimpse of data:-

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Coming to Python again

Imported all libraries which will help to support this algorithm and connected to HANA Database.

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Hxehost is the hostname of Hana server

39015 is the port

Devuser is the username

These details can be found on your SAP HANA Database.

Connection command syntax

Connection_context = dataframe.ConnectionContext (URL, PORT, UN, PWD)

Checked if  connected with HANA.

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Assigning Dataframe

Consider Dataframe is 2d table like spreadsheet or simple table in python with columns of different types.

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Dropping nulls

Algorithm prerequisite is to have no null values so using below function to remove nulls

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Describe command will demonstrate if we’ve null values

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Importing FP growth Algorithm using import

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Assigning parameters values for Algorithm

Details of each parameter values will be available on help.sap.com link provided in reference section

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

Result

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Guides

How to read:- Consider first line ,it shows if someone buys Poultry then there is 78% chances he/she will buy vegetables too. Support of the confidence is good and lift is above 1 which is indicating that there is high association between these items.

Math

Just want to highlight below numbers how these values are getting calculated with algorithm to better understand how it finds associations

There are 378 transactions involving Poultry and Vegetables both.

Total transaction are 1140 for this dataset.

Total transactions involving Poultry are 480.

Total transactions involving vegetables are 842.

Fractions of Vegetables on overall transactions is 0.7385. 

Support:-  378/1140 = 0.33

Confidence:-378/480=0.78

Lift:-0.78/.7385=1.066

Note: This is just for demonstration purpose to show how HANA ML with Python can be leveraged for Machine learning using PAL. This data is open dataset and I’ve note verified each transactions myself.

I just shown one of the example of FP growth one can keep filtering data as much as to extract useful information and even use relational options keeping the value as “True”.

No comments:

Post a Comment