Friday, 19 July 2019

End to End Model of Data Analysis & Prediction using Python on SAP HANA data

This blog post helps to connect with SAP HANA DB (Version 1.0 SPS12) then extract the data from HANA table/View and analyze the data using Python Pandas library. Then you can clean and select independent variables/features data to feed the Machine learning algorithms to predict dependent variables or find insights.

Scenario: I am taking the state wise startup company’s expenditure (R&D Spend, Administration Spend, and Marketing Spend) and profit data Source of the data in SAP HANA DB.

Goal is to predict the Profit for the given set of expenditure values. I am not explaining details about the ML Algorithm and the parameter tuning here. I would like to show the end to end process of Data extraction from SAP HANA DB, analyzing, cleaning, feature selection, and applying machine learning model and finally write back the results and ML algorithm performance metrics to the HANA tables.
The linear regression is the most commonly used model in research and business and is the simplest to understand, so using the random forest regression method we will predict the Profit.

The below diagram shows ML Prediction life cycle and steps fallowed in the use case.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

The basic steps involved in this process are:

1. Check the HANA Table data and analyze it using SQL in HANA Studio/WEB IDE.   (Make sure you have required privileges to do DML Operations on the tables in SAP HANA DB.)

2. Import pyodbc, pandas, Sklearn, Matplotlib, seaborn libraries in python.

3. Create connection to HANA data base and execute required SQL.

4. Extract all the historical data into data frame object and start analyzing it in Python using pandas.

5. Do the feature engineering, data cleaning and then feed the final set of independent variables to Machine learning algorithm (Random Forest) to predict dependent variable (Profit).

6. Analyze the Machine learning algorithm metrics and fine tune for the better accuracy by repeating the step 5.Store the Machine learning algorithm metrics in log table and also update the predicted value of historical data into the HANA Table.

7. For the new data set, create the python program which reads the new data using pyodbc connection and predict the dependent variable (Profit) and updates the actual transactional table for reporting.

8. Schedule this program and keep monitoring the model metrics and predicted value.

1. Check the HANA Table data and analyze it using SQL.


I have created two tables, one contains the actual company data which is used to store the transactional data. Second one to store the metrics of the Machine learning algorithm (Ex: MAE, R Squared, MAPE, RMSE, Accuracy etc.)

Please find the structures of both tables below.

Main transactional data table structure.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

ML Metrics table to store the evaluation parameters.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

The historical data (which is lesser than current month) before prediction with state wise expenditure and profit shown below. I added Predicted profit column as well, to store the predicted value by ML Algorithm using python and it is updated as NULL now.

SELECT * FROM “SCHEMA”.”STARTUP_DATA2″

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

Let us analyze the data using SQL in HANA Studio with available functions, we can find mean, standard deviation,median,max,min and count of nulls using below SQL. (You can try for all the measures)

Select State, sum(Profit),avg(Profit),count(Profit),max(Profit),

count( distinct profit) as “unique”,stddev(Profit),min(Profit),median(Profit),

sum(case when Profit IS NULL THEN 1 ELSE 0 END) AS “NO OF NULLS”

from “SCHEMA”.”STARTUP_DATA2″ group by State

order by State;

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

2. Import pyodbc, pandas, Sklearn libraries in python.


Now I am coming to Python scripting interface (Jupyter or spyder).

Import the required libraries.

PYODBC is an open source Python module that makes accessing ODBC databases simple. It implements the DB API 2.0 specification but is packed with even more Pythonic convenience. Using pyodbc, you can easily connect Python applications to data sources with an ODBC driver.

Typically, pyodbc is installed like any other Python package by running:

pip install pyodbc

from a Windows DOS prompt or Unix shell.

To install this package with conda run:
conda install -c anaconda pyodbc

Seaborn: Seaborn is a graphic library built on top of Matplotlib. It allows to make your charts prettier, and facilitates some of the common data visualization needs

Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Numpy: NumPy is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data

Sklearn: Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy

3. Create connection to HANA data base and execute required SQL.


Establish the connection and once a connection has been established, your application can execute selects, inserts, or other ODBC operations supported by your driver and database.

conn = pyodbc.connect(‘DRIVER={HDBODBC};SERVERNODE=IPADDRESS:30115;SERVERDB=HV1;UID=USERID;PWD=PASSWORD’)

#Open connection to SAP HANA and check for count of records.

# check if table has entries

cursor = conn.cursor()

sql_query1 = ‘SELECT count(*) FROM SCHEMA.STARTUP_DATA2’

dcount = pd.read_sql(sql_query1, conn)

dcount

print (‘Table exists and contains’, dcount.head(1), ‘records’ )

4. Extract the data into data frame object and start analyzing it in Python using pandas.


Execute the Select SQL and read the data and save it to data frame. Once you get the data into data frame, you can apply all statistical functions to analyze the data as shown below.

#querying the sap hana db data and store in data frame

sql_query2 = ‘SELECT * FROM SCHEMA.STARTUP_DATA2’

df = pd.read_sql(sql_query2, conn)

df.head()

#USING GROUP BY CLAUSE TO ANALYZE THE DATA

df1.groupby(‘STATE’).agg({‘PROFIT’:[‘sum’,’mean’,’count’,’max’,’std’,’min’,’median’]}).round(0)

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

5. Do the feature engineering, data cleaning, feed the final set of independent variables to Machine learning algorithm (Random Forest) to predict dependent variable


Check for the missing values or nulls using the functions and replace the values with mean/mode accordingly.

For Example I have two empty values in Rnd Spend column, so replacing with Mean/Median value. If the column is having more null values, then you can drop the column form the analysis.

Encode the nominal values, for example State has 3 unique values (‘New York’,’California’,’Florida’) replacing them with 0,1,2 codes using map function

Map ({‘New York’:0,’California’:1,’Florida’:2})

Now find the correlation between the dependent variable and independent variables to finalize the final independent variables/features for prediction. After observing the data choosing (R&D Spend, Administration Spend, and Marketing Spend) as independent variables and profit as a dependent variable.

Now I am using random forest regression to predict the profit.

Random forests or random decision forests are an ensemble learning method for classification and regression (and other tasks) .Random forests are bagged decision tree models that train on a subset of features on each split (in each iteration).

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

6. Store the Machine learning algorithm metrics in log table and also update the predicted value into the HANA Table.


Using the insert and update statements you can send the data back to the data base tables (Ex: ML Metrics and Predicted values). Take care of the primary key while doing insertions.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

SAP HANA table data :

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

updating the predicted profit value to data frame.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

Check the data in SAP HANA table to see the updated values in predicted profit column.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

7. For the new data set, create the python program which reads the new data using pyodbc connection and predict the dependent variable (Profit) and updates the actual transactional table for reporting.


In a business new transactional data will be flowing into Tables/data ware house. (In my case SAP HANA SQL data ware house). For the new data set, suppose you want to know what could be the profit if I spend “X” expenditure for next month.

Analyze whether this cost increase/decrease gives you return on investment/profit. Then,

Create the Python script in which you perform following tasks

◈ fetch the new dataset and process it through the predictor/model

◈ Update predicted value back to the DB.

◈ Schedule this python script using windows Scheduler/ python scheduler

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

8. Schedule this program and keep monitoring the model metrics and predicted value using HANA Models/reports.


You can schedule the Python script using Windows Scheduler/ using the schedule Library.

Now in SAP HANA Client tools WEB IDE/ HANA Studio, Create the HANA model to join the main transactional data with dimension tables like (Time, Location) for reporting purpose.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

State wise Actual Vs Predicted Profit Comparison Chart.

SAP HANA Online Exam, SAP HANA Tutorials and Materials, SAP HANA Certifications, SAP HANA Study Materials

When using Python IDE’s such as Jupyter, the data is persisted to the client with the above approach and this means more processing time when you have large data set, which leads to drop the productivity of Data Scientists.

This is where the SAP HANA Data Frame can add real value to a Data Scientist’s work. More features and capabilities are included in SAP HANA 2.0 SPS03 Version to analyze / address the data science use cases with Python driver (hdbcli) and then the Python Client API for machine learning algorithms.

No comments:

Post a Comment