A small write-up on HANA ML dataframe , it is really a learning , an exposure and a knowledge sharing process to write something beautiful you learn along with your day to day job so holding the passion for technology in both of my hands here come’s my first post of 2020 and topic is interesting enough , everyone’s favorite. HANA Machine learning & it’s about dataframe this time.
Dataframe methods it’s meaning and their python implementation.
Let’s go step by step.
HANA ML Dataframe – A Skeleton for data
◉ Represents a frame that is backed by a database SQL statement and can also be created by the table statement.
◉ The SAP HANA dataframe , which provides a set of methods for analyzing the data in SAP HANA without bringing the data to the client.
This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.
Dataframe methods it’s meaning and their python implementation.
Let’s go step by step.
HANA ML Dataframe – A Skeleton for data
◉ Represents a frame that is backed by a database SQL statement and can also be created by the table statement.
◉ The SAP HANA dataframe , which provides a set of methods for analyzing the data in SAP HANA without bringing the data to the client.
This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.
DataFrame Methods : –
Import dataframe library from hana_ml package.
import hana_ml
from hana_ml import dataframe ##import dataframe from HANA ML
Before diving deep into dataframes let’s connect first to SAP HANA System and load some data into dataframe for manipulation , filtering & slicing etc.
from data_load_utils import DataSets , Settings
url , port , usr , pwd = Settings.load_config(r"C:\Users\abc\config\e2edata.ini")
ConnectionContext – This represents a connection to HANA System let’s code this
connection_context = dataframe.ConnectionContext(url,port,usr,pwd) ##pass user id and password host and port for connection
connection_context.connection.isconnected() if this is connected you will True in console as output
Load Data – Create dataframe from table if you are choosing the table method pass the table name of corresponding schema
dataset1 = connection_context.table("ADULT_DATA_FULL_TBL") ##Table Name of Schema
##Another way of achieving it
dataset1 = connection_context.sql("SELECT * FROM ADULT_DATA_FULL_TBL") ##SQL Query for dataframe
this returns the dataframe which is just a skeleton , it’s not having any data So most of the operations on the dataframes are optimized to execute inside SAP HANA & on the fly which result’s in huge performance benefit.
dataset1.select_statement ## This is select statement for table which we have used to create dataframe , Meanning is ..
Output
'SELECT * FROM "ADULT_DATA_FULL_TBL"'
Execute dataset1 jupyter cell and output is the hana ml dataframe in return it doesn’t contain data as of now , we need to call the .collect() method which will further execute the select statement & give’s result back to the client (in our case jupyter notebook or python environment)
count() – Return the number of records in the dataframe.
dataset1.count()
Output
48822
collect() – This triggers the select statement on dataframe & fetches data from SAP HANA , this must be used when you think you need that data into python environment for some display , validate or any other kind of requirement , the design goal of this is to perform calculations on the fly and give back the result to python environment for further processing.
example:
dataset1.head(10).collect() ## Use Collect method to load data from sap hana into client , this will fetch first 10 records
head(10) means top 10 records
result :
columns – List the number of columns in dataset
dataset1.columns
Ouput
['ID',
'AGE',
'WORKCLASS',
'FNLWGT',
'EDUCATION',
'EDUCATIONNUM',
'MARITALSTATUS',
'OCCUPATION',
'RELATIONSHIP',
'RACE',
'CAPITALGAIN',
'CAPITALLOSS',
'HOURSPERWEEK',
'NATIVECOUNTRY',
'INCOME']
Limit rows during display – just pass the lower & upper limit in the square bracket , this is just to get the overview of data like you wanna sample something.
Copy of dataframe into a new dataframe – this is the case when you don’t wanna change/alter the original dataframe.
dataset3 = dataset1.head(100).collect()
dataset3 ##it will display the 100 records copied from dataset1
Drop duplicates – drop duplicates from the dataset , this is required when you are processing the data in algorithms which just require the unique entries.
data cleansing
dataset4 = dataset1.drop_duplicates()
##Check the select statement for dataset4
dataset4.select_statement
'SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'
When removing duplicates from the dataset , select statement on the fly selects only the unique records from the sap hana database table , hence your original table in sap hana database remains intact.
Drop Columns from the dataset – Drop the columns which are not required this is used where some useless columns are removed so that algorithms performs better and results must be accurate in regression/classification.
dataset5 = dataset1.drop(["AGE"])
dataset5.select_statement #####There is no column AGE here ,this is also called slicing of data
'SELECT "ID", "WORKCLASS", "FNLWGT", "EDUCATION", "EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", "NATIVECOUNTRY", "INCOME" FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'
So every time you perform an operation on dataset using any method , it prepare’s corresponding select statement & execute it on the fly & give’s result back to the client.
Select TOP 10 records from dataset but for two columns only –
Select statement on dataframe contains only two columns and selecting only 10 records once the collect method called.
dataset6 = dataset1.select('ID' , 'WORKCLASS').head(10)# , 'AGE' , 'HOURSPERWEEK' , 'RACE').head(10).collect()
dataset6.select_statement ##Dynamic select statement
'SELECT TOP 10 * FROM (SELECT "ID", "WORKCLASS" FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") dt'
dataset6.collect()
ID WORKCLASS
0 28 Private
1 75 Private
2 196 Private
3 264 Private
4 303 Private
5 324 Private
6 390 Private
7 442 Private
8 447 Private
9 472 Private
NAs in Dataset – Check for Null values in dataset , this is required before processing the data into any algorithm , drop the null or put any average value into it.
dataset1.hasna('CAPITALGAIN') ##Check if there is any NULL value in CAPITALGAIN column of dataset it will
return true or false
Output
True
Fill NAs with some random value
dataset1 = dataset1.fillna(10 , ["CAPITALGAIN"])
dataset1.hasna('CAPITALGAIN')
Output
False
Operation on dataframe and do some manipulation on fields –
A new column of TWICE_AGE has been added with a multiplication of 2 in original age.
dsp = dataset1.select('ID' , 'AGE' , ('"AGE"*2' , "TWICE_AGE")) ##Age Column has been added with multiplication
dsp.head(5).collect()
Ouput -
ID AGE TWICE_AGE
0 1 71 142
1 2 44 88
2 3 39 78
3 4 27 54
4 5 44 88
Checkout the select statement of above dataframe (dsp)
dsp.select_statement
Output
'SELECT "ID", "AGE", "AGE"*2 AS "TWICE_AGE" FROM
(SELECT "ID", "AGE", "WORKCLASS", "FNLWGT", "EDUCATION",
"EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION",
"RELATIONSHIP", "RACE", COALESCE("CAPITALGAIN", 10)
AS "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK",
"NATIVECOUNTRY", "INCOME"
FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") dt) AS "DT_18"'
Data Filtration methods –
dataset4.head(5).collect() ## Select 5 Sample records from the dataset
Output
ID AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS OCCUPATION RELATIONSHIP RACE GENDER CAPITALGAIN CAPITALLOSS HOURSPERWEEK NATIVECOUNTRY INCOME
0 41959 39 Private 286789 Doctorate 16 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
1 33040 22 Private 349212 Some-college 10 Never-married Other-service Own-child White Female 0 0 20 United-States <=50K
2 46142 53 Private 191389 HS-grad 9 Divorced Exec-managerial Not-in-family White Female 0 0 40 United-States <=50K
3 1319 37 Private 162424 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 45 United-States <=50K
4 35823 56 Private 200316 Some-college 10 Married-civ-spouse Adm-clerical Husband White Male 0 0
Query – We want those records which are having AGE greater than 60
dataset_age = dataset4.filter('AGE > 60') ## records AGE greater than 60
dataset_age.collect()
Output:
dataset_age.count()
Output:
3606
dataset_age.select_statement
Output:
'SELECT * FROM (SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_FULL_TBL")
AS "DT_6") AS "DT_20" WHERE AGE > 60'
Output : 3906 records displayed.
Add Multiple conditions for multiple columns –
dataset4_res = dataset4.filter('AGE > 60 AND EDUCATIONNUM = 8')
dataset4_res.select_statement
Output:
'SELECT * FROM (SELECT DISTINCT * FROM
(SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6")
AS "DT_20"
WHERE AGE > 70 AND EDUCATIONNUM = 8'
Here multiple conditions have been added and see the
where clause for this dynamic select statement
dataset4_res.count()
Output:
7
Only 7 records
Age > 70 & Educationnum = 8
## Suppose we want those records where EDUCATIONNUM = 9
dataset4.filter('EDUCATIONNUM = 9').head(5).collect()
## Suppose we want those records where AGE = 27
dataset4.filter('AGE = 27').head(5).collect()
Multiple filter conditions on Pandas Dataframe –
dataset3[(dataset3['AGE'] > 27) & ( dataset3['OCCUPATION'] == 'Craft-repair')]
if you have a pandas dataframe at client and you wanna filter records at client end only then above filter syntax will suffice.
Output :
Slicing of Dataframe –
Slice few columns from dataset
## Copy the original dataset into a new one
new_dataset1 = dataset1.head(5).collect()
new_dataset1[['ID' , 'AGE']]
Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35
Copy sliced data into third one
new_dataset2 = new_dataset1[['ID' , 'AGE']]
new_dataset2
Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35
Create dataframe for few columns only –
dataset_id_age = connection_context.sql('SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL')
dataset_id_age.collect()
Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35
5 324 24
6 390 34
7 442 49
8 447 35
9 472 42
dataset_id_age.select_statement
'SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL'
In Short dataframe is a select statement which just
wait to trigger unless it is called
describe() – Describe the dataframe with all the stats information
dataset4.describe().collect()
empty() – Returns True if this DataFrame has 0 rows
dataset2.empty()
Output:
False
has() – Returns true if a column is in the DataFrame.
dataset1.columns
Output:
['ID',
'AGE',
'WORKCLASS',
'FNLWGT', 'EDUCATION','EDUCATIONNUM',
'MARITALSTATUS',
'OCCUPATION',
'RELATIONSHIP',
'RACE',
'GENDER',
'CAPITALGAIN',
'CAPITALLOSS',
'HOURSPERWEEK',
'NATIVECOUNTRY',
'INCOME']
dataset1.has('ID')
Output:
True
join() – Joining of two dataset based on the condition like ID = ID
condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)
## Display Condition
condition
Output:
'"DT_7"."ID"="DT_3"."ID"'
## Let's join dataset4 and dataset2 and copy the result into a new dataset7
dataset7 = dataset4.join(dataset2, condition)
dataset7.head(5).collect()
Output:
dataset7
save dataframe save() –
dataset4.save('#SAVEDATA')
## retrieve the saved dataframe into a new dataframe
new_dataset4 = connection_context.table('#SAVEDATA')
##Fetch first 10 records and display
new_dataset4.head(10).collect()
Ouput:
10 Records displayed
new_dataset4.select_statement ### See here data is coming from temp database table or may be a pointer to the original table
'SELECT * FROM "#SAVEDATA"'
AGGREGATE FUNCTIONS – one of the most important topic under the dataframe methods , calculating average age , or summing up the prices , counting the items sold , such kind of requirements resulted inventing these kind of functions , let’s code and see how they work.
## Suppose we need to find out the average age of employees working in a specific department
agg_list = [('AVG' , 'AGE' , 'AVG_AGE')] ## Aggr. funtion , column and New column
dataset_avg_age = dataset4.agg(agg_list=agg_list , group_by='WORKCLASS').collect()
## Display the results
dataset_avg_age
Group by WORKCLASS and Average Age of employees are displayed here.
## Count the number of people of same age
dataset_age_count = dataset4.agg(agg_list=[('COUNT' , 'ID' , 'ID_COUNT')] , group_by='AGE').collect()
dataset_age_count
Output:
from IPython.display import HTML
HTML(dataset_age_count.head(5).to_html())
## Maximum AGE of employees in each group
dataset4.agg([('max', 'AGE', 'MAX_AGE')], group_by='WORKCLASS').collect()
Output:
## MIN Age of employees in each workclass
dataset4.agg([('min', 'AGE', 'MIN_AGE')], group_by='WORKCLASS').collect()
Output:
corr() – finding correlation between columns of a dataset
dataset4.corr('ID' , 'AGE').collect()
Create dataframe from pandas – An SAP HANA DataFrame that contains the data in the pandas_df.
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df_new = hana_ml.dataframe.create_dataframe_from_pandas(connection_context, df, 'pd_df', force=False, replace=True)
df ##Pandas dataframe
Output:
col1 col2
0 1 3
1 2 4
df_new.collect() ##HANA Ml dataframe
Output:
col1 col2
0 1 3
1 2 4
No comments:
Post a Comment