Friday, 1 April 2022

Two simple tips to boost the working efficiency of a Data Science Project

How can we make our daily work more efficient? Is there any straight forward answer? For me, the answer is only one word, experience.

Participating on several Data Science projects the last years, i was really amazed how fast you can confirm the saying “Almost 70-80% of a Data Science project is spent on the Data preparation”. There are two simple tips that will be presented on this blog post regarding the Data preparation process.

The first one is comparing four different ways, that a data scientist in SAP HANA, can create random sample datasets from an initial dataset and which can be their potential usage. The second one is exposing the power of SAP HANA ML on creating and automating a set of new aggregated columns (max(), sum(), avg() for example) from existing columns without the need of writing complex and big SQL queries (feature engineering part).

Different ways to create random sample datasets

During a data science project, several pain points can be highlighted from a Data Scientist.

SAP HANA Exam Prep, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Preparation, SAP HANA

◉ I want to decrease the time spent on Cross Validation process, the sampling my initial Dataset is really slow!
◉ I do not want to wait for hours for every sample subset execution. I need speed!
◉ I want to have a first sense on how my data can influence the  accuracy of my model.
◉ I want to have a quick generation of descriptive stats from my dataset, using the describe() function of SAP HANA ML.

Size of the dataset is the keyword for all the previous pain points of a data scientist. If the size is small, then everything is fine. But, what if we have hundred millions of records and hundred of columns? How do we handle this demanding size? A straightforward answer is to take repeated samples from the initial dataset and then evaluating the results in order to take the appropriate decisions. How easy and fast is to take a sample from a huge dataset. We will compare several ways below in order to understand which approach is the most optimal.

Potential usage of sampling the initial Data sets on a Data Science project


SAP HANA Exam Prep, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Preparation, SAP HANA

Let’s assume, that our dataset is couple of millions of rows and couple of hundred of columns, for example a typical production dataset for invoice records or for IOT data. In order to conclude on which is the most accurate design of the model, we will probably need to add many derived columns ( feature engineering process). According to the final accuracy, this should be repeated several times in order to have a first version for our training dataset.

This is highly repeatable time-consuming process which can last for days.

The trick


The trick on this issue is to create several sample subsets from our initial Model and execute the process multiple times until we will be ensure regarding the accuracy of our cross validation similar approach. Since, we have already concluded on the final Data Model, we need it to run it again only once. This is reducing the time dramatically. Moreover, initial insights regarding the level of our model accuracy are extracted.

Insights regarding the level of our model accuracy


For example, if we run couple of sample subsets and their accuracy is more or less all the time between 60 – 80, then we can be really confident that the final accuracy should be around this period (I).
If the accuracy is really low from the subset repeatable executions, then we really need to redesign our model (II).

If the accuracy among the sub executions is not close, then it is really easy to understand that the final accuracy is highly driven by the data that we will use (III).

SAP HANA Exam Prep, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Preparation, SAP HANA

Comparison of random samples creation


We have created the below column table with 49 M rows. This table contains random created values on all four columns. We will test its sampling performance with four different ways.

CREATE COLUMN TABLE "CHECK_tbl" (
"RN" BIGINT,
"RN1" BIGINT,
"RN2" BIGINT,
"RN3" BIGINT);0

DDL of table “CHECK_tbl”

--CASE 1, USING RAND() ON ORDER BY
SELECT *  
  FROM "CHECK_tbl"
 ORDER BY rand() 
 LIMIT 1000000
WITH HINT (IGNORE_PLAN_CACHE);

--CASE 2, USING RAND() ON WHERE CLAUSE
SELECT *  
  FROM "CHECK_tbl"
 WHERE rand() < 0.9 
 LIMIT 1000000
WITH HINT (IGNORE_PLAN_CACHE);

--CASE 3, USING TABLESAMPLE SYNTAX
SELECT *
  FROM "CHECK_tbl"
 TABLESAMPLE SYSTEM (90)
 LIMIT 1000000
WITH HINT (IGNORE_PLAN_CACHE);

--CASE 4, USING FROM PAl,the Preprocessing Algorithms (Sampling method)
DROP TABLE #PAL_PARAMETER_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL (
    "PARAM_NAME" VARCHAR (256),
    "INT_VALUE" INTEGER,
    "DOUBLE_VALUE" DOUBLE,
    "STRING_VALUE" VARCHAR (1000)
);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('SAMPLING_METHOD', 5, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('SAMPLING_SIZE', 1000000, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('INTERVAL', 5, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES ('COLUMN_CHOOSE', 2, NULL, NULL);
CALL "_SYS_AFL"."PAL_SAMPLING"("CHECK_tbl", #PAL_PARAMETER_TBL, "OUT_");

4 sampling cases

SAP HANA Exam Prep, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Preparation, SAP HANA
Metrics of executions

◉ As we can see from the results above, the most efficient on time and memory execution is using the Case 3. 

◉ Case 1, is the worst one among all, since it has to order the whole dataset randomly first.

◉ Moreover, by setting the rand() function now on the where clause, Case 2 is the second most optimal among the 4.

◉ Furthermore, Case 4 is implementing the “Sampling” preprocessing Algorithms provided by SAP HANA Predictive Analysis Library(PAL).

Upscaling this exercise to real life scenario


A normal Dataset could have 20-30 times more rows and couple of hundreds more columns. This translates on a quick calculated multiplier of at least 10 from the above metrics. Adding the repeated samples, for example 10 different executions, we can easily calculate the new metrics regarding the time execution for each case only to sample the data.

Case 1 : 44  (secs) * 10 (multiplier) * 10 (repeated executions) =  74 minutes of execution

Case 2 : 0,8 (secs) * 10 (multiplier) * 10 (repeated executions) =  2 minutes of execution

Case 3 : 0,01 (secs) * 10 (multiplier) * 10 (repeated executions) =  0,02 minutes of execution

Case 4 : 6,5 (secs) * 10 (multiplier) * 10 (repeated executions) =  11 minutes of execution

This exercise demonstrates 4 different approaches/solutions for the same problem. The only thing that differentiates them is how we use the features of SAP HANA. Taking under consideration the needs of the project, it is up to us which is the most appropriate every time.

SAP HANA ML / easily creation of new aggregated columns


On many scenarios, the data that we receive are really massive, for example on Sensor data / IOT and so on. In these cases, we do not only have to understand what is the business meaning / benefit of each column (from the hundred ones ), but we will need to create new ones in order to increase the performance of our machine learning model. These derived columns are usually aggregated ones from the existing ones.

If the number of columns is low, then we can easily write few lines of SQL in order to create them. The real challenge comes when we have hundred of columns of sensor data and we want to apply several aggregations upon them, such as min/max/avg/stdev and more. For those tasks an automation approach is required. Let’s see how hana_ml can be used to automate the creation of aggregation columns.

SAP HANA ML exercise


We created the below table which is composed by an ID column and then 10 measures( sensor information for example). We want for every measure (column) to create the aggregate ones based on the whole dataset and the aggregate ones based on the ID. Aggregation to be created for both cases are the count(),avg(), stddev(), min(), max() and median().

CREATE COLUMN TABLE "DEMO_USER"."DEMO_AGG" (
"ID" INTEGER NOT NULL ,
"Measure_1" DOUBLE NOT NULL ,
"Measure_2" DOUBLE NOT NULL ,
"Measure_3" DOUBLE NOT NULL ,
"Measure_4" DOUBLE NOT NULL ,
"Measure_5" DOUBLE NOT NULL ,
"Measure_6" DOUBLE NOT NULL ,
"Measure_7" DOUBLE NOT NULL ,
"Measure_8" DOUBLE NOT NULL ,
"Measure_9" DOUBLE NOT NULL ,
"Measure_10" DOUBLE NOT NULL )

SAP HANA table creation

.....
# The HANA ML library
import hana_ml.dataframe as dataframe 
conn = dataframe.ConnectionContext(server_no_port, port, uname, passw)

dset = conn.sql('''SELECT * FROM DEMO_USER.DEMO_AGG''')

Connection to SAP HANA and define our Hana ML Data frame

SAP HANA Exam Prep, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Preparation, SAP HANA
Our Hana ML Data frame

########################################################
## ADD DERIVED AGGREGATED COLUMN FROM ALL DATA SET##
#############################################################
corr_cols = list(map(lambda y: y[0], filter(lambda x: (x[1]=='DOUBLE' or x[1]=='DECIMAL'), dset.dtypes())))
corr_cols
#define the group by clm
groupby_col = 'ID'
#define the aggr functions
agg = [ 'count','avg', 'stddev', 'min', 'max', 'median']
agg_sql = []
for operation in agg:
    agg_string = map(lambda x: tuple((operation, x, operation+'_'+x)), corr_cols)
    agg_sql.extend(agg_string)
dset1 = dset.agg(agg_sql, group_by=groupby_col)
#Run Statement
#dset1.collect() 
#Get the sql generated
dset1.select_statement

SAP HANA ML code

This code implements the steps required to produce the result ( new aggregated columns ) that we want. Initially, we are keeping on a list only the columns which are type of double or decimal .Then we define,

◉ The column from the table that we want the grouping to applied. (groupby_col = ‘ID’)
◉ The aggregated functions that we want to create on these columns (agg = [ ‘count’,’avg’, ‘stddev’, ‘min’, ‘max’, ‘median’])

According to the SAP HANA ML documentation, in order to use the  avg() function we need to provide the following parameters :

agg_list : A list of tuples. Each tuple is a triplet. The triplet consists of (aggregate_operator, expression, name) where:

◉ aggregate_operator is one of [‘max’, ‘min’, ‘count’, ‘avg’]
◉ expression is a str that is a column or column expression name that is the name of this aggregate in the project list.

group_by : str or list of str. The group by column. Only a column is allowed although expressions are allowed in SQL.

For every derived column, we build the corresponding name with the following syntax ,

“{Aggregation}_{Initial column name}”

Then by using the select_statement function,  we gather the constructed SQL.

'SELECT "ID", 
 count("Measure_1") AS "count_Measure_1", 
 count("Measure_2") AS "count_Measure_2", 
 count("Measure_3") AS "count_Measure_3", 
 count("Measure_4") AS "count_Measure_4", 
 count("Measure_5") AS "count_Measure_5", 
 count("Measure_6") AS "count_Measure_6", 
 count("Measure_7") AS "count_Measure_7", 
 count("Measure_8") AS "count_Measure_8", 
 count("Measure_9") AS "count_Measure_9", 
 count("Measure_10") AS "count_Measure_10", 
 avg("Measure_1") AS "avg_Measure_1", 
 avg("Measure_2") AS "avg_Measure_2", 
 avg("Measure_3") AS "avg_Measure_3", 
 avg("Measure_4") AS "avg_Measure_4", 
 avg("Measure_5") AS "avg_Measure_5", 
 avg("Measure_6") AS "avg_Measure_6", 
 avg("Measure_7") AS "avg_Measure_7", 
 avg("Measure_8") AS "avg_Measure_8", 
 avg("Measure_9") AS "avg_Measure_9", 
 avg("Measure_10") AS "avg_Measure_10", 
 stddev("Measure_1") AS "stddev_Measure_1", 
 stddev("Measure_2") AS "stddev_Measure_2", 
 stddev("Measure_3") AS "stddev_Measure_3", 
 stddev("Measure_4") AS "stddev_Measure_4", 
 stddev("Measure_5") AS "stddev_Measure_5", 
 stddev("Measure_6") AS "stddev_Measure_6", 
 stddev("Measure_7") AS "stddev_Measure_7", 
 stddev("Measure_8") AS "stddev_Measure_8", 
 stddev("Measure_9") AS "stddev_Measure_9", 
 stddev("Measure_10") AS "stddev_Measure_10", 
 min("Measure_1") AS "min_Measure_1", 
 min("Measure_2") AS "min_Measure_2", 
 min("Measure_3") AS "min_Measure_3", 
 min("Measure_4") AS "min_Measure_4", 
 min("Measure_5") AS "min_Measure_5", 
 min("Measure_6") AS "min_Measure_6", 
 min("Measure_7") AS "min_Measure_7", 
 min("Measure_8") AS "min_Measure_8", 
 min("Measure_9") AS "min_Measure_9", 
 min("Measure_10") AS "min_Measure_10", 
 max("Measure_1") AS "max_Measure_1", 
 max("Measure_2") AS "max_Measure_2", 
 max("Measure_3") AS "max_Measure_3", 
 max("Measure_4") AS "max_Measure_4", 
 max("Measure_5") AS "max_Measure_5", 
 max("Measure_6") AS "max_Measure_6", 
 max("Measure_7") AS "max_Measure_7", 
 max("Measure_8") AS "max_Measure_8", 
 max("Measure_9") AS "max_Measure_9", 
 max("Measure_10") AS "max_Measure_10", 
 median("Measure_1") AS "median_Measure_1", 
 median("Measure_2") AS "median_Measure_2", 
 median("Measure_3") AS "median_Measure_3", 
 median("Measure_4") AS "median_Measure_4", 
 median("Measure_5") AS "median_Measure_5", 
 median("Measure_6") AS "median_Measure_6", 
 median("Measure_7") AS "median_Measure_7", 
 median("Measure_8") AS "median_Measure_8", 
 median("Measure_9") AS "median_Measure_9", 
 median("Measure_10") AS "median_Measure_10" 
 FROM (SELECT * FROM DEMO_USER.DEMO_AGG ) AS "DT_0" GROUP BY "ID"'

Extending the current scenario with even more aggregations based on a specific column, for example column ID on our case, it is really easy to implement. A final usage of SAP HANA ML join function is needed.

Final code

#############################################################
## ADD DERIVED AGGREGATED COLUMN FROM ALL DATA SET##
#############################################################
corr_cols = list(map(lambda y: y[0], filter(lambda x: (x[1]=='DOUBLE' or x[1]=='DECIMAL'), dset.dtypes())))
corr_cols
#define the group by clm
groupby_col = 'ID'
#define the aggr functions
agg = [ 'count','avg', 'stddev', 'min', 'max', 'median']
agg_sql = []
for operation in agg:
    agg_string = map(lambda x: tuple((operation, x, operation+'_'+x)), corr_cols)
    agg_sql.extend(agg_string)
dset1 = dset.agg(agg_sql, group_by=groupby_col)
#Run Statement
#dset1.collect() 
#Get the sql generated
#dset1.select_statement
#############################################################

#############################################################
## ADD DERIVED AGGREGATED COLUMN PARTITION BY THE ID COLUMN##
#############################################################
agg = [ 'count','avg', 'stddev', 'min', 'max', 'median']
agg_sql = []
for operation in agg:
    agg_string = map(lambda x :'%s("%s") over (partition by ID) AS "pct_%s_%s"'%(operation,x,operation,x),corr_cols)
    agg_sql.extend(agg_string)
agg_string =", ".join(agg_sql)
df_window_fct_ren = conn.sql('select distinct * from (select "ID", ' + agg_string + ' from DEMO_USER.DEMO_AGG)')
result = dset1.set_index('ID').join(df_window_fct_ren.set_index('ID'))
#Run Statement
#result.collect()
#Get the sql generated - FINAL#
result.select_statement
#############################################################

Again, by using the select_statement function,  we gather the new constructed SQL.

SELECT T0."ID", 
 T0."count_Measure_1", 
 T0."count_Measure_2", 
 T0."count_Measure_3", 
 T0."count_Measure_4", 
 T0."count_Measure_5", 
 T0."count_Measure_6", 
 T0."count_Measure_7", 
 T0."count_Measure_8", 
 T0."count_Measure_9", 
 T0."count_Measure_10", 
 T0."avg_Measure_1", 
 T0."avg_Measure_2", 
 T0."avg_Measure_3", 
 T0."avg_Measure_4", 
 T0."avg_Measure_5", 
 T0."avg_Measure_6", 
 T0."avg_Measure_7", 
 T0."avg_Measure_8", 
 T0."avg_Measure_9", 
 T0."avg_Measure_10", 
 T0."stddev_Measure_1", 
 T0."stddev_Measure_2", 
 T0."stddev_Measure_3", 
 T0."stddev_Measure_4", 
 T0."stddev_Measure_5", 
 T0."stddev_Measure_6", 
 T0."stddev_Measure_7", 
 T0."stddev_Measure_8", 
 T0."stddev_Measure_9", 
 T0."stddev_Measure_10", 
 T0."min_Measure_1", 
 T0."min_Measure_2", 
 T0."min_Measure_3", 
 T0."min_Measure_4", 
 T0."min_Measure_5", 
 T0."min_Measure_6", 
 T0."min_Measure_7", 
 T0."min_Measure_8", 
 T0."min_Measure_9", 
 T0."min_Measure_10", 
 T0."max_Measure_1", 
 T0."max_Measure_2", 
 T0."max_Measure_3", 
 T0."max_Measure_4", 
 T0."max_Measure_5", 
 T0."max_Measure_6", 
 T0."max_Measure_7", 
 T0."max_Measure_8", 
 T0."max_Measure_9", 
 T0."max_Measure_10", 
 T0."median_Measure_1", 
 T0."median_Measure_2", 
 T0."median_Measure_3", 
 T0."median_Measure_4", 
 T0."median_Measure_5", 
 T0."median_Measure_6", 
 T0."median_Measure_7", 
 T0."median_Measure_8", 
 T0."median_Measure_9", 
 T0."median_Measure_10", 
 T1."pct_count_Measure_1", 
 T1."pct_count_Measure_2", 
 T1."pct_count_Measure_3", 
 T1."pct_count_Measure_4", 
 T1."pct_count_Measure_5", 
 T1."pct_count_Measure_6", 
 T1."pct_count_Measure_7", 
 T1."pct_count_Measure_8", 
 T1."pct_count_Measure_9", 
 T1."pct_count_Measure_10", 
 T1."pct_avg_Measure_1", 
 T1."pct_avg_Measure_2", 
 T1."pct_avg_Measure_3", 
 T1."pct_avg_Measure_4", 
 T1."pct_avg_Measure_5", 
 T1."pct_avg_Measure_6", 
 T1."pct_avg_Measure_7", 
 T1."pct_avg_Measure_8", 
 T1."pct_avg_Measure_9", 
 T1."pct_avg_Measure_10", 
 T1."pct_stddev_Measure_1", 
 T1."pct_stddev_Measure_2", 
 T1."pct_stddev_Measure_3", 
 T1."pct_stddev_Measure_4", 
 T1."pct_stddev_Measure_5", 
 T1."pct_stddev_Measure_6", 
 T1."pct_stddev_Measure_7", 
 T1."pct_stddev_Measure_8", 
 T1."pct_stddev_Measure_9", 
 T1."pct_stddev_Measure_10", 
 T1."pct_min_Measure_1", 
 T1."pct_min_Measure_2", 
 T1."pct_min_Measure_3", 
 T1."pct_min_Measure_4", 
 T1."pct_min_Measure_5", 
 T1."pct_min_Measure_6", 
 T1."pct_min_Measure_7", 
 T1."pct_min_Measure_8", 
 T1."pct_min_Measure_9", 
 T1."pct_min_Measure_10", 
 T1."pct_max_Measure_1", 
 T1."pct_max_Measure_2", 
 T1."pct_max_Measure_3", 
 T1."pct_max_Measure_4", 
 T1."pct_max_Measure_5", 
 T1."pct_max_Measure_6", 
 T1."pct_max_Measure_7", 
 T1."pct_max_Measure_8", 
 T1."pct_max_Measure_9", 
 T1."pct_max_Measure_10", 
 T1."pct_median_Measure_1", 
 T1."pct_median_Measure_2", 
 T1."pct_median_Measure_3", 
 T1."pct_median_Measure_4", 
 T1."pct_median_Measure_5", 
 T1."pct_median_Measure_6", 
 T1."pct_median_Measure_7", 
 T1."pct_median_Measure_8", 
 T1."pct_median_Measure_9", 
 T1."pct_median_Measure_10"\nFROM (SELECT "ID", 
 count("Measure_1") AS "count_Measure_1", 
 count("Measure_2") AS "count_Measure_2", 
 count("Measure_3") AS "count_Measure_3", 
 count("Measure_4") AS "count_Measure_4", 
 count("Measure_5") AS "count_Measure_5", 
 count("Measure_6") AS "count_Measure_6", 
 count("Measure_7") AS "count_Measure_7", 
 count("Measure_8") AS "count_Measure_8", 
 count("Measure_9") AS "count_Measure_9", 
 count("Measure_10") AS "count_Measure_10", 
 avg("Measure_1") AS "avg_Measure_1", 
 avg("Measure_2") AS "avg_Measure_2", 
 avg("Measure_3") AS "avg_Measure_3", 
 avg("Measure_4") AS "avg_Measure_4", 
 avg("Measure_5") AS "avg_Measure_5", 
 avg("Measure_6") AS "avg_Measure_6", 
 avg("Measure_7") AS "avg_Measure_7", 
 avg("Measure_8") AS "avg_Measure_8", 
 avg("Measure_9") AS "avg_Measure_9", 
 avg("Measure_10") AS "avg_Measure_10", 
 stddev("Measure_1") AS "stddev_Measure_1", 
 stddev("Measure_2") AS "stddev_Measure_2", 
 stddev("Measure_3") AS "stddev_Measure_3", 
 stddev("Measure_4") AS "stddev_Measure_4", 
 stddev("Measure_5") AS "stddev_Measure_5", 
 stddev("Measure_6") AS "stddev_Measure_6", 
 stddev("Measure_7") AS "stddev_Measure_7", 
 stddev("Measure_8") AS "stddev_Measure_8", 
 stddev("Measure_9") AS "stddev_Measure_9", 
 stddev("Measure_10") AS "stddev_Measure_10", 
 min("Measure_1") AS "min_Measure_1", 
 min("Measure_2") AS "min_Measure_2", 
 min("Measure_3") AS "min_Measure_3", 
 min("Measure_4") AS "min_Measure_4", 
 min("Measure_5") AS "min_Measure_5", 
 min("Measure_6") AS "min_Measure_6", 
 min("Measure_7") AS "min_Measure_7", 
 min("Measure_8") AS "min_Measure_8", 
 min("Measure_9") AS "min_Measure_9", 
 min("Measure_10") AS "min_Measure_10", 
 max("Measure_1") AS "max_Measure_1", 
 max("Measure_2") AS "max_Measure_2", 
 max("Measure_3") AS "max_Measure_3", 
 max("Measure_4") AS "max_Measure_4", 
 max("Measure_5") AS "max_Measure_5", 
 max("Measure_6") AS "max_Measure_6", 
 max("Measure_7") AS "max_Measure_7", 
 max("Measure_8") AS "max_Measure_8", 
 max("Measure_9") AS "max_Measure_9", 
 max("Measure_10") AS "max_Measure_10", 
 median("Measure_1") AS "median_Measure_1", 
 median("Measure_2") AS "median_Measure_2", 
 median("Measure_3") AS "median_Measure_3", 
 median("Measure_4") AS "median_Measure_4", 
 median("Measure_5") AS "median_Measure_5", 
 median("Measure_6") AS "median_Measure_6", 
 median("Measure_7") AS "median_Measure_7", 
 median("Measure_8") AS "median_Measure_8", 
 median("Measure_9") AS "median_Measure_9", 
 median("Measure_10") AS "median_Measure_10" 
 FROM (SELECT * FROM DEMO_USER.DEMO_AGG\n ) AS "DT_0" GROUP BY "ID") T0 
 
 INNER JOIN (select distinct * from (select "ID", 
 count("Measure_1") over (partition by ID) AS "pct_count_Measure_1", 
 count("Measure_2") over (partition by ID) AS "pct_count_Measure_2", 
 count("Measure_3") over (partition by ID) AS "pct_count_Measure_3", 
 count("Measure_4") over (partition by ID) AS "pct_count_Measure_4", 
 count("Measure_5") over (partition by ID) AS "pct_count_Measure_5", 
 count("Measure_6") over (partition by ID) AS "pct_count_Measure_6", 
 count("Measure_7") over (partition by ID) AS "pct_count_Measure_7", 
 count("Measure_8") over (partition by ID) AS "pct_count_Measure_8", 
 count("Measure_9") over (partition by ID) AS "pct_count_Measure_9", 
 count("Measure_10") over (partition by ID) AS "pct_count_Measure_10", 
 avg("Measure_1") over (partition by ID) AS "pct_avg_Measure_1", 
 avg("Measure_2") over (partition by ID) AS "pct_avg_Measure_2", 
 avg("Measure_3") over (partition by ID) AS "pct_avg_Measure_3", 
 avg("Measure_4") over (partition by ID) AS "pct_avg_Measure_4", 
 avg("Measure_5") over (partition by ID) AS "pct_avg_Measure_5", 
 avg("Measure_6") over (partition by ID) AS "pct_avg_Measure_6", 
 avg("Measure_7") over (partition by ID) AS "pct_avg_Measure_7", 
 avg("Measure_8") over (partition by ID) AS "pct_avg_Measure_8", 
 avg("Measure_9") over (partition by ID) AS "pct_avg_Measure_9", 
 avg("Measure_10") over (partition by ID) AS "pct_avg_Measure_10", 
 stddev("Measure_1") over (partition by ID) AS "pct_stddev_Measure_1", 
 stddev("Measure_2") over (partition by ID) AS "pct_stddev_Measure_2", 
 stddev("Measure_3") over (partition by ID) AS "pct_stddev_Measure_3", 
 stddev("Measure_4") over (partition by ID) AS "pct_stddev_Measure_4", 
 stddev("Measure_5") over (partition by ID) AS "pct_stddev_Measure_5", 
 stddev("Measure_6") over (partition by ID) AS "pct_stddev_Measure_6", 
 stddev("Measure_7") over (partition by ID) AS "pct_stddev_Measure_7", 
 stddev("Measure_8") over (partition by ID) AS "pct_stddev_Measure_8", 
 stddev("Measure_9") over (partition by ID) AS "pct_stddev_Measure_9", 
 stddev("Measure_10") over (partition by ID) AS "pct_stddev_Measure_10", 
 min("Measure_1") over (partition by ID) AS "pct_min_Measure_1", 
 min("Measure_2") over (partition by ID) AS "pct_min_Measure_2", 
 min("Measure_3") over (partition by ID) AS "pct_min_Measure_3", 
 min("Measure_4") over (partition by ID) AS "pct_min_Measure_4", 
 min("Measure_5") over (partition by ID) AS "pct_min_Measure_5", 
 min("Measure_6") over (partition by ID) AS "pct_min_Measure_6", 
 min("Measure_7") over (partition by ID) AS "pct_min_Measure_7", 
 min("Measure_8") over (partition by ID) AS "pct_min_Measure_8", 
 min("Measure_9") over (partition by ID) AS "pct_min_Measure_9", 
 min("Measure_10") over (partition by ID) AS "pct_min_Measure_10", 
 max("Measure_1") over (partition by ID) AS "pct_max_Measure_1", 
 max("Measure_2") over (partition by ID) AS "pct_max_Measure_2", 
 max("Measure_3") over (partition by ID) AS "pct_max_Measure_3", 
 max("Measure_4") over (partition by ID) AS "pct_max_Measure_4", 
 max("Measure_5") over (partition by ID) AS "pct_max_Measure_5", 
 max("Measure_6") over (partition by ID) AS "pct_max_Measure_6", 
 max("Measure_7") over (partition by ID) AS "pct_max_Measure_7", 
 max("Measure_8") over (partition by ID) AS "pct_max_Measure_8", 
 max("Measure_9") over (partition by ID) AS "pct_max_Measure_9", 
 max("Measure_10") over (partition by ID) AS "pct_max_Measure_10", 
 median("Measure_1") over (partition by ID) AS "pct_median_Measure_1", 
 median("Measure_2") over (partition by ID) AS "pct_median_Measure_2", 
 median("Measure_3") over (partition by ID) AS "pct_median_Measure_3", 
 median("Measure_4") over (partition by ID) AS "pct_median_Measure_4", 
 median("Measure_5") over (partition by ID) AS "pct_median_Measure_5", 
 median("Measure_6") over (partition by ID) AS "pct_median_Measure_6", 
 median("Measure_7") over (partition by ID) AS "pct_median_Measure_7", 
 median("Measure_8") over (partition by ID) AS "pct_median_Measure_8", 
 median("Measure_9") over (partition by ID) AS "pct_median_Measure_9", 
 median("Measure_10") over (partition by ID) AS "pct_median_Measure_10" from DEMO_USER.DEMO_AGG)) T1
 ON T0."ID" = T1."ID"

This example is based only on 10 columns, just to expose the power of SAP HANA ML. On a recent project, we had to handle more or less 800 measures coming from sensor data. Writing the sql only for those transformations would not be a nightmare only on matter of time to produce this SQL but also really difficult to make any changes. On few words, the win with this trick is not limited only to a fast and easily created SQL script but also a gain of having a scalable and automated process at our disposal.

No comments:

Post a Comment