Wednesday, 10 June 2020

Removing outlier using standard deviation in SAP HANA

In this blog post we will learn how to remove the outlier in the data-set using the standard deviation , We can have one sample data set with product sales for all the years.

Before moving into the topic we should know what is a outlier and why it used. The Outlier is the values that lies above or below form the particular range of values . For example consider the data set (20,10,15,40,200,50) So in this 200 is the outlier value, There are many technique adopted to remove the outlier but we  are going to use standard deviation technique.

We already know that the standard deviation is square root of the variance. In HANA we have aggregation type STDDEV to calculate it.

i have created a sample data with a product and it no of sales during all the years.

SAP HANA Certification, SAP HANA Guides, SAP HANA Learning, SAP HANA Exam Prep, SAP HANA Study Material

Create a calculation view and use the table which you have created. Add the Product and sales value to semantics the sales is a integer data type convert the aggregation to the Standard deviation.(stddev). We also need to calculate the average value of sales, Create a calculated column as “sales_average” with aggregation type as Average in the semantics.

Now we have the Standard deviation value and the average value of the sales  for the product.

SAP HANA Certification, SAP HANA Guides, SAP HANA Learning, SAP HANA Exam Prep, SAP HANA Study Material

SAP HANA Certification, SAP HANA Guides, SAP HANA Learning, SAP HANA Exam Prep, SAP HANA Study Material

By using this calculated values we have to frame the equation that can replace the outlier value with Zeros.

For this we have to create a another calculation view and use the same table in projection and the calculation view with standard deviation calculation in the other projection.

SAP HANA Certification, SAP HANA Guides, SAP HANA Learning, SAP HANA Exam Prep, SAP HANA Study Material

We have to join both by using the product field. Create an new calculated column as “NEW_SALES” and as the following expression.

SAP HANA Certification, SAP HANA Guides, SAP HANA Learning, SAP HANA Exam Prep, SAP HANA Study Material

if(“SALES”<=(“AVERAGE_SALES”+(2*”SALES_STD”)) and (“SALES”>=(“AVERAGE_SALES”+(-2*”SALES_STD”))),”SALES”,0)

By this it compares the value above or below the range if it doesn’t satisfies the condition it is replaced with zero

SAP HANA Certification, SAP HANA Guides, SAP HANA Learning, SAP HANA Exam Prep, SAP HANA Study Material

So now we can see that in these case the value of sales for 2003 is replaced with zero and it can be removed.

No comments:

Post a Comment