Monday, 11 October 2021

Detecting Contextual Anomalies with SAP HANA ML

Introduction

What is an Anomaly?

The goal here is to detect outlier data points, which do not follow the collective common pattern of the majority of data points, hence can be easily separated from the group.

SAP HANA ML, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Preparation, SAP HANA Career

Some of the possible use cases here are: 

IT DevOps: Intrusion Detection system, monitoring network traffic

Banking: Fraudulent transactions, stock market analysis

Healthcare: Condition monitoring, tumor detection

Architecture


For this example, we will be using SAP HANA Python Client API for ML and calling the Automated Predictive Library (APL) with our preferred Python environment (I typically use Jupyter notebooks).

Time Series Anomaly Detection


For this POC, we will be conducting an anomaly detection within a time series data. What this means is we are trying to find data points that are not following the common collective trend or seasonal or cyclic pattern of the entire data. Example use cases here could be monitoring spikes in web traffic or spikes in global temperature. 

SAP HANA ML, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Preparation, SAP HANA Career

Are dates and the outcome variable all that we need? That rarely is the case. Most of the time, we will need to supplement our data with more context. For example, if we’re trying to detect temperature anomalies for all states around the US. We can’t just collect all temperature data and feed it to the model, because different states have varying temperatures on any given day. New York experiences drastic temperature changes over the year which is common, while Hawaii will be warm and sunny all year round. So our goal here is to find anomalies for the respective states. If we don’t include context (US states) into our model here, it is likely that the model might flag New York temperatures as anomalies when in fact the low temperatures are common during Nov-Mar. So depending on your use case, we should think about what other factors could affect our outcome to ensure that it is as accurate as possible. This is known as Contextual Anomalies.

The figure below is an example of how the contextual anomaly detection model should function for New York temperature data. 

SAP HANA ML, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Preparation, SAP HANA Career

Hypothetical Use Case: Flagging Abnormal Car Registrations for Countries


Now let’s jump right into our demo. 

The World Vehicle Association (WVA) needs to monitor and keep track of car registrations across the world. Car registrations vary across each country and time of the month but their patterns should be relatively consistent. If there are any sudden surges or drops in car registrations, the WVA needs to investigate and find out the causes and plan for next steps.

If there’s a surge, the WVA needs to make sure the country can accommodate the sudden increase in vehicles. Do they need to build more freeways, gas stations?

If there’s a drop, the WVA will need to investigate why? Is this just a one-off situation or are there consecutive drops for the next few months? Are people migrating out of the country, or are they using more public transportation?

This is what we are trying to figure out. We have a dataset of car registrations from different countries. It consists of 3 columns: Date, Country, Number of Car Registrations. Let’s get to it. 

1. First, let’s establish a connection to our SAP HANA system. 

import hana_ml
print(hana_ml.__version__)

hana_address = '' 
hana_port = 443 # Adjust if needed / as advised
hana_user = '' 
hana_password = '' 
hana_encrypt = 'true' # Adjust if needed / as advised
hana_sslcertificate = 'false'

# Instantiate connection object
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address = hana_address,
                                   port = hana_port, 
                                   user = hana_user, 
                                   password = hana_password, 
                                   encrypt = hana_encrypt,
                                   sslValidateCertificate = hana_sslcertificate
                                  )

# Send basic SELECT statement and display the result
sql = 'SELECT 12345 FROM DUMMY'
df_remote = conn.sql(sql)
print(df_remote.collect())

Next, we load our data into our python environment and upload it as a table called “VEHICLEREGISTRATIONS_ORG” to our HANA database. 

import pandas as pd
df_data = pd.read_csv("VEHICLEREGISTRATIONS.txt", sep=";")
df_data.columns = map(str.upper, df_data.columns) 

df_data.MONTH = pd.to_datetime(df_data.MONTH)

df_data.head(5)

SAP HANA ML, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Preparation, SAP HANA Career

df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, 
                                                   pandas_df = df_data, 
                                                   table_name = 'VEHICLEREGISTRATIONS_ORG',
                                                   force = True,
                                                   replace = False)

Here df_remote is calling our HANA table that we just created.

df_remote = conn.table("VEHICLEREGISTRATIONS_ORG")

df_segments shows us a list of all the unique countries we have in our dataset.

df_segments = df_remote.select("COUNTRY").distinct().sort("COUNTRY", desc = False).collect()
df_segments

SAP HANA ML, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Preparation, SAP HANA Career

The following code is where the magic happens. The year we are trying to analyze is 2017 – this will be our audit year. The for loop basically takes the raw data up until our audit year (2017) for each respective country and runs it through the automated time series algorithm. This returns a prediction for the expected car registrations in 2017. We compare this against the actuals in 2017 and return those that fall out of the prediction interval. This threshold can be modified as you wish depending on your use case.

audit_year = 2017

import pandas as pd
from hana_ml.algorithms.apl.time_series import AutoTimeSeries
pd.set_option('expand_frame_repr', False) # Print all pandas column without line break

#for index, row in df_segments[:3].iterrows(): 
for index, row in df_segments.iterrows(): 
   
    # Get raw data of each country
    segment = row['COUNTRY']
    print("Now starting with: " + segment)
    df_remote_segment = conn.sql("SELECT COUNTRY, MONTH, REGISTRATIONS FROM VEHICLEREGISTRATIONS_ORG WHERE COUNTRY = '" + segment + "'")
    
    # Sort data ascending, a requirement of the Automated Predictive Library
    df_remote_segment = df_remote_segment.sort("MONTH", desc = False)
    
    # Training data and hold out 
    df_remote_train = df_remote_segment.filter("YEAR(MONTH) < " + str(audit_year))
    df_remote_holdout = df_remote_segment.filter("YEAR(MONTH) = " + str(audit_year))
    #print(df_remote_holdout.collect())

    # Predict known past
    tsapl = AutoTimeSeries(time_column_name = 'MONTH', target = 'REGISTRATIONS', horizon = 12)
    tsapl.fit(data = df_remote_train.drop("COUNTRY"))
    df_remote_aplforecast = tsapl.forecast().filter("YEAR(MONTH) = " + str(audit_year))
    #print(df_remote_aplforecast.collect())
    
    # Join hold out with predictions
    #df_remote_joined = df_remote_aplforecast.alias('L').join(df_remote_holdout.alias('R'), 'L.MONTH = R.MONTH')
    df_remote_joined = df_remote_aplforecast.join(df_remote_holdout.rename_columns({'MONTH': 'RMONTH'}), 'MONTH = RMONTH')
    df_remote_joined = df_remote_joined.drop(['ACTUAL', 'RMONTH'])
    df_remote_joined = df_remote_joined.to_head('COUNTRY')
    
    # Filter on true values outside the predition interval
    df_remote_joined = df_remote_joined.filter('REGISTRATIONS < LOWER_INT_95PCT OR REGISTRATIONS > UPPER_INT_95PCT')
    print(df_remote_joined.collect())

The output is as followed:

E.g. Two anomalies were detected for Austria where the actual registrations fall outside the lower/upper prediction intervals.

Predicted November Range= (22230, 28200) || November Actuals = 29547

SAP HANA ML, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Preparation, SAP HANA Career

We can save this dataframe as a new table ( back into HANA which can be connected to SAC for user consumption.

#saving results into HANA
df_remote_joined.save(where=('schema name', 'Car Registration Anomalies'), 
                       table_type = 'COLUMN', 
                       force = True)

And there you have it! We utilized our time series prediction model and tweaked it into an anomaly detection model. 

No comments:

Post a Comment