Friday, 10 December 2021

SHAP-explained models with Automated Predictive (APL)

To address classification and regression machine learning scenarios, APL uses the Gradient Boosting technique. As for explaining what the predictive model does, APL relies on the SHAP framework (SHapley Additive exPlanations).

In this blog we will see how to extract and visualize SHAP values to obtain a:

1. Global explanation from the APL trained model.
2. Local explanation regarding an individual prediction.

Let’s go through a classification example: Detection of Car Insurance Fraud.

Global Explanation


First, we define a HANA dataframe pointing to the training data stored in the database.

from hana_ml import dataframe as hd
conn = hd.ConnectionContext(userkey='MLMDA_KEY')
sql_cmd = 'SELECT * FROM "APL_SAMPLES"."INSURANCE_CLAIMS" ORDER BY "Id"'
hdf_train = hd.DataFrame(conn, sql_cmd)
We request a couple of claims and display them using a Pandas dataframe.

df = hdf_train.filter('"Id" in (1766,1832)').collect()
df.style.hide_index() 

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

We learn from past insurance data including both fraudulent as well as genuine claims.

from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
apl_model = GradientBoostingBinaryClassifier()
apl_model.set_params(other_train_apl_aliases={'APL/VariableAutoSelection':'true'})
apl_model.fit(hdf_train, label='Fraud', key='Id')

Now we can look at the importance of each input variable for predicting fraud.

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(apl_model).build().display()

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

This global explanation is based on the method: ExactSHAP. The shap value for each variable is converted into a relative importance. All the calculations are performed on the database side. The total of all bars is equal to 1.

Some variables don’t appear in the chart. This is because they have been excluded by the APL automated selection. Here is the list of excluded variables with the reasons for exclusion.

df = apl_model.get_debrief_report('ClassificationRegression_VariablesExclusion').collect()
df = df[['Variable', 'Reason For Exclusion']]
df.style.hide_index()

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

Local Explanation


With the predict method, one can request the individual contributions (aka shap values) from the APL trained model. The score and the shap values are computed on the HANA database side.

apl_model.set_params( extra_applyout_settings=
{ 'APL/ApplyExtraMode': 'Advanced Apply Settings', 
  'APL/ApplyPredictedValue': 'true', 
  'APL/ApplyProbability': 'true', 
  'APL/ApplyDecision': 'true',  
  'APL/ApplyContribution': 'all' }
)
df_shap = apl_model.predict(hdf_train).collect()
dict = {'TRUE_LABEL': 'Target Actual', 
        'PREDICTED': 'Target Predicted', 
        'gb_score_Fraud': 'Score', 
        'gb_proba_Fraud': 'Probability',
        'gb_contrib_constant_bias': 'Shap Baseline' }
df_shap.rename(columns=dict, inplace=True)
df_shap.columns = [hdr.replace("gb_contrib_", "Shap ") for hdr in df_shap.columns]
df_shap[df_shap['Id'].isin([1766,1832])].style.hide_index()

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

If you add together the shap values of the predictors and the shap baseline (last column), you will get the score (4th column).

col_list= [col for col in df_shap.columns if col.startswith('Shap')]
df_shap['Total Shap'] = df_shap[col_list].sum(axis=1)
df_shap[df_shap['Id'].isin([1766,1832])].style.hide_index()

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

That code snippet was only to illustrate the additive property of shap values.

Often business users are not comfortable with the score obtained from classification models (regression models are simpler from that standpoint because they predict a value in the unit of the target, for example euros if the target is an amount). The predicted score of our classification model is an unbounded value, hard to interpret. It can be used though to sort the claims from highest to smallest score; that way the claims that are most likely to be fraudulent will show first. One can also use the decision probability, ranging from 0 to 1. Let’s look at the relationship between the two.

import matplotlib
%matplotlib inline
import seaborn as sns
sns.relplot(x="Score", y="Probability", hue="Target Predicted", data=df_shap, height=2.2, aspect=1.3);

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

The score and the decision probability are related by the logit and sigmoid transforms:

◉ score = logit(probability)
◉ probability = sigmoid(score)

The blue segment of the curve being longer than the orange one, we are under the wrong impression that there are more Yes predictions than No predictions. In fact, there are many more orange claims than blue claims. Here is the distribution of the predicted target:

df_shap['Target Predicted'].value_counts(normalize=True)*100

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

The threshold value to decide whether a claim should be classified as Fraudulent or Not fraudulent, is determined so that the ratio of yes predictions approaches the ratio found in the actual data during the learning phase. One can retrieve the threshold value in the score space and probability space.

df = apl_model.get_summary().filter("KEY like '%DecisionThreshold'").collect()
df = df[['KEY','VALUE']]
df['KEY'] = df['KEY'].str.replace('Apply', '').str.replace('Decision', ' ')
df['VALUE'] = df['VALUE'].astype(float)
df.columns = ['Item', 'Value']
df.style.hide_index()

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

An article about SHAP would not be complete without showing a force plot and a beeswarm plot.

We get ready for the shap visualizations.

from shap._explanation import Explanation
from shap.plots import force, waterfall, beeswarm
shap_columns = col_list[:-1]   # Removing the total
predictors_names = [c[5:] for c in shap_columns]   # Removing prefix Shap 
shap_values = df_shap[shap_columns].values
actual_values = hdf_train.collect()[predictors_names].values
baseline = df_shap['Shap Baseline']
xpl = Explanation(shap_values, base_values=baseline, data=actual_values, feature_names=predictors_names)

We show below an example of force plot for a claim with a low score:

idx=df_shap[df_shap['Id']==1832].index.values.astype(int)[0]
force(xpl[0].base_values, xpl[idx].values, features=xpl[idx].data, feature_names=predictors_names, matplotlib=True)

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

A force plot for a claim having a high score:

idx=df_shap[df_shap['Id']==1766].index.values.astype(int)[0]
force(xpl[0].base_values, xpl[idx].values, features=xpl[idx].data, feature_names=predictors_names, matplotlib=True)

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

Don’t be surprised if you see two claims with, for example, the same Bodily Injury amount but different force values. The shap value includes actually the interaction with other predictors, it is composed of a main effect and an interaction effect.

We conclude this blog with the famous beeswarm. That plot gives a summary view where each dot is an individual claim.

beeswarm(xpl, max_display=8)

SAP HANA, SAP HANA Exam, SAP HANA Exam Prep, SAP HANA Exam Preparation, SAP HAN Tutorial and Materials, SAP HANA Career, SAP HANA Guides, SAP HANA Leaning, SAP HANA Study Materials, SAP HANA Jobs, SAP HANA Skills, SAP HANA Learn, SAP HANA Studio, SAP HANA

You recognize the features we saw earlier in the variable importance bar chart. Note that the color gradient from blue to red does not apply to nominal variables (in grey) like gender or payment type.

No comments:

Post a Comment