Wednesday, 7 September 2022

Python hana_ml: PAL Classification Training(UnifiedClassification)

I am writing this blog to show basic classification training procedures using python package hana_ml. Wtih class UnifiedClassification, you can use several classification algorithms. Besides, training result can be exported as HTML report easily.

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

Environment


Environment is as below.

◉ Python: 3.7.13(Google Colaboratory)
◉ HANA: Cloud Edition 2022.16

Python packages and their versions.

◉ hana_ml: 2.13.22072200
◉ pandas: 1.3.5
◉ scikit-learn: 1.0.2

As for HANA Cloud, I activated scriptserver and created my users. Though I don’t recognize other special configurations, I may miss something since our HANA Cloud was created long time before.

I didn’t use HDI here to make environment simple.

Generated Model Report


Firstly let me show you model report. A report can be displayed within jupyter or downloaded as html file. I didn’t optimize hyper-parameters, so there is no optimal parameter page.

The report is for training and validation, not for test.

Statistic


Basic classification result is shown in Static page.

I don’t know much about KAPPA and MCC, Probably they are as below.


SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

Parameter


Used parameters for classification.

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

Confusion Matrix


Variables screen shows variable statistics.

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

Variable Importance


Variable importance is visualized as pie chart or bar chart.

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

Metrics


24 types of metrics are exported.

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

SAP HANA Exam, SAP HANA Exam Prep, SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Certification, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs

Python Script


1. Install Python packages

Install python package hana_ml, which is not pre-installed on Google Colaboratory.

As for pandas and scikit-learn, I used pre-installed ones.

!pip install hana_ml

2. Import modules

Import python package modules.

from hana_ml.dataframe import ConnectionContext, create_dataframe_from_pandas
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
import pandas as pd
from sklearn.datasets import make_classification

3. Connect to HANA Cloud

Connect to HANA Cloud and check its version.

ConnectionContext class is for connection to HANA.

HOST = '<HANA HOST NAME>'
SCHEMA = USER = '<USER NAME>'
PASS = '<PASSWORD>'
conn = ConnectionContext(address=HOST, port=443, user=USER,
                           password=PASS, schema=SCHEMA) 
print(conn.hana_version())
4.00.000.00.1660640318 (fa/CE2022.16)

4. Create test data

Create test data using scikit-learn.

There are 3 features and 1 target variable.

def make_df():
    X, y = make_classification(n_samples=1000, 
                               n_features=3, n_redundant=0)
    df = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])
    df['CLASS'] = y
    return df

df = make_df()
print(df)
df.info()

Here is dataframe overview.

           X1        X2        X3  CLASS
0    0.964229  1.995667  0.244143      1
1   -1.358062 -0.254956  0.502890      0
2    1.732057  0.261251 -2.214177      1
3   -1.519878  1.023710 -0.262691      0
4    4.020262  1.381454 -1.582143      1
..        ...       ...       ...    ...
995 -0.247950  0.500666 -0.219276      1
996 -1.918810  0.183850 -1.448264      0
997 -0.605083 -0.491902  1.889303      0
998 -0.742692  0.265878 -0.792163      0
999  2.189423  0.742682 -2.075825      1

[1000 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      1000 non-null   float64
 1   X2      1000 non-null   float64
 2   X3      1000 non-null   float64
 3   CLASS   1000 non-null   int64  
dtypes: float64(3), int64(1)
memory usage: 31.4 KB

5. define table and upload data

Define HANA Table and upload data using function “create_dataframe_from_pandas”.

The function is very useful, since it automatically define table and upload at the same time. Please check options for further detail.

TRAIN_TABLE = 'PAL_TRAIN'
dfh = create_dataframe_from_pandas(conn, df, TRAIN_TABLE,
                             schema=SCHEMA, 
                             force=True, # True: truncate and insert
                             replace=True) # True: Null is replaced by 0

6. Check upload result

check the result of HANA table definition and upload result. With HANA dataframe, python has connection to HANA table, so need to collect when getting data.

def show_hana_df(dfh):
    print(dfh.collect())
    print(f'Table Structure: {dfh.get_table_structure()}')
    print(dfh.describe().collect())

show_hana_df(dfh)
          X1        X2        X3  CLASS
0    0.964229  1.995667  0.244143      1
1   -1.358062 -0.254956  0.502890      0
2    1.732057  0.261251 -2.214177      1
3   -1.519878  1.023710 -0.262691      0
4    4.020262  1.381454 -1.582143      1
..        ...       ...       ...    ...
995 -0.247950  0.500666 -0.219276      1
996 -1.918810  0.183850 -1.448264      0
997 -0.605083 -0.491902  1.889303      0
998 -0.742692  0.265878 -0.792163      0
999  2.189423  0.742682 -2.075825      1

[1000 rows x 4 columns]
Table Structure: {'X1': 'DOUBLE', 'X2': 'DOUBLE', 'X3': 'DOUBLE', 'CLASS': 'INT'}
  column  count  unique  nulls      mean       std       min       max  \
0     X1   1000    1000      0 -0.022545  1.403956 -4.543441  4.020262   
1     X2   1000    1000      0  0.006131  0.987222 -3.019512  3.926238   
2     X3   1000    1000      0 -0.048433  1.322137 -3.836929  3.994644   
3  CLASS   1000       2      0  0.498000  0.500246  0.000000  1.000000   

     median  25_percent_cont  25_percent_disc  50_percent_cont  \
0 -0.197133        -1.054538        -1.056985        -0.197133   
1 -0.000251        -0.648000        -0.649373        -0.000251   
2 -0.224366        -1.021008        -1.021405        -0.224366   
3  0.000000         0.000000         0.000000         0.000000   

   50_percent_disc  75_percent_cont  75_percent_disc  
0        -0.197984         0.990549         0.990513  
1        -0.000709         0.666021         0.665924  
2        -0.229456         0.969732         0.967854  
3         0.000000         1.000000         1.000000  

7. Split data into train and test dataset

Split dataset using function “train_test_val_split”.  The function needs key columns, so I added key column using function “add_id”.

train, test, _ = train_test_val_split(dfh.add_id(), 
                                      testing_percentage=0.2,
                                      validation_percentage=0)
print(f'Train shape: {train.shape}, Test Shape: {test.shape}')

Train shape: [8000, 5], Test Shape: [2000, 5]

8. Training

Train with random forest by using class “UnifiedClassification”. When fitting I used partitioning for validation.

rdt_params = dict(n_estimators=10, max_depth=10)
uc_rdt = UnifiedClassification(func = 'RandomDecisionTree', **rdt_params)
uc_rdt.fit(data=train, training_percent=0.8, ntiles=2, key='ID', 
           partition_method='stratified', stratified_column='CLASS', 
           build_report=True)

9. Training result

9.1. Raw result

Raw result is in attribute “model_”.

for model in uc_rdt.model_:
    print(model.collect(), '\n')
Third dataframe is empty, since I don’t optimize hyper parameters.

    ROW_INDEX  PART_INDEX                                      MODEL_CONTENT
0           0          -1  <PMML version="4.0" xmlns="http://www.dmg.org/...
1           1           0  <PMML version="4.0" xmlns="http://www.dmg.org/...
2           2           0  " />\n<sd v="1" n="191" />\n<Node id="40" sc="...

-- ommision --

36         36           9  ode>\n<Node id="40" sc="0" n="76" >\n<sp x="X2...
37         37           9  "0" n="74" />\n<sd v="1" n="705" />\n<Node id=...
38         38           9  " op="lt" v="-0.476949" />\n<sd v="0" n="0" />... 

    STAT_NAME          STAT_VALUE CLASS_NAME
0         AUC             0.97875       None
1      RECALL  0.9637046307884856          0
2   PRECISION  0.9935483870967742          0
3    F1_SCORE  0.9783989834815756          0
4     SUPPORT                 799          0
5      RECALL  0.9937578027465668          1
6   PRECISION  0.9648484848484848          1
7    F1_SCORE  0.9790897908979089          1
8     SUPPORT                 801          1
9    ACCURACY             0.97875       None
10      KAPPA  0.9574983397788976       None
11        MCC  0.9579295387997437       None 

Empty DataFrame
Columns: [PARAM_NAME, INT_VALUE, DOUBLE_VALUE, STRING_VALUE]
Index: [] 

 ACTUAL_CLASS PREDICTED_CLASS COUNT
0            0               0    770
1            0               1     29
2            1               0      5
3            1               1    796 

  VARIABLE_NAME  IMPORTANCE
0            X1    0.125881
1            X2    0.018198
2            X3    0.855921 

               NAME    X        Y
0   RANDOM_CUMGAINS  0.0  0.00000
1   RANDOM_CUMGAINS  1.0  1.00000
2       RANDOM_LIFT  0.0  1.00000
3       RANDOM_LIFT  1.0  1.00000
4    RANDOM_CUMLIFT  0.0  1.00000
5    RANDOM_CUMLIFT  1.0  1.00000
6     PERF_CUMGAINS  0.0  0.00000
7     PERF_CUMGAINS  0.5  1.00000
8     PERF_CUMGAINS  1.0  1.00000
9         PERF_LIFT  0.0  2.00000
10        PERF_LIFT  0.5  2.00000
11        PERF_LIFT  1.0  0.00000
12     PERF_CUMLIFT  0.0  2.00000
13     PERF_CUMLIFT  0.5  2.00000
14     PERF_CUMLIFT  1.0  1.00000
15          ROC_FPR  0.0  0.00000
16          ROC_TPR  0.0  0.00000
17          ROC_FPR  1.0  0.02125
18          ROC_TPR  1.0  0.97875
19          ROC_FPR  2.0  1.00000
20          ROC_TPR  2.0  1.00000
21         CUMGAINS  0.0  0.00000
22         CUMGAINS  0.5  0.97875
23         CUMGAINS  1.0  1.00000
24             LIFT  0.0  2.00000
25             LIFT  0.5  1.00000
26             LIFT  1.0  0.00000
27          CUMLIFT  0.0  2.00000
28          CUMLIFT  0.5  1.95750
29          CUMLIFT  1.0  1.00000 

9.2. Model report

Instance “UnifiedClassification” can generate model report.


“generate_html_report” function save a html report file. Its parameter is fine name prefix, so “result_unified_classification_model_report.html” is the complete fine name in this script.
uc_rdt.generate_notebook_iframe_report()
uc_rdt.generate_html_report('result')

10. Test

10.1. Scoring

With “score” function, you can get metrics.

results = uc_rdt.score(data=test, key='ID')
for result in results:
    print(result.collect())
         ID SCORE  CONFIDENCE  \
0         2     0         1.0   
1         3     0         0.6   
2         7     1         1.0   
3        13     1         1.0   
4        28     1         1.0   
...     ...   ...         ...   
1995   9963     1         1.0   
1996   9975     0         0.8   
1997   9995     1         1.0   
1998   9998     1         1.0   
1999  10000     1         1.0   

                                            REASON_CODE  
0     [{"attr":"X3","pct":98.0,"val":0.4868758174245...  
1     [{"attr":"X1","pct":65.0,"val":0.3245506374878...  
2     [{"attr":"X3","pct":78.0,"val":0.4052506627718...  
3     [{"attr":"X3","pct":90.0,"val":0.4588332547405...  
4     [{"attr":"X1","pct":74.0,"val":0.6337552163132...  
...                                                 ...  
1995  [{"attr":"X3","pct":91.0,"val":0.4566825270920...  
1996  [{"attr":"X3","pct":88.0,"val":0.4430019995088...  
1997  [{"attr":"X1","pct":76.0,"val":0.6453786485350...  
1998  [{"attr":"X3","pct":85.0,"val":0.4241043164237...  
1999  [{"attr":"X3","pct":88.0,"val":0.4362227541608...  

[2000 rows x 4 columns]
    STAT_NAME          STAT_VALUE CLASS_NAME
0         AUC          0.99131975       None
1      RECALL  0.9710578842315369          0
2   PRECISION  0.9898270600203459          0
3    F1_SCORE   0.980352644836272          0
4     SUPPORT                1002          0
5      RECALL  0.9899799599198397          1
6   PRECISION  0.9714847590953786          1
7    F1_SCORE  0.9806451612903225          1
8     SUPPORT                 998          1
9    ACCURACY              0.9805       None
10      KAPPA  0.9610013259549175       None
11        MCC  0.9611748218717623       None
  ACTUAL_CLASS PREDICTED_CLASS  COUNT
0            0               0    973
1            0               1     29
2            1               0     10
3            1               1    988
                NAME     X         Y
0    RANDOM_CUMGAINS  0.00  0.000000
1    RANDOM_CUMGAINS  1.00  1.000000
2        RANDOM_LIFT  0.00  1.000000
3        RANDOM_LIFT  1.00  1.000000
4     RANDOM_CUMLIFT  0.00  1.000000
..               ...   ...       ...
141          CUMLIFT  0.80  1.246250
142          CUMLIFT  0.85  1.174118
143          CUMLIFT  0.90  1.109444
144          CUMLIFT  0.95  1.051579
145          CUMLIFT  1.00  1.000000

[146 rows x 3 columns]

10.2. prediction

If you want to get each prediction result, just call “predict” function.

df_pred = uc_rdt.predict(data=test, key='ID')
print(df_pred.collect())
         ID SCORE  CONFIDENCE  \
0         2     0         1.0   
1         3     0         0.6   
2         7     1         1.0   
3        13     1         1.0   
4        28     1         1.0   
...     ...   ...         ...   
1995   9963     1         1.0   
1996   9975     0         0.8   
1997   9995     1         1.0   
1998   9998     1         1.0   
1999  10000     1         1.0   

                                            REASON_CODE  
0     [{"attr":"X3","pct":98.0,"val":0.4868758174245...  
1     [{"attr":"X1","pct":65.0,"val":0.3245506374878...  
2     [{"attr":"X3","pct":78.0,"val":0.4052506627718...  
3     [{"attr":"X3","pct":90.0,"val":0.4588332547405...  
4     [{"attr":"X1","pct":74.0,"val":0.6337552163132...  
...                                                 ...  
1995  [{"attr":"X3","pct":91.0,"val":0.4566825270920...  
1996  [{"attr":"X3","pct":88.0,"val":0.4430019995088...  
1997  [{"attr":"X1","pct":76.0,"val":0.6453786485350...  
1998  [{"attr":"X3","pct":85.0,"val":0.4241043164237...  
1999  [{"attr":"X3","pct":88.0,"val":0.4362227541608...  

[2000 rows x 4 columns]

11. Close connection

Last but not least, closing connection explicitly is preferable.

conn.close()

No comments:

Post a Comment