Wednesday, 16 June 2021

Scheduling Python code on Cloud Foundry

This blog starts with a very simple example to schedule a Python file on Cloud Foundry, just to introduce the most important steps. That concept is then extended to schedule a Python file, which applies a trained Machine Learning model in SAP HANA.

Run Python file locally

We would like to schedule a Python file, not a Jupyter Notebook. Hence use your preferred local Python IDE or editor to run this simple file, helloworld.py.

import os

import sys

outputstring = 'Hello world'

if os.getenv('VCAP_APPLICATION'):

# File is executed in Cloud Foundry

outputstring += ' from Python in Cloud Foundry'

else:

# File is executed locally

outputstring += ' from local Python'

# Print the output

sys.stdout.write(outputstring)

The only specialty of this code is that it is already able to detect, whether it is executed locally or in Cloud Foundry. This will be important later, for example when your code may need to securely access logon credentials for SAP HANA.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Run the same Python file on Cloud Foundry


Now deploy this file on Cloud Foundry. If you haven’t worked with Cloud Foundry yet, the “Getting started” guide is a great resource. The free trial of the SAè Business Technology Platform is sufficient for the scheduling.

To run and schedule the file in Cloud Foundry, some further configuration is required. In the space of your Cloud Foundry Environment:

- Create an instance of the Job Scheduler services as explained by Carlos Roggan. I have chosen the same name for the service as him: jobschedulerinstance
- Then create an instance of the Authorization & Trust Management (XSUAA) service, as explained again by Carlos. I have used the same name again as him: xsuaaforsimplejobs. Once you created that instance, you can come back to this blog. You do not need to create the Node.js application to continue here.

Now prepare the app to be pushed to Cloud Foundry. In your local folder, in which the file  helloworld.py is saved, create a file called runtime.txt that contains just this single line, which specifies the Python runtime environment Cloud Foundry should use.

python-3.6.x

Create another file called manifest.yml, with the following content. It specifies for example the name of the application, the memory limit for the app and it binds our new two services to the app.

---
applications:
- name: helloworld
  memory: 512M
  command: python helloworld.py
  services:
    - xsuaaforsimplejobs
    - jobschedulerinstance

Now push the app to Cloud Foundry. You require the Cloud Foundry Command Line Interface (CLI). In a command prompt, navigate to the folder that contains your local files (helloworld.py, manifest.yml, runtime.txt).

Login to Cloud Foundry, if you haven’t yet.

cf login

Then push the app to Cloud Foundry. The – task flag ensures for instance that the app is pushed and staged, but not started. Without this flag Cloud Foundry would believe that the app crashed after the Python code has executed and terminated. It would therefore constantly restart the app, which we don’t want in this case. In case your CLI  doesn’t recognise this flag, please ensure your CLI is at least on version 7.

cf push --task

Once pushed, the app is showing as “down”.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Now run the application once as a task, from the command line. You can chose your own name for this task, “taskfromcli” has no specific meaning.

cf run-task helloworld --command "python helloworld.py" --name taskfromcli

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Confirm that the code ran as expected, by looking into the application log, which should include the hello statement.

cf logs helloworld --recent
 
And indeed, the code correctly identified that it is was running in Cloud Foundry. After outputting the code, the task ended and the container was destroyed.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Schedule that Python file on Cloud Foundry


So far we triggered the task manually in Cloud Foundry through the Command Line Interface. Now lets create a schedule for it. In the SAP Business Technology Platform, find the instance of the Job Scheduling Service, that was created earlier. Open its dashboard.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

On the “Tasks” menu create a new task as shown on the screenshot.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Open the newly created task and in the “Schedules” tab you can create the schedule we are after. In the screenshot the application is set to run every hour. In case you specify a certain time, note that: “SAP Job Scheduling service runs jobs in the UTC time zone“.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Click into the new schedule and on the left hand the “Run Log” shows the history of all runs.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

You can also access the logs through the CLI as before.

cf logs helloworld --recent

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Schedule Python file to trigger SAP HANA Machine Learning


With the understanding, how Python code can be scheduled in Cloud Foundry, let’s put this to good use. Imagine you have some master data in SAP HANA, with missing values in one column. You want to use SAP HANA’s Automated Predictive Library (APL) to estimate those missing values. All calculations will be done in SAP HANA. Cloud Foundry is just the trigger.

In this blog I will shorten the Machine Learning workflow to the most important steps. For this scenario:

◉ A Data Scientist / IT Expert will use Python to manually trigger the training of the Machine Learning model in SAP HANA and save the model in SAP HANA.
◉ Cloud Foundry then applies the model through a scheduled task and writes the predictions into a SAP HANA table.

Should you want to implement these steps, please check with your SAP Account Executive whether these steps are allowed by your SAP HANA license.

Some prerequisites:

◉ You have access to a SAP HANA environment that has the Automated Predictive Library (APL) installed. This can be a productive environment of SAP HANA Cloud. The trial of SAP HANA Cloud does not contain the embedded HANA Machine Learning.
◉ Your logon credentials to SAP HANA are stored in your local hdbuserstore
◉ You have the hana_ml library installed in your local Python environment.

Train the Machine Learning model


Begin by uploading this data about the prices of used vehicles to SAP HANA. Our model will predict the vehicle type, which is not known for all vehicles. This dataset was compiled by scraping offers on eBay and shared as “Ebay Used Car Sales Data” on Kaggle“.

In your local Python environment load the data into a pandas DataFrame and carry out a few transformations. Here I am using Jupyter Notebooks, but any Python environment should be fine.

import hana_ml
import pandas as pd
df_data = pd.read_csv('autos.csv', encoding = 'Windows-1252')

# Column names to upper case
df_data.columns = map(str.upper, df_data.columns)

# Simplify by dropping a few columns
df_data = df_data.drop(['NOTREPAIREDDAMAGE', 
                        'NAME', 
                        'DATECRAWLED', 
                        'SELLER', 
                        'OFFERTYPE', 
                        'ABTEST', 
                        'BRAND', 
                        'DATECREATED',
                        'NROFPICTURES', 
                        'POSTALCODE', 
                        'LASTSEEN', 
                        'MONTHOFREGISTRATION'],
                        axis = 1)

# Reneame a few columns
df_data = df_data.rename(index = str, columns = {'YEAROFREGISTRATION': 'YEAR',
                                                 'POWERPS': 'HP'})

# Add ID column
df_data.insert(0, 'CAR_ID', df_data.reset_index().index)

Load that data into a SAP HANA table. The userkey is pointing to SAP HANA credentials that have been added to the hdbuserstore.

import hana_ml
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey='HANACRED_LOCAL', encrypt = True, sslValidateCertificate = False)
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, 
                                                   pandas_df = df_data, 
                                                   table_name = 'USEDCARPRICES',
                                                   force = True,
                                                   replace = False)
 
Before training the Machine Learning model, have a quick check of the data.

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey='HANACRED_LOCAL', encrypt = True, sslValidateCertificate = False)
df_remote = conn.table("USEDCARPRICES")
df_remote.describe().collect()
 
The are 371.528 entries in the table. For 37.869 the vehicle type is not known. Our model will estimate those values.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Focus on the cars for which the vehicle types are known. What are the different types and how often do they occur?

top_n = 10
df_remote = df_remote.filter("VEHICLETYPE IS NOT NULL")
df_remote_col_frequency = df_remote.agg([('count', 'VEHICLETYPE', 'COUNT')],  group_by = 'VEHICLETYPE')
df_col_frequency = df_remote_col_frequency.sort('COUNT', desc = True).head(top_n).collect()
df_col_frequency

Limousines are most frequent, followed by “kleinwagen”, which is German for a smaller vehicle.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Go ahead and train the Machine Learning model. This step will take a while, that’s definitely a chance for a coffee, or rather lunch. On my environment it took about 20 minutes.

# Split the data into train and test
from hana_ml.algorithms.pal import partition
df_remote_train, df_remote_test, df_remote_ignore = partition.train_test_val_split(random_seed = 1972,
                                                                                   data = df_remote, 
                                                                                   training_percentage = 0.7, 
                                                                                   testing_percentage = 0.3,
                                                                                   validation_percentage = 0)

# Parameterise the Machine Learning model
from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingClassifier
gbapl_model = GradientBoostingClassifier()
col_target = 'VEHICLETYPE'
col_id = 'CAR_ID'
col_predictors = df_remote_train.columns
col_predictors.remove(col_target)
col_predictors.remove(col_id)
gbapl_model.set_params(other_train_apl_aliases={'APL/VariableAutoSelection': 'true', 
                                                'APL/Interactions': 'true',
                                                'APL/InteractionsMaxKept': 10})

# Train the Machine Learning model
gbapl_model.fit(data = df_remote_train, 
                key = col_id, 
                features = col_predictors,
                label = col_target)
 

Apply the trained model on the test data and evaluate the model quality with a confusion matrix.

# Apply the model on the test data and keep only the strongest predictions
df_remote_predict = gbapl_model.predict(df_remote_test)
df_remote_predict = df_remote_predict.filter('PROBABILITY > 0.9')

# Confusion Matrix
from hana_ml.algorithms.pal.metrics import confusion_matrix
df_remote__confusion_matrix = confusion_matrix(df_remote_predict, col_id, label_true = 'TRUE_LABEL', label_pred = 'PREDICTED')
df_remote__confusion_matrix[1].collect()

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Save the trained model in SAP HANA, where the scheduled process can pick it up later.

from hana_ml.model_storage import ModelStorage
model_storage = ModelStorage(connection_context = conn)
gbapl_model.name = 'Master data Vehicle type'
model_storage.save_model(model=gbapl_model, if_exists = 'replace')
 

Schedule the apply of the Machine Learning model


To trigger the application of the trained model, follow the same steps as before when the helloworld app was deployed. You require 4 files all together. Place these in a separate local folder on your local environment.

File 1 of 4: main.py

This file uses the same logic of the earlier helloworld.py to identify whether it is executed locally or in Cloud Foundry. Depending on the environment it is retrieving the SAP HANA logon credentials either from the local hdbuserstore or from a user-defined variable in Cloud Foundry. The trained machine learning model is then used to predict the missing values. The strongest predictions are saved in the VEHICLTETYPE_ESTIMATED table in SAP HANA. Again, no data ever leaves SAP HANA.

import os
import sys
import json
import hana_ml
import hana_ml.dataframe as dataframe

hana_encrypt = 'true' 
hana_sslcertificate = 'false'

if os.getenv('VCAP_APPLICATION'):
# File is executed in Cloud Foundry
sys.stdout.write('Python executing in Cloud Foundry')

# Get SAP HANA logon credentials from user-provided variable in CloudFoundry
hana_credentials_env = os.getenv('HANACRED')
hana_credentials = json.loads(hana_credentials_env)
hana_address = hana_credentials['address']
hana_port = hana_credentials['port']
hana_user = hana_credentials['user']
hana_password = hana_credentials['password']

# Instantiate connection object
conn = dataframe.ConnectionContext(address = hana_address, 
   port = hana_port, 
   user = hana_user, 
   password = hana_password, 
   encrypt = hana_encrypt,
   sslValidateCertificate = hana_sslcertificate
)
  
else:
# File is executed locally
sys.stdout.write('Python executing locally')

# Get SAP HANA logon credentials from the local client's secure user store
conn = dataframe.ConnectionContext(userkey='HANACRED_LOCAL',
   encrypt = hana_encrypt,
   sslValidateCertificate = hana_sslcertificate)

# Load the trained model
import hana_ml.dataframe as dataframe
from hana_ml.model_storage import ModelStorage
model_storage = ModelStorage(connection_context = conn)
gbapl_model = model_storage.load_model(name = 'Master data Vehicle type', version = 1)

# Apply the model on rows for which vehicle type is not known
df_remote = conn.table("USEDCARPRICES")
df_remote = df_remote.filter("VEHICLETYPE IS NULL")
df_remote_predict = gbapl_model.predict(df_remote)

# Save the predictions with probability > 0.9 to table
df_remote_predict = df_remote_predict.filter('PROBABILITY > 0.9')
df_remote_predict.save('VEHICLTETYPE_ESTIMATED')
  
# Print success message
sys.stdout.write('\nMaster data imputation process completed')
 
Run the file locally and you should see an update that the process completed successfully.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Before pushing this application to Cloud Foundry, you must create the remaining 3 files in the same folder.

File 2 of 4: manifest.yml

No surprises here, along the same lines as before.

---
applications:
- name: hanamlapply
  memory: 512M
  command: python main.py
  services:
    - xsuaaforsimplejobs
    - jobschedulerinstance
 
File 3 of 4: runtime.txt

Exactly as before.

python-3.6.x
 
File 4 of 4: requirements.txt

This file is new. It instructs the Python environment in Cloud Foundry to install the hana_ml library.

hana-ml==2.8.21042100
 
From your local command line push the app to Cloud Foundry.

cf push --task
 
Before running the app, store the SAP HANA logon credentials in a user-defined variable in Cloud Foundry. The name of the  variable (here “HANACRED”) has to match with the name used in the above main.py. Port 443 is the port of SAP HANA Cloud. If you are using the CLI on Windows, the following syntax should work. If you are on a Mac, you might need to put double quotes around the JSON parameter.

cf set-env hanamlapply HANACRED {\"address\":\"REPLACEWITHYOURHANASERVER\",\"port\":443,\"user\":\"REPLACEWITHYOURUSER\",\"password\":\"REPLACEWITHYOURPASSWORD\"}

From here, it’s exactly the same steps as before. Run the task from the CLI if you like.

cf run-task hanamlapply --command "python main.py" --name taskfromcli

Check the log.

cf logs hanamlapply --recent
 
And indeed, the task completed successfully. The SAP HANA logon credentials were used to apply the trained model, to predict the missing data and to save the strongest predictions into the target table.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Back in the SAP Business Technology Platform, go to the Job Scheduling’s Service’s Dashboard and create a task for the hanamlapply app.

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

Create a schedule, lean back and the model is applied automatically!

SAP HANA Exam Prep, SAP HANA Certification, SAP HANA Preparation, SAP HANA Learning, SAP HANA Career, SAP HANA Guides

No comments:

Post a Comment