Learn how to script R in your preferred editor with SAP HANA data and how to deploy the code directly in SAP HANA.
If you are a Data Scientist, there is a good chance that you enjoy working with R. You can bring in your creativity, choose from thousands of R packages and script some very agile data analysis and predictions.
You probably know that SAP HANA has multiple options to create predictive insight. I am a big fan of creating predictive models automatically and efficiently with SAP Predictive Analytics. But if the needed arises and you have the time and skill to script a specific task yourself, it can be fun to open up your favorite R editor.
Working in your preferred R editor is what I call “freestyle”. Do what you do. And once you have found the code that solves the business challenge you are addressing, you will feel a nice sense of accomplishment. But the project has not yet been completed. How do you integrate your R code into your existing SAP HANA landscape, under the governance of IT? How to deploy freestyle? Well, read on and try it out!
If you are a Data Scientist, there is a good chance that you enjoy working with R. You can bring in your creativity, choose from thousands of R packages and script some very agile data analysis and predictions.
You probably know that SAP HANA has multiple options to create predictive insight. I am a big fan of creating predictive models automatically and efficiently with SAP Predictive Analytics. But if the needed arises and you have the time and skill to script a specific task yourself, it can be fun to open up your favorite R editor.
Working in your preferred R editor is what I call “freestyle”. Do what you do. And once you have found the code that solves the business challenge you are addressing, you will feel a nice sense of accomplishment. But the project has not yet been completed. How do you integrate your R code into your existing SAP HANA landscape, under the governance of IT? How to deploy freestyle? Well, read on and try it out!
Scenario
I will introduce the concept of working with R and SAP HANA along a time series forecasting scenario. We will use the monthly history of vehicle registration numbers from 20+ European countries to predict how many vehicles will be registered in future in each country. Think of it as a demand forecast or revenue forecast.
In reality, I believe such a use case ought to be implemented using the automated concept of SAP Predictive Analytics, which provides a comprehensive framework to operationalise such forecasting. The R code we are using here, will only cover a fraction of what the automated framework delivers, but let’s stick with it for the sake of learning the concept of scripting R with SAP HANA.
Disclaimer: Just bear in mind some small print. I am hoping this blog will help you get familiar with the R integration in SAP HANA, but the code or any of the blog’s content is not supported by SAP. Should you want to use part of the code, please carry out your own tests to ensure the syntax is working for your use case. Also, please have look at the “Things to consider” section towards the end of this blog. And finally, please verify with your Account Executive at SAP any license implications such an R integration might have.
Prerequisites
R Server connected to SAP HANA
You must have a SAP HANA system with a connected R Server to follow the whole example. Adding such an R Server to SAP HANA is described in the SAP HANA R Integration Guide. The supported R versions are listed in SAP Note 2185029. You can also work with the free SAP HANA, express edition as described in this tutorial.
The SAP HANA system might or might not have the Automated Predictive Library (APL) or the Predictive Analysis Library (PAL) installed.
Local R installation
I assume you are already comfortable scripting with R. Don’t despair if you are new to R. This collection of online learning resources maintained by RStudio is one of many ways to get started.
For the R scripting itself we should use an R development environment to be efficient. I personally like using RStudio, but there are many alternatives. We could develop directly on the R Server, for instance by running RStudio Server on that system. However, for this tutorial I am using a local installation of RStudio Desktop on my own laptop to develop the code. This code will be deployed later to SAP HANA and run on the connected R Server.
Specifically, the laptop must have:
- An installation of R, in the same version of the R Server that is connected to SAP HANA. The versions should be identical to ensure that the code that was developed will run well on the SAP HANA’s R Server.
- The R development environment, ie RStudio Desktop.
- The SAP HANA client to be able to connect from R to SAP HANA.
- A SAP HANA development environment to deploy the R syntax. I am using the Eclipse-based SAP HANA Modeler.
Historic data in SAP HANA
For our hands-on example you need to load the historic data of vehicle registrations into SAP HANA. I used the straight-forward import from SAP HANA Studio: “File” → “Import” → “SAP HANA Content” → “Data from Local File”. Just make sure to change the DataType of the MONTH column to DATE before uploading, as shown in the screenshot.
The example below assumes that the table is called VEHICLEREGISTRATIONS and exists in the schema ML.
A big thanks goes to the European Automobile Manufacturers Association for collecting and providing these registration figures.
R freestyle on SAP HANA
Now that the data is in SAP HANA we can start scripting with R. We begin by loading a few rows of the data into R. For our time series forecast we just need the history of a single country to get going. Later on, we will see how the logic is applied to all countries.
Open up the R development environment on your laptop!
Execute the following code to load the history of Switzerland into a data frame. Just adjust the connection parameters according to your environment. In my example I am using the JDBC driver.
You may also need to adjust the SELECT statement if your schema or table name are different.
library("RJDBC")
jdbc_driver <- JDBC(driverClass = "com.sap.db.jdbc.Driver",
classPath = "C:/Program Files/SAP/hdbclient/ngdbc.jar")
jdbc_connection <- dbConnect(jdbc_driver,
"jdbc:sap://SERVER:PORT",
"YOURUSER",
"ANDYOURPASSWORD")
data <- dbGetQuery(jdbc_connection, "SELECT COUNTRY, MONTH, REGISTRATIONS FROM ML.VEHICLEREGISTRATIONS WHERE VEHICLEREGISTRATIONS.Country = 'Switzerland' ORDER BY MONTH ASC")
The historic registration numbers are now loaded in a data frame. Feel free to use R for some exploratory data analysis on these numbers.
We continue by using this history to create the forecast. The code below is commented, so I will not explain all details here. But overall, a number of important parameters are set first, such as the forecast horizon (12 months, see dates_to_forecast).
The script then uses the forecast package from Rob Hyndman to try out different forecasting methods and tests their accuracy on a hold out sample. The approach with the lowest MAPE (mean absolute percentage error) is then used to create the final forecast of the truly unknown future. This forecast is displayed in a plot.
You may wonder why the R code is printing a status update, specifying which time series (meaning which country) it is currently working on. Strictly speaking this is not required, but there is a good reason for it anyway that will be revealed later…
# Load required libraries ------------------------------------------
library(lubridate)
library(forecast)
library(MLmetrics)
library(gplots)
# Specify any parameters, adjust to the use case -------------------------
col_name_date <- "MONTH"
col_name_measure <- "REGISTRATIONS"
col_name_segment <- "COUNTRY"
date_format <- "%Y-%m-%d"
dates_to_forecast <- 12
confidence_level <- 0.95
forecast_methods <- c('arima', 'ets', 'rwdrift', 'naive')
frequency <- 12 # This value must not be changed
# Retrieve the individual columns from the data frame -------------------
col_date <- as.character(data[, col_name_date])
col_measure <- data[, col_name_measure]
col_segment <- data[1, col_name_segment]
# Print status update -----------------------------------------------
print(paste("Now starting with: ", col_segment, sep = ""))
# Ensure data is sorted on the date in ascending order -------------------
data <- data [order(data[, col_name_date]), ]
# Convert time series into ts object (required by forecast function) -------
start_date <- as.Date(data[1, col_name_date], date_format)
ts_historic <- ts(data[, col_name_measure],
start = c(year(start_date), month(start_date)),
frequency = frequency)
# Keep a hold out sample of forecast length to test forecast accuracy -------------
ts_short <- head(ts_historic, n = length(ts_historic)-dates_to_forecast)
ts_hold_out <- tail(ts_historic, n = dates_to_forecast)
# Assess all forecasting methods on their performance on the hold out sample ------
ts_short_mapes <- rep(NA, length(forecast_methods))
for (ii in 1:length(forecast_methods)) {
stlf_forecast <- stlf(ts_short,
method = forecast_methods[ii],
h = dates_to_forecast,
level = confidence_level)
ts_short_mapes[ii] <- MAPE(as.numeric(ts_hold_out), stlf_forecast$mean)
}
# Select the best performing method to carry out the final forecast -----
forecast_best_mape <- min(ts_short_mapes)
ts_forecast <- stlf(ts_historic,
method = forecast_methods[which(ts_short_mapes == forecast_best_mape)],
h = dates_to_forecast,
level = confidence_level)
plot(ts_forecast)
We have a forecast!
In our example we want to write the forecast into a table in SAP HANA. Hence, we create a data frame with the structure and content we would like to store the data in SAP HANA. The following script combines the history with the forecasted values and enriches the dataset with additional context, ie the bounds of the forecast interval, some information on the model type, etc.
# Dates, name of time series (segment) and date type (Actual or Forecast) ---------
dates_all <- as.character(seq(from = start_date, by = "month", length.out = length(ts_historic)+dates_to_forecast))
col_segments <- rep(col_segment, length(dates_all))
model_descr <- rep(paste(ts_forecast$method, "- MAPE:", round(forecast_best_mape, 3)), length(dates_all))
date_types <- as.character(c(rep("Actual", length(ts_historic)), rep("Forecast", dates_to_forecast)))
# Actual and historic measures ----------------------------------
forecast_mean <- rep(NA, dates_to_forecast)
forecast_mean <- ts_forecast$mean
forecast_upper <- ts_forecast$upper
forecast_lower <- ts_forecast$lower
dates_all_mean <- as.numeric(c(as.numeric(ts_historic), as.numeric(forecast_mean)))
dates_all_lower <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_lower)))
dates_all_upper <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_upper)))
# Return the combined data ----------------------------------------
result <- data.frame(SEGMENT = col_segments,
MONTH = dates_all,
MEASURETYPE = date_types,
MEASURE = dates_all_mean,
MEASURELOWER = dates_all_lower,
MEASUREUPPER = dates_all_upper,
MODEL = model_descr)
result
The output for Switzerland:
You may want to change the filter in the SELECT statement and re-run the code to produce the forecasts for different countries.
Eventually we need to deploy this code in a more governed context. SAP HANA and R should produce the forecasts for all countries without requiring the laptop. Let’s see how to deploy this syntax in SAP HANA and how to apply it dynamically for each country.
R deployment on SAP HANA
Now open your modelling environment for SAP HANA. I am using the Eclipse-based SAP HANA Modeler. Go into the SQL console and specify the schema into which you want to write the predictions for the various countries. Then create the table itself, in the structure of the data frame created by R. I called it VEHICLEREGISTRATIONS_FORECAST.
SET SCHEMA ML;
--- Create table for the output
DROP TABLE "VEHICLEREGISTRATIONS_FORECAST";
CREATE COLUMN TABLE "VEHICLEREGISTRATIONS_FORECAST"(
SEGMENT NVARCHAR(14),
MONTH DATE,
MEASURETYPE NVARCHAR(20),
MEASURE DECIMAL,
MEASURELOWER DECIMAL,
MEASUREUPPER DECIMAL,
MODEL NVARCHAR(100));
Now create an RLANG procedure with the R code we wrote earlier in RStudio. The JDBC connection does not need to be specified since the R code is now maintained by SAP HANA, which will also provide the data to the procedure.
The procedure’s parameters specify that:
◈ data in the structure of the historic VEHICLEREGISTRATIONS table will be passed into it (input data)
◈ and that the procedure returns data in in the structure of the VEHICLEREGISTRATIONS_FORECAST table (output data)
When the RLANG procedure is called, SAP HANA will pass the historic data and the R code to the R Server. The “CREATE PROCEDURE” statement below specifies in my example that the input data will be available to the R code as data frame named “data” (see the “IN data” element). The statement also specifies that the RLANG procedure will return the content of a data frame called “result” (see the “OUT result” element”. Note that you can use different variable names in in your own projects. You don’t have to use “data” and “result”.
Since the procedure will return the data, the line that was plotting the chart is removed. Apart from this, it is exactly the code we created earlier in RStudio. Since the code is using a few additional libraries you have to make sure that these libraries are also installed on the R Server.
--- Procedure for a single monthly time series
DROP PROCEDURE SINGLE_FORECAST_MONTHLY_R;
CREATE PROCEDURE SINGLE_FORECAST_MONTHLY_R(IN data "ML"."VEHICLEREGISTRATIONS", OUT result "VEHICLEREGISTRATIONS_FORECAST")
LANGUAGE RLANG
AS
BEGIN
# Load required libraries --------------------------------------------
library(lubridate)
library(forecast)
library(MLmetrics)
# Specify any parameters, adjust to the use case ----------------------
col_name_date <- "MONTH"
col_name_measure <- "REGISTRATIONS"
col_name_segment <- "COUNTRY"
date_format <- "%Y-%m-%d"
dates_to_forecast <- 12
confidence_level <- 0.95
forecast_methods <- c('arima', 'ets', 'rwdrift', 'naive')
frequency <- 12 # This value must not be changed
# Retrieve the individual columns from the data frame --------------------
col_date <- as.character(data[, col_name_date])
col_measure <- data[, col_name_measure]
col_segment <- data[1, col_name_segment]
# Print status update ------------------------------------------------
print(paste("Now starting with: ", col_segment, sep = ""))
# Ensure data is sorted on the date in ascending order ------------------
data <- data [order(data[, col_name_date]), ]
# Convert time series into ts object (required by forecast function) ---------
start_date <- as.Date(data[1, col_name_date], date_format)
ts_historic <- ts(data[, col_name_measure],
start = c(year(start_date), month(start_date)),
frequency = frequency)
# Keep a hold out sample of forecast length to test forecast accuracy ----------
ts_short <- head(ts_historic, n = length(ts_historic)-dates_to_forecast)
ts_hold_out <- tail(ts_historic, n = dates_to_forecast)
# Assess all forecasting methods on their performance on the hold out sample ------
ts_short_mapes <- rep(NA, length(forecast_methods))
for (ii in 1:length(forecast_methods)) {
stlf_forecast <- stlf(ts_short,
method = forecast_methods[ii],
h = dates_to_forecast,
level = confidence_level)
ts_short_mapes[ii] <- MAPE(as.numeric(ts_hold_out), stlf_forecast$mean)
}
# Select the best performing method to carry out the final forecast ------------
forecast_best_mape <- min(ts_short_mapes)
ts_forecast <- stlf(ts_historic,
method = forecast_methods[which(ts_short_mapes == forecast_best_mape)],
h = dates_to_forecast,
level = confidence_level)
# Dates, name of time series (segment) and date type (Actual or Forecast) ---------
dates_all <- as.character(seq(from = start_date, by = "month", length.out = length(ts_historic)+dates_to_forecast))
col_segments <- rep(col_segment, length(dates_all))
model_descr <- rep(paste(ts_forecast$method, "- MAPE:", round(forecast_best_mape, 3)), length(dates_all))
date_types <- as.character(c(rep("Actual", length(ts_historic)), rep("Forecast", dates_to_forecast)))
# Actual and historic measures --------------------------------------
forecast_mean <- rep(NA, dates_to_forecast)
forecast_mean <- ts_forecast$mean
forecast_upper <- ts_forecast$upper
forecast_lower <- ts_forecast$lower
dates_all_mean <- as.numeric(c(as.numeric(ts_historic), as.numeric(forecast_mean)))
dates_all_lower <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_lower)))
dates_all_upper <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_upper)))
# Return the combined data ------------------------------------------------
result <- data.frame(SEGMENT = col_segments,
MONTH = dates_all,
MEASURETYPE = date_types,
MEASURE = dates_all_mean,
MEASURELOWER = dates_all_lower,
MEASUREUPPER = dates_all_upper,
MODEL = model_descr)
END;
The procedure is now available and can be called. Remember that the R code was written for an individual country. This allows us now to pass the data of each country individually into the procedure. Let’s test this with a single country first, before using the concept for all countries.
I found it easiest to split the historic data by creating a temporary table that contains only one specific country. This table is then used as input parameter when calling the RLANG procedure. Since a “?” is used as output parameter, the results are not yet written into a table, they are just displayed in the SAP HANA Modeler.
--- Forecast a single country
DROP TABLE #"tmpTable" ;
CREATE LOCAL TEMPORARY TABLE #"tmpTable"
(COUNTRY NVARCHAR(14),
MONTH DATE,
REGISTRATIONS INTEGER);
INSERT INTO #"tmpTable" SELECT COUNTRY, MONTH, REGISTRATIONS FROM ML.VEHICLEREGISTRATIONS WHERE VEHICLEREGISTRATIONS.COUNTRY = 'United Kingdom';
CALL SINGLE_FORECAST_MONTHLY_R (#"tmpTable",?);
You should now see the forecast in SAP HANA Studio! SAP HANA has passed the historic data of the United Kingdom and the R code to the R Server. The server did its forecast and returned the output into SAP HANA Modeler. The local installation of R on your own laptop was not used anymore at this point!
You don’t have to execute the following statement. It just shows how such a single forecast could be written into our target table. Compare this to the earlier procedure call and you will notice that the “?” was replaced with the output table name and that “WITH OVERVIEW” was added to the end of the statement.
CALL SINGLE_FORECAST_MONTHLY_R (#"tmpTable","VEHICLEREGISTRATIONS_FORECAST") WITH OVERVIEW;
SELECT * FROM "ML"."VEHICLEREGISTRATIONS_FORECAST";
By now we can forecast a single country. So, let’s build on this logic to dynamically create a forecast for each country in the historic data.
We just need to create an SQL procedure that calls the RLANG procedure for each individual country. The “WITH OVERVIEW” option we saw above is not supported for nested calls, hence we will implement the logic slightly differently. Instead of writing the forecast directly into the output table, the forecast is first passed into a temporary table. From that temporary table the data is then inserted into the final target table.
The purpose of the iterative approach, passing the data of each country individually, is to make the task digestible for R. If you have very large datasets, you may not want to pass it all at once.
--- Procedure to forecast multiple monthly time series
DROP PROCEDURE MULTIPLE_FORECASTS_MONTHLY_R;
CREATE PROCEDURE MULTIPLE_FORECASTS_MONTHLY_R
LANGUAGE SQLSCRIPT AS
CURSOR cursor_segment (SEGMENT NVARCHAR(256))
FOR SELECT DISTINCT COUNTRY FROM "VEHICLEREGISTRATIONS" ORDER BY COUNTRY ASC;
BEGIN
FOR cursor_row AS cursor_segment('input_value_1') DO
single_country = SELECT COUNTRY, MONTH, REGISTRATIONS FROM ML.VEHICLEREGISTRATIONS WHERE VEHICLEREGISTRATIONS.COUNTRY = cursor_row."COUNTRY";
CALL SINGLE_FORECAST_MONTHLY_R (:single_country, tmp_single_forecast) ;
INSERT INTO "VEHICLEREGISTRATIONS_FORECAST" SELECT * FROM :tmp_single_forecast;
END FOR;
END;
We are almost done, just call the procedure to trigger the forecasts.
--- Forecast all monthly time series
CALL MULTIPLE_FORECASTS_MONTHLY_R();
SELECT * FROM "ML"."VEHICLEREGISTRATIONS_FORECAST";
This call will take longer to complete now, as forecasts for 29 different countries are created. And if you want to see what is currently happening in R, you can see the R output in the index server’s trace file. In the screenshot below we see that R just started forecasting the time series for Greece. Now you know why the code we are using is printing a status update.
The target table should now include the forecasts for every country that was found in the historic data. The call to this procedure can now be scheduled to create new forecasts when needed.
The code that was produced freestyle in RStudio is now deployed in SAP HANA!
Things to consider
If you have read the blog this far, hopefully it has become clear how R can be integrated with SAP HANA. If you have even implemented the logic yourself, hopefully you managed to execute the code without too much trouble. Well done!
Just to point out, that even though the forecasts have been produced, chances are that more efforts will be needed to use this approach in a common business environment. A few thoughts that come to mind are:
- The code expects at least 3 years of history to produce the forecasts. It takes the most recent 12 months as hold-out sample, leaving two years to identify seasonality. If you have a shorter history, you might get an error.
- The code is expecting a complete history without missing data. If there is data missing within the time series, the output will be inaccurate. Similarly, if you have missing data at the end of the time series, the forecast will be out of sync.
- The code does not allow for additional predictor variables that might help improve the accuracy.
- The code is tweaked towards monthly forecasting. Different intervals can be handled by R of course, but the code needs to be adjusted.
- You have to take care of the scheduling process.
- You don’t have any visualisations showing the predictions.
These points go back to what I mentioned at the beginning of the blog, that I am a big fan of creating predictive models automatically and efficiently with SAP Predictive Analytics. The automated component of SAP Predictive Analytics provides such functionality out of the box.
I also mentioned that it is fun scripting in R. I also enjoy this myself, that’s why I wrote this blog after all! I just want to point out that different projects and different tasks might require different approaches. Automated Predictive Analytics is one of them. R is another.
No comments:
Post a Comment