Why is parallelization relevant?
The R-Integration with SAP HANA aims at leveraging R’s rich set of powerful statistical, data mining capabilities, as well as its fast, high-level and built-in convenience operations for data manipulation (eg. Matrix multiplication, data sub setting etc.) in the context of a SAP HANA-based application. To benefit from the power of R, the R-integration framework requires a setup with two separate hosts for SAP HANA and the R/Rserve environment. A brief summary of how R processing from a SAP HANA application works is described in the following:
The R-Integration with SAP HANA aims at leveraging R’s rich set of powerful statistical, data mining capabilities, as well as its fast, high-level and built-in convenience operations for data manipulation (eg. Matrix multiplication, data sub setting etc.) in the context of a SAP HANA-based application. To benefit from the power of R, the R-integration framework requires a setup with two separate hosts for SAP HANA and the R/Rserve environment. A brief summary of how R processing from a SAP HANA application works is described in the following:
- SAP HANA triggers the creation of a dedicated R-process on the R-host machine, then
- R-code plus data (accessible from SAP HANA) are transferred via TCP/IP to the spawned R-process.
- Some computational tasks take place within the R-process, and
- the results are sent back from R to SAP HANA for consumption and further processing.
For more details, see the SAP HANA R Integration Guide:
There are certain performance-related bottlenecks within the default integration setup which should be considered. The main ones are the following:
- Firstly, latency is incurred when transferring large datasets from SAP HANA to the R-process for computation on the foreign host machine.
- Secondly, R inherently executes in a single threaded mode. This means that, irrespective of the number of CPU resources available on the R-host machine, an R-process will by default execute on a single CPU core. Besides full memory utilization on the R-host machine, the available CPU processing capabilities will remain underutilized.
A straightforward approach to gain performance improvements in the given setup is by leveraging parallelization. Thus I want to present an overview and highlight avenues for parallelization within the R-Integration with SAP HANA in this document.
Overview of parallelization options
The parallelization options to consider vary from hardware scaling (host box) to R-process scaling and are illustrated in the following diagram
The three main paths to leverage parallelization, as illustrated above, are the following:
(1) Trigger the execution of multiple R-calls in parallel from within SQLScript procedures in SAP HANA
(2) Use parallel R libraries to spawn child (worker) R processes within parent (master) R-process execution
(3) Scale the number of R-host machines connected to SAP HANA for parallel execution (scale memory and add computational power)
While each option can be implemented independently of one another, they can as well be combined and mixed. For example if you go for (3) – scaling number of R-hosts, you need (1) – Trigger the execution of multiple R-calls, for parallelism to take place. Without (1), you may remain “only” in a better high availability/fault tolerant scenario.
Based on the following use case, I would illustrate the different parallelization approaches using some code examples:
A Health Care unit wishes to predict cancer patient’s survival probability over different time horizons, after following various treatment options based on diagnosis. Let's assume the following information:
- Survival periods for prediction are: half year, one year and two years
- Accordingly, 3 predictive models have been trained (HALF, ONE, TWO) to predict a new patient’s survival probability over these periods, given a set predictor variables based on historical treatment data.
In a default approach without leveraging parallelization, you would have one R-CALL transferring a full set of new patient data to be evaluated, plus all three models from SAP HANA to the R-host. On the R-host, a single-threaded R process will be spawned. Survival predictions for all 3 periods would be executed sequentially. An example of the SAP HANA stored procedure of type RLANG is as shown below.
In the code above 3 trained models (variable tr_models) are passed to the R-Process for predicting the survival of new patient data (variable eval). The survival prediction based on each model takes place in the body of the “for loop” statement highlighted above.
Performance measurement: For dataset size of 1.038.024 (~16.15 MB) observations and 3 trained Blob model objects (each~26.8MB), an execution time of 8.900 seconds was recorded.
There are various sources of overhead involved in this scenario. The most notable ones are:
- Network communication overhead, in copying one dataset + 3 models (BLOB) from SAP HANA to R.
- Code complexity, sequentially executing each model in a single-threaded R-process. Furthermore, the “for” loop control construct, though in-built into base R, may not be efficient from a performance perspective in this case.
By employing parallelization techniques, I hope to achieve better results in terms of performance. Let the results of this scenario constitute our benchmark for parallelization.
Applying the 3 parallelization options to the example scenario
1. Parallelize by executing multiple R-calls from SAP HANA
We can exploit the inherent parallel nature of SAP HANA’s database processing engines by triggering multiple R-calls to run in parallel as illustrated as above. For each R-call triggered by SAP HANA, the Rserve-process would spawn an independent R-runtime process on the R-host machine.
An example illustrating how an SAP HANA SQLScript-stored procedure with multiple parallel calls of stored procedure type RLANG is given below. In the example, one thought is to separate patient survival prediction across 3 separate R-Calls as follows:
- Create an RLANG stored procedure handling survival prediction for just one model ( see input variable tr_model).
- Include expression “READS SQL DATA” (as highlighted above) in the RLANG procedure definition for parallel execution of R-operators to occur, when embedded in a procedure of type SQLScript. Without this instruction, R-calls embedded in an SQLScript will excute sequentially.
- Then create an SQLSCRIPT procedure
- Embed 3 RLANG procedure-calls within the SQLSCRIPT procedure as highlighted. Notice that I am calling the same RLANG procedure defined previously but I pass on different trained model objects (trModelHalf, trModelOne, trModelTwo) to separate survival predication across different R-calls.
- In this SQLScript procedure you can include the READS SQL DATA expression (recommended for security reasons as documented in the SAP HANA SQLScript Reference guide) in the SQLSCRIPT procedure definition, but to trigger R-Calls in parallel it is not mandatory. If included however, you cannot use DDL/DML instructions (INSERT/UPDATE/DELETE etc) within the SQLSCRIPT procedure.
- On the R host, 3 R processes will be triggered, and run in parallel. Consequently, 3 CPU cores will be utilized on the R machine.
Performance measurement: In this parallel R-calls scenario example, an execution time of 6.278 seconds was experienced. This represents a performance gain of roughly 29.46%. Although this indicates an improvement in performance, we may have theoretically expected a performance improvement close to 75%, given that we trigger 3 R-calls. The answer for this gab is overhead. But which one?
In this example, I parallelized survival prediction across 3 R-calls, but still transmit the same patient dataset in each R-call. While the improvement in performance could be explained, firstly, by the fact that now HANA transmits lesser data per R-call (only one model, as opposed to three in the default scenrio) and consequently the data transfer may be faster. Secondly, each model survival prediction is performed in 3 separate R-runtimes.
There are two other avenues we could explore for optimization in this use case scenario. One is to further parallelize R-runtime prediction itself (see section 2). The other is to further reduce the amount of data transmitted per R-call by splitting the patient dataset in HANA and parallelize the data transmitted across separate R-calls (see section 4).
Please note that without the READS SQL DATA instruction in the RLANG procedure definition an execution time of 13.868 seconds was experienced. This is because each R-CALL embedded in the SQLscript procedure is executed sequentially (3 R-call roundtrips).
2. Parallelize the R-runtime execution using parallel R libraries
By default, R execution is single threaded. No matter how much processing resource is available on the R-host machine (64, 32, 8 CPU cores etc.), a single R runtime process will only use one of them. In the following I will give examples of some techniques to improve the execution performance by running R code in parallel.
Several open source R packages exist which offer support for parallelism with R. The most popular packages for R-runtime parallelism on a single host are “parallel” and “foreach”. The “parallel” package offers a myriad of parallel functions, each specific to the nature of data (lists, arrays etc.) subject to parallelism. Moreover, for historical reasons, one can classify these parallel functions roughly under two broad categories, prefixed by “par-“ (parallel snow cluster) and “mc-“ (multicore).
In the following example I use the multicore function mcLapply() to invoke parallel R processes on the patient dataset. Within each of the 3 parallel R-runtimes triggered from HANA, split the patient data into 3 subsets, then, parallelize survival prediction on each subset. See figure below.
The script example above highlights the following:
- 3 CPU cores are used (variable n.cores)by the R-process
- The patient data is split into 3 partitions, according to number of chosen cores, using the “splitIndices” function.
- The task to be performed (survival prediction) by each CPU core is defined in function “scoreFun
- Then I call the mclapply()split.idx) , how many CPU cores to use, and which function should be executed by each core.
In this example, 3 R-processes (master) are initially triggered in parallel on the R-host by the 3 R-calls. Then within each master R-runtime, 3 additional child R-processes (worker) are spawn by calling mclapply(). On the R-host, therefore, we will have 3 processing groups executing in parallel, each consisting of 4 R-Runtimes (1 for master and 3 for workers). Each group is dedicated to predict patient survival based one model. For this setup 12 CPUs will be used in total.
Performance measurement: In this parallel R package scenario using mclapply(), an execution time of 4.603 seconds was observed. This represents roughly 48.28% gain in performance over the default (benchmark) scenario and a roughly 20% improvement over the parallel R-call example presented in section 2.
3. Parallelize by scaling the number of R-Host machines connected to HANA for parallel execution
It is also possible to connect SAP HANA to multiple R-hosts, and exploit this setup for parallelization. The major motivation for choosing this option is to increase the number of processing units (as well as memory) available for computation, provided the resources of a single host are not sufficient. With this constellation, however, it would not be possible to control which R-host receives which R request. The choice will be determined randomly via an equally-weighted round-robin technique. From an SQLScript procedure perspective, nothing changes. You can reuse the same parallel R-call scripts as exemplified in section 1 above.
Setup Prerequisites
- Include more than one IPv4 addresses in CalcEngine parameter cer_rserve_addressesindexserver.inixsengine.ini file (see section 3.3 of SAP HANA R Integration Guide)
- Setup parallel R-Calls within as SQLSCRIPT procedure, as described in section
I configure 2 R-host addresses in the calcengine rserve address option shown above. While still using the same SQLScript procedure as in the 3 R-Calls scenario example (I change nothing in the code), I trigger parallelization of 3 R-calls across two R-host machines.
Performance measurement: The scenario took 6.342 seconds to execute. This execution time is similar to the times experienced in the parallel R-calls example. This example only demonstrates that parallelism works in a multi R-host setup. Its real benefit for parallelization comes into play when it believed the computational resources (CPUs, memory) available on one R-box are not enough.
4. Optimizing data transfer latency between SAP HANA and R
As discussed in section 1, one performance overhead is in the transmission of the full patient data set in each parallel R-call from HANA to R (see example in section 1). We could further reduce the latency in data transfer by splitting data set into 3 subsets in HANA, then using 3 parallel R-calls we transfer each subset from HANA to R for prediction. In each R call, however, we would have to also transfer all 3 models.
An example illustrating this concept is provided in the next figure.
In the example above, the following is performed
- The patient dataset (eval) is split into 3 subsets in HANA (eval1, eval2, eval3).
- 3 R-calls are triggered, each with the transferring a data subset together with all 3 models.
- On the R-host, 3 master R-process will be triggered. Within each master R-Process I parallelize survival prediction across 3 cores using pair functions mcpallelel()/mccollect() for task parallelism in the “parallel” R-package from the (task parallelism) as shown below.
- I create and R funtion (scoreFun) to specify a particular task. This function focuses on predicting survival based on one model input parameter.
- For each call of mcparallel() function an R process is started in parallel and will evaluate the expression in R function definition scoreFun. I assign each model individually.
- With a list of assigned tasks I then call mccollect() to retrieve the results of parallel survival prediction.
In this manner, the overall data transfer latency is reduced to the size of data in each subset. Furthermore, we still maintaining completeness of data via parallel R-calls. The consistency in the results of this approach is guaranteed if there is no dependency in the result computation for each observation in the data set.
Performance measurement: With this scenario, an execution time of 2.444 seconds was observed. This represents a 72.54% performance gain over the default benchmark scenario. This represents roughly 43% improvement over the parallel R-call scenario example in section 1, and a 24.26% improvement over the parallel R-runtime execution (with parallel R-libraries) example in section 2. A fantastic result supporting the case for parallelization.
Concluding Remarks
The purpose of this document is to illustrate how techniques of parallelization can be implemented to address performance-related bottlenecks within the default integration setup between SAP HANA and R. The document presented 3 parallelization options one could consider:
- Trigger parallel R-calls from HANA
- Use parallel R libraries to parallelize the R-execution
- Parallelize R-calls across multiple R-hosts.
With parallel R libraries you can improve the performance of a triggered R-process execution by spawning additional R-runtime instances executing on the R-host (see section 2). You can either parallelize by data (split data set computation across multiple R-runtimes), or by task (split algorithmic computation across multiple R-runtimes). A good understanding of the nature of the data and the algorithm is, therefore, fundamental to choosing how to parallelize. When executing parallel R runtimes using R-libraries we should remember that there is an additional setup overhead incurred by the system when spawning child (worker) R-processes and terminating them. The benefits of parallelism using option should, therefore, be appreciated after prior testing in an environment similar to the productive environment it will eventually run.
On the other hand, when using the trigger parallel R-calls option, no additional overhead is incurred on the overall performance. This option provides us with a means to increase the number of data transmission lanes between HANA and the R-host, as well as allows us spawn multiple parent R-runtime processes in the R-host. Exploiting this option led to the following key finding: The data transfer latency between HANA and R can, in fact, be significantly reduced by splitting the data set in HANA, and then parallelize the transfer of each subset from HANA to R using parallel R-calls (as illustrated in section 4).
Source: scn.sap.com
No comments:
Post a Comment