Wednesday, 4 November 2020

Why are scalar UDFs so slow?

When loading data from source systems, there are requirements in SAP BW that are often repeated at the field level. This includes, above all, logic for cleaning up and processing the data, such as:

◉ Add and remove leading zeros

◉ Case distinctions

◉ Remove spaces at the beginning or end

◉ Conversion to capital letters

◉ Derive one field from another with fixed pattern sub-strings, e.g. COAREA ==> COUNTRY

◉ Derivation of time characteristics

The following listing shows a few examples in the source code:

outTab = SELECT TO_VARCHAR(TO_DATE(calday), 'YYYYMM') AS calmonth,

                "/BIC/SRCSYS",

                ...

                "/BIC/CURTYP",

                LEFT("/BIC/COAREA", 2) AS "/BIC/COUNTRY",

                SUBSTRING("/BIC/BI_PROFCT", 9, 2) as "/BIC/PCACCID",

                CASE SUBSTRING("/BIC/PROFCT", 8, 3)

                  when '643' then '1' 

                  when '655' then '1'

                  when '641' then '2' 

                  when '651' then '2'

                  when '643' then '3' 

                  when '655' then '3'

                  else ''

                end as  "/BIC/PRBY',

                ...

           FROM :intab;

Typical expressions in the field list of a transformation routine

What all these requirements have in common is that they

◉ can be easily implemented in SQLScript using the existing SQL functions

◉ repeat themselves often

◉ can be executed very quickly by SAP HANA as long as they are implemented directly in the transformation routines.

DRY – Don’t Repeat Yourself

The well-known DRY principle should actually come into play here. So that you don’t repeat the source code all the time. In the field list, the scalar, user-defined functions (UDF) are available for this purpose. Until recently, these were not available for the transformation routines in the BW, because they can only be created with the AMDP Framework since AS ABAP Release 753 . But scalar UDFs would be ideal for precisely these requirements. This ensures a uniform implementation. And in the source code you can move from a technical description of the expressions to a business view:

outTab = SELECT "ZCL_CALDAY=>TO_CALMONTH"(calday) AS calday

                "/BIC/SRCSYS",

                ...

                "/BIC/CURTYP",

                "ZCL_COAREA=>TO_COUNTRY"("/BIC/COAREA") AS "/BIC/COUNTRY",

                "ZCL_PROFCT=>TO_PCACCID"(/BIC/PROFCT") as "/BIC/PCACCID",

                "ZCL_PROFCT=>TO_PRBY"(/BIC/PROFCT") as "/BIC/PRBY',

                ...

           FROM :intab;

Outsourcing the expressions to a scalar UDF

The outsourced functions look much more elegant. And they are also superior from the point of view of maintenance. For example, if another profit center appears in the list, there is exactly one function that needs to be adjusted. Using the example of ZCL_PROFCT=>TO_PRBY I show such a scalar AMDP function:

CLASS zcl_profct DEFINITION

  PUBLIC

  FINAL

  CREATE PUBLIC .

  PUBLIC SECTION.

    INTERFACES if_amdp_marker_hdb.

    METHODS to_prby IMPORTING VALUE(iv_profct) TYPE char10

                    RETURNING VALUE(rv_prby)   TYPE char1.

ENDCLASS.

CLASS zcl_profct IMPLEMENTATION.

  METHOD to_prby BY DATABASE FUNCTION FOR HDB LANGUAGE SQLSCRIPT OPTIONS DETERMINISTIC READ-ONLY.

    rv_prby = case SUBSTRING(:iv_profct, 8, 3)

                  WHEN '643' then '1'

                  WHEN '655' then '1'

                  WHEN '641' then '2'

                  WHEN '651' then '2'

                  WHEN '643' then '3'

                  WHEN '655' then '3'

                  ELSE ''

                END ;

  ENDMETHOD.

ENDCLASS.

The example shows that the function is just a simple wrapper for a CASE expression. By the way, outsourcing to the ABAP world also makes it easy to write UnitTests for the functions. Since the functions in the database are only generated from the ABAP on the first call, UnitTests are also suitable for generating them.

Elegant but slow

As elegant as the outsourcing of logic in UDFs is, the concept is unfortunately not to be used for large amounts of data. This is because the UDFs have a significant influence on the runtime. I would like to show this in another expression from the example above:

TO_VARCHAR(TO_DATE(calday), 'YYYYMM')

This simple expression converts an ABAP DATS value into a SQLScript date, which is then output to a YYYYMM string. This corresponds to the format of the popular InfoObject CALMONTH.

Example: 20200928 ==> 202009

For this purpose, I create an AMDP UDF according to the above pattern:

  METHOD to_calmonth BY DATABASE FUNCTION FOR HDB LANGUAGE SQLSCRIPT OPTIONS DETERMINISTIC READ-ONLY.

    rv_result = TO_VARCHAR(TO_DATE(:iv_calday), 'YYYYMM');

  ENDMETHOD.

I made two queries in the SQL console on a small BW table with about 3 million entries: one directly with the expression and one with the function call.

Without UDF

do begin

    select budat, 

           TO_VARCHAR(TO_DATE(budat), 'YYYYMM'),

           account, 

           amount, 

           curr

           from "/BIC/AZBR_E1_S2";     

end;

Console output

Statement 'do begin select budat, TO_VARCHAR(TO_DATE(budat), 'YYYYMM'), account, amount, curr from ...'

successfully executed in 501 ms 237 µs (server processing time: 801 ms 664 µs)

Fetched 1000 row(s) in 102 ms 468 µs (server processing time: 0 ms 677 µs)

Result limited to 1000 row(s) due to value configured in the Preferences

With UDF

do begin

    select budat, 

           "ZCL_CALDAY=>TO_CALMONTH"(BUDAT),

           account, 

           amount, 

           curr

           from "/BIC/AZBR_E1_S2";     

end;

Console output:

Statement 'do begin select budat, "ZCL_CALDAY=>TO_CALMONTH"(BUDAT), account, amount, curr from ...'

successfully executed in 2:09.418 minutes (server processing time: 4:18.632 minutes)

Fetched 1000 row(s) in 103 ms 144 µs (server processing time: 0 ms 675 µs)

Result limited to 1000 row(s) due to value configured in the Preferences

The runtime is so different that you actually have to assume a bug. Because the complexity of the example is minimal. With less complexity, I don’t need UDF anymore. What I find amazing:

◉ In fact, there are only 365 different values of BUDAT in the table. Thus, the system has the information: With the keyword DETERMINISTIC the system has the information: The same input should create the same output.

◉ I would also have expected only the dictionary of the column to be converted, so each value exactly once.

◉ If you had replaced the function call with the function content with a primitive text editor with simple search/replacement, you are over 100 times faster. This is what the optimizer of an SAP HANA 7 years after the introduction of scalar UDFs should be able to do, right? 

In order to fully document the whole thing, I also did the analysis with PlanViz.

SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP ABAP Guides, SAP HANA Study Material
Planviz of the Query without UDF


SAP HANA Exam Prep, SAP HANA Tutorial and Material, SAP ABAP Guides, SAP HANA Study Material
PlanViz of the Query with UDF

It is clearly visible that, by and large, the same implementation plan is chosen. But the runtime of the one, red-marked JECalculate node contains the entire runtime for the UDF call. This should at least be parallelized.

I also stumbled across the HINTS. The HINT (INLINE) is supposed to trigger exactly the desired behavior: Optimization should be done globally. But unfortunately there is nothing in the documentation about functions. Everything refers to procedures. And that corresponds to my observation: the hint brings absolutely no change. It remains slow.

do begin
    select budat, 
           "ZCL_CALDAY=>TO_CALMONTH"(BUDAT),
           account, 
           amount, 
           curr
           from "/BIC/AZBR_E1_S2"
           WITH HINT(INLINE);     
end;

No comments:

Post a Comment