1. Introduction
HANA SDI can serve various needs when it comes to data extraction, loading and transformation. In terms of data loading flavors, it supports real-time replication and batch processing, push and pull based.
In this blog entry I would like to outline one option how pull based micro-batching can work with SDI in a simplified use case. The showcase leverages HANA XSC runtime (not state of the art but still applicable out there and therein mostly design time objects. Some material data sets will be loaded in a batch-based manner, using scheduled xs cron jobs, offering some kind of extended options.
2. Context
In our use case, every minute, hourly or on a daily basis, the load of material entities from a remote source is required. There is no simple way given, that allows for real-time or batch based replication of the data. This might be due to various reasons:
◈ No CDC possible/allowed in source
◈ Missing authorizations in source
◈ Compliance issues
Therefore the change data must be identified in a way that involves some basic custom logic.
Prerequisite are date/time based creation and/or change attributes. These act as some kind of markers used to scope the next delta data set. As the tables involved are MARA and MARC, a creation and change date is available. Finally, an XS cron job triggers the execution of the data flow in a defined frequency.
3. HANA/SDI Toolset
We will use a simple SDI flowgraph in combination with some stored procedures to implement the requirements. The flowgraph comprises two data sources which are joined (MARA + MARC), filtered and pushed into some data sink, a target table. The source tables can be based on a remote source of any adapter, it is generally applicable for such kinds of ELT flows.
A parameter enables filtering the data since the last successful load. Stored procedures take over to lookup load dates of the last successful HANA runtime task execution. The actual execution of the flowgraph task is triggered by a HANA XS job. Within the XS job definition it can be decided between an initial and a delta load.
The following visual outlines the described steps.
4. Limitations
◈ Deletions in the source table are not reflected. The reason is the employed change pointer, the creation or change date. Given this approach, there is no option to identify deletions and moreover apply those deletions on the target using the flowgraph. It is assumed, the applied records are kept in the target.
5. Implementation/Data Flow Modeling
5.1 FlowGraph
First and foremost a variable of type expression with some default value is introduced.
The simplified data flow looks as follows:
◈ Two data sources: ERP table MARA and MARC
◈ FILTER_DELTA node: subset of MARA columns + filter logic using the varDate variable
◈ JOIN_MARA_MARC: inner join of the two tables
◈ MATERIALS_TEMPLATE_TABLE: target table
The filter node applies some simple filtering logic. It will filter on the creation date (ERSDA) or the change date (LAEDA).
The writer type of your target/template table is to be set to upsert, else you might run into unique constraint violations.
5.2 Stored Procedures
SP_loadDateLookup
The stored procedure SP_loadDateLookup passes back the last successful load date of a HANA runtime task. As an input parameter the task name is defined. Alternatively and depending on your requirements, you define the return parameters as date/time/timestamp.
SP_loadMaterial
The stored procedure SP_loadMaterial triggers the execution of the HANA runtime task of the SDI flowgraph. As an input parameter, an initial load flag is defined. This flag enables a more flexible way to control the execution of the data flow and decide between an initial load (where initially the target table is truncated) and a delta load (where the variable and last successful load date will be considered, no truncation).
Your stored procedure to trigger respective task executions may look as follows:
PROCEDURE "SYSTEM"."sdi.prototyping::SP_loadMaterial" ( IN initialFlag VARCHAR(1))
LANGUAGE SQLSCRIPT SQL SECURITY INVOKER AS
BEGIN DECLARE v_date DATE; DECLARE v_dateChar VARCHAR(256);
BEGIN AUTONOMOUS TRANSACTION
IF (initialFlag = 'X') THEN
DELETE FROM "SYSTEM"."MATERIALS";
END IF;
END;
IF (initialFlag = 'X') THEN
v_dateChar := '19500101';
ELSE
--get latest successfull execution date
CALL "SYSTEM"."sdi.prototyping::SP_loadDateLookup"(IV_FGNAME => 'sdi.prototyping::FG_MATERIAL', OV_LOAD => v_date);
v_dateChar := REPLACE(TO_VARCHAR(v_date), '-', '');
END IF;
EXEC 'START TASK "SYSTEM"."sdi.flowgraphs::FG_MATERIAL" ("varDate" => ''' || v_dateChar || ''' )';
END
5.3 XS Cron Job
In order to schedule the execution of the SP_loadMaterial procedure, an XS cron job is created. The scheduler.xsjs file defines respective functions that the .xsjob file triggers. In the current use case, the functions propagate the initial flag to the procedure call, so that you can decide within the job definition if it goes for an initial load or a delta.
The following describes a way how to implement the scheduler.xsjs file:
function loadMaterial(initialLoadFlag)
{
if (initialLoadFlag != '' || initialLoadFlag != ' ' || initialLoadFlag == null)
{
initialLoadFlag = 'X'
}
var query = "{CALL \"SYSTEM\".\"sdi.prototyping::SP_loadMaterial\"('" + initialLoadFlag + "')}";
$.trace.debug(query);
var conn = $.db.getConnection();
var pcall = conn.prepareCall(query);
pcall.execute();
pcall.close();
conn.commit();
conn.close();
}
From the xs admin job section, you are able to decide between initial and delta load (URL: <host>:<port>/sap/hana/xs/admin/jobs):
6. Alternative Approaches/Thoughts
◈ The table comparison transform offers another approach how batch processing can be implemented using HANA SDI. The clear downside I see is the need to introduce DB sequences and drop the real primary keys. This might have consequences on the data modeling/consumption side of those tables so it is arguable which option to choose. However, the table comparison transform comes with the capability of also reflecting delete operations, which is in the given scenario not reflected (as initially stated in the limitations section).
◈ Depending on the frequency how often you want to process batches, a finer granularity can be achieved using date + time. You have to change the lookup in the task execution table and return e.g. a timestamp of the last successful load instead of just the date.
No comments:
Post a Comment