When batch-loading data from a source system into your Data Mart or Data Warehouse, you usually look for ways to reduce the to-be extracted data set to minimize extraction time and resource consumption. Ideally, there is a way to only extract changed records, for example, using a change date column. However, this does not always fly. For example, change date columns are not always reliable or deletions are not always tracked in the source table. An alternative would be to switch from batch to log or trigger replication on the source, but you don’t always have this luxury. So, now you have to load the entire source data set, but still you want to identify the change data, to only process delta’s in downstream data flows. How can you identify such change data using batch loading?
In this blog, I will explain how you can identify the delta data set using Smart Data Integration (SDI) flowgraphs inside the SAP Web IDE for HANA. A flowgraph is used that writes to a so-called changelog table. In this blog, only the changelog table is filled. Once you have the change data, it’s up to you how to use it in downstream data flows.
For this blog, version SPS04 Patch 0 of the Web IDE for HANA was used. In earlier or later versions, or in the Web IDE Full Stack for CF, UI or behaviour might differ from what is described here.
The following use case will be used, which was recently described to me by a customer:
◈ The source table is
◈ in an external database, connected to the SAP HANA SQL DW via SDI.
◈ rather large, but there is no field that tells us if data has been changed or deleted.
◈ The comparison table (the table that the source table needs to be compared with)
◈ tracks the history of all changes using validity timestamps
◈ has different column names than the source
◈ is populated by a tool that expects only changed records as input
You might have wondered if the Native Data Store Object (NDSO) would have been a candidate to solve this issue. After all, the NDSO provides a delta-mechanism out of the box. However, it is the three characteristics of the target table that discard that option, and therefore we solve this use case using SDI flowgraph transforms which cost more time to build, but allow for more flexibility.
The source table is defined as follows:
Figure 12: Changelog table contents as result of test run
In this blog, I will explain how you can identify the delta data set using Smart Data Integration (SDI) flowgraphs inside the SAP Web IDE for HANA. A flowgraph is used that writes to a so-called changelog table. In this blog, only the changelog table is filled. Once you have the change data, it’s up to you how to use it in downstream data flows.
For this blog, version SPS04 Patch 0 of the Web IDE for HANA was used. In earlier or later versions, or in the Web IDE Full Stack for CF, UI or behaviour might differ from what is described here.
Use case description
The following use case will be used, which was recently described to me by a customer:
◈ The source table is
◈ in an external database, connected to the SAP HANA SQL DW via SDI.
◈ rather large, but there is no field that tells us if data has been changed or deleted.
◈ The comparison table (the table that the source table needs to be compared with)
◈ tracks the history of all changes using validity timestamps
◈ has different column names than the source
◈ is populated by a tool that expects only changed records as input
You might have wondered if the Native Data Store Object (NDSO) would have been a candidate to solve this issue. After all, the NDSO provides a delta-mechanism out of the box. However, it is the three characteristics of the target table that discard that option, and therefore we solve this use case using SDI flowgraph transforms which cost more time to build, but allow for more flexibility.
Source and comparison table
The source table is defined as follows:
Figure 1: Source table
The comparison table (the table that the source table needs to be compared with) is defined as follows. As you can see, column “ValidityFrom” is part of the key, and serves tracking the history of the data set together with the “ValidityTo” column. Also, there is a field “SalesRename”, which should be compared to field “Sales” of the source table.
Figure 2: Comparison table
Creating the changelog table and the flowgraph
First, we need to define a changelog table that stores the change data. Initially, only a table is needed that identifies the changed data for each individual run, so we do not need to store changelog history.
The changelog table is defined as follows. It has a similar structure to the source, except for the renamed field, and the two added fields that indicate the Change Type (I for Insert, U for Update, D for Delete) and the ChangeTime (a timestamp of when the flowgraph detected the changes). Note that the flowgraph will expect such fields if you want to construct a changelog, although you are free to change the ChangeTime column to a date format or leave it out completely and only stick with the ChangeType column. I prefer to have the ChangeTime column to know for sure which results I am looking at in the changelog table.
Figure 3: The changelog table that should contain “Change Data”
The flowgraph needs four nodes, as you can see in the below illustration.
Figure 4: Main view of the created flowgraph
The first node is the data source, which does not need further explanation. The second node serves to rename the incoming fields to match with the table we want to compare with. Since in our source we have a field “Sales” and in the table to compare with we have a field “SalesRename”, the “Sales” field is renamed to “SalesRename”. Please note that any renames should be done at this point. You cannot rename fields in the Table Comparison node itself, and the Table Comparison node expects fields from the source and compare table to be the same.
Figure 5: Rename of a field in the Projection Node
In the table comparison node on the source screen, first select the table that you want to compare with. Once you do this, the fields of that table are displayed to the right (which has no further use than to inform you on the definition of that table). The “Generated Key Attribute” is left empty. This option is only necessary if you expect that the comparison table has duplicate entries on the compare criteria, and the compare table has a surrogate key. In that case, the record with the largest surrogate key is chosen to compare with.
Since the comparison table tracks history of the data set, the comparison should only take place on records that are “current”. In the example case, these are the records of which the ValidityTo timestamp is equal to ‘9999-12-31 23:59:59’. Also, since the source can delete records, and these should be tracked, the “Deleted rows detection” checkbox is checked. Please note that this option will slow down the operation, so only use it if necessary.
Figure 6: Comparison table settings in the Table Comparison node
In the attributes section, by default no fields are shown, which means that the comparion will take place on all fields of the source table. In case you do not want to compare all those fields, you can add the ones needed for comparison. The less fields to compare with, the faster the operation will be.
The Primary Key checkbox represent the fields on which the comparison is based. In the example use case, this is the SalesOrderID, so the default selection is correct.
Figure 7: Source table settings in the Table Comparison node
In the target node, the ChangeType and ChangeTime columns are left empty. They will be filled in the Settings panel.
Figure 8: Changelog table mapping in the Target node
In the Settings panel, the “Truncate Table” checkbox is checked, because the changelog should only store the records from single flowgraph executions. The Writer Type should be set to “Insert”, because Updates and Deletes should actually be stored as individual records, and not treated as an actual Update or Delete.
The Change Type Attribute and Change Time Attribute, obviously, are mapped to the ChangeType and ChangeTime column.
In this example, we are not tracking the full changelog history. If you would like to, you should make sure the changelog table contains a surrogate key column, which you then map to the “Key Generation Attribute” field. Also, you should create a sequence using an hdbsequence file, and map the sequence in the “Key Generation Sequence” field. Don’t forget to uncheck the “Truncate Table” checkbox.
Figure 9: Writer settings in the Target node
Test the flowgraph – example input and output
To establish a starting point for testing the flowgraph, records are inserted to the source table, as well as the comparison table. In the comparison table, one business record has two versions, of which only one is current.
TRUNCATE TABLE "deltas_for_fulls::SalesOrderHeader_TableCompare_HistoryTarget";
TRUNCATE TABLE "deltas_for_fulls::SalesOrderHeader";
INSERT INTO "deltas_for_fulls::SalesOrderHeader" VALUES (1,'13.10.2010','Muhammed MacIntyre','Eldon Simple','6','261.54','United States of America');
INSERT INTO "deltas_for_fulls::SalesOrderHeader" VALUES (2,'01.10.2012','Barry French','Cardinal Regular','2','199.99','Germany');
INSERT INTO "deltas_for_fulls::SalesOrderHeader" VALUES (3,'21.06.2019','Zeph Skater','SomeBoard','30','300.00','Spain');
INSERT INTO "deltas_for_fulls::SalesOrderHeader" VALUES (4,'10.07.2011','Clay Rozendal','R380','30','4965.76','Luxembourg');
INSERT INTO "deltas_for_fulls::SalesOrderHeader_TableCompare_HistoryTarget" VALUES (1,'12.06.2019 00:00:00','31.12.9999 23:59:59','13.05.2019','Muhammed MacIntyre','Eldon Base for stackable storage shelf platinum','6','261.54','United States of America');
INSERT INTO "deltas_for_fulls::SalesOrderHeader_TableCompare_HistoryTarget" VALUES (2,'12.06.2019 00:00:00','31.12.9999 23:59:59','14.05.2019','Barry French','Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl','27','244.57','Brazil');
INSERT INTO "deltas_for_fulls::SalesOrderHeader_TableCompare_HistoryTarget" VALUES (3,'01.05.2018 00:00:00','11.06.2019 23:59:59','11.06.2019','Zeph Skater','Shortboard','40','400.00','Colombia');
INSERT INTO "deltas_for_fulls::SalesOrderHeader_TableCompare_HistoryTarget" VALUES (3,'12.06.2019 00:00:00','31.12.9999 23:59:59','11.06.2019','Zeph Skater','Longboard','30','300.00','Colombia');
The source table now has the following content:
Figure 10: Source table contents for test run
The comparison table now has the following content, and comparing it with the source, you should see that the difference consists of one insert, two updates, and one deletion.
Figure 11: Comparison table contents for test run
In the results after running the flowgraph, you can see the captured change data.
No comments:
Post a Comment