Overview
Recently, one of my customer’s had an interesting data movement problem. They needed to extract 150+ tables from a traditional RDBMS, with some tables having 500M+ rows, and move those tables to an SAP HANA instance. This would seem to be a straightforward exercise, but there were some gotchas:
1. The volume of the data was large and the network bandwidth between the servers was restricted due to WAN limitations.
2. The goal was to move and import all the data into their SAP HANA instance in less than one hour.
3. The only open ports between the two servers was for SSH (port 22), and this could not be changed due to security policies.
To address the volume and bandwidth issues, it was clear that compression would be necessary. To address the one-hour loading time, it seemed some form of parallelism would be necessary. The last limitation, SSH only, suggested some form of flat file copy using secure copy (SCP) between the machines. Flat file content for this project ended-up being CSV files.
Architecture
The customer’s source database was an Oracle 12c instance and the target was an SAP HANA 2.0 SPS03 on-premise instance. Knowing the Oracle RDBMS has a powerful job scheduler along with the ability to execute scripts in the operating system, the architecture of this process seemed pretty straightforward: use the job scheduler to process a list of table asynchronously in as many parallel jobs as were reasonable on the source system.
For this proof of concept, two very small virtual machines (4 cores, 16GB of RAM) were used. Even with these very limited resources this tool moved and loaded 50M rows in 4 minutes. This includes creating the necessary schema objects on the SAP HANA instance.
Note: the customer’s requirements did not require ensuring transactional integrity on the source tables while the export moved the tables.
Implementation
OK, the architecture pretty much described itself, but what’s the best way to get the data out of the source database. There are lots of examples of CSV dumping procedures on the Internet, but specific information on handling SAP HANA data type translations, and most importantly, BLOB handling was hard to find.
In the PL/SQL code installed on the source Oracle database, there is a specific procedure called, not unsurprisingly, dump_table_to_csv to dump the contents of a table to CSV format. This procedure was created specifically to handle data type translations to appropriate SAP HANA data types. To verify BLOBs were being handled correctly, the output from this procedure of dumping a 4MB Microsoft PowerPoint file produced the same output as a native SAP HANA export to CSV of the same content.
Note: for this proof of concept, the tool does not handle exotic types such as spatial columns and nested tables. These could easily be added but were not required at this time.
To get started, specific privileges need to be granted on both the source Oracle database and the target SAP HANA instance. On the source side, the ability to write to files to the operating system was clearly a must (UTL_FILE). The ability to run operating system level scripts was also necessary (DBMS_SCHEDULER, job_type=EXECUTABLE). See the create‑users.txt file for details on all the configuration and grants needed on the source and target databases.
With the necessary privileges nailed down, the next step was to build the asynchronous process. It turned out to be an interesting problem in how to ensure 100% utilization of our parallel resources with no “blocking” operations. This is where the scheduling aspect of the source database was the right answer.
To manage the scheduler, we created two tables. The first is a “job” definition table that includes the information about the local Oracle environment and the target HANA instance, including SSH information like hostname and Linux user. The second tabe contains a list of tables chosen to be exported. This list of tables is fed sequentially to the parallel jobs and this keeps all the jobs at 100% utilization exporting tables. Once a job is defined, it can be executed any number of times.
Running the tool
To define a job, there is a simple API to create a job, assign individual tables or entire schemas to a job, and a stop procedure. Refer to the CSV_EXPORTER.pls file for details on the API. The key here is how many parallel threads to use for a job – this should be a number smaller than the number of cores on the source machine to avoid overrunning the source server.
The sample-job.sql script in the tests folder shows an example of creating a job, adding a schema, and running the new or re-created job.
When a job is started with the run_job procedure, a series of scheduler jobs are created to start processing the list of tables. The tool orders the tables from largest to smallest based on the current statistics for each table – this could have been done any number of ways and we may revisit this with the customer.
No comments:
Post a Comment