Overview
Ever wondered how will you load data/file from your local machine into HANA Cloud, Data Lake without hesitation?
Then you are at the right place. This blog will provide you a step-by-step guide as to how anyone can easily load data from their local machine to the HANA Cloud, Data Lake using Data Lake File store.
Step-by-step process:
Firstly, one must provision a Data Lake instance from the SAP HANA Cloud Central through SAP BTP Cockpit. One can learn to do so by going through the following tutorial – Provision a Standalone Data Lake in SAP HANA Cloud | Tutorials for SAP Developers
If you haven’t had a chance to try out SAP HANA Cloud and its built-in data lake, you can do that now by signing up for the Free trial – https://www.sap.com/cmp/td/sap-hana-cloud-trial.html
(This tutorial will teach you how to provision a standalone Data Lake Instance, one can also provision a HANA Managed HDLRE– It’s just an additional step when you try to provision a HANA instance, one needs to just select the data lake option while doing that) One could also watch the video – HDL_DEMO_pt1.mp4 (sapjam.com) to get an overview of provision a Data Lake instance in cloud central and how can one connect the Data Lake instance in the SAP HANA Database explorer for starters.
So, in this blog, first we will learn how to upload a local data file to the HANA Data lake file store and from there load it to a SQL table in Hana Data Lake.
Step 1:
Firstly, one needs to download the SAP HANA Data Lake Client. It could be downloaded from Software Downloads – SAP ONE Support Launchpad. Download the following version of the Data Lake Client. (it’s the recent build)
Unzip the folder and run the setup.exe installer and proceed with installing Data Lake Client on your machine.
Step 2:
To make a secure connection between the HDLFSCLI and a HANA Data Lake file container, few certificates must be generated. Follow the step 3 in Getting Started with Data Lake Files HDLFSCLI | Tutorials for SAP Developers to generate certificates. Follow the step 4 also to add Trust and Authorizations in your HDL (HANA Data Lake) instance.
Step 3:
To input files in the file store we will use a tool known as the Hana Data Lake File Store Command line interface or HDLFSCI for short.
We will verify that the configuration we did in Step 2.
Note: The following command wont return anything in the output.
hdlfscli -cert <PATH>\client.crt -key <PATH>\client.key -cacert <PATH>\ca.crt -k -s https://<REST API Endpoint> -filecontainer <Instance ID> ls
The <REST API Endpoint> and <Instance ID> can be found in the SAP HANA Cloud Central Cockpit. <PATH> is the path to the corresponding certificate.
[Optional]: Configure a configuration file to make using the CLI simpler.
Note: The configuration will be placed in the user’s root directory. It is saved as a JSON file that can be modified in any text editor.
hdlfscli -cert <PATH>\client.crt -key <PATH>\client.key -k -s <REST API Endpoint> -config myconfig -dump-config ls
Test the configuration that was just created.
hdlfscli -config myconfig ls
Enter this into your command line to specify the database connection which will be configured by you. (I called it myconfig)
The “ls” command will display the files that are already present in your Data Lake.
Step 4:
Upload a file to the SAP HANA Data Lake file container. Ensure you know the path to the data files you wish to load into the Data Lake File container.
To upload a new file, you will need to use the HDLFSCI upload command
hdlfsci -config myconfig upload customer_as_pipe.csv customer_demo_csv.csv
You can also upload the file directly through the Database Explorer connection into the File Container. Right click on your Data Lake File Container connection and select Upload File option.
The next step is to the give the RelativePath and browse the Local File. The Relative path will be the path as to how the file will appear in your Data Lake File Container. And the path prefix is ‘ / ‘.
Step 5:
The following section will demonstrate how to load data from the file we just uploaded to the Data Lake File store into a table in HANA Data Lake.
You need to open a SQL console and make sure you are connected to the HANA Data Lake using HDLADMIN schema.
Let’s just create a standard table for better understanding as to how does the loading work.
CREATE TABLE “CUSTOMER” (
C_CUSTKEY integer not null,
C_NAME varchar(25) not null,
C_ADDRESS varchar(40) not null,
C_NATIONKEY integer not null,
C_PHONE varchar(15) not null,
C_ACCTBAL decimal(15,2) not null,
C_MKTSEGMENT varchar(10) not null,
C_COMMENT varchar(117) not null,
primary key (C_CUSTKEY)
);
Following is the syntax for loading syntax into the Customer table that we just created.
LOAD TABLE CUSTOMER(
C_CUSTKEY,
C_NAME,
C_ADDRESS,
C_NATIONKEY,
C_PHONE
C_ACCTBAL
C_MKTSEGMENT
C_COMMENT
)
FROM ‘hdlfsci://customer_demo_csv.csv’
DELIMITED BY ‘|’
FORMAT CSV
ESCAPES OFF;
‘hdlfsci://customer_demo_csv.csv’ – The path from the HANA Data Lake File store to the file we just uploaded is indicated.
Execute the CREATE TABLE and LOAD statements in the HDLRE and let it execute for a while.
To verify if the data has been imported to the table, browse through
Your HDL Connection –> Catalog –> Tables –> CUSTOMER table.
Right click on the table definition and select “Open data” and you should see the data loaded successfully in the table.
Note: The above method shows how to load a file/data into HANA Data Lake using HANA Data Lake File store (load data from a local file).
But one can also load data into the Data Lake using Amazon S3 buckets and Azure Data Lake.
Step 6: Optional
Let’s see how one can load data using Amazon S3 buckets. (There’s not much of a difference)
You need to follow the same steps concerning Provisioning a Data Lake instance in Cloud Central.
Once that is done, you need to open up a SQL Console connected to the HDLADMIN schema.
Creating a CUSTOMER table,
CREATE TABLE “CUSTOMER” (
C_CUSTKEY integer not null,
C_NAME varchar(25) not null,
C_ADDRESS varchar(40) not null,
C_NATIONKEY integer not null,
C_PHONE varchar(15) not null,
C_ACCTBAL decimal(15,2) not null,
C_MKTSEGMENT varchar(10) not null,
C_COMMENT varchar(117) not null,
primary key (C_CUSTKEY)
);
Once the table is created, run the following load statement to load the data into the Data Lake table.
BEGIN
LOAD TABLE CUSTOMER(
C_CUSTKEY,
C_NAME,
C_ADDRESS,
C_NATIONKEY,
C_PHONE,
C_ACCTBAL,
C_MKTSEGMENT,
C_COMMENT
)
USING FILE ‘s3://hdl-test-data-eu-central-1/tpch-sf1000/customer.tbl’
DELIMITED BY ‘|’
FORMAT CSV
ACCESS_KEY_ID ‘<Access Key>’
SECRET_ACCESS_KEY ‘<Secret Access Key>’
REGION ‘<Region>’
ESCAPES OFF
QUOTES OFF;
END
The above statement will load the data into the table from the QA Amazon S3 bucket.
The path from where it will load data is – ‘s3://hdl-test-data-eu-central-1/tpch-sf1000/customer.tbl’
The ACCESS_KEY_ID, SECRET_ACCESS_KEY, REGION are the parameters of the AWS instances from the data is getting pulled from. To access that AWS instance, you need to mention the above 3 parameters.
No comments:
Post a Comment