SAP HANA Text Analysis using Twitter Data

In this tutorial, we are going to do following things.
  1. Use the Twitter API to get the tweets
  2. Save the tweets into SAP HANA system using JDBC connection
  3. Run the Text Analysis in HANA on top of the tweets.
After this tutorial, you will be able to learn:
  • SAP HANA integration with Twitter
  • Program with SAP HANA using JDBC in Java language
  • SAP HANA Text Analysis
Prerequisites:

Register an Application at Twitter Developers:
As we are going to use the Twitter API to extract the data from Twitter, it is required to create an application at Twitter Developer and we will need the authentication information of the application and use them to invoke the APIs later.

In case you haven't use Twitter before, you need to create your twitter account firstly.
You can register an application and create your oAuth Tokens at Twitter Developers by following below steps.
1. Logon with your twitter account, click your profile picture and click on the "My applications".


2. Click on the button "Create a new application".


3. Provide the information. You can give any name and description of your choice. 


4. Follow the instructions and finally click on "Create your Twitter application"
5. Scroll down the screen and you will see the button "Create my access token", click it to generate the token. 


6. After that, you will be able to see the oAuth settings like below, save the values of Consumer Key, Consumer secret, Access token and Access token secret. 


Download Twitter API Java library - Twitter4J
Twitter4J is an unofficial open source Java library for the Twitter API. With Twitter4J, you can easily integrate your Java application with the Twitter services. 
The link to download it is http://twitter4j.org/en/index.html 

Download "twitter4j-3.0.5.zip" and save it. We will need it later.

Prepare the HANA jdbc library
In order to access SAP HANA from java, we will need the jdbc library, which you can find it at
  • C:Program FilesSAPhdbclientngdbc.jar in windows
  • and /usr/sap/hdbclient/ngdbc.jar in Linux.

Download Eclipse IDE for Java Developers
In this exercise, we will use Eclipse IDE for Java Developers to run the Java Project. 
You can add the Plugins in your HANA Studio or directly download the new IDE from here.

Now we are ready!! Let's fetch data from Twitter and save it in HANA.

Create a column table in HANA:

Before running the Java program, we need to create a table in HANA, where we want to store the tweets we fetched from the twitter services. 
Copy and paste below script in SQL editor and execute.
Note: You need to replace the <SCHEMA_NAME> with your own schema. 

CREATE COLUMN TABLE <SCHEMA_NAME>.TWEETS(
      "ID" INTEGER NOT NULL,
      "USER_NAME" NVARCHAR(100),
      "CREATED_AT" DATE,
      "TEXT" NVARCHAR (140),
      "HASH_TAGS" NVARCHAR (100),
      PRIMARY KEY("ID")
);

CREATE SEQUENCE <SCHEMA_NAME>."TWEET_SEQUENCE" 
     INCREMENT BY 1 START WITH 1 NO CYCLE; 

Create and configure JAVA program:

Download the JAVA Project "TwitterAnalysis.zip" from here and save it to your local computer.
Open JAVA Eclipse and create a Java project called "TwitterAnalysis". 



3. Go to File -> Import and select "Archive File" 


4. Click on browse and select the "TwitterAnalysis.zip" file you downloaded in step 1. Click on finish.


5. Now you will be able to see the project with the structures like this: 


Understanding the Java Project:

TwitterConnection.java
Build the connection to twitter services

HDBConnection.java 
Build the jdbc connection to HANA

Configurations.java 
The public interface for the network, twitter authentication configurations, override it by your own account or settings

Tweet.java
The java bean class for the tweet objects

TweetDAO.java 
The data access object

ngdbc.jar 
SAP HANA jdbc library

twitter4j-core-3.0.3.ja 
Twitter4j library for twitter services in java


Update the configurations

In the purpose to maintain the configurations easily, we put all the required information in a single interface and it is mandatory for you update it with your own account or settings before you can connect to either HANA or Twitter. 

Open the file Configurations.java in your project. Basically, there are 4 category of setting you can override: 
Network Proxy Settings: 
The proxy host and port, set the HAS_PROXY as false if you do not need to use proxy. 
To get the proxy host is, open command prompt and type "ping proxy". This will show you proxy host. 



HANA Connection Settings: 
Replace the HANA URL with your own HANA host and port, user, password and the schema where you created your table. 


Twitter Authentication Settings: 
Replace with your own authentication information from your twitter application as described in the prerequisites.

Search Term: 
We will search the twitter based on the search term "HANA Training" and we want to know what people were talking around the HANA Training in twitter. You can always replace it with your own term if you are interested in other topics.


Test Connection to Twitter

Once have the twitter authentication maintained correctly in the previous step. You can open TwitterConnection.java and run it. 

You will see the message "Connection to Twitter Successfully!" following with your twitter user id in the console as the screenshot shows below.


Test Connection to SAP HANA

Now let us open the file HDBConnection.java and run it. 
You will see the message "Connection to HANA Successfully!" in the console as the screenshot shows below. 
Check the Configurations.java if you encountering any issue. 


Invoke Twitter API and save the tweets into HANA:

Now it's time to the do the real stuff. Open the file SearchTweets.java and run it, which will search the tweets based on the search term we specified in the Configurations.java and everything we got will saved to HANA table. 
You will see the messages in the console indicate the tweets have been inserted to HANA successfully like the screenshot shows: 


After that, you can run the data preview in HANA studio and see the contents of the table TWEETS in your schema like this: 


Run text analysis in HANA:

Now we already have the tweets stored in the HANA table. The next step, we are going to run the text analysis to see what people are talking around the "HANA Training" in twitter. 

To run the text analysis, the only thing we need to do is create a Full Text index for the column of the table we want to analysis and HANA will process the linguistic analysis, entity extraction, stemming for us and save the results in a generated table $TA_YOUR_INDEX_NAME at the same schema. 
After that, you can build views on top of the table and leverage all existing analysis tools around HANA to do the visualization even the predictive analysis. 

Copy the SQL statement and execute it in SQL console:
Note: Replace the <Scheme_Name> with your own Schema 

Create FullText Index <Scheme_Name>."TWEETS_FTI" 
On <Scheme_Name>."TWEETS"("TEXT") 
TEXT ANALYSIS ON 
CONFIGURATION 'EXTRACTION_CORE';

You will see Full-Text Index $TA_TWEETS_FTI under your schema. 
In case you don't see that try to refresh the folder. 


Text Analysis is done!! Yes it was that simple.

Do the data preview of $TA_TWEETS_FTI and to the Analysis tab. 
Select the chart type as "Other" - "Tag Cloud" to have a better view. 


Reference: This example was taken from SAP Startup Focus Program. 
If you are from a startup, interested in developing on top of the in-memory database and application platform SAP HANA, then you may check the SAP Startup Focus program for help. 

No comments:

Post a Comment