Implement SAP HANA Text Analysis in 10 minutes

In SAP HANA Text Analysis - One of the coolest features of SAP HANA we explained what is Text Analysis and why it is so important for business now-a-days.
In this article we will show you how you can easily implement Text Analysis in SAP HANA.

Use-case:

Suppose I am planning to buy a new iPhone 5 and I want to know the review of this over internet. I wanted to get a pulse of the iPhone 5 before I buy it not just from the critics but actual users like me. I also want to search the blogs, news and social media to find out whether people's review are positive, negative or neutral.
Lets see how we can do this with the help of SAP HANA Text Analysis.

Prerequisites:

Download unstructured data (iPhone-News.pdf) 
To save time, I have created a pdf file which contains news and blog articles on iPhone 5. Download this from here .

Create Table in SAP HANA
Create a table in SAP HANA which will contain this unstructured data. Replace <SCHEMA_NAME> with your schema.
CREATE COLUMN TABLE <SCHEMA_NAME>."IPHONE_NEWS" (
          "File_Name" NVARCHAR(20),
          "File_Content" BLOB ,
          PRIMARY KEY ("File_Name"));

Upload pdf file to SAP HANA using Python
Use the below Python code to upload pdf file to SAP HANA.

Note: Check below article to configure Python before running the Python code.
Power of Python Integrated with SAP HANA

import dbapi
# assume HANA host id is abcd1234 and instance no is 00
# and SAP HANA user id is USER1 and password is Password1
conn = dbapi.connect('abcd1234', 30015, 'USER1', 'Password1')

#Check if database connection was successful or not
print conn.isconnected()

#Open a cursor
cur = conn.cursor()
#Open file in read-only and binary
file = open('iPhone-News.pdf', 'rb')
#Save the content of the file in a variable
content = file.read()

#Save the content to the table - Replace SCHEMA1 with your schema
cur.execute("INSERT INTO SCHEMA1.IPHONE_NEWS VALUES(?,?)", ('iPhone-News.pdf',content))
print 'pdf file uploaded to HANA'

#Close the file
file.close()
#Close the cursor
cur.close()
#Close the connection
conn.close()

After executing the above Python script the pdf data will be uploaded in HANA table.

Implement Text Analysis in SAP HANA:

The most impressive thing about Text Analysis is how easy it is to implement it.
The only thing we need to do is run the following statement:

Create FullText Index "PDF_FTI" On <SCHEMA_NAME>."IPHONE_NEWS"("File_Content")
TEXT ANALYSIS ON
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER';

This will create a full text index called "PDF_FTI" (you can use any name) on the BLOB column "File_Content" of the table "IPHONE_NEWS".
With the execution of this script a new column table is created called $TA_PDF_FTI ($TA_<Index_Name>) that contains the result of our Text Analysis Process.

Note: If you do not see this table under your schema, try to refresh that.


That's it. Yes, Text Analysis is implemented. Rest everything is done by SAP HANA. 

Further Analysis:
2 columns of the table $TA_PDF_FTI is very important for us.

TA_TOKEN:
This column contains the extracted entity or element (for example, an identifiable person, place, topic, organization, or sentiment).
TA_TYPE: 
This is the category the entity falls under. For example PERSON, PLACE, PRODUCT etc.

To know people's review and sentiments about iPhone, we can query the table $TA_PDF_FTI like this. 

SELECT "TA_TYPE", ROUND("SENTIMENT_VALUE"/ "TOTAL_SENTIMENT_VALAUE" * 100,2) AS "SENTIMENT_VALAUE_PERCENTAGE" 
FROM 
SELECT "TA_TYPE", SUM("TA_COUNTER") AS "SENTIMENT_VALUE" 
     FROM <SCHEMA_NAME>."$TA_PDF_FTI" 
    where TA_TYPE in('WeakPositiveSentiment','StrongPositiveSentiment','NeutralSentiment',
       'WeakNegativeSentiment','StrongNegativeSentiment','MajorProblem','MinorProblem')
    GROUP BY "TA_TYPE"
) AS TABLE1,

(
SELECT SUM("TA_COUNTER") AS "TOTAL_SENTIMENT_VALAUE" 
    FROM <SCHEMA_NAME>."$TA_PDF_FTI" 
    where TA_TYPE in('WeakPositiveSentiment','StrongPositiveSentiment','NeutralSentiment',
       'WeakNegativeSentiment','StrongNegativeSentiment','MajorProblem','MinorProblem')
) AS TABLE2

You will get the output like this. 


The result shows that more percentage of people are giving positive review of this product. 
Good, now i can go ahead and buy my new iPhone 5. 

What's Next:

We can use this full text index table to get a lot of information other than just sentiments. 
Lets take a look into the structure of this table.

Column Name Key Description Data Type
File_Name Yes This is the primary key of my table. If you have more than one column in your primary key, the $TA table will include every single column Same as in source table. In this case: NVARCHAR(20)
RULE Yes Stores the rule package that yielded the token. In my case: "Entity Extraction" NVARCHAR(200)
COUNTER Yes Counts all tokens across the document BIGINT
TOKEN No The token that was extracted (the "who", "what", "where", "when" and "how much") NVARCHAR(250)
LANGUAGE No You can either specify a language column when you create the fulltext index or it can be derived from the text. In my case it was derived from the text and is English (en) NVARCHAR(2)
TYPE No The Token Type, whether it is a "who", a "what", a "where", etc. NVARCHAR(100)
NORMALIZED No Stores a normalized representation of the token. This becomes relevant e.g. for German with umlauts, or ./ss. Normalization with regards to capitalization would not be as important as to justify this column. NVARCHAR(250)
STEM No Stores the linguistic stemming information, e.g. the singular nominative for nouns, or the indicative for verbs. If text analysis yields several stems, only the first stem will be stored, assuming this to be the best match. NVARCHAR(300)
PARAGRAPH No The paragraph number where my token is located in the document INTEGER
SENTENCE No The sentence number where my token is located in the document INTEGER
CREATED_AT No Creation timestamp TIMESTAMP

Hope you liked this article. If you have any question please leave a comment. 
Continue reading: 

No comments:

Post a Comment