Taking unstructured data, extracting sentiments from it and then putting spatial data in the mix can yield valuable insights for any company.
Many companies have a large pool of unstructured feedback from customers (e.g. complaints). Most of the time, this data also has geographical information attached (e.g. the address).
Using HANA, we can easily extract value from all of this data, helping customers improve their services. For example, we could help an Internet Service Provider (ISP) to identify complaint hot-spots and pin-point places where their infrastructure should be improved.
As an example, we’ll look into some Twitter posts about #SAP and analyse the feedback SAP is receiving. Initially, I wanted to use the Twitter API to pull a large amount of tweets and directly feed them to HANA. Unfortunately, almost all of the tweets about SAP are:
◈ job postings,
◈ random advertisements for SAP or related companies,
◈ sharing blog posts or articles.
The solution: I had to cherry-pick the tweets. In a real-life scenario though, this problem would be solved through the feedback collection mechanism. For example, most ISP’s have a standard form for making complaints, in which you must enter:
◈ your address / subscription ID,
◈ the problem that you are facing.
Using such a form to collect data ensures that you only get actual complaints (and you don’t have random job postings in the middle).
Twitter can also have data about either the location from which the tweet was made (preferred) or the location of the author (fallback).
The result of my cherry-picking is a simple excel file which looks like so:
Intro
Many companies have a large pool of unstructured feedback from customers (e.g. complaints). Most of the time, this data also has geographical information attached (e.g. the address).
Using HANA, we can easily extract value from all of this data, helping customers improve their services. For example, we could help an Internet Service Provider (ISP) to identify complaint hot-spots and pin-point places where their infrastructure should be improved.
As an example, we’ll look into some Twitter posts about #SAP and analyse the feedback SAP is receiving. Initially, I wanted to use the Twitter API to pull a large amount of tweets and directly feed them to HANA. Unfortunately, almost all of the tweets about SAP are:
◈ job postings,
◈ random advertisements for SAP or related companies,
◈ sharing blog posts or articles.
The solution: I had to cherry-pick the tweets. In a real-life scenario though, this problem would be solved through the feedback collection mechanism. For example, most ISP’s have a standard form for making complaints, in which you must enter:
◈ your address / subscription ID,
◈ the problem that you are facing.
Using such a form to collect data ensures that you only get actual complaints (and you don’t have random job postings in the middle).
Twitter can also have data about either the location from which the tweet was made (preferred) or the location of the author (fallback).
The result of my cherry-picking is a simple excel file which looks like so:
Importing the data
We need to define a simple data model for storing the raw tweets. For that we can build a CDS context:
namespace spet.data;
@Schema: 'SPET'
context core {
entity Raw {
text: String(512) not null;
address: String(128) not null;
};
}
For simplicity, I just used a hdbtableimport to fill the “Raw” table from a CSV file:
import = [
{
cdstable = "spet.data::core.Raw";
schema = "SPET";
file = "spet.data:tweets.csv";
header = false;
delimField = ",";
delimEnclosing="\"";
}
];
Geocoding
Our geographical data is textual, but we would want to work with coordinates. The transformation between text and coordinates is called geocoding. First we need a table in which to store the coordinates for each tweet:
entity Processed {
key id: Integer not null;
text: String(512) not null;
address: String(128) not null;
location: hana.ST_POINT(4326);
};
We could use the GEOCODE INDEX capability to automatically process the data when we insert it into our table, but it has one big drawback: we would need to split the address into components (country, locality, etc.).
I used the Google Geocoding API instead, because of its increased flexibility. With some easy JavaScript, we can process all of our data at once. After running the XSJS geocoding service, the result looks like so:
Extracting sentiments
We want to see how users feel about SAP. For this, HANA has the built-in functionality of Text Analysis, with a configuration specialised in processing customer feedback (Voice of Customer).
Again, we could use a dedicated index, the FULL TEXT INDEX, for analysing rows as soon as they are inserted. But we also have an JavaScript API for doing this. As we already go through all the data for geocoding, we can simultaneously extract the sentiments using this API.
First we need to define another table in which to store the sentiments. Each sentiment has the following attributes:
◈ Type: positive, negative, neutral,
◈ Strong: yes, no,
◈ The words forming the sentiment,
◈ The position of the sentiment in the text.
entity Sentiment {
key id: Integer not null;
startIndex: Integer;
endIndex: Integer;
type: String(32);
strong: hana.TINYINT;
text: String(64);
processedId: Integer not null;
}
Lastly, we need a simple XSJS function for processing all the raw entries:
function process() {
var oConn = $.hdb.getConnection(),
aEntries = oConn.executeQuery('SELECT "text", "address" FROM '
+ '"SPET"."spet.data::core.Raw"');
oConn.executeUpdate('DELETE FROM "SPET"."spet.data::core.Processed"');
oConn.executeUpdate('DELETE FROM "SPET"."spet.data::core.Sentiment"');
var j = 0;
for (var i = 0; i < aEntries.length; ++i) {
var oLatLng = geocodeAddress(aEntries[i].address) || {lat: 0, lng: 0};
var aSentiments = sentimentAnalysis(aEntries[i].text) || [];
oConn.executeUpdate('INSERT INTO "SPET"."spet.data::core.Processed" '
+ 'VALUES (?, ?, ?, NEW ST_POINT(TO_DECIMAL(?, 9, 6), TO_DECIMAL(?, 9, 6)))',
i, aEntries[i].text, aEntries[i].address, oLatLng.lng, oLatLng.lat);
for (var k = 0; k < aSentiments.length; ++k) {
oConn.executeUpdate('INSERT INTO "SPET"."spet.data::core.Sentiment" ' + '
VALUES (?, ?, ?, ?, ?, ?, ?)', ++j, aSentiments[k].startIndex,
aSentiments[k].endIndex, aSentiments[k].type,
aSentiments[k].strong ? 1 : 0, aSentiments[k].text, i);
}
}
oConn.commit();
oConn.close();
}
Extraction results
After running the above script, we obtain some sentiments stored in our dedicate table:
You may wonder how accurate these results are. Let’s look at some examples of good and bad results:
◈ BAD: It doesn’t understand the surrounding context and wrongly categorised some sentiments:
◈ GOOD: It understands emoticons.
◈ BAD: It interpreted a sarcastic emoticon as positive.
◈ GOOD: It found a positive rating and interpreted it correctly.
◈ BAD: Several tweets with obvious sentiments are ignored.
Visualisation
Now that we have all our data analysed and into a decent structure, let’s visualise it. First we want to generate a score for each tweet, on a scale from 1 to 5, where 1 is bad and 5 is good. We compute this score from the sentiments.
First we build a simple view which computes the sum for each tweet:
view Score as select from Sentiment {
processedId,
CASE
WHEN type = 'NEGATIVE' AND strong = 1 THEN -2
WHEN type = 'NEGATIVE' AND strong = 0 THEN -1
WHEN type = 'POSITIVE' AND strong = 0 THEN 1
WHEN type = 'POSITIVE' AND strong = 1 THEN 2
ELSE 0
END as score
};
Then we combine this data with the rest of the tweet attributes and translate the sum computed above into the 1-5 scale:
view Tweet as select from Processed
left join TotalScore on Processed.id = TotalScore.processedId {
id,
text,
address,
location.ST_Y() as latitude,
location.ST_X() as longitude,
CASE
WHEN total IS NULL THEN 3
WHEN total >= 4 THEN 5
WHEN total <= -4 THEN 1
ELSE FLOOR(total / 2) + 3
END as score
};
Displaying all of our data in a simple table yields a fairly nice overview:
An alternative view is to show all our tweets on a map, with markers coloured based on the score:
For extracting some more deeply-embedded insights, we can use some of the examples from my previous blog post: HANA Spatial Demos: Geocoding, Clustering, Aggregation:
◈ Clustering:
◈ Aggregation:
No comments:
Post a Comment