How to use the Fuzzy Search in SAP HANA

In this article we will talk about

  • What is Fuzzy Search?
  • Why Fuzzy Search is important?
  • Real Time Example of Fuzzy Search Based Applications.
  • How to Implement Fuzzy Search in SAP HANA?

What is Fuzzy Search?

Also known as approximate string matching.
Fuzzy search is the technique of finding strings that match a pattern approximately (rather than exactly).
It is a type of search that will find matches even when users misspell words or enter in only partial words for the search.
purpose:
With the help of Fuzzy Search Misspellings and typos still provide relevant results.

A Real World Example: 
If a user types "SAP HANA Tutorl" into Yahoo or Google (both of which use fuzzy matching), a list of hits is returned along with the question, "Did you mean "SAP HANA Tutorial"?"

Fuzzy Search in SAP HANA:

In SAP HANA, you can call the fuzzy search by using the CONTAINS predicate with the FUZZY option in the WHERE clause of a SELECT statement.
Syntax: 
SELECT * FROM <tablename>
WHERE CONTAINS (<column_name>, <search_string>, FUZZY (0.8))

A search with FUZZY(x) returns all values that have a fuzzy score greater than or equal to x.

The SCORE() Function
The fuzzy search algorithm calculates a fuzzy score for each string comparison. The higher the score, the more similar the strings are. A score of 1.0 means the strings are identical. A score of 0.0 means the strings have nothing in common.

You can request the score in the SELECT statement by using the SCORE() function.

You can sort the results of a query by score in descending order to get the best records first (the best record is the record that is most similar to the user input). When a fuzzy search of multiple columns is used in a SELECT statement, the score is returned as an average of the scores of all columns used.

So not only does it find a "fault tolerant" match, it also puts a score behind it.

Example:
When searching with 'SAP', a record like 'SAP AG' gets a high score, because the term 'SAP' exists in the texts. A record like "BSAP Corp" gets a lower score, because 'SAP' is only a part of the longer term 'BSAP Corp'.

Create the table and data:
-- REPLACE <Schema_Name> WITH YOUR SCHEMA NAME
CREATE COLUMN TABLE <Schema_Name>.COMPANIES(
               ID INTEGER PRIMARY KEY,
               COMPANY_NAME SHORTTEXT(200) FUZZY SEARCH INDEX ON);

INSERT INTO <Schema_Name>.COMPANIES VALUES (1, 'SAP');
INSERT INTO <Schema_Name>.COMPANIES VALUES (2, 'SAP in Walldorf');
INSERT INTO <Schema_Name>.COMPANIES VALUES (3, 'SAP AG');
INSERT INTO <Schema_Name>.COMPANIES VALUES (4, 'ASAP Corp');
INSERT INTO <Schema_Name>.COMPANIES VALUES (5, 'BSAP orp');
INSERT INTO <Schema_Name>.COMPANIES VALUES (6, 'IBM Corp');

Perform the search on one column:
SELECT SCORE() AS score, * FROM <Schema_Name>.COMPANIES
        WHERE CONTAINS(COMPANY_NAME,'SAP',
                FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
        ORDER BY score DESC;

The output of fuzzy search contains 5 entries. Based on the fuzzy search factor (which is 0.7 in this case), it will also consider the similar words. In this case "SAP AG", "BSAP orp" etc.


A Real Time Example of Fuzzy Search:

Use Case 
A call center agent who receives an order by phone needs to know the customer number or, in the case of a new entry, the system has to inform him about a potentially duplicate entry. 
There are chances that name can be misspelled or there can be different person with same name but different spellings. For example "Jimi Hendricks" can be misspelled as "Jimy Hendricks" or "Jimi Hendrix". Or the address can also be spelled differently. For example "Berliner Platz 43" or "Berliner Plats 43" or "Berliner Platz" 

Without fuzzy search system can only find the exact match means the only entries that are 100% identical. But with fuzzy search system can find the misspelled words too.

Create table and some data: 
-- REPLACE <Schema_Name> WITH YOUR SCHEMA NAME 
create column table <Schema_Name>."CUSTOMERS"( 
                "CUSTOMER_ID" VARCHAR (5) not null default '', 
                "FIRST_NAME" VARCHAR (20) null default '', 
                "LAST_NAME" VARCHAR (20) null default '', 
                "STREET" VARCHAR (20) null default '', 
                "CITY" VARCHAR (20) null default '', 
                "COUNTRY" VARCHAR (20) null default '', 
                "POSTAL_CODE" VARCHAR (20) null default '', 
                primary key ("CUSTOMER_ID")); 

insert into <Schema_Name>."CUSTOMERS" values('00001','Jimi','Hendricks','Berliner Platz 43','Munchen','Germany','80805'); 
insert into <Schema_Name>."CUSTOMERS" values('00002','Jimy','Hendricks','Berlinr Platz 43','Munchen','Germany','80805'); 
insert into <Schema_Name>."CUSTOMERS" values('00003','Jimi','Hendrix','Berliner Plats 43','Munchen','Germany','80805'); 
insert into <Schema_Name>."CUSTOMERS" values('00004','Jimy','Feuer','Berliner','Munchen','Germany','80805'); 
insert into <Schema_Name>."CUSTOMERS" values('00006','Sven','Ottlieb','Walserweg 21','Aachen','Germany','52066'); 
insert into <Schema_Name>."CUSTOMERS" values('00007','Philip','Cramer','Maubelstr. 90','Brandenburg','Germany','14776'); 
insert into <Schema_Name>."CUSTOMERS" values('00008','Renate','Messner','Magazinweg 7','Frankfurt','Germany','60528'); 
insert into <Schema_Name>."CUSTOMERS" values('00009','Alexander','Feuer','Heerstr. 22','Leipzig','Germany','04179'); 
insert into <Schema_Name>."CUSTOMERS" values('00010','Antonio','Moreno','Mataderos 2312','Mexico','Mexico','05023'); 
insert into <Schema_Name>."CUSTOMERS" values('00011','Thomas','Hardy','120 Hanover','London','UK','WA1 1DP'); 
insert into <Schema_Name>."CUSTOMERS" values('00012','Christina','Berglund','Berguvsvagen 8','Lulea','Sweden','S-958 22');

Without Fuzzy Search: 
Suppose you want to search a customer with name "Jimi".
SQL Query:
SELECT * FROM <Schema_Name>."CUSTOMERS" 
                WHERE CONTAINS(FIRST_NAME, 'Jimi') 
                ORDER BY "CUSTOMER_ID" DESC;

The output will contain only one entry which contains exact match of "Jimi". 


Now let us try the fuzzy search function. 
SQL Query: 

SELECT SCORE() AS score, * FROM <Schema_Name>."CUSTOMERS" 
                WHERE 
               CONTAINS(FIRST_NAME, 'Jimi', FUZZY(0.7)) 
               ORDER BY score DESC;

The output of fuzzy search contains 4 entries. Based on the fuzzy search factor (which is 0.7 in this case), it will also consider the similar words. In this case "Jimy". 


We can also do fuzzy search on 2 columns. For example First Name and Last Name. 
SQL Query: 
SELECT SCORE() AS score, * FROM <Schema_Name>."CUSTOMERS" 
                WHERE 
                        CONTAINS(FIRST_NAME, 'Jimi', FUZZY(0.7)) 
                        and CONTAINS(LAST_NAME, 'Hendricks', FUZZY(0.7)) 
                ORDER BY score DESC;

The output contains 3 entries. Based on the fuzzy search factor (which is 0.7 in this case), it will also consider the similar names. In this case "Jimy Hendricks" and "Jimi Hendrix". 


No comments:

Post a Comment