Start there to learn how to create your own SAP HANA cloud instance and develop with SAP Web IDE Full-Stack.
Our goal is to create a graph workspace that represents routes travelled between airports by different airlines.
Our goal is to create a graph workspace that represents routes travelled between airports by different airlines.
The graph describes the relationships(=edges) between the different entities(=nodes) and all the properties of these relationships and entities.
The relationships allow data to be linked together directly and retrieved in one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database.
Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.
Graph databases are a type of NoSQL database, they can be very flexible, as data does not have to reside within pre-fixed schemas. Users can add or remove properties from nodes and edges as they go.
SAP HANA is not a NoSQL database. All data in SAP HANA reside within a schema. So how do we manage to offer graph capabilities ?
All data are stored in the vertex table (set of vertices) and the edge table (set of edges). Vertex attributes match to columns of the vertex table. Edge attributes match to columns of the edge table.
One of the vertex attributes (the vertex key) uniquely identifies vertices.
One of the edge attributes (the edge key) uniquely identifies edges.
The edge table contains two additional columns referencing the source vertex and the target vertex of each edge.
Now, I call them tables, but they do not have to be physical tables. They can be based on views which aggregate several different tables without data replication. They can also be based on virtual tables which access data stored in another database.
Relational storage allows all the functions of SAP HANA to be applied to the graph data: access control, backup and recovery, etc. It also allows all SAP HANA Graph functions to be applied to the graph data stored in relational format coming from business applications. SAP HANA Graph provides a dedicated catalog object, which is referred to as a graph workspace, for defining a graph in terms of the existing SAP HANA tables.
Here is an example with articles and their authors :
SAP HANA’s native graph engine
◉ Property graph model embedded in a relational model : full transaction properties (ACID)
◉ Built-in functions : shortest path(1-1, 1-all, top k) and strongly connected components, Neighbors, Breadth First Search
◉ Support for pattern matching using openCypher
◉ GraphScript to develop custom graph algorithms
◉ Graph viewer, SAP HANA Database Explorer
Benefits
◉ Store and analyze connected data in real-time
◉ Combine text, spatial, and advanced analytics with graph intelligence
◉ Tightly integrated in SAP HANA operations (security, backup/restore, scale-out, import/export etc.)
Create an edge and a vertex table
These two tables will become our edge and vertex tables.
Start by importing some data.
Insert data in csv format for airports and routes.
Import that data to tables with .hdbtabledata files
Build your project and you will have data within your two tables. You can check them in the Database Explorer.
Create a graph workspace
Now that you have a vortex table(AIRPORTS) and an edge table (ROUTES), you can create a graph workspace with a .hdbgraphworkspace file.
Airport code will be the key for airports.
Route ID will be the key for routes, and each route will have an origin and a destination.
Once you build your graph workspace, you can use all HANA graph algorithms on your data.
Let’s start by exploring our data through the Database Explorer.
By clicking your graph workspace, you can see an overview of the edge and vertex table. Press the mark on the upper-right to explore the graph.
Explore the graph algorithms
This opens a graphical view of the data, from here you can see the details of each node and edge by clicking them and filter the data.
You can test algorithms directly within the graphical view of Database Explorer on SAP HANA, on-premise. This features which is not yet available on SAP HANA Cloud as of April 2020.
You can use the Shortest Path algorithm to find the shortest path between two airports. By selecting the weight column, you can choose between distance and time, for example.
You can find all nodes which are related to a particular node with the Neighborhood algorithm. You can set up the depth of search, and whether it searches for incoming or outgoing edges.
Find clusters within your data with the Strongly Connected Components alogrithm. In this case, all airports are interconnected, so they form one big cluster.
SAP HANA supports two languages to perform graph computation : GraphScript and openCypher.
I prepared a procedure using the LANGUAGE GRAPH in a .hdbprocedure file.
This takes the origin and destination airports as input, and outputs the total number of segment, the total distance, the total duration and routing. It uses the shortest path algorithm to compute the most efficient path between nodes.
In the database explorer, right-click the procedure and select “Generate CALL Statement with UI”.
This opens a UI where you can input the parameters. In this case we’ll run the procedure from Dubai to Paris.
There is one stop in Amsterdam, and you can see the airline, distance and duration for each flight.
In order to find nodes related to any particular node, you can use the Traversal Breadth-First Search algorithm.
In this example, we compute how many airports have direct flights to Atlanta, how many airports are connected through one transfer, or two transfers.
Pattern matching with openCypher
SAP HANA supports openCypher for pattern matching. Let’s use Cypher to match this pattern : All flights from A to B, with filters on : the country, the distance and the altitude of the target, as well as a custom filter.
In openCypher, nodes are represented within parenthese : (a)
Edges are represented as arrows : -[e]->
We select all flights coming from the United States, to any destination within 4000km, with an altitude less than 50m.
We add a custom filter : the airline name must contain Airline, but the search is fuzzy, which means that similar words such as Air line or Airlines are also returned.
SELECT * FROM OPENCYPHER_TABLE( GRAPH WORKSPACE "TRAVEL"."Flights" QUERY
'
MATCH (a)-[e]->(b)
WHERE a.country = ''United States'' AND e.distance> 4000 AND b.altitude < 50 AND SYS.TEXT_CONTAINS(e.airlineName, ''Airline'', ''FUZZY(0.4)'')
RETURN a.airportCode AS airportCodeFrom, b.airportCode AS airportCodeTo, e.airlineName AS airlineName, e.distance AS distance, b.altitude AS altitude
ORDER BY b.altitude, e.distance DESC
'
)
Let’s see another example of openCypher. We match 3 different patterns in one query.
MATCH (a)-[e1]->(b)
MATCH (b)-[e2]->(c)
MATCH (c)-[e3]->(d)
Then we apply filters to search for flights from NTE to PDX.
SELECT * FROM OPENCYPHER_TABLE( GRAPH WORKSPACE "TRAVEL"."Flights" QUERY
'
MATCH (a)-[e1]->(b)
MATCH (b)-[e2]->(c)
MATCH (c)-[e3]->(d)
WHERE a.airportCode = ''NTE'' AND d.airportCode = ''PDX''
RETURN e1.airlineName AS airlineName1, b.airportCode AS transferAirportCode1, e2.airlineName AS
airlineName2, c.airportCode AS transferAirportCode2, e3.airlineName AS airlineName3
'
)
Use case : Supply chain risk analysis
Who is affected when a supplier shuts down ?
Which supplier was responsible for quality issues ?
These are some questions that customers answer with SAP HANA graph. In this use case, the customer stores purchasing data, engineering data, and sales data in different systems. To support risk analysis, the data is consolidated in a single SAP HANA system.
SAP HANA Graph is used to support a global “where-used list”, essentially providing cross-system “reachability analysis”. Which customer depends on which suppliers?
The graph has around 250k Nodes (supplier, material, customer) and 1 million edges (supplies, is-used-in, purchases). The impact analysis implementation with GraphScript takes about 10 lines of code using build-in BFS traversal operation with filter conditions. There is no data redundancy, the graph is based on views.
It takes –100 ms to aggregate all reachable materials and customers from a single supplier.
It takes 2 seconds to identify/count all customers for all suppliers
No comments:
Post a Comment