A lot of data science and machine learning work is done in Python. For this, the “Python machine learning client for SAP HANA” (hana-ml for short) provides easy access to SAP HANA tables in the form of Pandas compatible data-frames. However, with its multi-model capabilities (like spatial, graph, and document store), SAP HANA has more to offer. In this blog post, you will learn about the enhancements of the library to support these multi-model capabilities and how you can leverage them for your work. We will walk through an end-to-end example based on Wellington’s stormwater network. We will evaluate and visualize the graph and show how we can analyze problems that might occur in the network based on built in graph algorithms. The data and code is available on github.com. If you want to run the Jupyter Notebook yourself, you need to take care about the prerequisites.
Thursday, 22 July 2021
Augment your Python Analysis with Multi-Model data in SAP HANA Cloud
SAP HANA Cloud
◉ Register for a trial account of SAP HANA Cloud
◉ Provision an SAP HANA Cloud, HANA database instance
Python Environment
◉ You need a running Jupyter environment (e.g Jupyter Notebook or JupyterLab)
◉ For the visualization you need to install KeplerGl. The example will also work without KeplerGl but you will miss out on the visualization.
◉ Install the hana-ml library from PyPI with pip install hana-ml. Make sure, that the version you install is >= 2.9
◉ Get the Jupyter Notebook and sample data from github.com
We use a separate JSON file to store database logon data…
{
"user": "YourUserName",
"pwd": "YourPassword",
"url": "database instance URL",
"port": 443
}
… the Notebook loads the logon information from this file, so we don’t have to put them into the notebook itself. The database URL can be optained from the BTP Cockpit:
First, we import the libraries we need to create a connection to the database.
import sys
import os
import json
from hana_ml.dataframe import ConnectionContext
Next, we create a connection context, that represents the physical connection to your SAP HANA database instance. This connection context is used in the subsequent code to let the hana-ml library know where to connect to. The ConnectionContext is part of the hana-ml library.
Once the connection is set up, we create a Graph Workspace (in the subsequent text referred to as ‘graph’) based on imported vertices and edges in SAP HANA. As a result, we have the stormwater network stored as a graph in SAP HANA and we can run network analysis processes.
The hana-ml library offers different functions to create a graph:
◉ create_graph_from_dataframes() creates a graph either based on HANA data-frames or Pandas data-frames. Requires vertex and edges data.
◉ create_graph_from_edges_dataframe() similar to the one above, but only requires edges data. A minimal vertex table is created which only contains the vertex keys found in the edges table.
◉ create_graph_from_hana_dataframes() this is a special version of create_graph_from_dataframes() which only accepts HANA data-frames, but gives you more control, about how the data-frames are processed.
HANA DataFrames – a small excursion
A hana_ml.dataframe.DataFrame represents a database query as a data-frame (which is conceptually similar to a Pandas data-frame). Thus, it seamlessly fits into the toolset a data scientist is used to. The HANA DataFrame is designed to not bring data back from the database unless explicitly requested (lazy evaluation). That means all operations (like column filter, where clauses, …) are performed on the query level, and only if certain methods (like collect()) are called, the data gets materialized and transferred to the client. All materialized data are returned as Pandas data-frames so that they can directly be used for further processing. The hana-ml library also provides functions to create data-frames from local data and store this data to a SAP HANA database table. For our example, the data-frame’s geospatial support becomes handy. During the data import, we can already specify which columns from our source are geometries. When we query the data and use them on the client, the geometries from database tables are directly converted to objects, which can be used in GeoPandas or other spatial libraries. As we will see, these geometries can be directly used with geospatial functions on the database layer.
Create the Graph
In our example we create the graph from two CSV files – the data in one file represents the edges, i.e. the actual pipe segments, the other file contains the data for the pipe segment’s junctions. We use Pandas to import the data.
from hana_ml.dataframe import create_dataframe_from_pandas
import pandas as pd
# example SHAPE column: POINT (1752440.6821975708 5439964.327102661)
v_hdf = create_dataframe_from_pandas(
connection_context=cc,
pandas_df=pd.read_csv("./datasets/wwc_stormwater_vertices.csv"),
table_name="LM_STROM_WATER_VERTICES",
primary_key="ID",
geo_cols=["SHAPE"],
srid=2193,
force=True)
# example SHAPE column: LINESTRING (1749169.286201477 5422260.568099976, 1749162.987197876 5422242.643096924)
e_hdf = create_dataframe_from_pandas(
connection_context=cc,
pandas_df=pd.read_csv("./datasets/wwc_stormwater_edges.csv"),
table_name="LM_STORM_WATER_EDGES",
primary_key="ID",
not_nulls=["SOURCE", "TARGET"],
geo_cols=["SHAPE"],
srid=2193,
force=True)
So, we execute create_dataframe_from_pandas() function of the hana-ml library to upload the data from a Pandas data-frame to SAP HANA and store it in a database table. It’s worth mentioning, that we make use of the spatial features here already. The datasets include a SHAPE column, containing the spatial data in a Well-Known-Text format. hana-ml can directly convert WKT columns to geometries. This results in a column with the suffix _GEO (in our example SHAPE_GEO) in the table. You could even generate a geometry out of longitude and latitude columns.
import hana_ml.graph as hg
g_storm = hg.create_graph_from_dataframes(
connection_context=cc,
vertices_df=v_hdf,
vertex_key_column="ID",
edges_df=e_hdf,
edge_source_column="SOURCE",
edge_target_column="TARGET",
edge_key_column="ID",
workspace_name="LM_STORM_WATER",
)
The discover_graph_workspaces(cc) function displays all the graphs available in your database instance. So, you can check if the graph was created. The function returns something like this.
Visualize the Graph with KeplerGl
KeplerGl is a popular framework to visualize geospatial data. We use it in our example to display the stormwater network we’re working with.
We already created the g_storm object above, so the following step is optional. It just shows how you would create a graph object representation in python from an existing graph workspace in SAP HANA.
# Instantiate existing graph
# here only for demo purposes, since we already instantiated it during creation
g_storm = hg.Graph(
connection_context=cc,
workspace_name='LM_STORM_WATER',
schema="MM_USER" # Optional, only needed if the schema differs from your logon schema
)
Next, we run print(g_storm) and get some technical information about the graph. You specifically see the database tables the graph is based on and the columns that are used to define the source and target vertices of the edges.
The next statements bring up a map to visualize our stormwater network.
This is what happens:
g_storm.edges_hdf: the graph object provides direct access to several graph properties. The most important ones are edges_hdf and vertices_hdf, which both represent a HANA data-frame referring to the respective table in the database. In this example, we use the edges data-frame of our graph.
.select(‘ID’, (‘SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()’, ‘GJ’)) a HANA data-frame provides the select method as a filter on the columns you want to query. It can also be used to create new calculated columns on the fly. In this example we create a column GJ that is based on the column SHAPE_GEO (remember, that this was automatically generated as a geometry, when we imported the CSV data). First, we transform the data into the SRS 4326 (which is the spatial reference system expected by Kepler for visualization – remember our source data used the SRS 2193). Then we transform the data into GeoJSON format which can be interpreted by Kepler directly. This results in a data-frame with two columns: ID an GJ. The calculated column is not persisted to the database. Its lifespan is restricted to the lifespan of the data-frame object. However you could save() the data-frame to a HANA catalog object (either a new table or a view – see the API documentation for further details)
.collect() this method materializes the data of the HANA data-frame and transfers it to the client. Data is returned as a Pandas data-frame
Analyze the Graph
The hana-ml library provides functionality to analyze and process the graph. In this section, we explore some of them.
The function g_storm.describe() gives you statistical information about the graph (e.g. counts, min-, max-values, density). In our example, we focus on IS_CONNECTED, which indicates that we don’t have a single connected network, but multiple unconnected ones.
Use Graph Algorithms to analyze and process your Graph
The hana-ml library provides algorithms, that can be executed on the graph (like shortest path, neighbors, topological sort, weakly connected components). They can be found in the hana_ml.graph.algorithms package. The execution pattern is always the same:
result = hana_ml.graph.algorithms.<algorithm_name>(graph=<graph_instance>).execute(<parameters>)
Every algorithm is implemented in a separate Python class which you can instantiate (<algorithm_name>). This class expects the graph instance as a constructor parameter (graph = <graph_instance>), so it knows on which graph to perform the algorithms. Finally, every algorithm class implements an execute() method, which triggers the execution of the algorithm. The result gives you access to the results of the algorithm execution. The properties available depend on the algorithm you execute.
WeaklyConnectedComponents
In our case, we can use the WeaklyConnectedComponents algorithm to identify parts of our network – each weakly connected component represents a sub-graph, that is not connected to the other components.
The result indicates that we have 8332 independent components in the graph. Let’s take a look at the number of vertices in each component.
Since components is a Pandas data-frame, we use Pandas capabilities to return only the two largest components – 25 and 5. With wcc.vertices the algorithm provides a data-frame that maps each vertex to the component it belongs to. Let’s store the components to SAP HANA and use the subgraph() method of the graph for our visualization. We use the same method to upload the data to SAP HANA we used when we created the graph. The following statement stores the algorithm results to HANA.
hdf_wcc = create_dataframe_from_pandas(
connection_context=cc,
pandas_df=wcc.vertices,
drop_exist_tab=True,
table_name='LM_STORMWATER_WCC',
force=True,
allow_bigint=True,
primary_key='ID')
Sub-Graphs
As we have seen, the components 25 and 5 are the biggest ones in the graph. Let’s put them on a map to gain further insights. The Graph object provides a method to create subgraphs based on filter criteria on vertices or edges.
g_storm_comp1 = g_storm.subgraph(
workspace_name = "LM_STORMWATER_COMP1",
vertices_filter='ID IN (SELECT ID FROM LM_STORMWATER_WCC WHERE COMPONENT = 25)',
force = True
)
This creates a new graph with the name LM_STORMWATER_COMP1 based on a vertex filter. The filter selects only vertices from the components we persisted in the step before and which belong to component 25. In other words, we now have a new graph, i.e. vertices and edges that belong to component 25. We do the same for the vertices of component 5.
g_storm_comp2 = g_storm.subgraph(
workspace_name = "LM_STORMWATER_COMP2",
vertices_filter='ID IN (SELECT ID FROM LM_STORMWATER_WCC WHERE COMPONENT = 5)',
force = True
)
Like before, we use Kepler to visualize the two largest weakly connected components.
In this section, we evaluate how one could analyze which sections of our water network could cause a problem reported on a specific access point (i.e. vertex). We look at Neighbors, NeigborsSubgraphs and ShortestPath algorithms.
Get some Information about the Vertex
Let’s assume somebody reported a problem with the node WCC_SW002719 (e.g. there is less flow than expected which might indicate a broken pipe somewhere else). We want to analyze that further. First, let’s load the vertex itself. We use the same mechanisms we already used before. Since we want to map it later, we materialize the geo-location as a GeoJSON string.
Neighbors
Let’s use the Neighbors algorithm to find all neighbors to our reported vertex and display the 5 closest:
If we want to plot the vertices on the map, we need to read the geometries from the database. For that we use the filter() method of the HANA data-frame.
vkc=g_storm_comp2.vertex_key_column
in_list = neighbors.vertices.ID.str.cat(sep="','")
filter = f"{vkc} IN ('{in_list}')" # Dynamically build the filter condition as SQL WHERE
print(filter)
pdf_storm_comp2_neighbors = g_storm_comp2.vertices_hdf \
.filter(filter) \
.select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
With that we query and materialize all the geo-locations of all vertices which are 5 hops away from our start vertex. Now we plot the “problem” node and its neighbors.
Upstream and Downstream with NeighborsSubgraph
The image above only shows the vertices reachable within 5 hops from the start vertex. However, when we want to find out what’s the root cause of our reported problem, we need to identify all connected vertices. The water network is a directed graph, which means, we’re only interested in upstream nodes, because if we see an issue on a node (e.g. less flow than expected), the root cause must be somewhere upstream (i.e. incoming to the node we’re looking at). So instead of individual points let’s look at subgraphs again.
g_neighbors_upstream = hga.NeighborsSubgraph(graph=g_storm_comp2).execute(
start_vertex=start_vertex_id, direction='INCOMING',
lower_bound=0, upper_bound=10000)
This creates a subgraph starting from a given vertex and only evaluates incoming edges (i.e. upstream in a directed graph). We use the same approach for downstream, i.e. following outgoing edges of the start node.
g_neighbors_downstream = hga.NeighborsSubgraph(graph=g_storm_comp2).execute(
start_vertex=start_vertex_id, direction='OUTGOING',
lower_bound=0, upper_bound=10000)
Like above, g_neighbors_downstream.edges returns only the source and target IDs without the additional information. We’ve got to load the spatial information of the edges in a separate step again.
ekc = g_storm_comp2.edge_key_column
in_list = g_neighbors_upstream.edges.ID.astype(str).str.cat(sep=',' )
pdf_storm_comp2_neighbors_upstream_edges = g_storm_comp2.edges_hdf \
.filter(f"{ekc} IN ({in_list})") \
.select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
in_list = g_neighbors_downstream.edges.ID.astype(str).str.cat(sep=',' )
pdf_storm_comp2_neighbors_downstream_edges = g_storm_comp2.edges_hdf \
.filter(f"{ekc} IN ({in_list})") \
.select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
Looking at the map we understand that we might need to investigate the red, upstream part of the network. The question now is, in which order to check the intakes, to find the one causes problems.
ShortestPathOneToAll
The ShortestPathOneToAll calculates all shortest path distances from one vertex (our reported node), to all the other vertices in the graph.
With direction=’INCOMING’ we specify that we’re only interested in the upstream (i.e. incoming) vertices. weight=’LENGTH_M’ specifies which column is used to calculate the ‘distance’. In our case, we take the length of the pipe segment. If none is specified, the algorithm calculates hop distance. With this information we could investigate our network, assuming that it makes sense to start looking at closer vertices first to find the problem.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment