In this blog post, I will show how to use some basic SAP HANA Graph functions to understand dependencies in a directed graph. The demo data describes python packages and their dependencies. Resolving package dependencies is required if a package manager like “Anaconda” needs to check for pre-requisites before installing a new package. However, the patterns discussed in this article are generic and apply to use cases in manufacturing, utilities, IT and many other industries. The data and code is available on github.
dependencies among python packages
Many complex real-world systems are networks. Most often, a failure or change of one of its parts also impacts dependent parts and shocks spreads through the system. Understanding dependencies is a crucial step in many risk analysis and impact simulation scenarios, for instance:
◉ Utility networks – in an electricity network you need to understand which parts of the network go down if you switch a breaker; in a water network you need to understand which parts of the network feed into a pipe you would like repair.
◉ Manufacturing – which customers are affected if one of your suppliers can not deliver a raw material and the production process is affected.
◉ IT operations – in data center you have to know which systems/applications/processes went down in case of a hardware failure.
Usually, such networks are modelled as graphs, i.e. a set of vertices connected by a set of edges. If you are a database developer working on an application to analyze dependencies in such graphs, you would either have to execute a lot of join queries on relational tables… or you would go for a graph database. SAP HANA has a built-in graph engine, so you can run graph operations for dependency analysis directly on your relational data.
Let’s explore some fundamental patterns and how they can be implemented using database procedures in SAP HANA. As an example, we use a (slightly outdated) dataset on python package dependencies which I found. A subset of this network is depicted below – the “networkx” package and its dependencies.
dependencies of the networkx package
Introduction
In SAP HANA, you create a so-called GRAPH WORKSPACE to expose your data to the built-in graph engine. A graph workspace is defined on two data structures – vertices and edges. The “python packages” dataset we use is simple: the vertices are the python packages, the edges describe their direct dependencies.
There are “root” packages which do not depend on any other, and there are “top-level” packages which no other packages depend upon. But most packages have dependencies and are themselves a pre-requisite for others.
The guiding questions in our dependency analysis scenario are:
◉ Which packages does “networkx” – directly or indirectly – depend upon?
◉ Which packages depend on “networkx”?
Schematically, we are looking for downstream (orange) and upstream (blue) dependencies of the black node in the middle.
Basic Reachability Analysis
To get downstream dependencies, we need to identify all vertices in our graph that can be reached from “networkx” following the edges in outgoing direction. In SAP HANA, this can be handled by a database procedure:
CREATE PROCEDURE "DEPENDENCY_ANALYSIS"."GS_REACHABLE_VERTICES"( ... )
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
GRAPH g = Graph( "DEPENDENCY_ANALYSIS", "PACKAGE_GRAPH");
VERTEX v_start = Vertex(:g, 'networkx');
MULTISET<Vertex> m_reachVertices = REACHABLE_VERTICES(:g, :v_start, 'OUTGOING');
o_vertices = SELECT :v."PACKAGE_NAME" FOREACH v IN :m_reachVertices;
END;
First, a graph “g” is created by referring to a graph workspace.
Then, a vertex “v_start” is constructed using its key ‘networkx’.
Finally, the REACHABLE_VERTICES function is called on graph “g”, starting at vertex “v_start”, traversing the edges in OUTGOING direction.
The procedure then returns a simple flat list of all reachable vertices, i.e. all packages required to run “networkx”. A python package manager software would need to check/install the following packages before installing “networkx”.
Limit Traversal Depth
The REACHABLE_VERTICES function returns all dependencies – not matter the hop distance, i.e. the number of edges in between. If we want to limit the result to only include packages which are a defined number of hops away from the start package, we can use the NEIGHBORS function. The neighbors function takes a minimum and maximum distance parameter.
MULTISET<Vertex> m_neighbors = NEIGHBORS(:g, :v_start, 2, 3, 'OUTGOING');
The function above returns vertices which are 2..3 hops away from the start vertex. The procedure below shows how to use i_min and i_max parameters in a procedure.
CREATE PROCEDURE "DEPENDENCY_ANALYSIS"."GS_NEIGHBORS"(
IN i_start NVARCHAR(5000), -- the key of the start vertex, which is its package name
IN i_dir VARCHAR(10), -- the direction of the traversal: OUTGOING, INCOMING, ANY
IN i_min BIGINT, -- the minimum hop distance
IN i_max BIGINT, -- the maximum hop distance
OUT o_vertices "DEPENDENCY_ANALYSIS"."TT_VERTICES", -- the vertices result set
OUT o_edges "DEPENDENCY_ANALYSIS"."TT_EDGES" -- the edges result set
)
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
GRAPH g = Graph("DEPENDENCY_ANALYSIS","PACKAGE_GRAPH");
VERTEX v_start = Vertex(:g, :i_start);
MULTISET<Vertex> m_neighbors = NEIGHBORS(:g, :v_start, :i_min, :i_max, :i_dir);
o_vertices = SELECT :v."PACKAGE_NAME" FOREACH v IN :m_neighbors;
o_edges = SELECT :e."ID", :e."SOURCE", :e."TARGET"
FOREACH e IN EDGES(:g, :m_neighbors, :m_neighbors);
END;
Note that the procedure returns two result sets: o_vertices and o_edges.
◉ o_vertices contains the result from the neighbors function.
◉ o_edges contains all the edges between any of the neighbors.
This pattern can be used to identify a vertex’ so-called “ego-graph“, i.e. the graph of all vertices which are less than a certain distance away. Networkx’ 1-hop ego-graph could look like this:
Hop Distance and Spanning Tree
When dealing with larger dependency graphs, it is sometime helpful to return the distance information with the result. One way to do this is by leveraging the SHORTEST_PATHS_ONE_TO_ALL (SPOA) function. The SPOA function below returns a graph and stores DISTANCE information for each vertex.
GRAPH g_spoa = SHORTEST_PATHS_ONE_TO_ALL(:g, :v_start, "DISTANCE", :i_dir);
Looking at the vertex result, we see that “sphinx” is an indirect dependency, it is 2 hops away from “networkx”.
In addition to the vertices, the SPOA graph also contains edges. In contrast to the ego-graph of the NEIGHBORS example, the SPOA graph contains only edges which are on a shortest path – the SPOA graph is a spanning tree. For outgoing dependencies this might look like this.
Filter and Stop Conditions
In more complex tracing scenarios it is sometimes required to have more control over the traversal… you might want to check “local stop conditions” to decide if you want to “walk over” a certain vertex or edge. In an electricity network, you might want to stop the traversal when hitting a transformer (i.e. you cross the boundary of a low voltage to high voltage network). This kind of control is provided by the BREADTH FIRST SEARCH (BFS) traversal operator. It allows you to explore the graph incrementally, hooking into each vertex or edge visit to conditionally stop the traversal or run custom logic. In the below picture, the BFS traversal starts at “networkx” and hits a stop condition at the crossed-out vertex, leaving the left lower vertex unexplored.
The below BFS statement explores graph g, starting from vertex v_start. At each vertex visited, it checks the package’s version and conditionally ends the traversal of a specific path. Furthermore, the number of incoming and outgoing edges (IN- and OUT-DEGREE) of each visited vertex is checked and only “root” or “top-level” packages are put into a container for later output.
TRAVERSE BFS (:i_dir) :g FROM :v_start ON VISIT VERTEX (Vertex v_visited) {
IF (:v_visited."VERSION" > '1' OR :v_visited."PACKAGE_NAME" == :i_start) {
IF ( OUT_DEGREE(:v_visited) == 0L OR IN_DEGREE(:v_visited) == 0L ) {
s_o_vertices = :s_o_vertices || :v_visited;
}
}
ELSE { END TRAVERSE; }
};
Set Operations and Basic Aggregation
So we can go “downstream” and “upstream”, calculate hop distance, and deal with stop conditions. How can we analyze the dependencies for multiple start vertices?
◉ Which requirements do “networkx” and “weihnachtsgurke” have in common?
◉ What happens to my production if all 12 suppliers from a specific region cannot deliver?
◉ Which materials do products A, B, and C have in common?
Technically, we are talking about UNION and INTERSECT set operations. Our graph procedure needs to take a set of start vertices. This can be done be using a table parameter “i_startVertices”. The below procedure identifies the common neighbors for multiple start vertices.
CREATE PROCEDURE "DEPENDENCY_ANALYSIS"."GS_REACHABLE_VERTICES_INTERSECTION"(
IN i_startVertices "DEPENDENCY_ANALYSIS"."TT_VERTICES",
IN i_dir VARCHAR(10),
OUT o_vertices "DEPENDENCY_ANALYSIS"."TT_VERTICES"
)
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
GRAPH g = Graph( "DEPENDENCY_ANALYSIS", "PACKAGE_GRAPH");
MULTISET<NVARCHAR> m_startVertices = Multiset<NVARCHAR> (:i_startVertices."PACKAGE_NAME");
MULTISET<Vertex> m_commonNeighbors = Multiset<Vertex>(:g);
FOREACH package IN :m_startVertices {
MULTISET<Vertex> m_reachableVertices = REACHABLE_VERTICES(:g, Vertex(:g, :package), :i_dir);
IF (IS_EMPTY(:m_commonNeighbors)) { m_commonNeighbors = :m_reachableVertices; }
ELSE { m_commonNeighbors = :m_commonNeighbors INTERSECT :m_reachableVertices; }
}
o_vertices = SELECT :v."PACKAGE_NAME" FOREACH v IN :m_commonNeighbors;
END;
Looking at the intersection of the dependencies of “weihnachtsgurke” and “networkx”, the common neighbors procedure above would find the green vertices.
If we take a step back and look at our python dependency graph as a whole, we can ask:
◉ Which packages have the most dependencies on “root” packages?
◉ Which packages are required by most “top-level” packages?
Obviously, this type of analysis requires to take a look at all of the ~26k packages in our sample dataset. The below procedure loops over all packages and stores the dependency counts in a table.
CREATE PROCEDURE "DEPENDENCY_ANALYSIS"."GS_DEPENDENCY_COUNT"( ... )
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
GRAPH g = Graph( "DEPENDENCY_ANALYSIS", "PACKAGE_GRAPH");
BIGINT i = 1L;
MULTISET<Vertex> m_v = Vertices(:g);
MULTISET<Vertex> m_reach = MULTISET<Vertex>(:g);
FOREACH v_start IN :m_v {
m_reach = NEIGHBORS(:g, :v_start, -100, 100);
o_vertices."PACKAGE_NAME"[:i] = :v_start."PACKAGE_NAME";
o_vertices."UPSTREAM_TOP_LEVEL_PACKAGES"[:i] =
COUNT(v_reach IN :m_reach WHERE IN_DEGREE(:v_reach) == 0L);
o_vertices."DOWNSTREAM_ROOT_PACKAGES"[:i] =
COUNT(v_reach IN :m_reach WHERE OUT_DEGREE(:v_reach) == 0L);
i = :i + 1L;
}
END;
Sorting the result of the above procedure reveals that 5999 top-level packages require “sphinx”. The “onegov.onboarding” package has 162 root package dependencies.
No comments:
Post a Comment