Friday, 22 December 2017

Using Topology for Data Analysis II

In this second part 1, we use Topological Data Analysis (TDA) on a dataset consisting on spatial information related to traffic. We’ll compare to usual “DBSCAN” method from Machine Learning.

DBSCAN is a method for finding clusters in data. It means Density-Based Spatial Clustering of Applications with Noise. It usually requires two parameters and the data: a radius, known as eps, and the minimum number of points required to form a cluster, that is the “density” part.

In any case, the parameters are unknown a priori. TDA in this case can help giving connected components as initial election of clusters and also, being robust against noise, these clusters will persist.

And a quick visualization of the dataset can be helpful for comparing these methods:

We can inmediately see some clusters given by traffic in this trajectory. The data is nested in a small range so the radius is gonna be 0.0005 and the minimal number of points we set it to 3 so that we can keep close gps positions.

-- In the standard PAL procedure we introduce the parameters we defined:
INSERT INTO DB_PARAMS VALUES ('THREAD_NUMBER', 2, null, null);
INSERT INTO DB_PARAMS VALUES ('DISTANCE_METHOD', 2, null, null);
INSERT INTO DB_PARAMS VALUES ('MINPTS', 3, null, null); 
INSERT INTO DB_PARAMS VALUES ('RADIUS', null, 0.0005, null); 
-- Remeber to call this procedure using a predefined view.
CALL _SYS_AFL.PAL_DB (DB_DATA, DB_PARAMS, DB_RESULTS) WITH OVERVIEW;

The results can be visualized in R with the “dbscan” package and a very interesting idea that lets you see the convex hull of the clusters:


We can see that the eps was too small and the algorithm detected to many clusters but it kept the ones corresponding to traffic. Taking a bigger parameter we can find better results:


So the algorithm finds the traffic but it’s sensitive to noise induce by the fact that gps positions are not nicely distributed. So, to make a better approximation we turn to Topological Data Analysis expecting 4 clusters or more, but as connected components.

DROP PROCEDURE "TDA";
-- procedure with R script using TDA package
CREATE PROCEDURE "TDA" (IN gps_data "GPS_DATA", OUT persistence "PERSISTENCE")
LANGUAGE RLANG AS 
BEGIN
library(TDA)
persist <- function(gps_data){
    #You can find how to construct this example in TDA package documentation

    gps_vector <- cbind("V1" = gps_data$longitude, "V2" = gps_data$latitude)
    xlimit <- c(-37.1, -37.0)
    ylimit <- c(-11.0, -10.8)
    by <- 0.0002
    x_step <- seq(from = Xlim[1], to = Xlim[2], by = by)
    y_step <- seq(from = Ylim[1], to = Ylim[2], by = by)
    grid <- expand.grid(x_step, y_step)
    diag <- gridDiag(X = gps_vector, FUN = distFct, lim = cbind(xlimit, ylimit), by = by,
                 sublevel = FALSE, library = "Dionysus", printProgress = FALSE)
    # Since gridDiag returns a list, we access this in any way we want:
    diagram <- Diag[["diagram"]]
    topology <- data.frame(cbind("dimension"=a[,1],"death"=a[,2],"birth"=a[,3]))
    return(topology)
    }
# Use the function
persistence <- persist(gps_data)
END;
-- call to keep results in a table
CALL "TDA" ("VEHIC_DATA", "PERSISTENCE") WITH OVERVIEW;

And the resulting barcode is the following:


This is a very good result, we can explore datasets related to traffic by looking at its topological properties and extract information relevant to us, these are topological features. After finding these we can take a look at other tools in machine learning to have more detailed information.

In this case, topological information plays a crucial role since it gives geometric insight to start our research, it is a frame for machine learning and gives us mathematical support for the choice of the parameters usually given by a rule of thumb.

Topology has a big part to play in the development of Machine Learning in general and many different ideas are being explored, not only persistent homology. Also, we are currently working on new applications of this tool, so KeepTheThread.

No comments:

Post a Comment