Tuesday, 27 December 2016

Port of Antwerp from the Opendata challenge perspective – part 4

In the last part (part 3), we saw how to import a CSV file into HANA using HANA Studio where we converted the geoJSON filed into WKT, let’s now see how we can take care of the same content but in a JSON format.
if you remember, our CSV file (http://datasets.antwerpen.be/v4/gis/grondwaterstroming.csv) is also available in JSON format.

So the link to our JSON content is: http://datasets.antwerpen.be/v4/gis/grondwaterstroming.json, this will only return the first thousand rows (and luckily this has only 123 rows, but I will show you how to implement the pagination).


We will also be using Node.js to convert the JSON content into CSV and take care of the geometry field at the same time.

I will assume from now that you have installed Node.js and the packages from the previous blog.

You will to install the following additional package using the following comand from your working directory:

npm install fast-csv

You can now create a new file named ‘convert-geojson-to-wkt-json.js‘ and add the following code:

var csv = require('fast-csv');
var fs = require('fs');
var request     = require("super-request");
var Terraformer = require('terraformer');
var WKT = require('terraformer-wkt-parser');

var input_host = "http://datasets.antwerpen.be";
var input_path = "/v4/gis/grondwaterstroming.json";

var writeOptions = {
headers : true,
quoteHeaders : true,
quoteColumns : true,
rowDelimiter :'\n',
delimiter : ';'
};

var csvWriteStream = csv
.format(writeOptions);

var writableStream = fs.createWriteStream("C:/temp/grondwaterstroming-json-out.csv")
.on("finish", function(){
console.log("All done");
});

csvWriteStream.pipe(writableStream);

function transformOpenDataInCSVForHANA (json) {
var i = 0;
json.data.forEach(function(item) {
var geometry = new Terraformer.Primitive(JSON.parse(item.geometry));
if (geometry.hasHoles()) {
console.log("found holes in " + item.id + ". let's remove the holes");
} else {
item.geometry = WKT.convert(JSON.parse(item.geometry));
csvWriteStream.write(item);
}
i++;
    });
}

function getJSONOpenData (param_host, param_path) {
var record_count = 1;
// Start the request
request(param_host)
// first let's get the number of records from the first page
.get(param_path)
.qs(function () {
return {page_size: record_count};
})
.expect(200)
.end(
function (error, response, json) {
if (!error && response.statusCode === 200) {
var page = JSON.parse(json);
record_count = page.paging.records;
} else {
console.log(error);
}
}
)
// now let's get all the records
.get(param_path)
.qs(function () {
return {page_size: record_count};
})
.expect(200)
.end(
function (error, response, json) {
console.log("page_size " + record_count);
if (!error && response.statusCode === 200) {
transformOpenDataInCSVForHANA (JSON.parse(json));
} else {
console.log(error);
}
csvWriteStream.end();
}
);
}
getJSONOpenData (input_host, input_path);

The program will read the data from the URL made of the input_host and input_host variables (in other words: http://datasets.antwerpen.be/v4/gis/grondwaterstroming.json), then will write the output in ‘C:/temp/grondwaterstroming-json-out.csv‘ where the JSON properties will be transposed and the geoJSON field will be converted in a WKT format.

You can now run the following command:

node convert-geojson-to-wkt-json.js

Then you should be able to run the import just like in the last blog part.

A few more word about the pagination on this Opendata web site, they implemented it using a HTTP parameter named ‘page_size’. The first page will return you the number of records.
This is why in the code I call twice the URL but in the first one, I just retrieve one record, just so I can get the total number of records and use it in the second call.
Obviously this may lead to some performance issue if the the number of record is really huge, and might require some rework on the way I call the the transformOpenDataInCSVForHANA function.

No comments:

Post a Comment