Monday, February 8, 2016

(Web)Mapping Elephants with Sparks

CSV files (though not the most efficient format and least expressive due to meager header metadata) is one of the most ubiquitous formats to place data in BigData stores like HDFS. In addition, geospatial information such as latitude and longitude is now the norm as fields in those CSV files origination from say a moving GPS based device.
A constant request that I receive all the time is “How do I visualize on the web all these data points?” There is a legitimate concern in this question which is “How do I visualize on the web millions and millions on points?”. Well the short answer is “You Don’t!” (actually, you can…but that is blog post for another day). Though you can download a couple of million points to a web client, after a while the transfer time will be prohibitive. However, if you process the data on the server and send down the aggregated information to be symbolized on the client, then things become more interesting.
A common aggregation processing is binning, where, imagine you have a virtual fishnet and you cast that fishnet on your point space. All the points that are in the same fishnet cell are collapsed together to be represented by that cell. What you return now are the cells and their associated aggregates.

project is a collection of Python tools using the ArcGIS System that retrieves CSV data with geospatial fields from HDFS and displays the aggregation in the form of hexagonal areas using ArcGIS online web maps and web apps. The processing is done in Python using Apache Spark.

The ArcGIS System is a sequential composition of:

  • Desktop with Python based GeoProcessing extensions for authoring.
  • Server with GeoProcessing endpoints for publishing.
  • Online with WebMaps and WebApps built using AppBuilder for presenting.

Like usual, all the source code is here.