Thunderhead Explorer: January 2014

Wednesday, January 29, 2014

Cascading Workflow for Spatial Binning of BigData

I just finished a 3 day training on Cascading by Concurrent and it was worth every minute. I always knew about Cascading, but never invested in it but I wish I had, specially last month when I was doing a BigData ETL job in MapReduce. My development time would have been significantly reduced (pun intended :-) if I thought of the problem in terms of a cascading water flow rather than in MapReduce.
So in Cascading, you compose a data flow with set of pipes having operations such as filtering, joining and grouping and it turns that flow into a MapReduce job that you can execute on a Hadoop cluster.
Being spatially aware, I _had_ to add a spatial function to Cascading using our GIS Tools For Hadoop geometry API. The spatial function that I decided to implement bins location data in the same area, in such that at the end of the process, each area has a count of the locations that is covers. This is a nice way to visualize massive data.
So, we start with:

to produce:

Again, rather than thinking in MapReduce, think data water flow:

Here, I have an input tap that accepts text data from HDFS. Each text record is composed of fields separated by a tab character. In Cascading, a tap can define the field names and types. A pipe is created to select the “X” and “Y” fields to be processed by a function. This is a spatial function that utilizes the Esri Geometry API. It loads into an in-memory spatial index a set of polygons defined as an property value, and will be used to perform a point-in-polygon operation on each input X/Y tuple. The overlapping polygon identifier is emitted as the pipe output. The output polygon identifiers are grouped together and counted by yet another pipe. The tuple set of polygon identifier/count is written to a comma separated HDFS based file using an output tap. The count field is labeled as POPULATION to make it ArcGIS friendly :-)
Like usual, all the source code can be found here.

Friday, January 24, 2014

Hadoop and Shapefiles

Shapefiles are still today the ubiquitous way to share and exchange geospatial data. I’ve been getting a lot of requests lately from BigData Hadoop users to read shapefiles directly off HDFS, I mean after all, the 3rd V (variety) should allow me to do that. Since the format of shapefiles was developed by Esri, there was always an "uneasiness" in me as an Esri employee in using third party open source tools (geotools and JTS) to read these shapefiles when we have just released on Github our geometry API. In addition, I always thought that these powerful libraries were too heavy for my needs, when I just wanted to plow through a shapefile in a map or reduce phase in a job. So, I decide to write my own simple java implementation that for now just reads points and polygons. This 20% implementation should for now cover my 80% usage. I know that there exist a lot of java implementations on the net that read the shp and dbf format, but I wanted one that is tailored to by BigData needs specially when it comes to writable instances and more importantly, it generates geometry instances based on our geometry model and API. Like usual all the source code can be found here.

Wednesday, January 15, 2014

Apache Spark, Spatial Functions and ArcGIS for Desktop

A while back I watched with great fascination a webinar presented by UC Berkley amp lab on Spark and Shark. I wanted to spatially enable spark and has been on my todo list for a while.
Spark has “graduated” and has joined the real world as databricks and has raise some serious cash to take on map reduce. Even Cloudera is teaming up with databricks to support Spark.
So it was time for me to bring back that project to the front burner and I posted onto github a project that enables me to invoke a spark job from ArcGIS For Desktop to perform a density analysis on data residing in HDFS. The density calculation is based on a honeycomb style layer that I think produces some pretty neat looking maps. Here is a sample:

Anyway, like usual all the source code can be found here. Have fun and happy new year.