Sunday, January 18, 2015

Spark, Cassandra, Tessellation and ArcGIS

If you do BigData and have not heard or used Spark then…..you are living under a rock!
When executing a Spark job, you can read data from all kind of sources with schemas like file, hdfs, s3 and can write data to all kind of sinks with schemas like file and hdfs.
One BigData repository that I’ve been exploring is Cassandra.  The DataStax folks released a Cassandra connector to Spark enabling the reading and writing of data from and to Cassandra.
I’ve posted on Github a sample project that reads the NYC trip data from a local file and tessellates a hexagonal mosaic with aggregates of pickup locations.  That aggregation is persisted onto Cassandra.
To visualize the aggregated mosaic, I extended ArcMap with an ArcPy toolbox that fetches the content of a Cassandra table and converts it to a set of features in a FeatureClass. The resulting FeatureClass is associated with a gradual symbology to become a layer on the map as follows:

Like usual all the source code is here.

Saturday, January 17, 2015

Scala Hexagon Tessellation

I've committed myself for 2015 to learn Scala, and I wish I did that earlier after 20 years of Java (wow, that makes me sound old :-).  I've placed on Github a simple Scala based library to compute the row/column pair of a planar x/y value on a hexagonal grid.
Will be using that library in following posts...
In the meantime, like usual, all the source code is available here.

Friday, January 2, 2015

Spark SQL DBF Library

Happy new year all…It’s been a while. I was crazy busy from May till mid December of last year implementing BigData  geospatial solutions at client sites all over the world. Was in Japan a couple of times, Singapore, Malaysia, UK, and do not recall the times I was in Redlands, Texas and DC.  In addition, I’ve been investing heavily in Spark and Scala. Do not recall the last time I implemented a Hadoop MapReduce job !

One of the resolutions for the new year (in addition to the usual eating right, exercising more and the never-off-the-bucket-list biking Mt Ventoux) is to blog more. One post per month as a minimum.

So…to kick to year right, I’ve implemented a library to query DBF files using Spark SQL. With the advent of Spark 1.2, a custom relation (table) can be defined as a SchemaRDD.  A sample implementation is demonstrated by Databrick’s spark-avro, as Avro files have embedded schema and data so it is relatively easy to convert that to a SchemaRDD. We, in the geo community have such a “old” format that encapsulates schema and data; the DBF format. Using the Shapefile project, I was able to create an RDD using the spark context Hadoop file API and the implementation of a DBFInputFormat. Then using the DBFHeader fields information, each record was mapped onto a Row to be processed by SparkSQL.  This is mostly work in progress and is far from been optimized, but it works !


Like usual, all the source code can be downloaded from here. Happy new year all.

Sunday, May 4, 2014

Spatially Enabling In-Memory BigData Stores

I deeply believe that the future of BigData stores and processing will be driven by GPUs and purely based on distributed InMemory engines that is backed by something resilient to hardware failure like HDFS.
HBase, Accumulo, Cassandra depend heavily on their in-memory capabilities for their performance. And when it comes to processing, SQL is still King….MemSQL is combining both - pretty impressive.
However, ALL lack something that is so important in today’s BigData world and that is true spatial storage, index and processing of native points, lines and polygons. SpaceCurve is doing great progress on that front.
A lot of smart people have taken advantage of the native lexicographical indexing of these key value stores and used geohash to save, index, and search spatial elements, and have solved the Z-order range search. Though these are great implementation, I always thought that the end did not justify the means. There is a need for a true and effective BigData spatial capabilities.
I’ve been a big fan of Hazelcast for quite some time now and was always impressed by their technology. In their latest implementation, they have added a MapReduce API, in such that now you can send programs to data - very cool !
But…like the others, they lack the spatial aspect when it comes my world. So…here is a set of small tweaks that truly spatially enables this in-memory BigData engine. I’ve used the MapReduce API and the spatial index in an example to visualize hotspot conflict in Africa.

Like usual, all the source code can be downloaded from here.

Monday, March 3, 2014

Yet Another Temporal/Spatial Storage/Analysis of BigData

At this year’s FedUC, I presented an introduction and a practice session on spatial types in BigData. In these sessions, I demonstrated how to analyze temporal and spatial BigData in the form of Automatic Identification System.  This post discusses an ensemble of projects that made a section of this demo possible.  The storage and subsequence processing of the data is very specific to this project, where data is stored and analyzed in a temporal and then a spatial order. Please note the order; temporal then spatial. However, I see this pattern in a lot of BigData projects that I worked on.  This order enables us to take advantage of the native partitioning scheme of paths in HDFS for temporal indexing that we later augment with a “local” spatial index. So time, or more specifically an hour’s data file can be located by traversing the HDFS file system in the form /YYYY/MM/DD/HH/UUID, where YYYY is the year, MM is the numeric month, DD is the day, HH is the hour and UUID is a file with a unique name that holds all the data for that hour.  You can imagine a process, such as GeoEvent Processor that continuously adds data in this pattern. However, in my case, I received the data as a set of GZIP files and used the Hadoop MapReduce AISImport tool to place the data in the correct folders/files in HDFS. Once an hour file is closed, a spatial index file is created that enables the spatial search of the data at that hour.  This spatial index is based on the QuadTree model for point specific geometry and is initiated using the AISTools.  I rewrote my “ubiquitous” density MapReduce job to take into account this new spatial index, where now I can rapidly ask questions such as “What is the AIS density by unique MMSI and what does it look like at the entry of the harbor every day at 10AM in the month of January ?”  The following is a visualization of the answer in ArcMap from the Miami Port sample data.

Like usual, all the source code can be found here.

Wednesday, January 29, 2014

Cascading Workflow for Spatial Binning of BigData

I just finished a 3 day training on Cascading by Concurrent and it was worth every minute. I always knew about Cascading, but never invested in it but I wish I had, specially last month when I was doing a BigData ETL job in MapReduce. My development time would have been significantly reduced (pun intended :-) if I thought of the problem in terms of a cascading water flow rather than in MapReduce.
So in Cascading, you compose a data flow with set of pipes having operations such as filtering, joining and grouping and it turns that flow into a MapReduce job that you can execute on a Hadoop cluster.
Being spatially aware, I _had_ to add a spatial function to Cascading using our GIS Tools For Hadoop geometry API. The spatial function that I decided to implement bins location data in the same area, in such that at the end of the process, each area has a count of the locations that is covers. This is a nice way to visualize massive data.
So, we start with:

to produce:

Again, rather than thinking in MapReduce, think data water flow:

Here, I have an input tap that accepts text data from HDFS. Each text record is composed of fields separated by a tab character. In Cascading, a tap can define the field names and types. A pipe is created to select the “X” and “Y” fields to be processed by a function. This is a spatial function that utilizes the Esri Geometry API. It loads into an in-memory spatial index a set of polygons defined as an property value, and will be used to perform a point-in-polygon operation on each input X/Y tuple.  The overlapping polygon identifier is emitted as the pipe output. The output polygon identifiers are grouped together and counted by yet another pipe. The tuple set of polygon identifier/count is written to a comma separated HDFS based file using an output tap. The count field is labeled as POPULATION to make it ArcGIS friendly :-)
Like usual, all the source code can be found here.
 

Friday, January 24, 2014

Hadoop and Shapefiles

Shapefiles are still today the ubiquitous way to share and exchange geospatial data. I’ve been getting a lot of requests lately from BigData Hadoop users to read shapefiles directly off HDFS, I mean after all, the 3rd V (variety) should allow me to do that. Since the format of shapefiles was developed by Esri, there was always an "uneasiness" in me as an Esri employee in using third party open source tools (geotools and JTS) to read these shapefiles when we have just released on Github our geometry API. In addition, I always thought that these powerful libraries were too heavy for my needs, when I just wanted to plow through a shapefile in a map or reduce phase in a job. So, I decide to write my own simple java implementation that for now just reads points and polygons.  This 20% implementation should for now cover my 80% usage. I know that there exist a lot of java implementations on the net that read the shp and dbf format, but I wanted one that is tailored to by BigData needs specially when it comes to writable instances and more importantly, it generates geometry instances based on our geometry model and API. Like usual all the source code can be found here.