Monday, March 16, 2015

BigData, MemSQL and ArcGIS Interceptors

Last week, at the Developer Summit, we unveiled Server Object Interceptors. They have the same API as Server Object Extensions, and are intended to extend an ArcGIS Server with custom capabilities. An SOI intercepts REST and/or SOAP calls on a MapServer before and/or after it executes the operation on an SOE or SO. Think servlet filters.

A use case of an SOI associated with a published MXD is to intercept an export image operation on its MapService and digitally watermark the original resulting image. Another use case of an interceptor is to use the associated user credentials in the single-sign-on request to restrict the visibility of layers or data fields.

This is pretty neat and being the BigData Advocate, I started thinking how to use this interceptor in a BigData context. The stars could not have been more aligned than when I heard that the MemSQL folks have announced geospatial capabilities in their InMemory database.  See, I knew for a while that they were spitballing native geospatial types, but the fact that they showcased it at Strata + Hadoop World made me reach back to them to see how we can collaborate.
The idea is that since ArcGIS server does not natively support MemSQL, and since MemSQL natively supports the MySQL wire protocol,  I can use the MySQL JDBC driver to query MemSQL from an SOI and display the result in a map.
The good folks at MemSQL bootstrapped a set of AWS instances with their “new” engine and loaded the now-very-famous New York City taxis trips data. This (very very small) set consists of about 170 million records with geospatial and temporal information such as pickup and drop off locations and times.  Each trip has additional attributes such as travel times, distances and number of passengers. It was up to me now to query and display dynamically this information in a standard WebMap on every map pan and zoom. What do I mean by “standard” here, is that an out-of-the-box WebMap should be able to interact with this MemSQL database without being augmented with a new layer type or any other functionality. Thus the usage of an SOI. It will intercept the call to an export image operation with a map extent as an argument in a “stand-in” MapService and will execute a spatial MemSQL call on the AWS instances. The result set is drawn on an off-screen PNG image and is sent back to the requesting WebMap for display as a layer on a map.

Like usual, all the source code can be found here.

Tuesday, February 17, 2015

A Whale, an Elephant, a Dolphin and a Python walked into a bar....

This project started simply as an experiment in trying to execute a Spark job that writes to specific path locations based on partitioned key/value tuples. Once I figured out the usage of rdd.saveAsHadoopFile with a customized MultipleOutputFormat implementation and a customized RecordWriter, I was partitioning and shuffling data in all the right places.
Though I could read the content of a file in a path, I could not query selectively the content. So to query the data, I need to SQL map the content. Enter Hive.  It enables me to define a table that is externally mapped by partition to path locations. What makes Hive so neat is that schema is applied on read rather than on write, this is very unlike traditional RDBMS systems. Now, to execute HQL statements, I need a fast engine. Enter SparkSQL. It is such an active project, and with all the optimizations that can be applied to the engine, I think it will rival Impala and Hive on Tez !!
So I came to a point where I can query the data using SQL. But, what if the data becomes too big ? Enter HDFS.  So now, I need to run HDFS on my mac. I could download a bloated Hadoop distribution VM like Cloudera QuickStart or HortworkWorks Sandbox, but I just need HDFS (and maybe YARN :-) Enter Docker. Found the perfect Hadoop image from SequenceIQ that just runs HDFS and YARN on a single node. So now, with a small addition of a config file to my classpath, I can write the data into HDFS and since I have docker now, this enables me to move the Hive Metastore from the embedded Derby to an external RDBMS. Found a post that describes that and bootstrapped yet another container with a MySQL instance to house the Hive Metastore.
Seeing data streaming on the screen like in the Matrix is no fun for me - but placing that data on a map, now that is expressive and can tell a story.  Enter ArcMap (On the TODO list, is to use Pro). Using a Python Toolbox extension, I can include a library that can make me communicate with SparkSQL to query the data and turn it into a set of features on the map.

Wow...Here is what the "Zoo" looks like:

And like usual, all the source code and how to do this yourself is available here.

Monday, February 2, 2015

Accumulo and Docker

If you want to experiment with BigData and Accumulo, then you can use docker to build an image and run a single node instance using this docker project.

In that container, you will have a single instance of Zookeeper, YARN, HDFS and Accumulo. You can 'hadoop fs -put files' in HDFS, run MapReduce jobs and start an Accumulo interactive shell.

There was an issue with setting the vm.swappiness in the docker container directly where it was not taking effect, and the only way I could make it stick, was to set it in the docker daemon environment, in such that it is "inherited" (not sure if this is the correct term) by the container.

This project was an experiment for me in the hot topic of container based applications using docker, and as a way to share with colleagues a common running environment for some upcoming Accumulo based projects.

And so far it has been a success :-) You can pull the image using:

docker pull mraad/accumulo

And like usual, all the source code is here.

Sunday, January 18, 2015

Spark, Cassandra, Tessellation and ArcGIS

If you do BigData and have not heard or used Spark then… are living under a rock!
When executing a Spark job, you can read data from all kind of sources with schemas like file, hdfs, s3 and can write data to all kind of sinks with schemas like file and hdfs.
One BigData repository that I’ve been exploring is Cassandra.  The DataStax folks released a Cassandra connector to Spark enabling the reading and writing of data from and to Cassandra.
I’ve posted on Github a sample project that reads the NYC trip data from a local file and tessellates a hexagonal mosaic with aggregates of pickup locations.  That aggregation is persisted onto Cassandra.
To visualize the aggregated mosaic, I extended ArcMap with an ArcPy toolbox that fetches the content of a Cassandra table and converts it to a set of features in a FeatureClass. The resulting FeatureClass is associated with a gradual symbology to become a layer on the map as follows:

Like usual all the source code is here.

Saturday, January 17, 2015

Scala Hexagon Tessellation

I've committed myself for 2015 to learn Scala, and I wish I did that earlier after 20 years of Java (wow, that makes me sound old :-).  I've placed on Github a simple Scala based library to compute the row/column pair of a planar x/y value on a hexagonal grid.
Will be using that library in following posts...
In the meantime, like usual, all the source code is available here.

Friday, January 2, 2015

Spark SQL DBF Library

Happy new year all…It’s been a while. I was crazy busy from May till mid December of last year implementing BigData  geospatial solutions at client sites all over the world. Was in Japan a couple of times, Singapore, Malaysia, UK, and do not recall the times I was in Redlands, Texas and DC.  In addition, I’ve been investing heavily in Spark and Scala. Do not recall the last time I implemented a Hadoop MapReduce job !

One of the resolutions for the new year (in addition to the usual eating right, exercising more and the never-off-the-bucket-list biking Mt Ventoux) is to blog more. One post per month as a minimum.

So…to kick to year right, I’ve implemented a library to query DBF files using Spark SQL. With the advent of Spark 1.2, a custom relation (table) can be defined as a SchemaRDD.  A sample implementation is demonstrated by Databrick’s spark-avro, as Avro files have embedded schema and data so it is relatively easy to convert that to a SchemaRDD. We, in the geo community have such a “old” format that encapsulates schema and data; the DBF format. Using the Shapefile project, I was able to create an RDD using the spark context Hadoop file API and the implementation of a DBFInputFormat. Then using the DBFHeader fields information, each record was mapped onto a Row to be processed by SparkSQL.  This is mostly work in progress and is far from been optimized, but it works !

Like usual, all the source code can be downloaded from here. Happy new year all.

Sunday, May 4, 2014

Spatially Enabling In-Memory BigData Stores

I deeply believe that the future of BigData stores and processing will be driven by GPUs and purely based on distributed InMemory engines that is backed by something resilient to hardware failure like HDFS.
HBase, Accumulo, Cassandra depend heavily on their in-memory capabilities for their performance. And when it comes to processing, SQL is still King….MemSQL is combining both - pretty impressive.
However, ALL lack something that is so important in today’s BigData world and that is true spatial storage, index and processing of native points, lines and polygons. SpaceCurve is doing great progress on that front.
A lot of smart people have taken advantage of the native lexicographical indexing of these key value stores and used geohash to save, index, and search spatial elements, and have solved the Z-order range search. Though these are great implementation, I always thought that the end did not justify the means. There is a need for a true and effective BigData spatial capabilities.
I’ve been a big fan of Hazelcast for quite some time now and was always impressed by their technology. In their latest implementation, they have added a MapReduce API, in such that now you can send programs to data - very cool !
But…like the others, they lack the spatial aspect when it comes my world. So…here is a set of small tweaks that truly spatially enables this in-memory BigData engine. I’ve used the MapReduce API and the spatial index in an example to visualize hotspot conflict in Africa.

Like usual, all the source code can be downloaded from here.