Monday, May 20, 2013

Export FeatureClass to Hadoop, Run MapReduce, Visualize in ArcMap

In the previous post we launched a CDH cluster on EC2 in under 5 minutes. In this post, we will use that cluster to perform geo spatial analytics in the form of MapReduce and visualize the result in ArcMap. See, ArcMap is one of the desktop tools that a GeoData Scientist will use when working and visualizing spatial data.  The use case in my mind is something like the following:  Point data is streamed through, for example GeoEventProcessor into Amazon S3. The user has a set polygons in ArcGIS that needs to be spatially joined with that big data point content. The result of the big data join is linked back to the polygon set for symbol classification and visualization.

After editing the polygons in ArcMap, the user exports the feature class into HDFS using the ExportToHDFSTool.

Using the new Esri Geometry API for Java, a MapReduceJob is written as a GeoProcessing extension, in such that it can be directly executed from within ArcMap. The result of the job is covered directly back into a feature class.

The tool expects the following parameters:

  • A Hadoop configuration in the form of a properties file.
  • A user name that Hadoop will use as credentials and its privileges when executing the job.
  • The big data input to use as the source - in the above use case that will be the S3 data.
  • The small data polygon set to spatially join and aggregate.
  • The remote output folder - where the MapReduce job will put its results.
  • A list of jar files that will be used by the DistributedCache to augment the MapReduce classpath. That is because the Esri Geometry API for Java jar is not part of the Hadoop distribution, and we use the distributed cache mechanism to "push" it to each node.
  • The output feature class to create when returning back the MapReduce job result from HDFS.
This code borrows resources from Spatial Framework for Hadoop for the JSON serialization.

For the ArcGIS Java GeoProcessing Extension developers out there - I would like to point you to a couple tricks that will make your development a bit easier and hopefully will make you bang your head a little bit less when dealing with ArcObjects - LOL !

I strongly recommend that you use Apache Maven as your building process. In addition to a good code structure and unified repository, it comes with a great set of plugins to assist in the deployment process.  The first plugin is the maven-dependency-plugin. It copies all the runtime dependencies into a specified folder.  I learned the hard that when ArcMap starts up, it introspects all the classes in all the jars in the extension folder for the @ArcGISExtension annotation declaration.  Now this is fine if you are writing a nice "hello world" GP extension, but in our case, where we are depending on 30+ jars, well….ArcMap will never starts. So, the solution is to put all the jars in a separate subfolder and declare a Class-Path entry in the main jar manifest file that references all the dependencies. This is where the second plugin comes in to the rescue. The maven-jar-plugin can be configured to automatically generate a manifest file containing a class path entry that references all the dependencies declares in the maven pom.

Talk about classpath. So you think that all is well and good by using the above setup. There is one more thing that you have to do to force the correct classpath when executing the GP task and that is to set the current thread context class loader to the system class loader as follows:


It took me 3 weeks to figure this out - so hopefully somebody will find this useful.

Like usual, all the source code can be found here.


Priya said...

Could you please outline how you upload earthquakes.csv(Big data). Did you use the ExportToHDFSTool as well here?

Mansour Raad said...

for the earthquakes.csv, I use the hadoop CLI
$ hadoop fs -put earthquakes.csv earthquakes.csv