Saturday, May 18, 2013

BigData: Launch CDH on EC2 from ArcMap in under 5 minutes

well....after you get all the necessary software, certificates and... setup everything correctly :-)

Update: Regarding the above comment - you can download a zip file containing all the necessary jars and the toolbox so like that you do not have to package the project from scratch.

The idea here is that I would like an ArcGIS user to just push a button from within ArcMap and have a Cloudera based Hadoop cluster started on Amazon EC2. From there on, a user can edit features in ArcMap that can be exported into that cluster to be used as an input to a MapReduce job. The output of the MapReduce job is imported back into ArcMap for further analysis. This combination of SmallData (GeoDatabase) and BigData (Hadoop) is a great fusion in a geo-data scientist arsenal. When done with the analysis, the user again will push a button and destroys the cluster, thus paying for what he/she used while having access to elastic resources.

The following is a sequence of prerequisite steps that you need to execute to get going:

Create an Amazon Web Service account and setup AWS security credentials. You need to remember your 'Access Key ID' and 'Secret Access Key'. They are ugly long strings that we will need later to create our Hadoop cluster in EC2.

Download the PuTTY Windows installer from and run it. This will add two additional programs to your system; putty.exe and puttygen.exe. Will use puttygen.exe to generate a OpenSSH formatted public and private key to enable a secure tunneled communication with the soon to be created AWS instances.

Here are the steps to create the keys:

  • Launch PuTTYGen.
  • Set the number of bits in generated keys to 2048.
  • Click the 'Generate' button and move your mouse randomly in the 'blank' area.
  • To save your private key, click the 'Conversions' menu option and select the 'Export OpenSSH key' menu item. An alert will be displayed, warning you that you are exporting this key without a passphrase. Press 'Yes' to save it
  • To save your public key, select and copy the content of the text area. Open notepad, paste the content and save the file.
  • Finally, save this key in a PuTTY format by clicking on the 'Save private key' button.

Run as administrator the 'JavaConfigTool' located in C:\Program Files (x86)\ArcGIS\Desktop10.1\bin and adjust the Minimum heap size to 512, Maximum heap size to 999, and the Thread stack size to 512.

The deployment of the Cloudera Distribution of Hadoop (CDH) on Amazon EC2 is executed by Apache Whirr. You can find great documentation here on how to run it from the command line. However in my case, I wanted to execute the process from within ArcMap. So, I wrote a GeoProcessing extension that you can add to ArcCatalog enabling the invocation of the deployment tool from within ArcMap.

Whirr depends on a set of configurations to launch the cluster. The values of some of the parameters depend on the previous two steps. Here is a snippet of my '' file.

whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,3 hadoop-datanode+hadoop-tasktracker

Please follow the instructions on Github to install the WhirrToolbox in ArcMap.

To launch the cluster, start ArcMap, show the catalog window, expand the 'WhirrToolbox' and double click the 'LaunchClusterTool'. Select a cluster configuration file and click the 'Ok' button.

Now, this will take a little bit less that 5 minutes.

 To see the Whirr progress, disable the background processing in the GeoProcessing Options by unchecking the 'Enable' checkbox:

Once the cluster is launched successfully, Whirr will create a folder named '.whirr' in your home folder. In addition, underneath that folder, a subfolder will be created and will be named after the value of 'whirr.cluster-name' property in the configuration file.

The file 'hadoop-site.xml' contains the Hadoop cluster properties.

I've written yet another GP tool (ClusterPropertiesTool) that will convert this xml file in a properties file. This properties file will be used in the subsequent tools when exporting, importing feature classes from and to the cluster and running map reduce jobs.

Now, all the communication with the cluster is secure and should be tunneled through an ssh connection. To test this, we will use PuTTY. But before starting PuTTY, we need the name node IP. This can be found in the 'instance' file in the '.whirr' subfolder. Make sure to get the first IP in the 'namenode' row, not the one that starts with ''.

Start PuTTY. Put in the 'Host Name (or IP Address)' field the value of the name node IP.

In the Connection / SSH / Auth form, browse for the PuTTY formatted private key that we save previously.

 In The Connection / SSH / Tunnels, select the 'Dynamic' radio button and enter '6666' in the 'Source port' field and click the 'Add' Button.

Click the 'Open' button, and that should connect you to your cluster.

Enter your username (this is your Windows login username) and hit enter, and you should see a welcome banner and a bash shell prompt.

Once you are done with the cluster, the DestroyClusterTool GP tool will destroy the cluster.

In the next post, we will use that cluster to export, import feature classes and run map reduce jobs.

Like usual, all the source code can be found here.


Priya said...

I'm getting this error "unable to execute the selected tool". Pls advise on what might be wrong.(Any debug logs that I can look into.)


Priya said...
This comment has been removed by the author.
Priya said...

I was able to resolve the issue. i I was trying to run the tool in ARCGIS 10.0 moving it to 10.1 worked like a charm. Thanks for the wonderful blog

Shafi said...

I have a cloudera VM (linux) installed locally can I use it instead of Amzon.
Thanks for your help

Shafi said...

Please note in Continuation of above comment am using Arcmap on windows