Tuesday, February 17, 2015

A Whale, an Elephant, a Dolphin and a Python walked into a bar....

This project started simply as an experiment in trying to execute a Spark job that writes to specific path locations based on partitioned key/value tuples. Once I figured out the usage of rdd.saveAsHadoopFile with a customized MultipleOutputFormat implementation and a customized RecordWriter, I was partitioning and shuffling data in all the right places.
Though I could read the content of a file in a path, I could not query selectively the content. So to query the data, I need to SQL map the content. Enter Hive.  It enables me to define a table that is externally mapped by partition to path locations. What makes Hive so neat is that schema is applied on read rather than on write, this is very unlike traditional RDBMS systems. Now, to execute HQL statements, I need a fast engine. Enter SparkSQL. It is such an active project, and with all the optimizations that can be applied to the engine, I think it will rival Impala and Hive on Tez !!
So I came to a point where I can query the data using SQL. But, what if the data becomes too big ? Enter HDFS.  So now, I need to run HDFS on my mac. I could download a bloated Hadoop distribution VM like Cloudera QuickStart or HortworkWorks Sandbox, but I just need HDFS (and maybe YARN :-) Enter Docker. Found the perfect Hadoop image from SequenceIQ that just runs HDFS and YARN on a single node. So now, with a small addition of a config file to my classpath, I can write the data into HDFS and since I have docker now, this enables me to move the Hive Metastore from the embedded Derby to an external RDBMS. Found a post that describes that and bootstrapped yet another container with a MySQL instance to house the Hive Metastore.
Seeing data streaming on the screen like in the Matrix is no fun for me - but placing that data on a map, now that is expressive and can tell a story.  Enter ArcMap (On the TODO list, is to use Pro). Using a Python Toolbox extension, I can include a library that can make me communicate with SparkSQL to query the data and turn it into a set of features on the map.

Wow...Here is what the "Zoo" looks like:

And like usual, all the source code and how to do this yourself is available here.


David Kaiser said...

Wow. If you think the Hortonworks Sandbox is 'bloated', I'd be curious what your definition of "big" data is? 1 Gigabyte?

Move your demo off the laptop to a real cluster... :)

David Kaiser said...

SparkSQL is going to need a LOT of work to be on par with Hive on Tez or even Hive on Spark. SparkSQL so far is only reasonably fast at very small data scales. Hive I/O from RDD won't be able to keep up with Tez I/O from HDFS on larger data sizes.

Here's the presentation that was shown at Strata last week showing some realworld (30TB) scale analysis: http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final