Wednesday, July 29, 2015

BigData Goes to Die on USB Drives

With the User Conference plenary behind me, I can catch up on my blog posts. I want to share something with you all. A BigData request pattern that I have been encountering a lot lately and how I've been responding to it.
See, people are accumulating lots of data and their traditional means to store and process (forget visualize) this data have been overwhelmed. So, they offload their "old" data into USB drives to make room for the new data. And IMHO BigData goes to die on USB drives, and that is a shame. And a lot of this offloading happens as CSV file exports. I wish they do the exports in something like Avro where data and schema is stored together.
Nevertheless, they now want to process all this data and I'm handed these drives with the statement “What can we do with these ?"
Lot of this data (and I'm talking millions and millions of records) has a spatial and temporal component and the questions to me (as I do geo) is more like “How can I filter, dissect and view on a map this data?"
Typically, these folks have virtualized environments on or off premise and can spin up a couple of Linux or Windows machines very quickly. Well...more than a couple, I usually request 4 to start. Once I have the machines' IPs, I install Hadoop using either Hortonworks or Cloudera depends on what the client has adopted and Elasticsearch.
From the Hadoop stack, I only install Zookeeper, YARN, HDFS and Spark. Some folks might wonder, why not use Solr since it is part of the distribution. I find ES easier to use and Hadoop is "temporary" as eventually, I just need Spark and Elasticsearch.
Then, I start transferring all the data off the USB drives to HDFS. See, HDFS gives me the parallel read that I need to massively bulk load the data using Spark into ES, where all the data is indexed in a spatial, temporal and attribute matter.
Once the data is in Elasticsearch, I can query it using ArcPy from ArcGIS Desktop or Server as a GeoProcessing extension or service.
If the resulting data is "small", then the GP returns a FeatureSet instance - however, that is not what is interesting in the data. What these folks want to see is how the data is clustered and what temporal and spatial patterns are been formed? What are the “life” patterns? What they are interested in is seeing the density, hotspots and more importantly the emerging hotpspots. That is where the Esri platform excels !
Like usual, all the source code to do this here. It is a bit crude, but it works.