Ever since I became a Cloudera Certified Developer for Apache Hadoop, I've been walking around with a hammer written on it "Map Reduce" looking for Big Data nails to pound. Finally, a real world problem from a customer came to my attention where a Hadoop implementation will solve his dilemma. Given a 250GB (I know, I know, this is _not_ big) CSV data set of demographic data consisting of gender, race, age, income and of course location, and given a set of Point of Interest locations, generate a 50 mile heatmap for each demographic attribute for each the POI locations.
Using the "traditional" GeoProcessing with Python would take way more than a couple of days to run and would generate over 850GB of raster data. What do I mean by the "traditional" way ? You load the CSV data into a GeoDatabase and then you write an ArcPy script that; for each location, generate a 50 mile buffer. Cookie cut the demographic data based on an attribute using the buffer and pass that feature set to the statistical package for density analysis which generates a raster file. Sure, you can manually partition the process onto acquired high CPU machines, but as I said, all that has to be done manually and will still take days.
There gotta to be a better way!
Enter Hadoop and a "different" way to process the data by taking advantage of:
- Hadoop File System fast splittable input streaming
- Distributed nature of Hadoop map reduce parts
- Distributed cache for "join" data
- External Fast java-based computational geometry library
- Producing vectors data rather than raster images
The last advantage is very important. This is something I call "Cooperative processing". See, people forget that there is a CPU/GPU on their client machines. If the server can producer vector data and we let the client render that data based on its capabilities, we will have a way more expressive application and the size of the data is way smaller. Will explain that in a bit.
Let me go back to the data processing. Actually, there is nothing to do. There is no need for a transform and load process, as the CSV data can be directly placed onto an HDFS folder. The hadoop job will take as input the HDFS folder.
The Mapper Task - After instantiation, the 'configure' method is invoked to load the POI locations from the distributed cache and a 50 mile buffer is generated for each POI location using the fast computational geometry library, whereupon the buffer polygons are stored in an memory-based spatial index for fast intersection look up. The 'map' method is invoked on every record in the 250GB CSV input, where each record is tokenized for coordinates and demographic values. Using the coordinates and the prebuilt spatial index, we can find the associated POI locations. Each 50 mile buffer is logically divided into kernel cells. Knowing the POI location, we can determine mathematically the relative kernel cell. We emit as a map key the combination of the POI location and the demographic value, and we emit the relative kernel cell as a map value.
map(k1,v1) - list(k2,v2)
k1 = lineno
v1 = csv text
k2 = POI location,demographic
v2 = cellx,celly,1
Again, taking advantage of the powerful shuffle and sort capability of Hadoop on the POI Location/demographic key, I am ensuring that a reduce task will receive all the cells for a POI location/demographic combination.
The Reduce Task - For a POI location/demographic, the reduce method is invoke with the list of its associated cells. Cells with the same cellx,celly values are aggregated to produce a new list. We compose a JSON document of the new list and we emit the string representation of the JSON document using a custom output formatter onto which we override the 'generateFileNameForKeyValue' method to return something of the form "poi-location
reduce(k2,list(v2)) - list(k3,v3)
k2: POI location, demographic
v2: cellx, cellx, 1
k3: POI location, demographic
v3: JSON Text
I was able to validate my progress by invoking MRUnit on my codebase to ensure the soundness of my logic.
I packaged my map/reduce code and the geometry library into a jar, and I was ready to test it on the 250GB CSV.
But where to run this ?
Enter Amazon Elastic MapReduce. With a virtual swipe of a credit card, I was able to start up 10 large instances passing it a reference to my data and my jar in S3. 30 minutes later, a set of JSON files where produced in S3 occupying 238MB of space! Pretty cool, eh ? Compare that to days of execution time, and 850GB of rasters. What is even more exciting, after a set of trials and errors and density kernel adjustments, I looked up my account balance and I owed Amazon $37.67 (will cost more to process a reimbursement request) !
Next comes the fun part, how to represent this JSON data for a particular POI location/demographic? Enter the Flex API for ArcGIS with its amazing extensibility and the flash player with its vector graphic and bitmap GPU-enhanced capabilities. See, by using the gradient filling of a drawn circle and the screen blending mode when placing that circle onto a bitmap, a set of close points will dissolve into a heatpoint. So, by taking advantage of this collaborative process between the server and the client, where the server generates a weighted point and let the client rasterize that point based on its weight, you get an expressive dynamic application. Let's push the visualization further to the coolest platform....the iPad. Flex code can be cross-compiled to run natively on the an iOS device. Let's push a bit more....3D. Taking advantage to the Stage3D capability, the heatmap vector data can be downloaded at runtime and dynamically morphed into a heightmap and a texture that can be draped on that heightmap. And here is the result....I call it "Heatmap in the Cloud". You can download the pdf of this Esri UC 2012 presentation from here. Have fun.