Sunday, May 26, 2013

Creating Spatial Crunch Pipelines

Josh Wills (@Josh_wills) introduced me to Apache Crunch which is now a top-level project within the Apache Foundation. Crunch simplifies the coding of data aggregation in MapReduce.

Here is a proof-of-concept project that spatially enables a crunch pipeline with a Point-In-Polygon function from a very large set of static point data with a small set of dynamic polygons.

Crunch has simplified so much so the process, that is came down to a one line syntax:

final PTable<Long, Long> counts = pipeline. readTextFile(args[0]). parallelDo(new PointInPolygon(), Writables.longs()). count();

Crunch's strength is in processing BigData that cannot be stored in the "traditional means", such a time series and graphs. Will be interesting to perform some kind to spatial and temporal analysis with it in a followup post.

Like usual, all the source code can be found here.

No comments: