Josh Wills (@Josh_wills) introduced me to Apache Crunch which is now a top-level project within the Apache Foundation. Crunch simplifies the coding of data aggregation in MapReduce.
Here is a proof-of-concept project that spatially enables a crunch pipeline with a Point-In-Polygon function from a very large set of static point data with a small set of dynamic polygons.
Crunch has simplified so much so the process, that is came down to a one line syntax:
final PTable<Long, Long> counts = pipeline. readTextFile(args[0]). parallelDo(new PointInPolygon(), Writables.longs()). count();
Crunch's strength is in processing BigData that cannot be stored in the "traditional means", such a time series and graphs. Will be interesting to perform some kind to spatial and temporal analysis with it in a followup post.
Like usual, all the source code can be found here.
No comments:
Post a Comment