Thursday, August 13, 2015

BigData Point-In-Polygon GeoEnrichment

I’m always handed a huge set of CSV records with lat/lon coordinates, and the task at hand is to spatially join these records with a huge set of feature polygons where the output is a geoenhancement of the orignal points with the intersecting polygon’s attributes. An example is a set of retailer customer locations that need to be spatially intersected with demographic polygons for targeted advertisement (Sorry to send you all more junk mail :-).

This is a reference implementation, where both the points data and the polygon data are stored in raw text TSV format and the polygon geometries are in WKT format. Not the most efficient format, but at least the input is splittable for massive parallelization.

The feature class polygons can be converted to WKT using this ArcPy tool.

This Spark based job can be executed in local mode, or better in this docker container.

One of these days will have to re-implement the reading of the polygons from a binary source such as shapefiles or file geodatabases. Until then, you can download all the source code from here.