At this year’s FedUC, I presented an introduction and a practice session on spatial types in BigData. In these sessions, I demonstrated how to analyze temporal and spatial BigData in the form of Automatic Identification System. This post discusses an ensemble of projects that made a section of this demo possible. The storage and subsequence processing of the data is very specific to this project, where data is stored and analyzed in a temporal and then a spatial order. Please note the order; temporal then spatial. However, I see this pattern in a lot of BigData projects that I worked on. This order enables us to take advantage of the native partitioning scheme of paths in HDFS for temporal indexing that we later augment with a “local” spatial index. So time, or more specifically an hour’s data file can be located by traversing the HDFS file system in the form /YYYY/MM/DD/HH/UUID, where YYYY is the year, MM is the numeric month, DD is the day, HH is the hour and UUID is a file with a unique name that holds all the data for that hour. You can imagine a process, such as GeoEvent Processor that continuously adds data in this pattern. However, in my case, I received the data as a set of GZIP files and used the Hadoop MapReduce AISImport tool to place the data in the correct folders/files in HDFS. Once an hour file is closed, a spatial index file is created that enables the spatial search of the data at that hour. This spatial index is based on the QuadTree model for point specific geometry and is initiated using the AISTools. I rewrote my “ubiquitous” density MapReduce job to take into account this new spatial index, where now I can rapidly ask questions such as “What is the AIS density by unique MMSI and what does it look like at the entry of the harbor every day at 10AM in the month of January ?” The following is a visualization of the answer in ArcMap from the Miami Port sample data.
Like usual, all the source code can be found here.
Like usual, all the source code can be found here.