Thunderhead Explorer: 2020

Monday, August 24, 2020

On Machine Learning in ArcGIS and Data Preparation using Spark

Artificial Intelligence / Machine Learning implementations have been part of Esri software in ArcGIS for a very long time. Geographic Weighted Regression (GWR) and Hot Spot Analysis are Machine Learning algorithms. ArcGIS users have been utilizing supervised and unsupervised learning like clustering to solve a myriad of problems and gain geospatial insight from their data. We just did not call these learning algorithms AI/ML back then. Not until the recent popularity of DeepLearning that finally blossomed from the AI Winter and made AI a household name. And guess what? Esri software has now DeepLearning implementations!

It is important to understand the difference between AI, ML, and DL. I came across this very insightful article that analogizes the relationship to Russian dolls.

Typically, collected data is very "dirty", and a lot of cleaning has to performed on it before processing it through a Machine Learning algorithm. The bigger the data, the harder is the process especially when pruning noisy outliers and anomalies. But then, that could be what you are looking for, outliers and anomalies. That is why "Data Janitor" is a new job title, as a majority of your time when dealing with data is cleaning it! That is why I love to use Apache Spark for this task. It can handle BigData very efficiently in a distributed, parallel share-nothing environment, and the usage of the DataFrame API and SQL makes the maniplation a breeze. Utilizing the latter two in an interactive environment like a Jupyter Notebook enables quick cleaning and more importantly insightful explorations.

Now the best part, we can have all 3, Jupyter, Spark and ML, in one enviroment; ArcGIS Pro!

This notebook and that notebook demonstrate the usage of Spark in Notebook in Pro to clean and explore the data, and then prepare it for Machine Learning. We are using the built-in Forest Based Regression to predict (or attempt to predict) the trip duration of a New York City taxi given a pickup and dropoff location.

The last notebook, enables the user to select pickup locations (in the below case, about JFK) and "see" the errors of the model on a map.

Some will argue that all this could have been done using Pandas, and that is true. But that is not the point of this demonstration and this assumes that all the data could be held in memory which is true in this case (1.45) but will fail when you have 100's of million of rows and you have to process all this data on one machine with limited resources. If you like Pandas, I recommend that you check out Koalas.

Monday, August 3, 2020

ArcGIS Pro, Jupyter Notebook and Databricks¶

Yet another post in the continuing saga of the usage of Apache Spark from a Jupyter notebook within ArcGIS Pro.

In the previous posts, the execution was always within the ArcGIS Pro environment on a single machine, albeit taking advantage of all the cores of that machine. Here, we take a different angle, the execution is performed on a remote cluster of machines in the cloud.

So, we author the notebook locally, but we execute it remotely.

In this notebook, we demonstrate the spatial binning of AIS broadcast points on a Databricks cluster on Azure. In addition, to colocate the data storage with the execution engine for performance purposes, we converted the local feature class of the AIS broadcast points to a parquet file and placed it in the Databricks distributed file system.

More to come :-)

Sunday, August 2, 2020

Virtual Gate Crossing

Yet another continuation post regarding Pro, Notebook, and Spark :-). In this notebook, we will demonstrate a parallel, distributed, share-nothing spatial join between a relatively large dataset and a small dataset.

In this case, virtual gates are defined at various locations in a port, and the outcome is an account of the number of crossings of these gates by ships using their AIS target positions.

Note that the join is to a "small" spatial dataset that we can:

Broadcast to all the spark workers.
Brutly traverse it on each worker, as it is cheaper and faster to do so that spatially index it.

The following are sample gates:

And the following is a sample processed output:

More to come...

Saturday, August 1, 2020

MicroPath Reconstruction of AIS Broadcast Points

This is a continuation of the last post regarding ArcGIS Pro, Jupyter Notebook, and Spark. And, this is a rehash of an older post in a more "modern" way.

Micropathing is the construction of a target's path from a limited set of a consecutive sequence of target points. Typically, the sequence is time-based, and the collection is limited to 2 or 3 target points. The following is an illustration of 2 micropaths derived from 3 target points:

Micropathing is different than path reconstruction, in such that the latter produced one polyline for the path of a target. Path reconstruction losses insightful in-path behavior, as a large number of attributes cannot be associated with the path parts. Some can argue that the points along the path can be enriched with these attributes. However, with the current implementations of Point objects, we are limited to only the extra M and Z to the necessary X and Y. You can also join the PathID and M to a lookup table and gain back that insight, but that joining is typically expensive and is difficult to derive from it the "expression" of the path using traditional mapping. A micropath overcomes today's limitations with today's traditional means to express the path insight better.

So, a micropath is a line composed of typically 2 points only and is associated with a set of attributes that describe that line. These attributes are typical enrichment metrics derived from its two ends. An attribute can be, for example, the traveled distance, time, or speed.

In this notebook, we will construct "clean" micropaths using SparkSQL. What do I mean by clean? As we all know, emitted target points are notoriously affected by noise, so using SparkSQL, we will eliminate that noise during the micropath construction.

Here is a result:

More to come...

Friday, July 31, 2020

On ArcGIS Pro, Jupyter Notebook and Apache Spark

Been a while since I posted something, and thank you, faithful reader, for coming back :-)

I'm intending to writing a series of posts on how to use Apache Spark and Machine Learning within a jupyter notebook within ArcGIS Pro. Yes, you can now start a jupyter notebook instance in ArcGIS Pro to create an amazing data science and data exploration experience. Check out this link to see how to get started with a Jupyter notebook in Pro. But...my favorite hidden "GeoGem", is that Pro comes with built-in Apache Spark, and y'all know how much I love Spark. People think that Spark is intended for only BigData analytics. That is so far from the truth. What I love about it, is the frictionless movement of data and analysis locally or remotely and the language fusion. In my case, I'm using Python, SQL, and Scala.

The usage of Apache Spark in Pro was demonstrated in the publically shared Covid-19 Contact Tracing Application and the Proximity Tracing Application.

In this first notebook, we will start by loading selected features into a Spark dataframe from a local feature class, process the dataframe using Spark SQL, and write the result back to an ephemeral feature class that will be displayed on the map.

Like usual, all the source code can be found here.