Monday, August 24, 2020

On Machine Learning in ArcGIS and Data Preparation using Spark

Artificial Intelligence / Machine Learning implementations have been part of Esri software in ArcGIS for a very long time. Geographic Weighted Regression (GWR) and Hot Spot Analysis are Machine Learning algorithms. ArcGIS users have been utilizing supervised and unsupervised learning like clustering to solve a myriad of problems and gain geospatial insight from their data.  We just did not call these learning algorithms AI/ML back then.  Not until the recent popularity of DeepLearning that finally blossomed from the AI Winter and made AI a household name. And guess what? Esri software has now DeepLearning implementations!

It is important to understand the difference between AI, ML, and DL.  I came across this very insightful article that analogizes the relationship to Russian dolls.

Typically, collected data is very "dirty", and a lot of cleaning has to performed on it before processing it through a Machine Learning algorithm.  The bigger the data, the harder is the process especially when pruning noisy outliers and anomalies.  But then, that could be what you are looking for, outliers and anomalies. That is why "Data Janitor" is a new job title, as a majority of your time when dealing with data is cleaning it!  That is why I love to use Apache Spark for this task. It can handle BigData very efficiently in a distributed, parallel share-nothing environment, and the usage of the DataFrame API and SQL makes the maniplation a breeze. Utilizing the latter two in an interactive environment like a Jupyter Notebook enables quick cleaning and more importantly insightful explorations.

Now the best part, we can have all 3, Jupyter, Spark and ML, in one enviroment; ArcGIS Pro!

This notebook and that notebook demonstrate the usage of Spark in Notebook in Pro to clean and explore the data, and then prepare it for Machine Learning.  We are using the built-in Forest Based Regression to predict (or attempt to predict) the trip duration of a New York City taxi given a pickup and dropoff location.
The last notebook, enables the user to select pickup locations (in the below case, about JFK) and "see" the errors of the model on a map.

Some will argue that all this could have been done using Pandas, and that is true. But that is not the point of this demonstration and this assumes that all the data could be held in memory which is true in this case (1.45) but will fail when you have 100's of million of rows and you have to process all this data on one machine with limited resources.  If you like Pandas, I recommend that you check out Koalas.

No comments: