Monday, August 24, 2020

On Machine Learning in ArcGIS and Data Preparation using Spark

Artificial Intelligence / Machine Learning implementations have been part of Esri software in ArcGIS for a very long time. Geographic Weighted Regression (GWR) and Hot Spot Analysis are Machine Learning algorithms. ArcGIS users have been utilizing supervised and unsupervised learning like clustering to solve a myriad of problems and gain geospatial insight from their data.  We just did not call these learning algorithms AI/ML back then.  Not until the recent popularity of DeepLearning that finally blossomed from the AI Winter and made AI a household name. And guess what? Esri software has now DeepLearning implementations!

It is important to understand the difference between AI, ML, and DL.  I came across this very insightful article that analogizes the relationship to Russian dolls.

Typically, collected data is very "dirty", and a lot of cleaning has to performed on it before processing it through a Machine Learning algorithm.  The bigger the data, the harder is the process especially when pruning noisy outliers and anomalies.  But then, that could be what you are looking for, outliers and anomalies. That is why "Data Janitor" is a new job title, as a majority of your time when dealing with data is cleaning it!  That is why I love to use Apache Spark for this task. It can handle BigData very efficiently in a distributed, parallel share-nothing environment, and the usage of the DataFrame API and SQL makes the maniplation a breeze. Utilizing the latter two in an interactive environment like a Jupyter Notebook enables quick cleaning and more importantly insightful explorations.

Now the best part, we can have all 3, Jupyter, Spark and ML, in one enviroment; ArcGIS Pro!

This notebook and that notebook demonstrate the usage of Spark in Notebook in Pro to clean and explore the data, and then prepare it for Machine Learning.  We are using the built-in Forest Based Regression to predict (or attempt to predict) the trip duration of a New York City taxi given a pickup and dropoff location.
The last notebook, enables the user to select pickup locations (in the below case, about JFK) and "see" the errors of the model on a map.

Some will argue that all this could have been done using Pandas, and that is true. But that is not the point of this demonstration and this assumes that all the data could be held in memory which is true in this case (1.45) but will fail when you have 100's of million of rows and you have to process all this data on one machine with limited resources.  If you like Pandas, I recommend that you check out Koalas.

Monday, August 3, 2020

ArcGIS Pro, Jupyter Notebook and Databricks¶

Yet another post in the continuing saga of the usage of Apache Spark from a Jupyter notebook within ArcGIS Pro.

In the previous posts, the execution was always within the ArcGIS Pro environment on a single machine, albeit taking advantage of all the cores of that machine.  Here, we take a different angle, the execution is performed on a remote cluster of machines in the cloud.

So, we author the notebook locally, but we execute it remotely.

In this notebook, we demonstrate the spatial binning of AIS broadcast points on a Databricks cluster on Azure. In addition, to colocate the data storage with the execution engine for performance purposes, we converted the local feature class of the AIS broadcast points to a parquet file and placed it in the Databricks distributed file system.

More to come :-)

Sunday, August 2, 2020

Virtual Gate Crossing

Yet another continuation post regarding Pro, Notebook, and Spark :-). In this notebook, we will demonstrate a parallel, distributed, share-nothing spatial join between a relatively large dataset and a small dataset.

In this case, virtual gates are defined at various locations in a port, and the outcome is an account of the number of crossings of these gates by ships using their AIS target positions.

Note that the join is to a "small" spatial dataset that we can:

  • Broadcast to all the spark workers.
  • Brutly traverse it on each worker, as it is cheaper and faster to do so that spatially index it.

The following are sample gates:

And the following is a sample processed output:

More to come...

Saturday, August 1, 2020

MicroPath Reconstruction of AIS Broadcast Points

This is a continuation of the last post regarding ArcGIS Pro, Jupyter Notebook, and Spark. And, this is a rehash of an older post in a more "modern" way.

Micropathing is the construction of a target's path from a limited set of a consecutive sequence of target points. Typically, the sequence is time-based, and the collection is limited to 2 or 3 target points.  The following is an illustration of 2 micropaths derived from 3 target points:

Micropathing is different than path reconstruction, in such that the latter produced one polyline for the path of a target. Path reconstruction losses insightful in-path behavior, as a large number of attributes cannot be associated with the path parts. Some can argue that the points along the path can be enriched with these attributes. However, with the current implementations of Point objects, we are limited to only the extra M and Z to the necessary X and Y. You can also join the PathID and M to a lookup table and gain back that insight, but that joining is typically expensive and is difficult to derive from it the "expression" of the path using traditional mapping. A micropath overcomes today's limitations with today's traditional means to express the path insight better.

So, a micropath is a line composed of typically 2 points only and is associated with a set of attributes that describe that line.  These attributes are typical enrichment metrics derived from its two ends. An attribute can be, for example, the traveled distance, time, or speed.

In this notebook, we will construct "clean" micropaths using SparkSQL.  What do I mean by clean? As we all know, emitted target points are notoriously affected by noise, so using SparkSQL, we will eliminate that noise during the micropath construction.

Here is a result:

More to come...

Friday, July 31, 2020

On ArcGIS Pro, Jupyter Notebook and Apache Spark

Been a while since I posted something, and thank you, faithful reader, for coming back :-)

I'm intending to writing a series of posts on how to use Apache Spark and Machine Learning within a jupyter notebook within ArcGIS Pro.  Yes, you can now start a jupyter notebook instance in ArcGIS Pro to create an amazing data science and data exploration experience. Check out this link to see how to get started with a Jupyter notebook in Pro. favorite hidden "GeoGem", is that Pro comes with built-in Apache Spark, and y'all know how much I love Spark. People think that Spark is intended for only BigData analytics.  That is so far from the truth. What I love about it, is the frictionless movement of data and analysis locally or remotely and the language fusion.  In my case, I'm using Python, SQL, and Scala.

The usage of Apache Spark in Pro was demonstrated in the publically shared Covid-19 Contact Tracing Application and the Proximity Tracing Application.

In this first notebook, we will start by loading selected features into a Spark dataframe from a local feature class, process the dataframe using Spark SQL, and write the result back to an ephemeral feature class that will be displayed on the map.

Like usual, all the source code can be found here.

Sunday, May 27, 2018

On Patterns Of Life: From MacroData to PicoData

Insight (or GeoInsight in our case) is lost in the deluge of data that we are acquiring today from everyday sensors that are machine or humanly generated. This project is a set of heuristic Spark based implementations to reveal signals from the movement of ships in and out of the Port of Miami.

The idea is to extract small clean data (PicoData) from the overlap of a massive amount of data (MacroData). The aggregation of "clean" PicoData derived from MacroData trust into the forefront patterns of life.

For example, given the following display of AIS broadcasts:


We can mutate the data to reveal the "clean" influx of ships into the harbor at high tide:


Like usual, you can download all the source code from here.

Monday, January 1, 2018

On ML and Elastic Principle Graphs

Happy 2018 all. It has been a while since my last post. Thank you for your patience dear reader. Like usual, the perpetual resolutions for every year in addition to blogging more are to eat well, often exercise and climb Ventoux.


I genuinely believe that 2018 will be the year of the ubiquity of Geo-AI. It will be the year when Machine Learning and Spatial Awareness will blossom inside and mostly outside the GIS community.

We at Esri have had Machine Learning based tools in our "shed" for a long time. Every time an ArcGIS user performs a graphically weighted regression, trains a random trees classifier or detects an emerging hot spot, that user is using a form of Machine Learning without knowing it!

So one of my "missions" for 2018, it to make this knowledge more explicit to our users and non-traditional GIS users. Also, start to implement new forms of Machine Learning.

Machine Learning (ML), a branch of Artificial Intelligence (AI), is a disruptive force that is changing how today's industries are gaining new insight from their data. ML uses math, statistics and probability to find hidden patterns and make predictions from the data without being explicitly programmed. It is this last statement that is disruptive, "No explicit programming"! An ML algorithm iterates "intelligently" over the data, and the patterns emerge. Being iterative, the more data an ML algorithm is exposed to, the more refined the output becomes. Thus the coupling of BigData and ML is a perfect marriage fueled by cheap storage, ever more powerful computational power (think GPU) and faster networking.

This reemergence of this "No Explicit Programming" paradigm such as Deep Learning, Reinforcement Learning, and Self Organization is skyrocketing the likes of Google's AlphaGo-Zero, Facebook, and Uber.

So, I am starting this launch with something I have been fascinated by for quite some time, and that is "Elastic Principle Graphs."

It is a "deep" extension of PCA that I came across it during my research of mapping noisy 2D data to a curve and was fascinated by its self-organization.


After reading, (and rereading for the nth time) this paper, this GitHub repo is a minimalist implementation in Scala.

Happy New Year All.