Thunderhead Explorer

Saturday, February 3, 2024

On Using AutoML for NYC Taxi Trip Duration Prediction

While explaining Optuna to a client in the context of hyperparameter tuning, and performing more research on the topic, I came across AutoGluon to perform "AutoML for images, text, and tabular data". After a quick scan of the documentation, I decided to give it a try and see how it performs on a simple project.

I always loved the Kaggle competition NYC Taxi Trip Duration, as the data has spatial, temporal, and other traditional attribute information and is a great dataset to test various models and feature engineering techniques.

I used a local Apache Spark instance (as it is my go-to ETL engine) to perform some feature engineering before letting AutoGluon do its magic. As a quick proof of concept, the results are quite impressive, and here are the steps to reproduce the project and visualize the result.

Sunday, January 28, 2024

Arabic SDK For Apache Spark

I recently attended the Esri Saudi Arabia User Conference and was amazed by the changes in the Kingdom. The capital city of Riyadh is booming and proliferating. During the conference, I presented on integrating GenerativeAI and GIS in the plenary session and led a session on BigData and GeoAnalytics Engine. GeoAnalytics Engine, based on Apache Spark, allows spatial operations on Spark data frames. We showcased a project called "A Day in the Life," which used historical traffic data from HERE to demonstrate traffic congestion during peak hours. Traffic is notoriously bad in the city, so this was a fitting example. My colleague Mahmoud H. presented a traditional workflow process in a Jupyter Notebook off a Google Cloud DataProc cluster, efficiently processing over 300 million records (this is relatively "small"). The processed traffic information was then displayed in ArcGIS Pro in a time-aware layer to reflect the congestion visually while activating a time slider. At the end of the presentation, we surprised the audience by using ChatGPT to translate Arabic sentences to SparkSQL code, and Azure OpenAI GPT4 handled the translation very well. Look here for code snippets. This form of interaction IS the future, and I am excited to invest more in this technology and in the following areas:

Enhanced Visualization and Real-time Data Integration:

Dynamic Visualization: Integrating real-time traffic data feeds into existing models. This will not only show historical congestion but also provide live updates. Dynamic heatmaps can be particularly effective in visualizing the intensity of traffic at different times.
3D Modeling: Utilize ArcGIS's 3D scene capabilities to give a more immersive view of traffic congestion and urban planning scenarios.

Improved Data Analysis through Machine Learning:

Predictive Analytics: Integrate machine learning models to predict future traffic patterns based on historical data, weather conditions, events, and other variables.
Anomaly Detection: Implement anomaly detection algorithms to identify unusual traffic patterns, which can be crucial for incident response and urban planning.

Enhancing User Interaction and Accessibility:

Multilingual Support: While we showcased the translation of Arabic sentences to SparkSQL code, we should consider expanding this feature to include more languages, making your tool more accessible to a global audience.
Voice Commands and Chatbots: Integrate voice command functionality and develop a chatbot using Azure OpenAI GPT4 for querying and controlling the GeoAnalytics Engine, making the system more interactive and user-friendly.

Scalability and Performance Optimization:

Optimization for Large Datasets: Continue to refine the efficiency of processing large datasets. Explore the latest advancements in distributed computing and in-memory processing to handle even larger datasets more efficiently.
Cloud Integration: Ensure the solutions are cloud-agnostic and can be deployed on any public or private cloud provider, enhancing the system's scalability and reliability.

Collaboration and Sharing:

Collaborative Features: Develop features that allow multiple users to work on the same project simultaneously, including version control and change tracking for shared projects.
Export and Sharing Options: Enhance the ability to export results and visualizations in various formats and share them across different platforms, facilitating easier collaboration and reporting.

Ethical Considerations and Transparency:

Data Privacy: Address data privacy concerns by implementing robust data encryption and anonymization techniques, ensuring that individual privacy is respected while analyzing traffic patterns.
Algorithm Transparency: Provide clear documentation and explanations of the algorithms used, promoting transparency and trust in your system.

Saturday, January 27, 2024

Back in Action: GenAI Meets GeoSpatial

Hello, everyone. It has been a while since my last post, and I wanted to explain my absence. I have been working on demanding client projects requiring confidentiality, so I couldn't share anything.

But now I'm back and excited to dive into something new and exciting.

Generative AI (GenAI) has gained much attention lately, but I'm taking it to a different level by merging Large Language Models (LLMs) with insights from geospatial analysis. It's GenAI with a GeoSpatial twist.

I want to introduce a simple project, "ReAct geospatial logic with Ollama," which uses resources such as the Python Langchain and Ollama.

I'm thrilled to be back and can't wait to start this new journey with you. Keep an eye on this space for future updates, tips, and unique code snippets. Your feedback and questions are valuable, so please don't hesitate to reach out.

I'll see you in the next post, and as usual, you can check out the source code here.

Monday, August 24, 2020

On Machine Learning in ArcGIS and Data Preparation using Spark

Artificial Intelligence / Machine Learning implementations have been part of Esri software in ArcGIS for a very long time. Geographic Weighted Regression (GWR) and Hot Spot Analysis are Machine Learning algorithms. ArcGIS users have been utilizing supervised and unsupervised learning like clustering to solve a myriad of problems and gain geospatial insight from their data. We just did not call these learning algorithms AI/ML back then. Not until the recent popularity of DeepLearning that finally blossomed from the AI Winter and made AI a household name. And guess what? Esri software has now DeepLearning implementations!

It is important to understand the difference between AI, ML, and DL. I came across this very insightful article that analogizes the relationship to Russian dolls.

Typically, collected data is very "dirty", and a lot of cleaning has to performed on it before processing it through a Machine Learning algorithm. The bigger the data, the harder is the process especially when pruning noisy outliers and anomalies. But then, that could be what you are looking for, outliers and anomalies. That is why "Data Janitor" is a new job title, as a majority of your time when dealing with data is cleaning it! That is why I love to use Apache Spark for this task. It can handle BigData very efficiently in a distributed, parallel share-nothing environment, and the usage of the DataFrame API and SQL makes the maniplation a breeze. Utilizing the latter two in an interactive environment like a Jupyter Notebook enables quick cleaning and more importantly insightful explorations.

Now the best part, we can have all 3, Jupyter, Spark and ML, in one enviroment; ArcGIS Pro!

This notebook and that notebook demonstrate the usage of Spark in Notebook in Pro to clean and explore the data, and then prepare it for Machine Learning. We are using the built-in Forest Based Regression to predict (or attempt to predict) the trip duration of a New York City taxi given a pickup and dropoff location.

The last notebook, enables the user to select pickup locations (in the below case, about JFK) and "see" the errors of the model on a map.

Some will argue that all this could have been done using Pandas, and that is true. But that is not the point of this demonstration and this assumes that all the data could be held in memory which is true in this case (1.45) but will fail when you have 100's of million of rows and you have to process all this data on one machine with limited resources. If you like Pandas, I recommend that you check out Koalas.

Monday, August 3, 2020

ArcGIS Pro, Jupyter Notebook and Databricks¶

Yet another post in the continuing saga of the usage of Apache Spark from a Jupyter notebook within ArcGIS Pro.

In the previous posts, the execution was always within the ArcGIS Pro environment on a single machine, albeit taking advantage of all the cores of that machine. Here, we take a different angle, the execution is performed on a remote cluster of machines in the cloud.

So, we author the notebook locally, but we execute it remotely.

In this notebook, we demonstrate the spatial binning of AIS broadcast points on a Databricks cluster on Azure. In addition, to colocate the data storage with the execution engine for performance purposes, we converted the local feature class of the AIS broadcast points to a parquet file and placed it in the Databricks distributed file system.

More to come :-)

Sunday, August 2, 2020

Virtual Gate Crossing

Yet another continuation post regarding Pro, Notebook, and Spark :-). In this notebook, we will demonstrate a parallel, distributed, share-nothing spatial join between a relatively large dataset and a small dataset.

In this case, virtual gates are defined at various locations in a port, and the outcome is an account of the number of crossings of these gates by ships using their AIS target positions.

Note that the join is to a "small" spatial dataset that we can:

Broadcast to all the spark workers.
Brutly traverse it on each worker, as it is cheaper and faster to do so that spatially index it.

The following are sample gates:

And the following is a sample processed output:

More to come...

Saturday, August 1, 2020

MicroPath Reconstruction of AIS Broadcast Points

This is a continuation of the last post regarding ArcGIS Pro, Jupyter Notebook, and Spark. And, this is a rehash of an older post in a more "modern" way.

Micropathing is the construction of a target's path from a limited set of a consecutive sequence of target points. Typically, the sequence is time-based, and the collection is limited to 2 or 3 target points. The following is an illustration of 2 micropaths derived from 3 target points:

Micropathing is different than path reconstruction, in such that the latter produced one polyline for the path of a target. Path reconstruction losses insightful in-path behavior, as a large number of attributes cannot be associated with the path parts. Some can argue that the points along the path can be enriched with these attributes. However, with the current implementations of Point objects, we are limited to only the extra M and Z to the necessary X and Y. You can also join the PathID and M to a lookup table and gain back that insight, but that joining is typically expensive and is difficult to derive from it the "expression" of the path using traditional mapping. A micropath overcomes today's limitations with today's traditional means to express the path insight better.

So, a micropath is a line composed of typically 2 points only and is associated with a set of attributes that describe that line. These attributes are typical enrichment metrics derived from its two ends. An attribute can be, for example, the traveled distance, time, or speed.

In this notebook, we will construct "clean" micropaths using SparkSQL. What do I mean by clean? As we all know, emitted target points are notoriously affected by noise, so using SparkSQL, we will eliminate that noise during the micropath construction.

Here is a result:

More to come...