Thunderhead Explorer

On Using AutoML for NYC Taxi Trip Duration Prediction

2024-02-03T01:10:00.000-05:00

While explaining Optuna to a client in the context of hyperparameter tuning, and performing more research on the topic, I came across AutoGluon to perform "AutoML for images, text, and tabular data". After a quick scan of the documentation, I decided to give it a try and see how it performs on a simple project.

I always loved the Kaggle competition NYC Taxi Trip Duration, as the data has spatial, temporal, and other traditional attribute information and is a great dataset to test various models and feature engineering techniques.

I used a local Apache Spark instance (as it is my go-to ETL engine) to perform some feature engineering before letting AutoGluon do its magic. As a quick proof of concept, the results are quite impressive, and here are the steps to reproduce the project and visualize the result.

Arabic SDK For Apache Spark

2024-01-28T11:16:00.000-05:00

I recently attended the Esri Saudi Arabia User Conference and was amazed by the changes in the Kingdom. The capital city of Riyadh is booming and proliferating. During the conference, I presented on integrating GenerativeAI and GIS in the plenary session and led a session on BigData and GeoAnalytics Engine. GeoAnalytics Engine, based on Apache Spark, allows spatial operations on Spark data frames. We showcased a project called "A Day in the Life," which used historical traffic data from HERE to demonstrate traffic congestion during peak hours. Traffic is notoriously bad in the city, so this was a fitting example. My colleague Mahmoud H. presented a traditional workflow process in a Jupyter Notebook off a Google Cloud DataProc cluster, efficiently processing over 300 million records (this is relatively "small"). The processed traffic information was then displayed in ArcGIS Pro in a time-aware layer to reflect the congestion visually while activating a time slider. At the end of the presentation, we surprised the audience by using ChatGPT to translate Arabic sentences to SparkSQL code, and Azure OpenAI GPT4 handled the translation very well. Look here for code snippets. This form of interaction IS the future, and I am excited to invest more in this technology and in the following areas:

Enhanced Visualization and Real-time Data Integration:

Dynamic Visualization: Integrating real-time traffic data feeds into existing models. This will not only show historical congestion but also provide live updates. Dynamic heatmaps can be particularly effective in visualizing the intensity of traffic at different times.
3D Modeling: Utilize ArcGIS's 3D scene capabilities to give a more immersive view of traffic congestion and urban planning scenarios.

Improved Data Analysis through Machine Learning:

Predictive Analytics: Integrate machine learning models to predict future traffic patterns based on historical data, weather conditions, events, and other variables.
Anomaly Detection: Implement anomaly detection algorithms to identify unusual traffic patterns, which can be crucial for incident response and urban planning.

Enhancing User Interaction and Accessibility:

Multilingual Support: While we showcased the translation of Arabic sentences to SparkSQL code, we should consider expanding this feature to include more languages, making your tool more accessible to a global audience.
Voice Commands and Chatbots: Integrate voice command functionality and develop a chatbot using Azure OpenAI GPT4 for querying and controlling the GeoAnalytics Engine, making the system more interactive and user-friendly.

Scalability and Performance Optimization:

Optimization for Large Datasets: Continue to refine the efficiency of processing large datasets. Explore the latest advancements in distributed computing and in-memory processing to handle even larger datasets more efficiently.
Cloud Integration: Ensure the solutions are cloud-agnostic and can be deployed on any public or private cloud provider, enhancing the system's scalability and reliability.

Collaboration and Sharing:

Collaborative Features: Develop features that allow multiple users to work on the same project simultaneously, including version control and change tracking for shared projects.
Export and Sharing Options: Enhance the ability to export results and visualizations in various formats and share them across different platforms, facilitating easier collaboration and reporting.

Ethical Considerations and Transparency:

Data Privacy: Address data privacy concerns by implementing robust data encryption and anonymization techniques, ensuring that individual privacy is respected while analyzing traffic patterns.
Algorithm Transparency: Provide clear documentation and explanations of the algorithms used, promoting transparency and trust in your system.

Back in Action: GenAI Meets GeoSpatial

2024-01-27T14:40:00.001-05:00

Hello, everyone. It has been a while since my last post, and I wanted to explain my absence. I have been working on demanding client projects requiring confidentiality, so I couldn't share anything.

But now I'm back and excited to dive into something new and exciting.

Generative AI (GenAI) has gained much attention lately, but I'm taking it to a different level by merging Large Language Models (LLMs) with insights from geospatial analysis. It's GenAI with a GeoSpatial twist.

I want to introduce a simple project, "ReAct geospatial logic with Ollama," which uses resources such as the Python Langchain and Ollama.

I'm thrilled to be back and can't wait to start this new journey with you. Keep an eye on this space for future updates, tips, and unique code snippets. Your feedback and questions are valuable, so please don't hesitate to reach out.

I'll see you in the next post, and as usual, you can check out the source code here.

On Machine Learning in ArcGIS and Data Preparation using Spark

2020-08-24T08:28:00.000-04:00

Artificial Intelligence / Machine Learning implementations have been part of Esri software in ArcGIS for a very long time. Geographic Weighted Regression (GWR) and Hot Spot Analysis are Machine Learning algorithms. ArcGIS users have been utilizing supervised and unsupervised learning like clustering to solve a myriad of problems and gain geospatial insight from their data. We just did not call these learning algorithms AI/ML back then. Not until the recent popularity of DeepLearning that finally blossomed from the AI Winter and made AI a household name. And guess what? Esri software has now DeepLearning implementations!

It is important to understand the difference between AI, ML, and DL. I came across this very insightful article that analogizes the relationship to Russian dolls.

Typically, collected data is very "dirty", and a lot of cleaning has to performed on it before processing it through a Machine Learning algorithm. The bigger the data, the harder is the process especially when pruning noisy outliers and anomalies. But then, that could be what you are looking for, outliers and anomalies. That is why "Data Janitor" is a new job title, as a majority of your time when dealing with data is cleaning it! That is why I love to use Apache Spark for this task. It can handle BigData very efficiently in a distributed, parallel share-nothing environment, and the usage of the DataFrame API and SQL makes the maniplation a breeze. Utilizing the latter two in an interactive environment like a Jupyter Notebook enables quick cleaning and more importantly insightful explorations.

Now the best part, we can have all 3, Jupyter, Spark and ML, in one enviroment; ArcGIS Pro!

This notebook and that notebook demonstrate the usage of Spark in Notebook in Pro to clean and explore the data, and then prepare it for Machine Learning. We are using the built-in Forest Based Regression to predict (or attempt to predict) the trip duration of a New York City taxi given a pickup and dropoff location.

The last notebook, enables the user to select pickup locations (in the below case, about JFK) and "see" the errors of the model on a map.

Some will argue that all this could have been done using Pandas, and that is true. But that is not the point of this demonstration and this assumes that all the data could be held in memory which is true in this case (1.45) but will fail when you have 100's of million of rows and you have to process all this data on one machine with limited resources. If you like Pandas, I recommend that you check out Koalas.

ArcGIS Pro, Jupyter Notebook and Databricks¶

2020-08-03T21:48:00.001-04:00

Yet another post in the continuing saga of the usage of Apache Spark from a Jupyter notebook within ArcGIS Pro.

In the previous posts, the execution was always within the ArcGIS Pro environment on a single machine, albeit taking advantage of all the cores of that machine. Here, we take a different angle, the execution is performed on a remote cluster of machines in the cloud.

So, we author the notebook locally, but we execute it remotely.

In this notebook, we demonstrate the spatial binning of AIS broadcast points on a Databricks cluster on Azure. In addition, to colocate the data storage with the execution engine for performance purposes, we converted the local feature class of the AIS broadcast points to a parquet file and placed it in the Databricks distributed file system.

More to come :-)

Virtual Gate Crossing

2020-08-02T16:58:00.001-04:00

Yet another continuation post regarding Pro, Notebook, and Spark :-). In this notebook, we will demonstrate a parallel, distributed, share-nothing spatial join between a relatively large dataset and a small dataset.

In this case, virtual gates are defined at various locations in a port, and the outcome is an account of the number of crossings of these gates by ships using their AIS target positions.

Note that the join is to a "small" spatial dataset that we can:

Broadcast to all the spark workers.
Brutly traverse it on each worker, as it is cheaper and faster to do so that spatially index it.

The following are sample gates:

And the following is a sample processed output:

More to come...

MicroPath Reconstruction of AIS Broadcast Points

2020-08-01T11:52:00.001-04:00

This is a continuation of the last post regarding ArcGIS Pro, Jupyter Notebook, and Spark. And, this is a rehash of an older post in a more "modern" way.

Micropathing is the construction of a target's path from a limited set of a consecutive sequence of target points. Typically, the sequence is time-based, and the collection is limited to 2 or 3 target points. The following is an illustration of 2 micropaths derived from 3 target points:

Micropathing is different than path reconstruction, in such that the latter produced one polyline for the path of a target. Path reconstruction losses insightful in-path behavior, as a large number of attributes cannot be associated with the path parts. Some can argue that the points along the path can be enriched with these attributes. However, with the current implementations of Point objects, we are limited to only the extra M and Z to the necessary X and Y. You can also join the PathID and M to a lookup table and gain back that insight, but that joining is typically expensive and is difficult to derive from it the "expression" of the path using traditional mapping. A micropath overcomes today's limitations with today's traditional means to express the path insight better.

So, a micropath is a line composed of typically 2 points only and is associated with a set of attributes that describe that line. These attributes are typical enrichment metrics derived from its two ends. An attribute can be, for example, the traveled distance, time, or speed.

In this notebook, we will construct "clean" micropaths using SparkSQL. What do I mean by clean? As we all know, emitted target points are notoriously affected by noise, so using SparkSQL, we will eliminate that noise during the micropath construction.

Here is a result:

More to come...

On ArcGIS Pro, Jupyter Notebook and Apache Spark

2020-07-31T09:47:00.003-04:00

Been a while since I posted something, and thank you, faithful reader, for coming back :-)

I'm intending to writing a series of posts on how to use Apache Spark and Machine Learning within a jupyter notebook within ArcGIS Pro. Yes, you can now start a jupyter notebook instance in ArcGIS Pro to create an amazing data science and data exploration experience. Check out this link to see how to get started with a Jupyter notebook in Pro. But...my favorite hidden "GeoGem", is that Pro comes with built-in Apache Spark, and y'all know how much I love Spark. People think that Spark is intended for only BigData analytics. That is so far from the truth. What I love about it, is the frictionless movement of data and analysis locally or remotely and the language fusion. In my case, I'm using Python, SQL, and Scala.

The usage of Apache Spark in Pro was demonstrated in the publically shared Covid-19 Contact Tracing Application and the Proximity Tracing Application.

In this first notebook, we will start by loading selected features into a Spark dataframe from a local feature class, process the dataframe using Spark SQL, and write the result back to an ephemeral feature class that will be displayed on the map.

Like usual, all the source code can be found here.

On Patterns Of Life: From MacroData to PicoData

2018-05-27T13:31:00.001-04:00

Insight (or GeoInsight in our case) is lost in the deluge of data that we are acquiring today from everyday sensors that are machine or humanly generated. This project is a set of heuristic Spark based implementations to reveal signals from the movement of ships in and out of the Port of Miami.

The idea is to extract small clean data (PicoData) from the overlap of a massive amount of data (MacroData). The aggregation of "clean" PicoData derived from MacroData trust into the forefront patterns of life.

For example, given the following display of AIS broadcasts:

We can mutate the data to reveal the "clean" influx of ships into the harbor at high tide:

Like usual, you can download all the source code from here.

On ML and Elastic Principle Graphs

2018-01-01T11:06:00.000-05:00

Happy 2018 all. It has been a while since my last post. Thank you for your patience dear reader. Like usual, the perpetual resolutions for every year in addition to blogging more are to eat well, often exercise and climb Ventoux.

Onward.

I genuinely believe that 2018 will be the year of the ubiquity of Geo-AI. It will be the year when Machine Learning and Spatial Awareness will blossom inside and mostly outside the GIS community.

We at Esri have had Machine Learning based tools in our "shed" for a long time. Every time an ArcGIS user performs a graphically weighted regression, trains a random trees classifier or detects an emerging hot spot, that user is using a form of Machine Learning without knowing it!

So one of my "missions" for 2018, it to make this knowledge more explicit to our users and non-traditional GIS users. Also, start to implement new forms of Machine Learning.

Machine Learning (ML), a branch of Artificial Intelligence (AI), is a disruptive force that is changing how today's industries are gaining new insight from their data. ML uses math, statistics and probability to find hidden patterns and make predictions from the data without being explicitly programmed. It is this last statement that is disruptive, "No explicit programming"! An ML algorithm iterates "intelligently" over the data, and the patterns emerge. Being iterative, the more data an ML algorithm is exposed to, the more refined the output becomes. Thus the coupling of BigData and ML is a perfect marriage fueled by cheap storage, ever more powerful computational power (think GPU) and faster networking.

This reemergence of this "No Explicit Programming" paradigm such as Deep Learning, Reinforcement Learning, and Self Organization is skyrocketing the likes of Google's AlphaGo-Zero, Facebook, and Uber.

So, I am starting this launch with something I have been fascinated by for quite some time, and that is "Elastic Principle Graphs."

It is a "deep" extension of PCA that I came across it during my research of mapping noisy 2D data to a curve and was fascinated by its self-organization.

After reading, (and rereading for the nth time) this paper, this GitHub repo is a minimalist implementation in Scala.

Happy New Year All.

On Machine Learning and General Path Recognition

2017-04-16T07:32:00.001-04:00

This is part II of my journey back into SOMs. Actually this journey started with the below picture:

It is a map of AIS broadcast points around the harbor of Miami, FL. Now we, sentient creatures, when we look at this map we can quickly and clearly see patterns formed by the points. There is a sequence of points that start from the harbor and go northeast. There is a sequence of points that starts at the harbor and go south southeast. And there are a plenty of north south paths, some are very close to the shore, others are on the "edge" to the east. And there are path in the "middle".

Wouldn't it be wonderful if the Machine can see these pattern and formulate the general paths? That is actually what started this journey. I needed an unsupervised way for the Machine to recognize the patterns and emit the paths. I'm sure there are multiple ways to solve this, but I remembered that a while back I used Self Organizing Maps due to their simplicity and crucially for belonging to a class of unsupervised machine learning algorithms. So, in Part I, I reacquainted myself with SOMs and in this part, I completed the journey by showing the paths.

However faithful reader, I have not been totally honest with you. Please forgive me. This is part III in this journey. Along the way, I diverted a bit, as I needed a way to assemble tracks from targets. There exist a hidden gem in this project. The PathFinder application is an important stop on this journey, as in addition to assembling tracks from targets, it quantizes the path into grid cells. The quantization of paths is the linchpin between the raw targets points and the unsupervised path detection.

The below picture will help in my explanation:

If a virtual grid is overlaid on the map (the grid cells in the above map are coarse on purpose for illustration purposes), then a linear vector that represents a path can be composes by scanning the grid cells from left to right and then from top to bottom. The existence of targets in a cell is the binary value of the element in the vector. So in the above case, the path will be represented by the vector [0,0,0,0,1,0,0,0,1,1,1,1,0,0,0,1]. Side note: In this implementation the vector is composed of binary values, however, a vector with real numbers can be composed where the element value is proportional to the number of targets. In addition, the cell values can "bleed" to neighboring cells in say a gaussian way for better path recognition. Will have to come back to this one day. A Master's thesis can be made out this.

Now that we can compose a set of vectors from a set of targets, we can train the SOM with these vectors. The below figure is the visual representation of a 3x3 SOM result. Each sub map is a visual representation of the settled weights of a SOM node where each node weight is a linear representation of a quantized grid as described above.

We can see the path patterns that have self organized, and they do reflect what we have implicitly seen in the first map as humans. I highlighted in red the cells in each map with the most target associations thus forming a path. Isn't it amazing ?

Like usual, you can download all the source code for this from here.

On Machine Learning with Self Organizing Maps

2017-04-02T12:22:00.001-04:00

Self Organizing Map (SOM) is a form of Artificial Neural Network (ANN) belonging to a class of Machine Learning. AI Junkie has a GREAT tutorial about it. What I like about SOMs is that they belong to a class of unsupervised learning models and they hold true to the first law of geography.

"Everything is related to everything else, but near things are more related than distant things." - Tobler

I encountered them and used them over 20 years ago, and since AI/ML is the hottest topic these days, I'm reacquainting myself with them. There are plenty of SOM libraries, but I learn (or in this case re-learn) by doing. This project is my learning journey in implementing SOMs and "Sparkyfing" them.

The following is a sample output of the obligatory RGB classifier, where a million random RGB triples are organized by a Spark based SOM into a 10x10 square lattice:

And the following is a sample solution to a TSP using SOM:

Like usual, all the source code can be found here.

ArcGIS, Spark and Alluxio Integration

2017-03-27T18:15:00.004-04:00

There exist a plethora of backend distributed data stores. I am always using S3 or Hadoop HDFS or OpenStack Swift with my GIS applications to read from these backends geospatial data or to save into these backends my data. Some of these distributed data stores are not natively supported by the ArcGIS platform. However, the platform can be extended with ArcPy to handle these situations. Depending on the data store, I will have to use a different API (mostly Python based) to read and write geospatial information. This is where Alluxio comes in very handy. It provides an abstract layer between the application and the data store and (here is the best part), it caches this information in memory in a distributed and resilient-to-failure manner. So, at the application level, the code to access the data is invariant. On the backend, I can configure Alluxio to use either S3, HDFS or SWIFT. Finally, the advent of a REST endpoint in Alluxio eases the integration with ArcGIS to write, read and visualize Geospatial data.

Like usual, all the source code for this integration can be found here.

ArcGIS, Spark & MemSQL Integration

2017-03-18T15:32:00.000-04:00

Just got back from the fantastic Strata + Hadoop 2017 conference where the topics ranged from BigData, Spark to lots of AI/ML and not so much on Hadoop explicitly, at least not in the sessions that I attended. I think that is why the conference is renamed Strata + Data from now on as there is more to Hadoop in BigData.

While strolling the exhibition hall, I walked into the booth of our friends at MemSQL and got a BIG hug from Gary. We reminisced about our co-presentations at various conferences regarding the integration of ArcGIS and MemSQL as they natively support geospatial types.

This post is a refresher on the integration with a "modern" twist, where we are using the Spark Connector to ETL geo spatial data into MemSQL in a Docker container. To view the bulk loaded data, ArcGIS Pro is extended with an ArcPy toolbox to query MemSQL, aggregate and view the result set of features on a map.

Like usual, all the source code can be found here

GeoBinning On IBM Bluemix Spark

2017-03-06T09:17:00.000-05:00

This is a proof of concept project to enable ArcGIS Pro to invoke a Spark based geo analytics on IBM Bluemix and view the result of the analysis as features in a map.

Check out the source code here

Space Time Ripples

2017-03-01T13:12:00.003-05:00

Start by looking at this application and that one. Make sure to tilt the map by holding down the right mouse button and sliding the mouse up. Then, slide the bottom slider back and forth to see the data "ripple" through time.

This type of visualization is something I have wanted to do for a long time and is now possible with the advent of the new 4.2 ArcGIS API for JavaScript. The new API has "hooks" to enable a developer to invoke WebGL shaders directly, which can render a massive amount of data very efficiently and very quickly.

The authoring of the data for the above applications is based on ArcGIS Pro extended with a custom ArcPy based toolbox. The tool queries features from a user selected feature class, bins the features by space and time and emits a space-time "cube" in the form of a Dojo AMD module to be loaded by a JavaScript application. The source of the feature class can be a geodatabase, a relational data store, or the new SpatialTemporal BigData store.
Yes, I should have written a web service to do that, but this is my blog post and leaving that as an exercise for the reader :-)
I have to admit that I am a bit selfish in building the JavaScript application in "mixing" two languages: JavaScript and TypeScript. I wanted to try out the TypeScript extension to our JavaScript API, and I long for a type-safe language when building front end applications like in my olde Flex/AS3 days. It turned out that TypeScript is very cool, especially when used within IntelliJ :-)
Like usual, all the source code can be found here, and I will be talking about it more next week at my presentation at DevSummit. See some of you in Palm Springs.

Snapping Points To Lines And ArcGIS Pro

2016-05-25T09:23:00.003-04:00

Been wanting to post on this subject for quite some time (actually over a year) as associating a world coordinate with the proper nearby linear feature provides tremendous insight based on the fusion of their attributes. Moreover, doing that on a massive scale and quickly is even more imperative in today's BigData world, thus the usage of Apache Spark. I’ve posted a standalone implementation that relies on well-documented simple math and published methodology to perform searches on massive datasets in batch mode. What is exciting to me in writing this post was the viewing of the snap results in ArcGIS Pro. My lack of knowledge in extending ArcGIS Pro with downloadable Python modules contributed to the delay (and slight case of procrastination :-). However, with the help of a colleague, I was able to pip install modules that can be imported by my custom ArcPy based toolboxes without any errors.

Also, since this is all based on BigData, well it has to be tested in a BigData environment. The post describes the usage of Docker and the Cloudera QuickStart container to check the snap and the visualization. The following illustrates my development environment.

Like usual, all the source code can be found here.

Vector Tiles: The Third Wave

2016-04-10T00:38:00.004-04:00

When it comes to web mapping, we are surfing on a third wave in our digital ocean. And the “collaborative processing” between the digital entities while surfing that wave is making the ride more fun, insightful and expressive.

The first web wave was back in the mid 1990s, where interactive maps in the form of html image tags relied heavily on the server and requests parameters to regenerate the image when you clicked on the edge arrows to pan and zoom. Remember MapQuest and ArcIMS ?

Then in the mid 2000s came the second wave or more like a tsunami, Google Maps. You hold down the right mouse button on the map and drag to pan, you use the scroll wheel to zoom in and out, and… when you click on the map, a bubble appears on the map showing the details of the clicked location. Disruptive ! And all was smooth, responsive and AJAXy. This is when I believe that this collaborative processing concept took root and materialized itself in the web mappers’ minds. Soon after, more expressiveness was required as HTML was lacking in power and functionalities and the capitalization on browser plugins emerged to create Single Page Applications. Remember Flex and Silverlight ?

We are now in the mid 2010s. Flash is dead because he ate an “Apple”. HTML5, CSS3 and Javascript are in full swing and though Tile Services are fast as the tile images are preprocessed and prepared to be displayed, they are still image based, and dynamic styling of the features in a tile is not easy. In addition, with the ubiquity of GPUs on edge devices, faster rendering for expressiveness is now possible through the elusive “collaborative processing”.

Enter Vector Tiles. Map box has defined a vector tile specification that we at Esri have adopted it in our Javascript API, and demonstrated its versatility at the 2015 User Conference. Andrew Turner has a nice writeup about it. And found this nice in-depth paper that analyses the dynamic rendering of vector-based maps with WebGL.

I wanted to know more about it and I learn by doing. So I implemented two projects, a Mapbox Vector Tile encoder and a visualizer as heuristic experiments to be used with the Esri Javascript API. Again, these are experiments and will report on more updates.

ArcGIS For Server On Docker

2016-03-15T08:35:00.003-04:00

“But…It works on _my_ machine !!!” How many times did you hear that ? That is exactly one of the use cases of Docker for developers - Create an exact reproducible environment for each developer, even down to the hardware specification. And, that same environment can be on premise or in the cloud.
With the advent of ArcGIS For Server 10.4, I wanted to run it on my mac so I can try out some of the new features like chaining multiple SOIs.
I could have started a Windows based VM and gone through the GUI based setup, which is a pretty straight forward process (My friend Georges G. calls this, a PhD process, Push Here Dummy). But, I wanted to automate the whole install process in a headless way (I’m sure there is a way to do that using Windows, just I do not know how, maybe a blog post for another day)
Enter Docker. After downloading the ArcGIS For Linux tarball and the license file from my.esri.com, you can build a Dockerfile that automates the whole install process in a headless way - DevOps love this - In addition, once a build is done, you can run the image on premise or in the cloud by referencing a docker-machine.
Like usual, you can check out the whole source code on how you can do this here.

(Web)Mapping Elephants with Sparks

2016-02-08T07:21:00.001-05:00

CSV files (though not the most efficient format and least expressive due to meager header metadata) is one of the most ubiquitous formats to place data in BigData stores like HDFS. In addition, geospatial information such as latitude and longitude is now the norm as fields in those CSV files origination from say a moving GPS based device.
A constant request that I receive all the time is “How do I visualize on the web all these data points?” There is a legitimate concern in this question which is “How do I visualize on the web millions and millions on points?”. Well the short answer is “You Don’t!” (actually, you can…but that is blog post for another day). Though you can download a couple of million points to a web client, after a while the transfer time will be prohibitive. However, if you process the data on the server and send down the aggregated information to be symbolized on the client, then things become more interesting.
A common aggregation processing is binning, where, imagine you have a virtual fishnet and you cast that fishnet on your point space. All the points that are in the same fishnet cell are collapsed together to be represented by that cell. What you return now are the cells and their associated aggregates.

This project is a collection of Python tools using the ArcGIS System that retrieves CSV data with geospatial fields from HDFS and displays the aggregation in the form of hexagonal areas using ArcGIS online web maps and web apps. The processing is done in Python using Apache Spark.

The ArcGIS System is a sequential composition of:

Desktop with Python based GeoProcessing extensions for authoring.
Server with GeoProcessing endpoints for publishing.
Online with WebMaps and WebApps built using AppBuilder for presenting.

Like usual, all the source code is here.

DBSCAN on Spark

2016-01-30T13:12:00.001-05:00

The applications of DBSCAN clustering straddle various domains including machine learning, anomaly detection and feature learning. But my favorite part about it, is that you do not have to specify apriori the number of clusters to classify the input data. You specify a neighborhood distance and the minimum numbers of points to form a cluster and it will return back a set of clusters with the associated points in the cluster that meet the input parameters.
However, DBSCAN can consume a lot of memory when the input is very large. And since I do BigData, my data inputs will overwhelm my MacBook Pro very quickly. Since I know Hadoop MapReduce fairly well, and MR has been around for quite some time, I decided to see how other folks implemented such a solution in a distributed share nothing environment. I came across this paper, which was very inspiring and found out that IrvingC used it too as a reference implementation. So I decided to implement my own DBSCAN on Spark as a way to further my education in Scala. And boy did I learn a lot when it comes to immutable data structures, type aliasing and collection folding. BTW, I highly recommend the Twitter Scala School.
Like usual, all the source code can be found here, and make sure to check out the “How It Works?” section.

[Update] After posting - I saw this post - very nice video too!

Spark Library To Read File Geodatabase

2016-01-04T15:56:00.005-05:00

Happy 2016 all. Yes it has been a while and thanks for your patience. Like usual, at the beginning of every year, there are the promises to eat less, exercise more, climb Ventoux and blog more. Was listening to Feakonomic (When Willpower Isn’t Enough), and this initial post of the year is to harness the power of a fresh start.

Esri has been advocating for a while to use FileGeodatabase, and actually released a C++ based API to perform read-only operations on it. However, the read has to be performed off a local file system and the read is single threaded (you could write an abstract layer on top of the API to perform a parallel partitioned read if you have the time).

In my BigData uses cases, I need to place the GDB files in HDFS so I can perform Spark based GeoAnalytics. Well, that made the usage of the C++ API difficult (as it is not using the Hadoop File System API) and will have to map the Spark API to a native API and will have to publish the DLL and…(well you can imagine the pain) - I attempted this in my Ibn Battuta Project where I relied on the GeoTools implementation of the FileGeodatabase, but was not too happy with it.

I asked the core team if they will have a pure Java implementation of the API, but they told me it was low on their list of priorities. Googling around, I found somebody that published a reversed engineered specification. My co-worker Danny H. took an initial stab at the implementation and over the holidays, I took over targeting the Spark API and the DataFrames API. The implementation will enable me to do something like:

sc.gdbFile("hdfs:///data/Test.gdb", "Points", numPartitions = 2).map(row => {  row.getAs[Geometry](row.fieldIndex("Shape")).buffer(1)}).foreach(println)

and in SQL:

val df = sqlContext.read.
    format("com.esri.gdb").
    option("path", "hdfs:///data/Test.gdb").
    option("name", "Lines").
    option("numPartitions", "2").
    load()
df.registerTempTable("lines")
sqlContext.sql("select * from lines").show()

Pretty cool, no ? Like usual all the source code can be found here.

A Whale and a Python GeoSearching on a Photon Wave

2015-08-21T23:06:00.004-04:00

In the last post, we walked through how to setup Elasticsearch in a Docker container and how to bulk load the content of an ArcGIS feature class into ES, in such that it can be spatially searchable from an ArcPy based tool.

There was something nagging me about my mac development environment, as I was using docker in VirtualBox and ArcGIS Desktop on Windows in WMWare Fusion. I wish I had one unified virtualized environment.

Well, while at MesosCon in Seattle, I stopped by the VMWare booth and the folks there told me about a new project named Photon™. It is "a minimal Linux container host. It is designed to have a small footprint and boot extremely quickly on VMware platforms. Photon™ is intended to invite collaboration around running containerized applications in a virtualized environment.” - That was exactly what I needed, and docker is built into it !

See, what also got me excited, was the fact that in a couple of weeks, I will be visiting a very forward thinking client that is willing to bootstrap a cluster on an on-premise WMWare based cloud with Linux for a BigData project. See, his IT department is a Windows shop and I was going to ask him to install CentOS and yum install docker and all that jazz. As you can imagine, that was going to raise some eyebrows. However, now that Photon™ is made by VMWare, it will trusted by the customer (I hope) to move forward with focusing on the BigData aspect of the project and not be dragged down with Linux installation issues.

The following, is a retrofit of the walk through, but using Photon™. And the best part is….there are no changes due to docker’s universality.

I’m using VMWare Fusion on mac, so I followed these instructions. However, I set up Photon™ with 4 CPUs and 4 GB of RAM.

Once the system was up, I logged in as root, and got the IP address that is bound to eth0 using the ifconfig command.

I created a folder named config, and populated it with the following Elasticsearch configuration files:

$ mkdir config

$ cat << EOF > config/elasticsearch.yml
cluster.name: elasticsearch
index.number_of_shards: 1
index.number_of_replicas: 0
network.bind_host: dev
network.publish_host: dev
cluster.routing.allocation.disk.threshold_enabled: false
action.disable_delete_all_indices: true
EOF

$ cat << EOF > config/logging.yml
es.logger.level: INFO
rootLogger: ${es.logger.level}, console
logger:
  action: DEBUG
  com.amazonaws: WARN
appender:
  console:
    type: console
    layout:
      type: consolePattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
EOF

Next, I started Elasticsearch in docker:

docker run -d -p 9200:9200 -p 9300:9300 -h dev -v /root/config:/usr/share/elasticsearch/config elasticsearch

And validated that ES is up and running by opening a browser on my mac and navigated to IP_ADDRESS:9200 and got:

{
status: 200,
name:"Longshot",
cluster_name: "elasticsearch",
version: {
 number: "1.7.1",
 build_hash: "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
 build_timestamp: "2015-07-29T09:54:16Z",
 build_snapshot: false,
 lucene_version: "4.10.4"
},
tagline: "You Know, for Search"
}

Excellent! From then on, the walk through is as previously described, but now I have one unified environment and that will be the same when in two weeks I will be on-site.

Final note: I set to yes the value of PermitRootLogin in the /etc/ssh/sshd_config file to able remote login as root into the VM from my mac iTerm. I recommend that you check out the FAQs.

Resources: Update to Docker 1.6

Bulk Load Features from ArcGIS Into Elasticsearch

2015-08-21T09:35:00.001-04:00

I really like Elasticsearch because it natively supports geo spatial types and queries. I just added to gitbub a ArcPy based toolbox to bulk load the content of a feature class into an ES index/type.

The toolbox contains yet another tool as a proof-of-concept to spatially query the loaded document.

BigData Point-In-Polygon GeoEnrichment

2015-08-13T20:38:00.001-04:00

I’m always handed a huge set of CSV records with lat/lon coordinates, and the task at hand is to spatially join these records with a huge set of feature polygons where the output is a geoenhancement of the orignal points with the intersecting polygon’s attributes. An example is a set of retailer customer locations that need to be spatially intersected with demographic polygons for targeted advertisement (Sorry to send you all more junk mail :-).

This is a reference implementation, where both the points data and the polygon data are stored in raw text TSV format and the polygon geometries are in WKT format. Not the most efficient format, but at least the input is splittable for massive parallelization.

The feature class polygons can be converted to WKT using this ArcPy tool.

This Spark based job can be executed in local mode, or better in this docker container.

One of these days will have to re-implement the reading of the polygons from a binary source such as shapefiles or file geodatabases. Until then, you can download all the source code from here.