tag:blogger.com,1999:blog-44425733903566435222024-03-05T05:17:03.022-05:00Thunderhead ExplorerTips and Tricks using GIS BigData, GeoAI, GenAI, ML and other fun stuff :-)thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comBlogger192125tag:blogger.com,1999:blog-4442573390356643522.post-16871375423637413862024-02-03T01:10:00.000-05:002024-02-03T01:10:07.483-05:00On Using AutoML for NYC Taxi Trip Duration Prediction<p><span style="background-color: white; font-family: arial;"><span style="font-size: 16px;">While explaining</span><span style="font-size: 16px;"> </span><a href="https://optuna.org/" rel="nofollow" style="box-sizing: border-box; font-size: 16px; text-underline-offset: 0.2rem;">Optuna</a><span style="font-size: 16px;"> </span><span style="font-size: 16px;">to a client in the context of hyperparameter tuning, and performing more research on the topic, I came across</span><span style="font-size: 16px;"> </span><a href="https://auto.gluon.ai/stable/index.html" rel="nofollow" style="box-sizing: border-box; font-size: 16px; text-underline-offset: 0.2rem;">AutoGluon</a><span style="font-size: 16px;"> </span><span style="font-size: 16px;">to perform "</span><a href="https://www.automl.org/automl/" rel="nofollow" style="box-sizing: border-box; font-size: 16px; text-underline-offset: 0.2rem;">AutoML</a><span style="font-size: 16px;"> </span><span style="font-size: 16px;">for images, text, and tabular data". After a quick scan of the documentation, I decided to give it a try and see how it performs on a simple project.</span></span></p><p dir="auto" style="box-sizing: border-box; font-size: 16px; margin-bottom: 16px; margin-top: 0px;"><span style="background-color: white; font-family: arial;">I always loved the Kaggle competition NYC Taxi Trip Duration, as the data has spatial, temporal, and other traditional attribute information and is a great dataset to test various models and feature engineering techniques.</span></p><p dir="auto" style="box-sizing: border-box; font-size: 16px; margin-bottom: 16px; margin-top: 0px;"><span style="background-color: white; font-family: arial;">I used a local <a href="https://spark.apache.org/" rel="nofollow" style="box-sizing: border-box; text-underline-offset: 0.2rem;">Apache Spark</a> instance (as it is my go-to ETL engine) to perform some feature engineering before letting AutoGluon do its magic. As a quick proof of concept, the results are quite impressive, and <a href="https://github.com/mraad/auto_ml_nyc" rel="nofollow" target="_blank">here</a> are the steps to reproduce the project and visualize the result.</span></p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-65248042525220853522024-01-28T11:16:00.000-05:002024-01-28T11:16:25.629-05:00Arabic SDK For Apache Spark<p><span style="color: #0e101a;">I recently attended the <a href="https://www.esrisaudiarabia.com/en-sa/about/events/esrisaudi-uc2024/overview" target="_blank">Esri Saudi Arabia User Conference</a> and was amazed by the changes in the Kingdom. The capital city of Riyadh is booming and proliferating. During the conference, I presented on integrating GenerativeAI and GIS in the plenary session and led a session on BigData and <a href="https://developers.arcgis.com/geoanalytics/" rel="nofollow" target="_blank">GeoAnalytics Engine</a>. GeoAnalytics Engine, based on <a href="https://spark.apache.org/" rel="nofollow" target="_blank">Apache Spark</a>, allows spatial operations on Spark data frames. We showcased a project called "A Day in the Life," which used historical traffic data from <a href="https://www.here.com/learn/blog/traffic-analytics-speed-data" rel="nofollow" target="_blank">HERE</a> to demonstrate traffic congestion during peak hours. Traffic is notoriously bad in the city, so this was a fitting example. My colleague Mahmoud H. presented a traditional workflow process in a Jupyter Notebook off a <a href="https://cloud.google.com/dataproc?hl=en" rel="nofollow" target="_blank">Google Cloud DataProc</a> cluster, efficiently processing over 300 million records (this is relatively "small"). The processed traffic information was then displayed in <a href="https://www.esri.com/en-us/arcgis/products/arcgis-pro/overview" rel="nofollow" target="_blank">ArcGIS Pro</a> in a time-aware layer to reflect the congestion visually while activating a time slider. At the end of the presentation, we surprised the audience by using <b>ChatGPT</b> to translate <b>Arabic</b> sentences to SparkSQL code, and Azure OpenAI GPT4 handled the translation very well. Look <a href="https://github.com/mraad/GeoLLM/blob/main/PySparkAI.ipynb" rel="nofollow" target="_blank">here</a> for code snippets. This form of interaction <b>IS</b> the future, and I am excited to invest more in this technology and in the following areas:</span></p><ol style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><li style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: decimal; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Enhanced Visualization and Real-time Data Integration:</strong></li></ol><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Dynamic Visualization:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Integrating real-time traffic data feeds into existing models. This will not only show historical congestion but also provide live updates. Dynamic heatmaps can be particularly effective in visualizing the intensity of traffic at different times.</span></li><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">3D Modeling:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Utilize ArcGIS's 3D scene capabilities to give a more immersive view of traffic congestion and urban planning scenarios.</span></li></ul></ul><ol style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><li style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: decimal; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Improved Data Analysis through Machine Learning:</strong></li></ol><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Predictive Analytics:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Integrate machine learning models to predict future traffic patterns based on historical data, weather conditions, events, and other variables.</span></li><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Anomaly Detection:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Implement anomaly detection algorithms to identify unusual traffic patterns, which can be crucial for incident response and urban planning.</span></li></ul></ul><ol style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><li style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: decimal; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Enhancing User Interaction and Accessibility:</strong></li></ol><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Multilingual Support:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> While we showcased the translation of Arabic sentences to SparkSQL code, we should consider expanding this feature to include more languages, making your tool more accessible to a global audience.</span></li><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Voice Commands and Chatbots:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Integrate voice command functionality and develop a chatbot using Azure OpenAI GPT4 for querying and controlling the GeoAnalytics Engine, making the system more interactive and user-friendly.</span></li></ul></ul><ol style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><li style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: decimal; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Scalability and Performance Optimization:</strong></li></ol><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Optimization for Large Datasets:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Continue to refine the efficiency of processing large datasets. Explore the latest advancements in distributed computing and in-memory processing to handle even larger datasets more efficiently.</span></li><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Cloud Integration:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Ensure the solutions are cloud-agnostic and can be deployed on any public or private cloud provider, enhancing the system's scalability and reliability.</span></li></ul></ul><ol style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><li style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: decimal; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Collaboration and Sharing:</strong></li></ol><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Collaborative Features:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Develop features that allow multiple users to work on the same project simultaneously, including version control and change tracking for shared projects.</span></li><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Export and Sharing Options:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Enhance the ability to export results and visualizations in various formats and share them across different platforms, facilitating easier collaboration and reporting.</span></li></ul></ul><ol style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><li style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: decimal; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Ethical Considerations and Transparency:</strong></li></ol><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><ul style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Data Privacy:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Address data privacy concerns by implementing robust data encryption and anonymization techniques, ensuring that individual privacy is respected while analyzing traffic patterns.</span></li><li class="ql-indent-1" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; list-style-type: disc; margin-bottom: 0pt; margin-top: 0pt;"><strong style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;">Algorithm Transparency:</strong><span data-preserver-spaces="true" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin-bottom: 0pt; margin-top: 0pt;"> Provide clear documentation and explanations of the algorithms used, promoting transparency and trust in your system.</span></li></ul></ul><p style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"></p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-46546808446772496022024-01-27T14:40:00.001-05:002024-01-27T14:40:03.876-05:00Back in Action: GenAI Meets GeoSpatial<p> Hello, everyone. It has been a while since my last post, and I wanted to explain my absence. I have been working on demanding client projects requiring confidentiality, so I couldn't share anything.</p><p>But now I'm back and excited to dive into something new and exciting. </p><p>Generative AI (GenAI) has gained much attention lately, but I'm taking it to a different level by merging Large Language Models (LLMs) with insights from geospatial analysis. It's GenAI with a GeoSpatial twist.</p><p>I want to introduce a simple project, "<a href="https://github.com/mraad/GeoLLM" rel="nofollow" target="_blank">ReAct geospatial logic with Ollama</a>," which uses resources such as the Python <a href="https://www.langchain.com/" rel="nofollow" target="_blank">Langchain</a> and <a href="https://ollama.ai/" rel="nofollow" target="_blank">Ollama</a>.</p><p>I'm thrilled to be back and can't wait to start this new journey with you. Keep an eye on this space for future updates, tips, and unique code snippets. Your feedback and questions are valuable, so please don't hesitate to reach out.</p><p>I'll see you in the next post, and as usual, you can check out the source code <a href="https://github.com/mraad/GeoLLM" rel="nofollow" target="_blank">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-3784984559525803562020-08-24T08:28:00.000-04:002020-08-24T08:28:02.909-04:00On Machine Learning in ArcGIS and Data Preparation using Spark<span style="font-family: arial;">Artificial Intelligence / Machine Learning implementations have been part of Esri software in ArcGIS for a very long time. <a href="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/geographically-weighted-regression.htm" rel="nofollow" target="_blank">Geographic Weighted Regression (GWR)</a> and <a href="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/hot-spot-analysis.htm" rel="nofollow" target="_blank">Hot Spot Analysis</a> <b>are</b> Machine Learning algorithms. ArcGIS users have been utilizing <a href="https://en.wikipedia.org/wiki/Supervised_learning" rel="nofollow" target="_blank">supervised</a> <b>and</b> <a href="https://en.wikipedia.org/wiki/Unsupervised_learning" rel="nofollow" target="_blank">unsupervised</a> learning like <a href="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/multivariate-clustering.htm" rel="nofollow" target="_blank">clustering</a> to solve a myriad of problems and gain geospatial insight from their data. We just did not call these learning algorithms AI/ML back then. Not until the recent popularity of <a href="https://en.wikipedia.org/wiki/Deep_learning" rel="nofollow" target="_blank">DeepLearning</a> that finally blossomed from the <a href="https://en.wikipedia.org/wiki/AI_winter" rel="nofollow" target="_blank">AI Winter</a> and made AI a household name. And guess what? Esri software has now <a href="https://pro.arcgis.com/en/pro-app/help/analysis/image-analyst/deep-learning-in-arcgis-pro.htm" rel="nofollow" target="_blank">DeepLearning implementations</a>!</span><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;">It is important to understand the difference between AI, ML, and DL. I came across this very insightful <a href="https://www.7wdata.be/big-data/what-is-the-difference-between-ai%EF%BB%BF-machine-learning-and-deep-learning/" rel="nofollow" target="_blank">article</a> that analogizes the relationship to <a href="https://en.wikipedia.org/wiki/Matryoshka_doll" rel="nofollow" target="_blank">Russian dolls</a>.</span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;">Typically, collected data is very "dirty", and a lot of cleaning has to performed on it before processing it through a Machine Learning algorithm. The bigger the data, the harder is the process especially when pruning noisy outliers and anomalies. But then, that could be what you are looking for, outliers and anomalies. That is why "Data Janitor" is a new job title, as a majority of your time when dealing with data is cleaning it! That is why I love to use <a href="https://spark.apache.org/" rel="nofollow" target="_blank">Apache Spark</a> for this task. It can handle BigData very efficiently in a distributed, parallel share-nothing environment, and the usage of the <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html" rel="nofollow" target="_blank">DataFrame API and SQL</a> makes the maniplation a breeze. Utilizing the latter two in an interactive environment like a <a href="https://jupyter.org/" rel="nofollow" target="_blank">Jupyter</a> Notebook enables quick cleaning and more importantly insightful explorations.</span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;">Now the best part, we can have all 3, Jupyter, Spark and ML, in one enviroment; <a href="https://www.esri.com/en-us/arcgis/products/arcgis-pro/resources" rel="nofollow" target="_blank">ArcGIS Pro</a>!</span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;"><a href="https://github.com/mraad/spark-esri/blob/master/taxi_trips_duration_train.ipynb" rel="nofollow" target="_blank">This</a> notebook and <a href="https://github.com/mraad/spark-esri/blob/master/taxi_trips_duration_error.ipynb" rel="nofollow" target="_blank">that</a> notebook demonstrate the usage of Spark in Notebook in Pro to clean and explore the data, and then prepare it for Machine Learning. We are using the built-in <a href="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/forestbasedclassificationregression.htm" rel="nofollow" target="_blank">Forest Based Regression</a> to predict (or attempt to predict) the <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration" rel="nofollow" target="_blank">trip duration of a New York City taxi</a> given a pickup and dropoff location.<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0xgyKAYwn235jabKI6olLt0aP4h_gHbH87bGx4UvO6u8rAGhXx11-S8vbl-vGshkv6Tch4bMyW-SAt8GQylTcY8ik01Nbt-vK7SvjVnkiR9kNTUJAuKFplPcK_wSEfxMiQSZe13Jj1Oo/s768/PickupBins200x200.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="580" data-original-width="768" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0xgyKAYwn235jabKI6olLt0aP4h_gHbH87bGx4UvO6u8rAGhXx11-S8vbl-vGshkv6Tch4bMyW-SAt8GQylTcY8ik01Nbt-vK7SvjVnkiR9kNTUJAuKFplPcK_wSEfxMiQSZe13Jj1Oo/s640/PickupBins200x200.png" width="640" /></a></span></div><div><div><span style="font-family: arial;">The last notebook, enables the user to select pickup locations (in the below case, about JFK) and "see" the errors of the model on a map.</span></div></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8-LFCNF0vEAxkj02AjyuNcjXyZysGGmKG23T6CesDzeEhNgJmSt_RptyAco_Bar309_LxOzmzuidrR1l70Vy6DusRq0yCbIaM3hRKGQLyilhHsJ9AVE-bFSZiXSppylB8OzDpqQQR0VI/s768/TripErrors.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: arial;"><img border="0" data-original-height="580" data-original-width="768" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8-LFCNF0vEAxkj02AjyuNcjXyZysGGmKG23T6CesDzeEhNgJmSt_RptyAco_Bar309_LxOzmzuidrR1l70Vy6DusRq0yCbIaM3hRKGQLyilhHsJ9AVE-bFSZiXSppylB8OzDpqQQR0VI/s640/TripErrors.png" width="640" /></span></a></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;">Some will argue that all this could have been done using <a href="https://pandas.pydata.org/" rel="nofollow" target="_blank">Pandas</a>, and that is true. But that is not the point of this demonstration and this assumes that all the data could be held in memory which is true in this case (1.45) but will fail when you have 100's of million of rows and you have to process all this data on one machine with limited resources. If you like Pandas, I recommend that you check out <a href="https://koalas.readthedocs.io/en/latest/" rel="nofollow" target="_blank">Koalas</a>.</span></div>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-31154561260512890112020-08-03T21:48:00.001-04:002020-08-04T04:01:12.650-04:00ArcGIS Pro, Jupyter Notebook and Databricks¶<font face="verdana">Yet another post in the continuing saga of the usage of <a href="https://spark.apache.org/" rel="nofollow" target="_blank">Apache Spark</a> from a <a href="https://www.esri.com/arcgis-blog/products/arcgis-pro/analytics/introducing-arcgis-notebooks-in-arcgis-pro/" rel="nofollow" target="_blank">Jupyter notebook within ArcGIS Pro</a>.</font><div><font face="verdana"><br /></font></div><div><font face="verdana">In the <a href="http://thunderheadxpler.blogspot.com/2020/07/on-arcgis-pro-notebook-and-apache-spark.html" rel="nofollow" target="_blank">previous posts</a>, the execution was always within the ArcGIS Pro environment on a single machine, albeit taking advantage of all the cores of that machine. Here, we take a different angle, the execution is performed on a remote cluster of machines in the cloud.</font></div><div><font face="verdana"><br /></font></div><div><font face="verdana">So, we author the notebook locally, but we execute it remotely.</font></div><div><font face="verdana"><br /></font></div><div><font face="verdana">In <a href="https://github.com/mraad/spark-esri/blob/master/spark_dbconnect.ipynb" rel="nofollow" target="_blank">this</a> notebook, we demonstrate the spatial binning of AIS broadcast points on a <a href="https://databricks.com/product/azure" rel="nofollow" target="_blank">Databricks cluster on Azure</a>. In addition, to colocate the data storage with the execution engine for performance purposes, we converted the local feature class of the AIS broadcast points to a <a href="https://parquet.apache.org/" rel="nofollow" target="_blank">parquet</a> file and placed it in the <a href="https://docs.databricks.com/data/databricks-file-system.html" rel="nofollow" target="_blank">Databricks distributed file system</a>.</font></div><div><font face="verdana"><br /></font></div><div><font face="verdana">More to come :-)</font></div><div><font face="verdana"><br /></font></div>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-54363941029350365432020-08-02T16:58:00.001-04:002020-08-02T16:58:20.460-04:00Virtual Gate Crossing<font face="verdana">Yet another continuation post regarding Pro, Notebook, and Spark :-). In <a href="https://github.com/mraad/spark-esri/blob/master/virtual_gates.ipynb" rel="nofollow" target="_blank">this</a> notebook, we will demonstrate <span style="background-color: white; text-align: justify;">a parallel, distributed, share-nothing spatial join between a relatively large dataset and a small dataset.</span></font><div><div style="text-align: justify;"><span style="background-color: white;"><font face="verdana"><br /></font></span></div><div style="text-align: justify;"><span style="background-color: white;"><font face="verdana">In this case, virtual gates are defined at various locations in a port, and the outcome is an account of the number of crossings of these gates by ships using their AIS target positions.</font></span></div><div style="text-align: justify;"><p style="background-color: white; margin: 1em 0px 0px; padding: 0px;"><font face="verdana">Note that the join is to a "small" spatial dataset that we can:</font></p><ul style="background-color: white; list-style-image: initial; list-style-position: initial; margin: 1em 2em 0px; padding: 0px; text-align: start;"><li style="line-height: 20px; margin: 0px; padding: 0px;"><font face="verdana"><a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables" rel="nofollow" style="color: #0088cc; margin: 0px; padding: 0px;" target="_blank">Broadcast</a> to all the spark workers.</font></li><li style="line-height: 20px; margin: 0px; padding: 0px;"><font face="verdana">Brutly traverse it on each worker, as it is cheaper and faster to do so that spatially index it.</font></li></ul></div><div><span style="background-color: white; text-align: justify;"><font face="verdana"><br /></font></span></div></div><div style="text-align: justify;"><span style="background-color: white;"><font face="verdana">The following are sample gates:</font></span></div><div style="text-align: justify;"><font face="verdana"><br /></font></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxYnFr2AvzQ6OkVkBSxNysKhVsiyikpdVIJsmIpFkfAKBmyqK4nK9hTVNwkMuK6h-oz7boKSlCCF8nmJ5nMz07oyZXYswoAmMVd6ryFjC9N2FP7Wcy1uB_laMwIPP-hhbJsK1fCGPsC5w/s768/Gates1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><font face="verdana"><img border="0" data-original-height="575" data-original-width="768" height="383" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxYnFr2AvzQ6OkVkBSxNysKhVsiyikpdVIJsmIpFkfAKBmyqK4nK9hTVNwkMuK6h-oz7boKSlCCF8nmJ5nMz07oyZXYswoAmMVd6ryFjC9N2FP7Wcy1uB_laMwIPP-hhbJsK1fCGPsC5w/w512-h383/Gates1.png" width="512" /></font></a></div><div style="text-align: justify;"><font face="verdana"><br /></font></div><div style="text-align: justify;"><font face="verdana">And the following is a sample processed output:</font></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX0O-nIeyK7SVxj-gh3LEkOsyJVRFYbmbLkcJcw2K7VdUdLmT37YNhoN6fQx5Ib4JgqCTeppxgRABylCNxKwkSjeBW4WuXX7A0kUfGLbgLw-1liblPW6znaydSy2kebBv7lyFHARQbPyQ/s768/Gates2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><font face="verdana"><img border="0" data-original-height="480" data-original-width="768" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX0O-nIeyK7SVxj-gh3LEkOsyJVRFYbmbLkcJcw2K7VdUdLmT37YNhoN6fQx5Ib4JgqCTeppxgRABylCNxKwkSjeBW4WuXX7A0kUfGLbgLw-1liblPW6znaydSy2kebBv7lyFHARQbPyQ/w512-h320/Gates2.png" width="512" /></font></a></div><div style="text-align: justify;"><font face="verdana"><br /></font></div><div style="text-align: justify;"><font face="verdana">More to come...</font></div>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-69621317834398004712020-08-01T11:52:00.001-04:002020-08-01T11:56:10.163-04:00MicroPath Reconstruction of AIS Broadcast Points<font face="helvetica">This is a continuation of the last post regarding ArcGIS Pro, Jupyter Notebook, and Spark. And, this is a rehash of an <a href="http://thunderheadxpler.blogspot.com/2018/05/on-patterns-of-life-from-macrodata-to.html" rel="nofollow" target="_blank">older post</a> in a more "modern" way.</font><div><font face="helvetica"><br /></font></div><div><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica">Micropathing is the construction of a target's path from a limited set of a consecutive sequence of target points. Typically, the sequence is time-based, and the collection is limited to 2 or 3 target points.<span class="Apple-converted-space"> </span>The following is an illustration of 2 micropaths derived from 3 target points:</font></p><p class="p2" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px; min-height: 14px;"><font face="helvetica"><br /></font></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ6JGi7a46HwF77b_JFhU0iFHCj9OoYWiWbdTLhZiEPMDIs6waNaOu1S3IrXvYRcinp1eCUYT7lka7wY1GH5N8id9sPtI38i2UdHYiXJlVJ_cnNM1Ag-7nwDezu47DPR7YfNd81pIvU5c/s320/Micropath0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><font face="helvetica"><img border="0" data-original-height="162" data-original-width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ6JGi7a46HwF77b_JFhU0iFHCj9OoYWiWbdTLhZiEPMDIs6waNaOu1S3IrXvYRcinp1eCUYT7lka7wY1GH5N8id9sPtI38i2UdHYiXJlVJ_cnNM1Ag-7nwDezu47DPR7YfNd81pIvU5c/s0/Micropath0.png" /></font></a></div><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica"><br /></font></p><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica">Micropathing is different than path reconstruction, in such that the latter produced one polyline for the path of a target. Path reconstruction losses insightful in-path behavior, as a large number of attributes cannot be associated with the path parts. Some can argue that the points along the path can be enriched with these attributes. However, with the current implementations of Point objects, we are limited to only the extra <i>M</i> and <i>Z</i> to the necessary <i>X</i> and <i>Y</i>. You can also join the <i>PathID</i> and <i>M</i> to a lookup table and gain back that insight, but that joining is typically expensive and is difficult to derive from it the "expression" of the path using traditional mapping. A micropath overcomes today's limitations with today's traditional means to express the path insight better.</font></p><p class="p2" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px; min-height: 14px;"><font face="helvetica"><br /></font></p><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica">So, a micropath is a line composed of typically 2 points only and is associated with a set of attributes that describe that line.<span class="Apple-converted-space"> </span>These attributes are typical enrichment metrics derived from its two ends. An attribute can be, for example, the traveled distance, time, or speed.</font></p><p class="p2" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px; min-height: 14px;"><font face="helvetica"><br /></font></p><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica">In <a href="https://github.com/mraad/spark-esri/blob/master/micro_path.ipynb" rel="nofollow" target="_blank">this</a> notebook, we will construct "clean" micropaths using SparkSQL.<span class="Apple-converted-space"> </span>What do I mean by clean? As we all know, emitted target points are notoriously affected by noise, so using SparkSQL, we will eliminate that noise during the micropath construction.</font></p><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica"><br /></font></p><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica">Here is a result:</font></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdEHQvgRrkyFnkI9A9a2XqmC5NUOb0fpxmnWXShyphenhyphenwUdpQ42Zi98W1AUDiaNS9PYkBc63ijbMzAmMVtaHaypwVbzv2nXk92ftItwoncoc87o0BT0vwHhA734h-0v3_tqpd4pTU8n7x3od4/s768/Micropath1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><font face="helvetica"><img border="0" data-original-height="480" data-original-width="768" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdEHQvgRrkyFnkI9A9a2XqmC5NUOb0fpxmnWXShyphenhyphenwUdpQ42Zi98W1AUDiaNS9PYkBc63ijbMzAmMVtaHaypwVbzv2nXk92ftItwoncoc87o0BT0vwHhA734h-0v3_tqpd4pTU8n7x3od4/w512-h320/Micropath1.png" width="512" /></font></a></div><p class="p1" style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; margin: 0px;"><font face="helvetica">More to come...</font></p></div>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-15375752035413752922020-07-31T09:47:00.003-04:002020-07-31T09:52:00.354-04:00On ArcGIS Pro, Jupyter Notebook and Apache SparkBeen a while since I posted something, and thank you, faithful reader, for coming back :-)<div><br /></div><div>I'm intending to writing a series of posts on how to use <a href="https://spark.apache.org/" target="_blank">Apache Spark</a> and Machine Learning within a jupyter notebook <b>within</b> ArcGIS Pro. Yes, you can now start a jupyter notebook instance in ArcGIS Pro to create an amazing data science and data exploration experience. Check out <a href="https://pro.arcgis.com/en/pro-app/arcpy/get-started/pro-notebooks.htm" target="_blank">this link</a> to see how to get started with a Jupyter notebook in Pro. But...my favorite hidden "GeoGem", is that Pro comes with built-in Apache Spark, and y'all know how much I love Spark. People think that Spark is intended for only BigData analytics. That is so far from the truth. What I love about it, is the frictionless movement of data and analysis locally or remotely and the language fusion. In my case, I'm using Python, SQL, and Scala.</div><div><br /></div><div>The usage of Apache Spark in Pro was demonstrated in the publically shared <a href="https://www.esri.com/about/newsroom/blog/introducing-community-contact-tracing/" target="_blank">Covid-19 Contact Tracing Application</a> and the <a href="https://www.esri.com/arcgis-blog/products/arcgis-pro/health/use-proximity-tracing-to-identify-possible-contact-events/" target="_blank">Proximity Tracing Application</a>.</div><div><br /></div><div>In this first <a href="https://github.com/mraad/spark-esri/blob/master/spark_esri.ipynb" target="_blank">notebook</a>, we will start by loading selected features into a Spark dataframe from a local feature class, process the dataframe using Spark SQL, and write the result back to an ephemeral feature class that will be displayed on the map.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLzWiYWaccCyJt3-QVXyUoViO5JUjKIcIi8y02JAnZeF1Tgu5IazRzegxLrNaqX4W3ESfe4lf4AQktgKKEumNyRFrNKc8XUaxr3CAm_I021tS9Ys5r_kgXV5DpRngZUHXBooZvIDi_BDk/s768/Notebook.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="480" data-original-width="768" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLzWiYWaccCyJt3-QVXyUoViO5JUjKIcIi8y02JAnZeF1Tgu5IazRzegxLrNaqX4W3ESfe4lf4AQktgKKEumNyRFrNKc8XUaxr3CAm_I021tS9Ys5r_kgXV5DpRngZUHXBooZvIDi_BDk/w512-h320/Notebook.png" width="512" /></a></div><div><br /></div><div><br /></div><div>Like usual, all the source code can be found <a href="https://github.com/mraad/spark-esri" target="_blank">here</a>.</div><div><br /></div>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-12721983040049938852018-05-27T13:31:00.001-04:002018-05-27T13:32:00.106-04:00On Patterns Of Life: From MacroData to PicoData<p>Insight (or GeoInsight in our case) is lost in the deluge of data that we are acquiring today from everyday sensors that are machine or humanly generated. <a href="https://github.com/mraad/spark-pico-path">This</a> project is a set of heuristic <a href="https://spark.apache.org/">Spark</a> based implementations to reveal signals from the movement of ships in and out of the Port of Miami.</p><p>The idea is to extract small clean data (PicoData) from the overlap of a massive amount of data (MacroData). The aggregation of "clean" PicoData derived from MacroData trust into the forefront patterns of life.</p><p>For example, given the following display of AIS broadcasts:</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoehkFtzYW3BCH0DkwPhZVFLbQXuEO8sXdb7q0C99GQQXV4wkauZrH-9EOAGs3riTlW85JlxRGL3YvAoDsz6aNH68TuJtimyMxkkrbkJJ2MlUTYTPUmC5fcvf3wLP2syJsHIyswnw0F0E/s9999/1527442048.png" alt="img-alternative-text" style="max-width: 100%;"></div><p>We can mutate the data to reveal the "clean" influx of ships into the harbor at high tide:</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDPvmC12tYiQBfZ0QzepzhVvIjIkM0sdgafMWY1idjCRb-XCPg54JrfOJe5BsxIlaCrLJEn7iGWZkM_2TAqG9U4zfDzPhbHCrG1DLEvVx8LiFVzxP1M9sW3YF9cVkL9Npg8ky8Zy93b3s/s9999/1527442066.png" alt="img-alternative-text" style="max-width: 100%;"></div><p>Like usual, you can download all the source code from <a href="https://github.com/mraad/spark-pico-path">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-5952057478169969032018-01-01T11:06:00.000-05:002018-01-01T11:06:29.604-05:00On ML and Elastic Principle Graphs<p>Happy 2018 all. It has been a while since my last post. Thank you for your patience dear reader. Like usual, the perpetual resolutions for every year in addition to blogging more are to eat well, often exercise and climb <a href="https://en.wikipedia.org/wiki/Mont_Ventoux">Ventoux</a>.</p><p>Onward.</p><p>I genuinely believe that 2018 will be the year of the ubiquity of Geo-AI. It will be the year when Machine Learning and Spatial Awareness will blossom inside and mostly outside the GIS community.</p><p>We at Esri have had Machine Learning based tools in our "shed" for a long time. Every time an ArcGIS user performs a <a href="http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/geographically-weighted-regression.htm">graphically weighted regression</a>, trains a <a href="http://pro.arcgis.com/en/pro-app/tool-reference/spatial-analyst/train-random-trees-classifier.htm">random trees classifier</a> or detects an <a href="http://pro.arcgis.com/en/pro-app/tool-reference/space-time-pattern-mining/emerginghotspots.htm">emerging hot spot</a>, that user is using a form of Machine Learning without knowing it!</p><p>So one of my "missions" for 2018, it to make this knowledge more explicit to our users and non-traditional GIS users. Also, start to implement new forms of Machine Learning.</p><p>Machine Learning (ML), a branch of Artificial Intelligence (AI), is a disruptive force that is changing how today's industries are gaining new insight from their data. ML uses math, statistics and probability to find hidden patterns and make predictions from the data without being explicitly programmed. It is this last statement that is disruptive, "No explicit programming"! An ML algorithm iterates "intelligently" over the data, and the patterns emerge. Being iterative, the more data an ML algorithm is exposed to, the more refined the output becomes. Thus the coupling of BigData and ML is a perfect marriage fueled by cheap storage, ever more powerful computational power (think GPU) and faster networking.</p><p>This reemergence of this "No Explicit Programming" paradigm such as <a href="https://en.wikipedia.org/wiki/Deep_learning">Deep Learning</a>, <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement Learning</a>, and <a href="https://en.wikipedia.org/wiki/Self-organization#In_learning">Self Organization</a> is skyrocketing the likes of Google's <a href="https://en.wikipedia.org/wiki/AlphaGo_Zero">AlphaGo-Zero</a>, Facebook, and Uber.</p><p>So, I am starting this launch with something I have been fascinated by for quite some time, and that is "<a href="https://github.com/auranic/Elastic-principal-graphs/wiki">Elastic Principle Graphs</a>."</p><p>It is a "deep" extension of <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> that I came across it during my research of mapping noisy 2D data to a curve and was fascinated by its self-organization.</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgACQn2zSQLUx_vDTaoHaRrTUBvh0T4BZ2bFveUdyXPdNV7KveHiWgmi8vS8k44DhCiHylE0rfZkovCjaltX2OL1j3_k-3902YCAEZdlD_q6Ms3RqBaJx3x8cISTuI8u_xbyxsRnXS-1fo/s9999/1514821804.gif" alt="img-alternative-text" style="max-width: 100%;"></div><p>After reading, (and rereading for the nth time) <a href="https://github.com/auranic/Elastic-principal-graphs/blob/master/ElPiGraph_Methods.pdf">this</a> paper, <a href="https://github.com/mraad/elastic-graph">this</a> GitHub repo is a minimalist implementation in Scala.</p><p>Happy New Year All.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-69105396707710795962017-04-16T07:32:00.001-04:002017-04-16T07:51:16.692-04:00On Machine Learning and General Path Recognition<p>This is part II of my journey back into <a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=7&cad=rja&uact=8&ved=0ahUKEwi2i4_CwqbTAhVG6iYKHZgyBNEQFghHMAY&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSelf-organizing_map&usg=AFQjCNEJQ-mp56DiF1TRY3TFrrtQ7Y1B0w&sig2=q1xhvQWWbDDdVEtut0-TLA&bvm=bv.152479541,d.eWE">SOMs</a>. Actually this journey started with the below picture:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" title="Map.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPtg0y7288usy2YvcpO_xbIjyQbA2ydDGb9NEP3l1mfygECQNG4jx6cseuQ0C2couVp83A3oS3hrTYmGYIR-myVOt4uPeJA9MydpNdMRh3LLVWbvhy8JQYXmenCmmA7nq5LB_JV8AXlrA/?imgmax=1600" alt="Map" width="480" height="480" border="0" /></p>
<p>It is a map of <a href="https://marinecadastre.gov/ais">AIS</a> broadcast points around the harbor of Miami, FL. Now we, sentient creatures, when we look at this map we can quickly and clearly see patterns formed by the points. There is a sequence of points that start from the harbor and go northeast. There is a sequence of points that starts at the harbor and go south southeast. And there are a plenty of north south paths, some are very close to the shore, others are on the "edge" to the east. And there are path in the "middle".</p>
<p>Wouldn't it be wonderful if the Machine can see these pattern and formulate the general paths? That is actually what started this journey. I needed an unsupervised way for the Machine to recognize the patterns and emit the paths. I'm sure there are multiple ways to solve this, but I remembered that a while back I used <a href="https://en.wikipedia.org/wiki/Self-organizing_map">Self Organizing Maps</a> due to their simplicity and crucially for belonging to a class of unsupervised machine learning algorithms. So, in <a href="http://thunderheadxpler.blogspot.com/2017/04/on-machine-learning-with-self.html">Part I</a>, I reacquainted myself with SOMs and in this part, I completed the journey by showing the paths.</p>
<p>However faithful reader, I have not been totally honest with you. Please forgive me. This is part III in this journey. Along the way, I diverted a bit, as I needed a way to assemble tracks from targets. There exist a hidden gem in <a href="http://thunderheadxpler.blogspot.com/2017/03/arcgis-spark-and-alluxio-integration.html">this</a> project. The <a href="https://github.com/mraad/arcgis-alluxio/blob/master/src/main/scala/com/esri/PathFinder.scala">PathFinder</a> application is an important stop on this journey, as in addition to assembling tracks from targets, it quantizes the path into grid cells. The quantization of paths is the <a href="https://en.wikipedia.org/wiki/Linchpin">linchpin</a> between the raw targets points and the unsupervised path detection.</p>
<p>The below picture will help in my explanation:</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" title="Track.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbq21cLSCCF5M-o7MPXBNZgHUSRP3LhJmRrLYgBezUC2Xu5xGjVdt2rK3e6idvrXIs6xPENUc-rtowI5C4w_EUVn06FnJt-suV9kFhikdM36KgHbc4H1Mf8zMHUrK5nv24EMfeZjHTubQ/?imgmax=1600" alt="Track" width="480" height="480" border="0" /></p>
<p>If a virtual grid is overlaid on the map (the grid cells in the above map are coarse on purpose for illustration purposes), then a linear vector that represents a path can be composes by scanning the grid cells from left to right and then from top to bottom. The existence of targets in a cell is the binary value of the element in the vector. So in the above case, the path will be represented by the vector [0,0,0,0,1,0,0,0,1,1,1,1,0,0,0,1]. Side note: In this implementation the vector is composed of binary values, however, a vector with real numbers can be composed where the element value is proportional to the number of targets. In addition, the cell values can "bleed" to neighboring cells in say a gaussian way for better path recognition. Will have to come back to this one day. A Master's thesis can be made out this.</p>
<p>Now that we can compose a set of vectors from a set of targets, we can train the SOM with these vectors. The below figure is the visual representation of a 3x3 SOM result. Each sub map is a visual representation of the settled weights of a SOM node where each node weight is a linear representation of a quantized grid as described above.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" title="fig2.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirlvlSWhpqK6gEDx6K4jpjuhO09b2EuLAFZiNmKTWaG9WUMtX9ub4juOuTetaMU7SWLlGJDuKWvEhyphenhyphen2KwVug6ikOzkTEatG26cVWl9kS0c0AjaX9ZPMF84YYNstc9mi9AL8Y-non6oy8A/?imgmax=1600" alt="Fig2" width="270" height="270" border="0" /></p>
<p>We can see the path patterns that have self organized, and they do reflect what we have implicitly seen in the first map as humans. I highlighted in red the cells in each map with the most target associations thus forming a path. Isn't it amazing ?</p>
<p>Like usual, you can download all the source code for this from <a href="https://github.com/mraad/spark-som-path">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-82321539590044773432017-04-02T12:22:00.001-04:002017-04-02T12:22:13.866-04:00On Machine Learning with Self Organizing Maps<p>Self Organizing Map (<a href="https://en.wikipedia.org/wiki/Self-organizing_map">SOM</a>) is a form of Artificial Neural Network (ANN) belonging to a class of Machine Learning. AI Junkie has a GREAT <a href="http://www.ai-junkie.com/ann/som/som1.html">tutorial</a> about it. What I like about SOMs is that they belong to a class of unsupervised learning models and they hold true to the first law of geography.</p>
<blockquote>
<p>"E<span style="color: #222222; font-family: sans-serif; font-size: 14px; font-variant-ligatures: normal; orphans: 2; widows: 2;">verything is related to everything else, but near things are more related than distant things.</span>" - <a href="https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography">Tobler</a></p>
</blockquote>
<p>I encountered them and used them over 20 years ago, and since AI/ML is the hottest topic these days, I'm reacquainting myself with them. There are plenty of SOM libraries, but I learn (or in this case re-learn) by doing. <a href="https://github.com/mraad/spark-som">This</a> project is my learning journey in implementing SOMs and "<a href="http://spark.apache.org/">Sparkyfing</a>" them.</p>
<p>The following is a sample output of the obligatory RGB classifier, where a million random RGB triples are organized by a <a href="https://github.com/mraad/spark-som/blob/master/src/main/scala/com/esri/SparkApp.scala">Spark based SOM</a> into a 10x10 square lattice:</p>
<p><img title="som2.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtKyMmGsqY2M8WvOwjwNJGBH6mpoiS2kX2GVy4xlFV4Cso6ONizYJW8b1BqJY2bblUzaPPBLDiuwnvoAxOa-ia1mZYXmhGrDcyouFXiIZEDPv3lZrX813Y8tUlOyfgGnCGxIAHk8vFej8/?imgmax=1600" alt="Som2" width="100" height="100" border="0" /></p>
<p>And the following is a sample solution to a <a href="https://en.wikipedia.org/wiki/Travelling_salesman_problem">TSP</a> using SOM:</p>
<p><img title="tsp.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhl7slvk8mNA8JrYmEQDaZy7iWgnxu-Xh_5W2UbDGu2v1JMlgjOXAzuKS8QKX66V80YK908-WCs1g5H-m5Z_i5ASusc7QgXZqje0XF1TVHQQtvepmoiBBAyJbBwSKacyupzWYTViQvQf88/?imgmax=1600" alt="Tsp" width="200" height="200" border="0" /></p>
<p>Like usual, all the source code can be found <a href="https://github.com/mraad/spark-som">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-43644836454516540192017-03-27T18:15:00.004-04:002017-03-27T18:18:52.443-04:00ArcGIS, Spark and Alluxio Integration<p>There exist a plethora of backend distributed data stores. I am always using <a href="https://aws.amazon.com/s3/">S3</a> or <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Hadoop HDFS</a> or <a href="https://docs.openstack.org/developer/swift/">OpenStack Swift</a> with my GIS applications to read from these backends geospatial data or to save into these backends my data. Some of these distributed data stores are not natively supported by the <a href="http://www.esri.com/arcgis/about-arcgis">ArcGIS platform</a>. However, the platform can be extended with <a href="http://pro.arcgis.com/en/pro-app/arcpy/get-started/what-is-arcpy-.htm">ArcPy</a> to handle these situations. Depending on the data store, I will have to use a different API (mostly Python based) to read and write geospatial information. This is where <a href="http://www.alluxio.org/">Alluxio</a> comes in very handy. It provides an abstract layer between the application and the data store and (here is the best part), it caches this information in memory in a distributed and resilient-to-failure manner. So, at the application level, the code to access the data is invariant. On the backend, I can configure Alluxio to use either S3, HDFS or SWIFT. Finally, the advent of a <a href="https://www.alluxio.com/blog/whats-new-in-alluxio-140">REST endpoint</a> in Alluxio eases the integration with ArcGIS to write, read and visualize Geospatial data.</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_lYwy4HcY3Z2UplhLnkjQfyLEbhzmLchuXsEhgajIjmN0SXzCxzmh4OPsNJOo2jWjuaFqesmXts3i2kluTqqK7xP1N_bEbtmskaaWguwdsS9yWvzI3BumbOFH2XNMzNgD10UcSnZZnuk/s9999/1490652880.png" alt="img-alternative-text" style="max-width: 100%;"></div><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFFWqXm7Hpk-wIxINZao0ysyoSzNM4_9crCc7EH66Hcxf7wSJhqHmbL__nwFg_i54nsRj7p7L6l36wf-lXS1NoZZkdORMIFuqOdxdopSEVFjYAY3Xcg3_vtswjESDDL1NKx9MjAWZS7V0/s9999/1490653122.png" alt="img-alternative-text" style="max-width: 100%;"></div><p>Like usual, all the source code for this integration can be found <a href="https://github.com/mraad/arcgis-alluxio">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-23856370421210020072017-03-18T15:32:00.000-04:002017-03-18T15:32:19.194-04:00ArcGIS, Spark & MemSQL Integration<p>Just got back from the fantastic <a href="https://conferences.oreilly.com/strata/strata-ca">Strata + Hadoop 2017</a> conference where the topics ranged from BigData, Spark to lots of AI/ML and not so much on Hadoop explicitly, at least not in the sessions that I attended. I think that is why the conference is renamed Strata + Data from now on as there is more to Hadoop in BigData.</p><p>While strolling the exhibition hall, I walked into the booth of our friends at <a href="http://www.memsql.com/">MemSQL</a> and got a BIG hug from <a href="https://www.linkedin.com/in/garyorenstein/">Gary</a>. We reminisced about our co-presentations at various conferences regarding the integration of ArcGIS and MemSQL as they natively support <a href="http://www.memsql.com/content/geospatial/">geospatial</a> types.</p><p>This post is a refresher on the integration with a "modern" twist, where we are using the <a href="https://github.com/memsql/memsql-spark-connector">Spark Connector</a> to ETL geo spatial data into MemSQL in a <a href="https://github.com/memsql/memsql-docker-quickstart">Docker</a> container. To view the bulk loaded data, <a href="https://pro.arcgis.com/en/pro-app/">ArcGIS Pro</a> is extended with an ArcPy toolbox to query MemSQL, aggregate and view the result set of features on a map.</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEig_G2yjz2gJvkqwLIBFFzlRnFQJZqeNtPkEKPIFnoVn22h4TqigYq3mcG9UyZbgoEPmxLcaOvTZ64WDmbtyxyOI6OLJwcxzIeHPn_Z1PbDfs-3qxGtC2qKDoQ1i1qFenzwH16CnGBfC6Q/s9999/1489865504.png" alt="img-alternative-text" style="max-width: 100%;"></div><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgg6mXUuxzaRCB5dypxRXKBUsMTt-lRzg-PQhQkIAZLBN5njg1OiEJ8O4KruQzF8h4JXeVnn9dDft4As8hcgle96r8RyOhtyfmRHk8h9IBPuy00zbT9kYo6DreqAg9pvpTSLfni-4i4beU/s9999/1489865354.png" alt="img-alternative-text" style="max-width: 100%;"></div><p>Like usual, all the source code can be found <a href="https://github.com/mraad/memsql-arcgis">here</a></p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-16088949861380486832017-03-06T09:17:00.000-05:002017-03-06T09:17:46.405-05:00GeoBinning On IBM Bluemix Spark<p>This is a proof of concept project to enable <a href="https://pro.arcgis.com/en/pro-app/">ArcGIS Pro</a> to invoke a <a href="https://console.ng.bluemix.net/catalog/services/apache-spark">Spark</a> based geo analytics on <a href="https://www.ibm.com/cloud-computing/bluemix/what-is-bluemix">IBM Bluemix</a> and view the result of the analysis as features in a map.</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtw5XRej6zR7pO8AZB6cVfWdlSF67G1pG2tY4JwbR6UBLBhWMUQlffb-9m8Lg8MgFrbt4cJjenlSgko0-rIzty4JOo42rHrjAaF3_ci4R0l0wG0ezrFte2WHGHiRePZl_AwQ9a5H49Qmk/s9999/1488809832.png" alt="img-alternative-text" style="max-width: 100%;"></div><p>Check out the source code <a href="https://github.com/mraad/arcgis-bluemix">here</a></p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-68464840078295861272017-03-01T13:12:00.003-05:002017-10-03T15:25:23.693-04:00Space Time Ripples<div dir="ltr" style="text-align: left;" trbidi="on">
Start by looking at <a href="http://duor9l3bemy01.cloudfront.net/" title="trips">this</a> application and <a href="http://d1fa42h05jrwsl.cloudfront.net/">that</a> one. Make sure to tilt the map by holding down the right mouse button and sliding the mouse up. Then, slide the bottom slider back and forth to see the data "ripple" through time.<br />
<div class="separator" style="clear: both; text-align: center;">
<img alt="img-alternative-text" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZw7ilIRm_tJ6ZZcKtTs98k_VSzxTQh0EdKhIfpAmJxDi0jGZ1iVjnjulk258i2IUM-jn5IRed-JpEDo3k9vdImofmoSHMdTE6uwR7RstbbO2yeQ2pGtTrM58IbSHT2410bKAnXh22AVk/s9999/1488392889.png" style="max-width: 100%;" /></div>
This type of visualization is something I have wanted to do for a long time and is now possible with the advent of the new <a href="https://developers.arcgis.com/javascript/">4.2 ArcGIS API for JavaScript</a>. The new API has "hooks" to enable a developer to invoke <a href="https://www.khronos.org/webgl/">WebGL</a> <a href="https://webglfundamentals.org/webgl/lessons/webgl-shaders-and-glsl.html">shaders</a> directly, which can render a massive amount of data very efficiently and very quickly.<br />
<div class="separator" style="clear: both; text-align: center;">
<img alt="img-alternative-text" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinIr_rayLq02ubR-22PN7iKgPa0SN2t3uD-OLiNWVele-FqG0DVDFlBh9HGs572lsz_q_MDxqra6St4QArYrdSf6mdkPT9Fb-8m_zAW3gNnCMjaflUiXpuKmC2dQkG9IIyiWoSlYzsd64/s9999/1488392903.png" style="max-width: 100%;" /></div>
The authoring of the data for the above applications is based on <a href="https://pro.arcgis.com/en/pro-app">ArcGIS Pro</a> extended with a custom ArcPy based toolbox. The tool queries features from a user selected feature class, bins the features by space and time and emits a space-time "cube" in the form of a <a href="http://dojotoolkit.org/documentation/tutorials/1.10/modules/">Dojo AMD module</a> to be loaded by a JavaScript application. The source of the feature class can be a geodatabase, a relational data store, or the new SpatialTemporal BigData store.<br />
Yes, I should have written a web service to do that, but this is my blog post and leaving that as an exercise for the reader :-)<br />
I have to admit that I am a bit selfish in building the JavaScript application in "mixing" two languages: JavaScript and <a href="https://www.typescriptlang.org/">TypeScript</a>. I wanted to try out the TypeScript <a href="https://github.com/Esri/jsapi-resources/tree/master/4.x/typescript">extension</a> to our JavaScript API, and I long for a type-safe language when building front end applications like in my olde Flex/AS3 days. It turned out that TypeScript is very cool, especially when used within <a href="https://www.jetbrains.com/help/idea/2016.3/typescript-support.html">IntelliJ</a> :-)<br />
Like usual, all the source code can be found <a href="https://github.com/mraad/space-time-ripple">here</a>, and I will be talking about it more next week at my presentation at <a href="http://www.esri.com/events/devsummit">DevSummit</a>. See some of you in Palm Springs.</div>
thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-23929943055794408712016-05-25T09:23:00.003-04:002016-05-25T09:28:39.407-04:00Snapping Points To Lines And ArcGIS Pro<p>Been wanting to post on this subject for quite some time (actually over a year) as associating a world coordinate with the proper nearby linear feature provides tremendous insight based on the fusion of their attributes. Moreover, doing that on a massive scale and quickly is even more imperative in today's BigData world, thus the usage of <a href="http://spark.apache.org/" target="_blank">Apache Spark</a>. I’ve posted a standalone <a href="https://github.com/mraad/spark-snap-points" target="_blank">implementation</a> that relies on well-documented <a href="http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html">simple math</a> and <a href="http://www-users.cs.umn.edu/~mokbel/papers/icde15a.pdf">published methodology</a> to perform searches on massive datasets in batch mode. What is exciting to me in writing this post was the viewing of the snap results in <a href="http://www.esri.com/en/software/arcgis-pro">ArcGIS Pro</a>. My lack of knowledge in extending ArcGIS Pro with downloadable Python <a href="https://docs.python.org/2/tutorial/modules.html">modules</a> contributed to the delay (and slight case of procrastination :-). However, with the help of a colleague, I was able to <a href="https://docs.python.org/3.6/installing/index.html">pip</a> install modules that can be imported by my custom <a href="http://pro.arcgis.com/en/pro-app/arcpy/main/arcgis-pro-arcpy-reference.htm">ArcPy</a> based toolboxes without any errors.</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhysNxnkHkeI42GmLeZgND5a2AmQs2brXstQC3wiyXihWX3QAatozHiu9sAVVNmYJ6dZ0IfQiWcORE1EyP1MvGg94ldN9wh9EU4ld1nmOcDibvTA9FyH1BU9Ghblq6ykS71-zF8Vd4f3kc/" alt="img-alternative-text"></div><p>Also, since this is all based on BigData, well it has to be tested in a BigData environment. The post describes the usage of <a href="https://www.docker.com/" target="_blank">Docker</a> and the <a href="http://www.cloudera.com/documentation/enterprise/5-6-x/topics/quickstart_docker_container.html" target="_blank">Cloudera QuickStart container</a> to check the snap and the visualization. The following illustrates my development environment.</p><div class="separator" style="clear: both; text-align: center;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFO1Z3IDvjErwgF4-A7G4Ns4izzujlNDBTy70SJvf1jPamRSGPCdEz7ukWJ7gpSk1RCNC6Vgi_97lFBmzMv-L7_rnF5wDbSyrEw38Yj3U3Y-zB6la3rASDHLLlkjP9U-2P2KRBR4X7fWU/" alt="img-alternative-text"></div><p>Like usual, all the source code can be found <a href="https://github.com/mraad/spark-snap-points" target="_blank">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-69637620360505662992016-04-10T00:38:00.004-04:002016-04-10T08:11:31.361-04:00Vector Tiles: The Third Wave<p>When it comes to web mapping, we are surfing on a third wave in our digital ocean. And the “collaborative processing” between the digital entities while surfing that wave is making the ride more fun, insightful and expressive.</p><p>The first web wave was back in the mid 1990s, where interactive maps in the form of html image tags relied heavily on the server and requests parameters to regenerate the image when you clicked on the edge arrows to pan and zoom. Remember <a href="https://en.wikipedia.org/wiki/MapQuest" target="_blank">MapQuest</a> and <a href="http://www.esri.com/software/arcgis/arcims" target="_blank">ArcIMS</a> ?</p><p>Then in the mid 2000s came the second wave or more like a tsunami, Google Maps. You hold down the right mouse button <strong><em>on</em></strong> the map and drag to pan, you use the scroll wheel to zoom in and out, and… when you click on the map, a bubble appears <strong><em>on</em></strong> the map showing the details of the clicked location. <strong>Disruptive</strong> ! And all was smooth, responsive and <a href="https://en.wikipedia.org/wiki/Ajax_(programming)" target="_blank">AJAX</a>y. This is when I believe that this collaborative processing concept took root and materialized itself in the web mappers’ minds. Soon after, more expressiveness was required as HTML was lacking in power and functionalities and the capitalization on browser plugins emerged to create <a href="https://en.wikipedia.org/wiki/Single-page_application" target="_blank">Single Page Applications</a>. Remember <a href="https://en.wikipedia.org/wiki/Apache_Flex" target="_blank">Flex</a> and <a href="https://en.wikipedia.org/wiki/Microsoft_Silverlight" target="_blank">Silverlight</a> ?</p><p>We are now in the mid 2010s. <a href="http://shirt.woot.com/offers/poison-apple" target="_blank">Flash is dead because he ate an “Apple”</a>. HTML5, CSS3 and Javascript are in full swing and though <a href="https://en.wikipedia.org/wiki/Web_Map_Tile_Service" target="_blank">Tile Services</a> are fast as the tile images are preprocessed and prepared to be displayed, they are still image based, and dynamic styling of the features in a tile is not easy. In addition, with the ubiquity of GPUs on edge devices, faster rendering for expressiveness is now possible through the elusive “collaborative processing”.</p><p>Enter <a href="https://en.wikipedia.org/wiki/Vector_tiles" target="_blank">Vector Tiles</a>. <a href="https://en.wikipedia.org/wiki/Mapbox" target="_blank">Map box</a> has defined a vector tile <a href="https://github.com/mapbox/vector-tile-spec" target="_blank">specification</a> that we at Esri have adopted it in our <a href="https://developers.arcgis.com/javascript/jsapi/vectortilelayer-amd.html" target="_blank">Javascript API</a>, and <a href="http://video.esri.com/watch/4645/vector-map-tiles" target="_blank">demonstrated</a> its versatility at the 2015 User Conference. <a href="https://twitter.com/ajturner" target="_blank">Andrew Turner</a> has a nice <a href="https://blogs.esri.com/esri/arcgis/2015/07/20/vector-tiles-preview/" target="_blank">writeup</a> about it. And found this nice in-depth <a href="http://www.diva-portal.org/smash/get/diva2:851452/FULLTEXT02" target="_blank">paper</a> that analyses the dynamic rendering of vector-based maps with WebGL.</p><p>I wanted to know more about it and I learn by doing. So I implemented two projects, a Mapbox Vector Tile <a href="https://github.com/mraad/vector-tiles" target="_blank">encoder</a> and a <a href="https://github.com/mraad/vector-tiles-boot" target="_blank">visualizer</a> as heuristic experiments to be used with the <a href="https://developers.arcgis.com/javascript/beta/" target="_blank">Esri Javascript API</a>. Again, these are experiments and will report on more updates.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-70815232742857759592016-03-15T08:35:00.003-04:002016-03-15T08:50:34.486-04:00ArcGIS For Server On Docker<p>“But…It works on _my_ machine !!!” How many times did you hear that ? That is exactly one of the use cases of <a href="https://www.docker.com/" target="_blank">Docker</a> for developers - Create an exact reproducible environment for each developer, even down to the hardware specification. And, that same environment can be on premise or in the cloud.<br>With the advent of <a href="http://server.arcgis.com/en/server/latest/get-started/windows/what-s-new-in-arcgis-10-4-for-server.htm" target="_blank">ArcGIS For Server 10.4</a>, I wanted to run it on my mac so I can try out some of the new features like chaining multiple SOIs.<br>I could have started a Windows based VM and gone through the GUI based setup, which is a pretty straight forward process (My friend Georges G. calls this, a PhD process, <em>P</em>ush <em>H</em>ere <em>D</em>ummy). But, I wanted to automate the whole install process in a headless way (I’m sure there is a way to do that using Windows, just I do not know how, maybe a blog post for another day)<br>Enter Docker. After downloading the ArcGIS For Linux tarball and the license file from <a href="http://my.esri.com" target="_blank">my.esri.com</a>, you can build a <a href="https://docs.docker.com/engine/reference/builder/" target="_blank">Dockerfile</a> that automates the whole install process in a headless way - DevOps love this - In addition, once a build is done, you can run the image on premise or in the cloud by referencing a <a href="https://docs.docker.com/machine/" target="_blank">docker-machine</a>.<br>Like usual, you can check out the whole source code on how you can do this <a href="https://github.com/mraad/docker-arcgis" target="_blank">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-84037623902681222482016-02-08T07:21:00.001-05:002016-02-08T07:21:34.394-05:00(Web)Mapping Elephants with Sparks<p><a href="https://en.wikipedia.org/wiki/Comma-separated_values" target="_blank">CSV</a> files (though not the most efficient format and least expressive due to meager header metadata) is one of the most ubiquitous formats to place data in BigData stores like <a href="https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html" target="_blank">HDFS</a>. In addition, geospatial information such as latitude and longitude is now the norm as fields in those CSV files origination from say a moving GPS based device.<br>A constant request that I receive all the time is “How do I visualize on the web all these data points?” There is a legitimate concern in this question which is “How do I visualize on the web millions and millions on points?”. Well the short answer is “You Don’t!” (actually, you can…but that is blog post for another day). Though you can download a couple of million points to a web client, after a while the transfer time will be prohibitive. However, if you process the data on the server and send down the aggregated information to be symbolized on the client, then things become more interesting.<br>A common aggregation processing is binning, where, imagine you have a virtual fishnet and you cast that fishnet on your point space. All the points that are in the same fishnet cell are collapsed together to be represented by that cell. What you return now are the cells and their associated aggregates.<br><a href="https://github.com/mraad/hdfs-geohex" target="_blank"><br>This</a> project is a collection of Python tools using the ArcGIS System that retrieves CSV data with geospatial fields from HDFS and displays the aggregation in the form of hexagonal areas using <a href="https://www.arcgis.com/home/" target="_blank">ArcGIS online</a> web maps and web apps. The processing is done in Python using <a href="http://spark.apache.org/" target="_blank">Apache Spark</a>.</p><p>The ArcGIS System is a sequential composition of:<br></p><ul><li><a href="http://www.esri.com/software/arcgis/arcgis-for-desktop" target="_blank">Desktop</a> with Python based GeoProcessing extensions for authoring.<br></li><li><a href="http://www.esri.com/software/arcgis/arcgisserver" target="_blank">Server</a> with GeoProcessing endpoints for publishing.<br></li><li><a href="https://www.arcgis.com/home/" target="_blank">Online</a> with <a href="http://doc.arcgis.com/en/arcgis-online/reference/what-is-web-map.htm" target="_blank">WebMaps</a> and WebApps built using <a href="http://doc.arcgis.com/en/web-appbuilder/" target="_blank">AppBuilder</a> for presenting.</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_Pp0fE80-my5fjFDggScgzGXBD8qbkdDYVnZypiOf0X6aC3_9iQQ22OPCHS-U0u8wAH-rZOfXkNGJ8njfrOzPR6tlZxaPEYFWPWjcYnN0am3i2tlu-JkeDVQw98hEVfeNbd9xZVzUEOY/" target="_blank" style="margin-left: auto; margin-right: auto;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgB_LjnQXHxFbk8fthN0x2-b9D1i709FWfUkAXFjZ3hIFb8BZ4-NsQMsHmtT-RxeXyLUo3zxH_wuiCwAn-HsT81G3seOCIvj_JrQP0whLNLGbHzXefPtROh9H_XnJY7TjGWZ7ZaZad17JY/"></a></div><p>Like usual, all the source code is <a href="https://github.com/mraad/hdfs-geohex" target="_blank">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-18157932584446855892016-01-30T13:12:00.001-05:002016-01-31T13:48:17.972-05:00DBSCAN on Spark<div dir="ltr" style="text-align: left;" trbidi="on">
The applications of <a href="https://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a> clustering straddle various domains including machine learning, anomaly detection and feature learning. But my favorite part about it, is that you do not have to specify apriori the number of clusters to classify the input data. You specify a neighborhood distance and the minimum numbers of points to form a cluster and it will return back a set of clusters with the associated points in the cluster that meet the input parameters.<br />
However, DBSCAN can consume a lot of memory when the input is very large. And since I do BigData, my data inputs will overwhelm my MacBook Pro very quickly. Since I know <a href="https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Hadoop MapReduce</a> fairly well, and MR has been around for quite some time, I decided to see how other folks implemented such a solution in a distributed share nothing environment. I came across <a href="https://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf">this</a> paper, which was very inspiring and found out that <a href="https://github.com/irvingc/dbscan-on-spark">IrvingC</a> used it too as a reference implementation. So I decided to implement my own DBSCAN on <a href="http://spark.apache.org/">Spark</a> as a way to further my education in <a href="http://www.scala-lang.org/">Scala</a>. And boy did I learn a lot when it comes to immutable data structures, <a href="http://alvinalexander.com/scala/scala-type-examples-type-aliases-members">type aliasing</a> and <a href="http://oldfashionedsoftware.com/2009/07/10/scala-code-review-foldleft-and-foldright/">collection folding</a>. BTW, I highly recommend the <a href="https://twitter.github.io/scala_school/">Twitter Scala School</a>.<br />
Like usual, all the source code can be found <a href="https://github.com/mraad/dbscan-spark">here</a>, and make sure to check out the “<a href="https://github.com/mraad/dbscan-spark#how-does-it-work-">How It Works?</a>” section.<br />
<br />
[Update] After posting - I saw <a href="https://www.oreilly.com/ideas/clustering-geolocated-data-using-spark-and-dbscan">this</a> post - very nice video too!</div>
thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com0tag:blogger.com,1999:blog-4442573390356643522.post-59483727779194234572016-01-04T15:56:00.005-05:002016-01-04T16:38:32.905-05:00Spark Library To Read File Geodatabase<p>Happy 2016 all. Yes it has been a while and thanks for your patience. Like usual, at the beginning of every year, there are the promises to eat less, exercise more, climb <a href="https://en.wikipedia.org/wiki/Mont_Ventoux" target="_blank">Ventoux</a> and blog more. Was listening to <a href="http://freakonomics.com/2015/12/31/when-willpower-isnt-enough-a-freakonomics-radio-rebroadcast/" target="_blank">Feakonomic (When Willpower Isn’t Enough)</a>, and this initial post of the year is to harness the power of a fresh start.</p><p>Esri has been advocating for a while to use <a href="http://www.esri.com/news/arcuser/0309/files/9reasons.pdf" target="_blank">FileGeodatabase</a>, and actually released a <a href="http://www.esri.com/apps/products/download/index.cfm?fuseaction=#File_Geodatabase_API_1.3" target="_blank">C++ based API</a> to perform read-only operations on it. However, the read has to be performed off a local file system and the read is single threaded (you could write an abstract layer on top of the API to perform a parallel partitioned read if you have the time).</p><p>In my BigData uses cases, I need to place the GDB files in <a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html" target="_blank">HDFS</a> so I can perform <a href="http://spark.apache.org/" target="_blank">Spark</a> based GeoAnalytics. Well, that made the usage of the C++ API difficult (as it is not using the Hadoop File System API) and will have to map the Spark API to a native API and will have to publish the DLL and…(well you can imagine the pain) - I attempted this in my <a href="https://github.com/mraad/ibn-battuta" target="_blank">Ibn Battuta Project</a> where I relied on the GeoTools implementation of the <a href="https://github.com/geotools/geotools/tree/master/modules/plugin/ogr/ogr-bridj" target="_blank">FileGeodatabase</a>, but was not too happy with it.</p><p>I asked the core team if they will have a pure Java implementation of the API, but they told me it was low on their list of priorities. Googling around, I found <a href="https://github.com/rouault/dump_gdbtable/wiki/FGDB-Spec" target="_blank">somebody</a> that published a reversed engineered specification. My co-worker Danny H. took an initial stab at the implementation and over the holidays, I took over targeting the Spark API and the <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html" target="_blank">DataFrames</a> API. The implementation will enable me to do something like:<br></p><pre><code>sc.gdbFile("hdfs:///data/Test.gdb", "Points", numPartitions = 2).map(row => { row.getAs[Geometry](row.fieldIndex("Shape")).buffer(1)}).foreach(println)</code></pre><p>and in SQL:</p><pre>val df = sqlContext.read.<br> format("com.esri.gdb").<br> option("path", "hdfs:///data/Test.gdb").<br> option("name", "Lines").<br> option("numPartitions", "2").<br> load()<br>df.registerTempTable("lines")<br>sqlContext.sql("select * from lines").show()<br></pre><p>Pretty cool, no ? Like usual all the source code can be found <a href="https://github.com/mraad/spark-gdb" target="_blank">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-30518706728317439172015-08-21T23:06:00.004-04:002015-08-22T08:22:49.200-04:00A Whale and a Python GeoSearching on a Photon Wave<div dir="ltr" style="text-align: left;" trbidi="on">
In the last post, we <a href="https://github.com/mraad/bulk-geo-es" target="_blank">walked through</a> how to setup <a href="https://www.elastic.co/products/elasticsearch" target="_blank">Elasticsearch</a> in a <a href="https://www.docker.com/" target="_blank">Docker</a> container and how to bulk load the content of an <a href="http://resources.arcgis.com/en/help/main/10.1/index.html#//003n00000005000000" target="_blank">ArcGIS feature class</a> into ES, in such that it can be spatially searchable from an <a href="http://help.arcgis.com/EN/arcgisdesktop/10.0/help/index.html#//000v00000001000000" target="_blank">ArcPy</a> based tool.<br />
<br />
There was something nagging me about my mac development environment, as I was using docker in <a href="https://www.virtualbox.org/" target="_blank">VirtualBox</a> and ArcGIS Desktop on Windows in <a href="http://www.vmware.com/products/fusion" target="_blank">WMWare Fusion</a>. I wish I had one unified virtualized environment.<br />
<br />
Well, while at <a href="http://events.linuxfoundation.org/events/mesoscon" target="_blank">MesosCon</a> in Seattle, I stopped by the VMWare booth and the folks there told me about a new project named <a href="https://vmware.github.io/photon/" target="_blank">Photon™</a>. It is "a minimal Linux container host. It is designed to have a small footprint and boot extremely quickly on VMware platforms. Photon™ is intended to invite collaboration around running containerized applications in a virtualized environment.” - That was exactly what I needed, and docker is built <em>into</em> it !<br />
<br />
See, what also got me excited, was the fact that in a couple of weeks, I will be visiting a very forward thinking client that is willing to bootstrap a cluster on an on-premise WMWare based cloud with Linux for a BigData project. See, his IT department is a Windows shop and I was going to ask him to install CentOS and yum install docker and all that jazz. As you can imagine, that was going to raise some eyebrows. However, now that Photon™ is made by VMWare, it will trusted by the customer (I hope) to move forward with focusing on the BigData aspect of the project and not be dragged down with Linux installation issues.<br />
<br />
The following, is a retrofit of the walk through, but using Photon™. And the best part is….there are no changes due to docker’s universality.<br />
<br />
I’m using VMWare Fusion on mac, so I followed <a href="https://vmware.github.io/photon/assets/files/getting_started_with_photon_on_vmware_fusion.pdf" target="_blank">these</a> instructions. However, I set up Photon™ with 4 CPUs and 4 GB of RAM.<br />
<br />
Once the system was up, I logged in as root, and got the IP address that is bound to <em>eth0</em> using the <em>ifconfig</em> command.<br />
<br />
I created a folder named <em>config</em>, and populated it with the following Elasticsearch configuration files:<br />
<pre><code>$ mkdir config
$ cat << EOF > config/elasticsearch.yml
cluster.name: elasticsearch
index.number_of_shards: 1
index.number_of_replicas: 0
network.bind_host: dev
network.publish_host: dev
cluster.routing.allocation.disk.threshold_enabled: false
action.disable_delete_all_indices: true
EOF
$ cat << EOF > config/logging.yml
es.logger.level: INFO
rootLogger: ${es.logger.level}, console
logger:
action: DEBUG
com.amazonaws: WARN
appender:
console:
type: console
layout:
type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
EOF</code></pre>
<br />
Next, I started Elasticsearch in docker:<br />
<pre>
</pre>
<pre>docker run -d -p 9200:9200 -p 9300:9300 -h dev -v /root/config:/usr/share/elasticsearch/config elasticsearch</pre>
<br />
And validated that ES is up and running by opening a browser on my mac and navigated to <em>IP_ADDRESS:9200 and got:</em><br />
<pre><code>
</code></pre>
<pre><code>{
status: 200,
name:"Longshot",
cluster_name: "elasticsearch",
version: {
number: "1.7.1",
build_hash: "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
build_timestamp: "2015-07-29T09:54:16Z",
build_snapshot: false,
lucene_version: "4.10.4"
},
tagline: "You Know, for Search"
}</code></pre>
<br />
Excellent! From then on, the walk through is as previously described, but now I have one unified environment and that will be the same when in two weeks I will be on-site.<br />
<em><br /></em>
<em>Final note</em>: I set to <code>yes</code> the value of <code>PermitRootLogin</code> in the <code>/etc/ssh/sshd_config</code> file to able remote login as root into the VM from my mac iTerm. I recommend that you check out the <a href="https://github.com/vmware/photon/wiki/Frequently-Asked-Questions" target="_blank">FAQ</a>s.<br />
<br />
<i>Resources</i>: Update to <a href="http://www.vcritical.com/2015/06/quickly-update-to-docker-1-6-on-photon-by-using-tdnf/">Docker 1.6</a></div>
thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-50501892933105537392015-08-21T09:35:00.001-04:002015-08-21T09:35:58.108-04:00Bulk Load Features from ArcGIS Into Elasticsearch<p>I really like <a href="https://www.elastic.co/products/elasticsearch" target="_blank">Elasticsearch</a> because it natively supports geo spatial types and queries. I just added to <a href="https://github.com/mraad/bulk-geo-es" target="_blank">gitbub</a> a ArcPy based toolbox to bulk load the content of a feature class into an ES index/type.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguRkqiU8McNnkrQmqLffK4TDmKx8tNi9hxulXYqc704SUD5N3mWM0UxiHcvTmcwCKvO1td_rDGxkPz6_upozU9738qyVUGSjtG-815zPUk8_7c4njIUaOcaRj4FK8RN264CBkYOVED1kM/" target="_blank" style="margin-left: auto; margin-right: auto;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1wSThXzA-QJpxry_CxKRmXHNaeFgep3Cq0DHrt-43VMFzWFmffDY-q81fD3teHFFa3vDRe4wpJtTdGj934ngb2kMyODuD6sy4jvWeuc3iXk_S0EblPFhn1EZV7TtLopK5WKb2t3A-m34/"></a></div><p>The toolbox contains yet another tool as a proof-of-concept to spatially query the loaded document.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_9igy6bl-YYPzupNyJKjbi7bZDCrDGKLSIQu4JTHdLla8ZpKKq381l60y5bMZ0_3DzjELnlSByMz05ZMqsXlVc7tQYJtSJBcAM7DVnInk9p1A8y1Fe2dEF9wlVCCuq36maeyfhTAxDhc/" target="_blank" style="margin-left: auto; margin-right: auto;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiI_tTrpgCY20LE1BruveLt9GnAC9j50R0KCbWwVfJW5QEpX3vUtd3VzgrZGrxCUZ70dooSroxWyV-IP7cnVdUTSOWc5_nm47xtqMeNbRWt0IMJo_oGS-9PqGHx26muxPk8DsnH5U2QJyM/"></a></div>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.comtag:blogger.com,1999:blog-4442573390356643522.post-72788038439615222782015-08-13T20:38:00.001-04:002015-08-13T20:38:11.897-04:00BigData Point-In-Polygon GeoEnrichment<p>I’m always handed a huge set of CSV records with lat/lon coordinates, and the task at hand is to spatially join these records with a huge set of feature polygons where the output is a geoenhancement of the orignal points with the intersecting polygon’s attributes. An example is a set of retailer customer locations that need to be spatially intersected with demographic polygons for targeted advertisement (Sorry to send you all more junk mail :-).</p><p><a href="https://github.com/mraad/spark-pip" target="_blank">This</a> is a reference implementation, where both the points data and the polygon data are stored in raw text TSV format and the polygon geometries are in <a href="https://en.wikipedia.org/wiki/Well-known_text" target="_blank">WKT</a> format. Not the most efficient format, but at least the input is splittable for massive parallelization.</p><p>The feature class polygons can be converted to WKT using <a href="https://gist.github.com/mraad/ee63a7c787d42ca6a1e8" target="_blank">this</a> ArcPy tool.</p><p>This <a href="http://spark.apache.org/" target="_blank">Spark</a> based job can be executed in local mode, or better in <a href="https://hub.docker.com/r/mraad/hdfs-yarn-spark/" target="_blank">this</a> docker container.</p><p>One of these days will have to re-implement the reading of the polygons from a binary source such as <a href="https://en.wikipedia.org/wiki/Shapefile" target="_blank">shapefiles</a> or <a href="http://www.esri.com/news/arcuser/0309/files/9reasons.pdf" target="_blank">file geodatabases</a>. Until then, you can download all the source code from <a href="https://github.com/mraad/spark-pip" target="_blank">here</a>.</p>thunderheadhttp://www.blogger.com/profile/09200852299600047243noreply@blogger.com