Saturday, February 3, 2024

On Using AutoML for NYC Taxi Trip Duration Prediction

While explaining Optuna to a client in the context of hyperparameter tuning, and performing more research on the topic, I came across AutoGluon to perform "AutoML for images, text, and tabular data". After a quick scan of the documentation, I decided to give it a try and see how it performs on a simple project.

I always loved the Kaggle competition NYC Taxi Trip Duration, as the data has spatial, temporal, and other traditional attribute information and is a great dataset to test various models and feature engineering techniques.

I used a local Apache Spark instance (as it is my go-to ETL engine) to perform some feature engineering before letting AutoGluon do its magic. As a quick proof of concept, the results are quite impressive, and here are the steps to reproduce the project and visualize the result.

Sunday, January 28, 2024

Arabic SDK For Apache Spark

I recently attended the Esri Saudi Arabia User Conference and was amazed by the changes in the Kingdom. The capital city of Riyadh is booming and proliferating. During the conference, I presented on integrating GenerativeAI and GIS in the plenary session and led a session on BigData and GeoAnalytics Engine. GeoAnalytics Engine, based on Apache Spark, allows spatial operations on Spark data frames. We showcased a project called "A Day in the Life," which used historical traffic data from HERE to demonstrate traffic congestion during peak hours. Traffic is notoriously bad in the city, so this was a fitting example. My colleague Mahmoud H. presented a traditional workflow process in a Jupyter Notebook off a Google Cloud DataProc cluster, efficiently processing over 300 million records (this is relatively "small"). The processed traffic information was then displayed in ArcGIS Pro in a time-aware layer to reflect the congestion visually while activating a time slider. At the end of the presentation, we surprised the audience by using ChatGPT to translate Arabic sentences to SparkSQL code, and Azure OpenAI GPT4 handled the translation very well. Look here for code snippets. This form of interaction IS the future, and I am excited to invest more in this technology and in the following areas:

  1. Enhanced Visualization and Real-time Data Integration:
    • Dynamic Visualization: Integrating real-time traffic data feeds into existing models. This will not only show historical congestion but also provide live updates. Dynamic heatmaps can be particularly effective in visualizing the intensity of traffic at different times.
    • 3D Modeling: Utilize ArcGIS's 3D scene capabilities to give a more immersive view of traffic congestion and urban planning scenarios.
  1. Improved Data Analysis through Machine Learning:
    • Predictive Analytics: Integrate machine learning models to predict future traffic patterns based on historical data, weather conditions, events, and other variables.
    • Anomaly Detection: Implement anomaly detection algorithms to identify unusual traffic patterns, which can be crucial for incident response and urban planning.
  1. Enhancing User Interaction and Accessibility:
    • Multilingual Support: While we showcased the translation of Arabic sentences to SparkSQL code, we should consider expanding this feature to include more languages, making your tool more accessible to a global audience.
    • Voice Commands and Chatbots: Integrate voice command functionality and develop a chatbot using Azure OpenAI GPT4 for querying and controlling the GeoAnalytics Engine, making the system more interactive and user-friendly.
  1. Scalability and Performance Optimization:
    • Optimization for Large Datasets: Continue to refine the efficiency of processing large datasets. Explore the latest advancements in distributed computing and in-memory processing to handle even larger datasets more efficiently.
    • Cloud Integration: Ensure the solutions are cloud-agnostic and can be deployed on any public or private cloud provider, enhancing the system's scalability and reliability.
  1. Collaboration and Sharing:
    • Collaborative Features: Develop features that allow multiple users to work on the same project simultaneously, including version control and change tracking for shared projects.
    • Export and Sharing Options: Enhance the ability to export results and visualizations in various formats and share them across different platforms, facilitating easier collaboration and reporting.
  1. Ethical Considerations and Transparency:
    • Data Privacy: Address data privacy concerns by implementing robust data encryption and anonymization techniques, ensuring that individual privacy is respected while analyzing traffic patterns.
    • Algorithm Transparency: Provide clear documentation and explanations of the algorithms used, promoting transparency and trust in your system.

Saturday, January 27, 2024

Back in Action: GenAI Meets GeoSpatial

 Hello, everyone. It has been a while since my last post, and I wanted to explain my absence. I have been working on demanding client projects requiring confidentiality, so I couldn't share anything.

But now I'm back and excited to dive into something new and exciting. 

Generative AI (GenAI) has gained much attention lately, but I'm taking it to a different level by merging Large Language Models (LLMs) with insights from geospatial analysis. It's GenAI with a GeoSpatial twist.

I want to introduce a simple project, "ReAct geospatial logic with Ollama," which uses resources such as the Python Langchain and Ollama.

I'm thrilled to be back and can't wait to start this new journey with you. Keep an eye on this space for future updates, tips, and unique code snippets. Your feedback and questions are valuable, so please don't hesitate to reach out.

I'll see you in the next post, and as usual, you can check out the source code here.