Friday, August 21, 2015

A Whale and a Python GeoSearching on a Photon Wave

In the last post, we walked through how to setup Elasticsearch in a Docker container and how to bulk load the content of an ArcGIS feature class into ES, in such that it can be spatially searchable from an ArcPy based tool.

There was something nagging me about my mac development environment, as I was using docker in VirtualBox and ArcGIS Desktop on Windows in WMWare Fusion. I wish I had one unified virtualized environment.

Well, while at MesosCon in Seattle, I stopped by the VMWare booth and the folks there told me about a new project named Photon™. It is "a minimal Linux container host. It is designed to have a small footprint and boot extremely quickly on VMware platforms. Photon™ is intended to invite collaboration around running containerized applications in a virtualized environment.” - That was exactly what I needed, and docker is built into it !

See, what also got me excited, was the fact that in a couple of weeks, I will be visiting a very forward thinking client that is willing to bootstrap a cluster on an on-premise WMWare based cloud with Linux for a BigData project. See, his IT department is a Windows shop and I was going to ask him to install CentOS and yum install docker and all that jazz. As you can imagine, that was going to raise some eyebrows. However, now that Photon™ is made by VMWare, it will trusted by the customer (I hope) to move forward with focusing on the BigData aspect of the project and not be dragged down with Linux installation issues.

The following, is a retrofit of the walk through, but using Photon™. And the best part is….there are no changes due to docker’s universality.

I’m using VMWare Fusion on mac, so I followed these instructions. However, I set up Photon™ with 4 CPUs and 4 GB of RAM.

Once the system was up, I logged in as root, and got the IP address that is bound to eth0 using the ifconfig command.

I created a folder named config, and populated it with the following Elasticsearch configuration files:
$ mkdir config

$ cat << EOF > config/elasticsearch.yml
cluster.name: elasticsearch
index.number_of_shards: 1
index.number_of_replicas: 0
network.bind_host: dev
network.publish_host: dev
cluster.routing.allocation.disk.threshold_enabled: false
action.disable_delete_all_indices: true
EOF

$ cat << EOF > config/logging.yml
es.logger.level: INFO
rootLogger: ${es.logger.level}, console
logger:
  action: DEBUG
  com.amazonaws: WARN
appender:
  console:
    type: console
    layout:
      type: consolePattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
EOF

Next, I started Elasticsearch in docker:

docker run -d -p 9200:9200 -p 9300:9300 -h dev -v /root/config:/usr/share/elasticsearch/config elasticsearch

And validated that ES is up and running by opening a browser on my mac and navigated to IP_ADDRESS:9200 and got:

{
status: 200,
name:"Longshot",
cluster_name: "elasticsearch",
version: {
 number: "1.7.1",
 build_hash: "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
 build_timestamp: "2015-07-29T09:54:16Z",
 build_snapshot: false,
 lucene_version: "4.10.4"
},
tagline: "You Know, for Search"
}

Excellent! From then on, the walk through is as previously described, but now I have one unified environment and that will be the same when in two weeks I will be on-site.

Final note: I set to yes the value of PermitRootLogin in the /etc/ssh/sshd_config file to able remote login as root into the VM from my mac iTerm. I recommend that you check out the FAQs.

Resources: Update to Docker 1.6

Bulk Load Features from ArcGIS Into Elasticsearch

I really like Elasticsearch because it natively supports geo spatial types and queries. I just added to gitbub a ArcPy based toolbox to bulk load the content of a feature class into an ES index/type.

The toolbox contains yet another tool as a proof-of-concept to spatially query the loaded document.

Thursday, August 13, 2015

BigData Point-In-Polygon GeoEnrichment

I’m always handed a huge set of CSV records with lat/lon coordinates, and the task at hand is to spatially join these records with a huge set of feature polygons where the output is a geoenhancement of the orignal points with the intersecting polygon’s attributes. An example is a set of retailer customer locations that need to be spatially intersected with demographic polygons for targeted advertisement (Sorry to send you all more junk mail :-).

This is a reference implementation, where both the points data and the polygon data are stored in raw text TSV format and the polygon geometries are in WKT format. Not the most efficient format, but at least the input is splittable for massive parallelization.

The feature class polygons can be converted to WKT using this ArcPy tool.

This Spark based job can be executed in local mode, or better in this docker container.

One of these days will have to re-implement the reading of the polygons from a binary source such as shapefiles or file geodatabases. Until then, you can download all the source code from here.