Tuesday, March 29, 2011

Hacking the Kinect with Flash in a Mapping Application

At this year's DevSummit, I did a couple of demo theater presentations. One of them was about hacking the Microsoft Kinect using the Flash platform in a mapping application. Here is a video.
Kinect is a very successful and important product for Microsoft. And if you ever played with it using an XBox, you will understand why it is a very neat piece of technology.
Now, have you see the movie "Minority Report" ? Remember in the beginning of the movie when Tom Cruise step up to a console and started waving his hands to manipulate images (video)? One of the gestures that most fascinated me was the one where he twisted his wrist while his fingers were pretending to grab a baseball size object. That twisting gesture moved a sequence of images through time. Twist to the right and the sequence moves forward in time. Twist to the left and the sequence moves back in time. In the movie, he went back and forth to detect a pattern in that time sequence. Now, wouldn't be cool if we can do the same with Flash and a Kinect?
A couple of months back, I was working on a project that visualized levels of flash flood data from hurricane Hermine over Austin, TX. The data is temporal and is localized to nodes of a virtual grid on top of Austin with a cell size of about one and half kilometers. Using a Flex interface to Kinect, I want to detect rotation gestures that my hands will be performing to animate back and forth the flash flood levels over a time period.
My flood data was stored in a DBF file. Using the most excellent DBF library from Edwin van Rijkom, I was able to parse and load the data into memory. This data is about 71 Megabytes, as it spans an area of about 25 by 25 kilometers for about 4 days worth of hourly information. All that data is loaded into memory and bucketed by space and time in such that for each hour, I can know instantly what is the flood level value at any location.
Next, to make this spatial data morph itself over time visually and efficiently using the ArcGIS API for Flex, I created my own custom temporal layer. This custom layer, relies heavily on the bitmap capabilities of the Flash Player. At the application startup time, I create a base bitmap that is proportional to the map pixel width and height. Then, at each frame refresh during the life cycle of the application, I advance or retard an hour index value. For an hour index value, I can lookup all the nodes and their values and convert them to small rectangular bitmaps whose pixel width and height values are adjusted proportionally to the map scale based on about 1.5 square Km area. Each bitmap is filled with a color picked from a color ramp proportional to the range of the loaded flood level values. Blue on the lower end. Red on the upper end. Each bitmap is bit-blit'ed on the base bitmap who is then bit-blit'ed onto the flash player display. By repeating this process over time, each location varies its color, giving the illusion of motion as a color traverses from one node to another. Cool, eh?
In my MVC implementation of the application, my hour index was a bindable property in my model and my temporal layer was bound to this property in such a way that whenever that property changes, the layer reflected that change by bit-blitting the node data stored in the model.
Now, that was the easy part of the implementation. The difficult part is how to hook the Kinect to my Mac and consume Kinect depth frames using Flex to detect gestures that will be translated into positive or negative changes to my model hour index property (Remember, I modify that value and the layer reflects that change). Googling around, I came across the OpenKinect project. In the Wrappers section, I was delighted to find that somebody has already developed an ActionScript implementation and from the looks of it, I thought I was almost there. The AS3Kinect client implementation opens a persistent socket connection from flash to a daemon process written in C that was linked at compile time with the libfreenect library. This daemon process reads the Kinect depth frames as a byte stream from the USB port and forwards that stream of bytes through the open socket to the Flash application. Once on the client side and again taking advantage of the bitmap capabilities, blobs can be detected. A blob is a small region on the bitmap with the same color. See, the Kinect sends depth information as bitmap frames. If you extend your hands in front of you and make your palms face the Kinect, your palms and your body are at a different depth relatively to the Kinect device. By creating a virtual front and back plane, depth data can be filtered and converted to either white or black color encoded bytes on a bitmap. White for the front bytes and black for the back bytes. A continuous patch of white bytes (your palms) can be converted to a blob. A blob movement (single palm) can be translated to gestures like a swipe up, down, left and right. And multiple blob movements (your two palms moving) can be translated to for example a rotation when the blobs are swirling around a point or a scale up or down when the blobs are separated from each other diagonally or brought together diagonally. I wrote a simple program to test the transfer of the Kinect bytes through that daemon proxy to my flex application and it was sluggish and non-responsive despite that the supplied test program rgbdemo (written in pure C and utilizing OpenGL) worked flawless. Now, the AS3Kinet forum said that all should be fine when they tested it on their PC's. It was that last word that prompted me to test it on my Windows machine and on that machine, it worked !!! I did a little bit of investigation and that led me to write this blog post. Summarizing the post; the problem is that there exists a 64K chunking limit on sockets in the Flash player on Mac OS, and I needed to process 2 MBytes chunks (the size of a Kinect depth frame). And that 64K limit, throttled way back the data stream resulting in a slow to respond application :-( BTW, in the post, I did not want to give out too much of what I was working on, as I was preparing a surprise demo for GISWORX'11 plenary session.
I tried to circumvent the chunking problem by using Alchemy and LocalConnection, but that too had it chunking limits, and came to the realization that blob detection and gesture recognition has to occur down in the proxy and passed along as small packets of information to the client. The AS3Kinect had yet another project based on the OpenNI specification that did exactly what I needed, but that implementation was based on the Windows API and I have a Mac. I was very disappointed and started to panic with GISWORX'11 looming so closely. More googling around, I came across another natural interface specification, TUIO. The TUIO home page mentioned a Kinect implementation that runs on Mac OS. In addition, somebody implemented an AS3 interface to the TUIO protocol. The process is the same as with the AS3Kinect, where a TUIO server (TUIOKinect.app) is started. It reads the USB data stream and converts the chunks into frames where each frame is analyzed to detect blobs. All that is happening on the server. The detected blobs and their trajectories are converted into gestures. The gestures are encoded into byte arrays per the TUIO specification and broadcasted as UDP packets. The AS3 TUIO protocol implementation can read the UDP packet, decodes it and dispatch it as AS3 event. To test the implementation, a simple sample application was provided to rotate and translate any display object whose mouseenabled property is set to true. The sample application worked beautifully. Now, it was time to hook the TUIO AS3 implementation into my mapping application.
UDP packets can only be consumed by an AIR application, so first, I had to convert my web application to an AIR application. I used the AIRLanchpad application to generate the base template, and because my web application was based on MVC, the port of the model, the controls and some of the views was simple. The main application had to change from an Application subclass to a WindowedApplication subclass. At the application creation complete event handling, a TUIOClient is instantiated with a UDP connector so is a gesture manager with a stage reference and a rotation gesture listener. The gesture manager having a reference to the stage watches for any added display object with mouse enabled property. These display objects can become gesture listeners. The custom layer being mouse enabled and a child of the stage is a candidate for a listener to gesture rotate events. Now, Kinect is very responsive and TUIOKinect will blast the application with rotation events. To smooth this fast sequence of rotation value that could have pikes in the stream, I implemented a digital low pass filter giving me smooth rotation values as my extended palms are performing rotation gestures. A rotation to the right gives me a positive angle which translated to incrementing my hour index value in my model which through binding automagically refreshed the layer to show the flood level values at each node at that time instance. A rotation to the left, does the opposite. Works like a charm, and this was a huge success at the GISWORX'11 plenary. Modest, ain't I ?
If you have a Kinect and want to try it on your mac, download the AIR application from here. Like usual, the source is available for you to check out what I have done. Just hold the right key on the date label of the running application and it will give you the option to view and download the source code.
Happy Kinecting.