Saturday, January 30, 2016

DBSCAN on Spark

The applications of DBSCAN clustering straddle various domains including machine learning, anomaly detection and feature learning. But my favorite part about it, is that you do not have to specify apriori the number of clusters to classify the input data. You specify a neighborhood distance and the minimum numbers of points to form a cluster and it will return back a set of clusters with the associated points in the cluster that meet the input parameters.
However, DBSCAN can consume a lot of memory when the input is very large. And since I do BigData, my data inputs will overwhelm my MacBook Pro very quickly. Since I know Hadoop MapReduce fairly well, and MR has been around for quite some time, I decided to see how other folks implemented such a solution in a distributed share nothing environment. I came across this paper, which was very inspiring and found out that IrvingC used it too as a reference implementation. So I decided to implement my own DBSCAN on Spark as a way to further my education in Scala. And boy did I learn a lot when it comes to immutable data structures, type aliasing and collection folding. BTW, I highly recommend the Twitter Scala School.
Like usual, all the source code can be found here, and make sure to check out the “How It Works?” section.

[Update] After posting - I saw this post - very nice video too!

No comments: