Implementation Distributed Versions of Stream Clustering Algorithms in SAMOA-Flink

Abstract

Clustering is an unsupervised machine learning where a data set is divided into disjoint subsets (clusters) where each cluster’s elements are similar to each other and different from elements in other clusters. In a streaming context, where data keep coming indefinitely and the data set is unbounded, we can see a data element exactly once. Thus, traditional clustering algorithms that require several passes on the data cannot be used.

A number of stream clustering approaches have been devised with the single-pass restriction as their main requirement, e.g., CluStream[1]. Although the algorithm is designed to cluster data in the move, it assumes a central processing, i.e., the stream of data is directed to a single machine where clustering takes place. With today’s volumes of data streams, usually peta bytes per day, the central processing paradigm cannot stand the data volume and large latency and failure would not be acceptable.

Apache SAMOA (Scalable Advanced Massive Online Analytics) [2,3] is an ambitious project incubated by Apache to bring scalability to stream analytics. Currently, scalable classification over data streams is supported by SAMOA. However, for clustering, still only the central approach is supported.

The objective of this thesis is to bring the distributed version of stream clustering, e.g. as presented in [4] to SAMOA as well as devising distributed versions of other stream clustering approaches as ClusTree [6] and realize it on the stream processing engine Apache Flink[5]. SAMOA already support mapping to different stream processing engines among which is Flink.

References

  1. Aggarwal, Charu C., et al. "A framework for clustering evolving data streams." Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003.
  2. Kourtellis, Nicolas, Gianmarco De Francisci Morales, and Albert Bifet. "Large-Scale Learning from Data Streams with Apache SAMOA." Learning from Data Streams in Evolving Environments. Springer, Cham, 2019. 177-207.
  3. SAMOA Developer Guide https://samoa.incubator.apache.org/documentation/SAMOA-Developers-Guide-0.0.1.pdf
  4. Karunaratne, Pasan, Shanika Karunasekera, and Aaron Harwood. "Distributed stream clustering using micro-clusters on Apache Storm." Journal of Parallel and Distributed Computing 108 (2017): 74-8
  5. Flink Windowing: https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/windows.html
  6. Kranen, Philipp, et al. "The ClusTree: indexing micro-clusters for anytime stream mining." Knowledge and information systems 29.2 (2011): 249-272.
  7. Flink Working with State: https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/state.html#using-managed-keyed-state
  8. Supporting Youtube video about SAMOA https://www.youtube.com/watch?v=VwpmDRC0-bQ
  9. Supporting Youtube video about SAMOA https://www.youtube.com/watch?v=UB1DYCyJqVo
  10. Massive Online Analytics https://moa.cms.waikato.ac.nz/