Auto Tuning of Flink Jobs: A Machine Learning Approach

Abstract

Stream processing is a form of continuous queries on unbounded moving data. Streaming jobs are long running applications, typically keep running until undeployed. As they are long running, it is common that such jobs need parameter tuning to meet performance objectives. Two major objectives are latency, how long a data item takes from entry to the system until it is completely processed, and throughput, how many items are processed per time unit. During the lifetime of a streaming job, several parameters can be tweaked to keep the objectives fulfilled.

Apache Flink is an open source distributed stream processing engine. When deploying a job, a stream processing application (query) on Flink there are several parameters that can be tuned. However, given the large number of parameters as well as their interdependence, it is not feasible to manually change those parameters.

In [1], the authors provided an algorithmic approach to tune Apache Storm topology parameters. The approach depends on sampling from the domain of values for the individual parameters, applying the sampled configuration, measuring the change in performance and looping until no more performance gain is possible or the performance tuning budget is consumed.

The objective of this thesis is to apply machine learning approaches to train a model that correlates performance metrics, configuration parameters as well as characteristics of input data, e.g. arrival rate, amount of data elements per partition [2]. The training of the model shall be a continuous process so that chosen of the configuration parameters values are applied, monitored and updates about performance metrics are fed to retrain and improve the model.

References

Bilal, Muhammad, and Marco Canini. "Towards automatic parameter tuning of stream processing systems." Proceedings of the 2017 Symposium on Cloud Computing. ACM, 2017.
Van Aken, Dana, et al. "Automatic database management system tuning through large-scale machine learning." Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 2017.