Thesis Topics | Data Systems Group

Handling data stream imperfections in the context of business process executions and conformance checking.

The aim of the MSc thesis is to investigate popular streaming frameworks (Apache Spark, Apache Flink, Apache Kafka) and their applicability on streaming conformance checking methods, especially with regards to out-of-order event arrivals, in the context of distributed systems (e.g., IoT devices).
Expected skills: Python, SQL, data engineering
Benefits: Learning about state-of-the-art streaming frameworks (Spark, Flink, Kafka)

Contact Information:

Kristo Raun

Kristo [dot] Raun [at] ut [dot] ee

Tuning Deep Learning Hyper parameters in the light of performance and Energy efficiency

Machine learning applications are in constant evolution and have become an integral part of our daily lives. Nevertheless, research in this field has mainly focused on producing high-precision models without taking energy constraints into account. This is mainly the case for deep learning, where the aim has been to produce more accurate model results without any constraints in terms of computing or energy consumption. The implications of climate change and the global consensus of Nations have made this trend no longer tolerable.

MongoDB Vs Cassandra: Energy consumption Evaluation and Comparative studies towards Query Energy Efficiency

Nowadays, energy efficiency (EE) is becoming more and more critical. Many efforts have been made at different levels to increase EE as well as definition of techniques to manage the trade-off between energy and performance, etc.Some academic work has attempted to address this objective by proposing less energy-intensive solutions.

Interpretability of Black box models

In this project, we focus on different kinds on Interpretability; model specific and model agnostic techniques. We developed ILIME, a novel technique that explains the prediction of any supervised learning-based prediction model by relying on an interpretation mechanism that is based on the most influencing instances for the prediction of the instance to be explained. We demonstrate the effectiveness of our approach by explaining different models on different datasets.

Machine learning Interpretability in Healthcare

Although complex machine learning models (e.g., Random Forest, Neural Networks) are commonly outperforming the traditional and simple interpretable models (e.g., Linear Regression, Decision Tree), in the healthcare domain, clinicians find it hard to understand and trust these complex models due to the lack of intuition and explanation of their predictions. With the new General Data Protection Regulation (GDPR), the importance for plausibility and verifiability of the predictions made by machine learning models has become essential.

Automated Selection and Hyperparameter Optimization for Supervised Tasks

A major obstacle for developing machine learning models using big data is the challenging and time consuming process of identifying and training an adequate predictive model. Therefore, machine learning mode building is a highly iterative exploratory process where most scientists work hard to find the best model or algorithm that meets their performance requirement. In practice, there is no one-model-fits-all solutions, thus, there is no single model or algorithm that can handle all data set varieties and changes in data that may occur over time.

A Meta Learning-Based Framework for Automated Selection and Hyperparameter Tuning for Clustering

Novel technologies in automated machine learning ease the complexity of algorithm selection and hyper-parameter optimization. However, these are usually restricted to supervised learning tasks such as classification and regression, while unsupervised learning remains a largely unexplored problem. In this project, we offer a solution for automating machine learning specifically for the case of unsupervised learning with clustering, in a domain-agnostic manner.

Toward Robust model agnostic interpretability technique

Machine learning predicting models have widely used in different domains, however, the lack of their interpretability can limit their adoption in many critical domains. Robustness and identity are mainly required features in any interpretability technique. Robustness states that similar inputs should have close explanations. Identity states that similar inputs should have similar outputs. Robustness is very important for different reasons.

Evaluating the quality of machine leaning interpretability techniques

Explaining machine learning black-box decisions receive huge attention especially after the EU General Data Protection Regulation, known as GDPR. The current interpretability techniques are roughly partitioned into two groups: saliency and perturbation approaches.

Interpretability of automatically extracted machine learning features in medical images

Despite the growing use of machine learning-based prediction models in the medical domains [1,2], clinicians still do not trust using these models in practice for many reasons. One important reason is that, most of the developed models focused on the predictive performance (accuracy, Area under the curve) but rarely explain the prediction in an understandable form for users. Thus most of the currently available predictive systems depend on the knowledge of the domain experts [3, 4].

Toward Interpretable Machine Learning Techniques

Machine learning techniques have been used in different areas such as finance, advertisement, marketing, and medicine and also achieve satisfactory performance. In practice complex machine learning models such as Random Forest, Support Vector Machines, and Neural Networks usually achieve better performance than interpretable models such as Linear Regression, and Decision Tree.

Auto Tuning of Flink Jobs: A Machine Learning Approach

Stream processing is a form of continuous queries on unbounded moving data. Streaming jobs are long running applications, typically keep running until undeployed. As they are long running, it is common that such jobs need parameter tuning to meet performance objectives. Two major objectives are latency, how long a data item takes from entry to the system until it is completely processed, and throughput, how many items are processed per time unit.

Declarative Querying of Distributed Graphs

The aim of this project is to implement an efficient execution engine for the G-Core language on top of distributed graph processing platforms

Automated Selection and Optimization of Distributed Machine Learning Algorithms

The aim of this project is to develop novel mechanism for automated selection and optimization for distributed machine learning models based on the characteristics of the underlying data sets