Thesis Topics

The University of Tartu is running a research project with the Nordic Institute for Interoperability Solutions (NIIS) to take trust management of the X-Road data exchange layer towards full automation. X-Road is used by many organizations and communities, including the Estonian government, to secure and automate the exchange of messages between their information systems.

Multl-modal deep CEP
Architecture for deep multi-modal CEP

Deep learning has served object detection (computer vision) very well [2]. However, such models struggle when it comes to complex pattern identification within both spatial and temporal spaces [2].

Machine learning predicting models have widely used in different domains, however, the lack of their interpretability can limit their adoption in many critical domains. Robustness and identity are mainly required features in any interpretability technique. Robustness states that similar inputs should have close explanations. Identity states that similar inputs should have similar outputs. Robustness is very important for different reasons.

Explaining machine learning black-box decisions receive huge attention especially after the EU General Data Protection Regulation, known as GDPR. The current interpretability techniques are roughly partitioned into two groups: saliency and perturbation approaches.

Data-driven interval analytics (D2IA) is an approach to enable analytics over user-defined data-driven windows [1]. In this approach, the user defines conditions on the properties of events that continuously arrive on a stream. The condition can be absolute, i.e., refers to one or more properties of a single event, or relative, i.e., compares the event to aggregate or another event that has been matched before. The set of events for which the conditions are satisfied are
grouped into an interval. Moreover, the user selects an aggregate function to be computed on those elements. 

Despite the growing use of machine learning-based prediction models in the medical domains [1,2], clinicians still do not trust using these models in practice for many reasons. One important reason is that, most of the developed models focused on the predictive performance (accuracy, Area under the curve) but rarely explain the prediction in an understandable form for users. Thus most of the currently available predictive systems depend on the knowledge of the domain experts [3, 4].

Machine learning techniques have been used in different areas such as finance, advertisement, marketing, and medicine and also achieve satisfactory performance.  In practice complex machine learning models such as Random Forest, Support Vector Machines, and Neural Networks usually achieve better performance than interpretable models such as Linear Regression, and Decision Tree.

Nowadays the traditional machine learning models are powerful and easy to use than ever before. The main bottleneck and the time consuming step used to be the feature engineering step but now we can feed raw data to models that can learn its features by their own. Such models like deep neural network perform very well in most of the tasks and the only problem is that such models requires vast amount of labeled training data. Such huge amount of labeled training data does not exist or hard to obtain in reality and creating such data is extremely expensive.

Stream processing is a form of continuous queries on unbounded moving data. Streaming jobs are long running applications, typically keep running until undeployed. As they are long running, it is common that such jobs need parameter tuning to meet performance objectives. Two major objectives are latency, how long a data item takes from entry to the system until it is completely processed, and throughput, how many items are processed per time unit.

Big stream processing frameworks such as Apache Flink [1,2] provide rich APIs to build real-time data analytics applications. Theoretically, unbounded streams of data flow into the system and results are calculated. To tackle the unbounded nature, streams are divided into chunks by means of window operators. A window can be seen as a way to take a snapshot of the stream and apply a user-defined logic on its contents.

With the advent of Bitcoin as a cryptocurrency, blockchain, the technology behind Bitcoin has gained a lot of attention. The attraction to the technology is mainly driven by the tamper-proof promise given by a blockchain. Being a distributed ledger that no single entity owns and manages and having a consensus protocol to approve transactions in addition to the immutability of written transactions, opens the door for countless applications.

RDF (Resource Description Framework) is the main ingredient and the data representation format of Linked Data and Semantic Web. It supports a generic graph-based data model and data representation format for describing things, including their relationships with other things. In practice, the SPARQL query language has been recommended by the W3C as the standard language for querying RDF data. The size of RDF databases is growing fast, thus RDF query processing engines must to be able to deal with increasing amounts of data.

The aim of this project is to implement an efficient execution engine for the G-Core language on top of distributed graph processing platforms
The aim of this project is to develop novel mechanism for automated selection and optimization for distributed machine learning models based on the characteristics of the underlying data sets