Thesis Topics

The aim of this project is to develop novel mechanism for automated selection and optimization for distributed machine learning models based on the characteristics of the underlying data sets
The aim of this project is to implement an efficient execution engine for the G-Core language on top of distributed graph processing platforms

RDF (Resource Description Framework) is the main ingredient and the data representation format of Linked Data and Semantic Web. It supports a generic graph-based data model and data representation format for describing things, including their relationships with other things. In practice, the SPARQL query language has been recommended by the W3C as the standard language for querying RDF data. The size of RDF databases is growing fast, thus RDF query processing engines must to be able to deal with increasing amounts of data.

Energy analytics is gaining attention with the wide adoption of smart meters. Such meters can provide updates about energy consumption at least at the rate of every 15 minutes. This amounts for 96 reads a day . The analysis of such consumption data can give insight and help predict energy demand.

With the advent of Bitcoin as a cryptocurrency, blockchain, the technology behind Bitcoin has gained a lot of attention. The attraction to the technology is mainly driven by the tamper-proof promise given by a blockchain. Being a distributed ledger that no single entity owns and manages and having a consensus protocol to approve transactions in addition to the immutability of written transactions, opens the door for countless applications.

Big stream processing frameworks such as Apache Flink [1,2] provide rich APIs to build real-time data analytics applications. Theoretically, unbounded streams of data flow into the system and results are calculated. To tackle the unbounded nature, streams are divided into chunks by means of window operators. A window can be seen as a way to take a snapshot of the stream and apply a user-defined logic on its contents.

Stream processing is a form of continuous queries on unbounded moving data. Streaming jobs are long running applications, typically keep running until undeployed. As they are long running, it is common that such jobs need parameter tuning to meet performance objectives. Two major objectives are latency, how long a data item takes from entry to the system until it is completely processed, and throughput, how many items are processed per time unit.

Clustering is an unsupervised machine learning where a data set is divided into disjoint subsets (clusters) where each cluster’s elements are similar to each other and different from elements in other clusters. In a streaming context, where data keep coming indefinitely and the data set is unbounded, we can see a data element exactly once. Thus, traditional clustering algorithms that require several passes on the data cannot be used.

Nowadays the traditional machine learning models are powerful and easy to use than ever before. The main bottleneck and the time consuming step used to be the feature engineering step but now we can feed raw data to models that can learn its features by their own. Such models like deep neural network perform very well in most of the tasks and the only problem is that such models requires vast amount of labeled training data. Such huge amount of labeled training data does not exist or hard to obtain in reality and creating such data is extremely expensive.

Machine learning techniques have been used in different areas such as finance, advertisement, marketing, and medicine and also achieve satisfactory performance.  In practice complex machine learning models such as Random Forest, Support Vector Machines, and Neural Networks usually achieve better performance than interpretable models such as Linear Regression, and Decision Tree.

Despite the growing use of machine learning-based prediction models in the medical domains [1,2], clinicians still do not trust using these models in practice for many reasons. One important reason is that, most of the developed models focused on the predictive performance (accuracy, Area under the curve) but rarely explain the prediction in an understandable form for users. Thus most of the currently available predictive systems depend on the knowledge of the domain experts [3, 4].

Fantasy Premier League (https://fantasy.premierleague.com/) has become a very popular game , with more than 5 million users, in the soccer world. The aim of this project is use Machine Learning and Deep Learning Techniques to build a recommender system for the weekly team formation. In order to work in this project, you need to have at least 2 seasons experience with the game and its rules.

 

Data-driven interval analytics (D2IA) is an approach to enable analytics over user-defined data-driven windows [1]. In this approach, the user defines conditions on the properties of events that continuously arrive on a stream. The condition can be absolute, i.e., refers to one or more properties of a single event, or relative, i.e., compares the event to aggregate or another event that has been matched before. The set of events for which the conditions are satisfied are
grouped into an interval. Moreover, the user selects an aggregate function to be computed on those elements. 

Stream processing is concerned with analyzing data as they are created. Many use cases require such analysis on-the-fly. IoT applications in smart cities [1], smart homes [2], and healthcare [3] are just a few examples of such scenarios. In all these scenarios, data originate from sensors. As most of the sensors communicate their readings wirelessly, there is a large potential for interference. Moreover, sensors might malfunction and start to produce inaccurate readings. All these are forms of the uncertainty of the data [4,5].

Explaining machine learning black-box decisions receive huge attention especially after the EU General Data Protection Regulation, known as GDPR. The current interpretability techniques are roughly partitioned into two groups: saliency and perturbation approaches.

Machine learning predicting models have widely used in different domains, however, the lack of their interpretability can limit their adoption in many critical domains. Robustness and identity are mainly required features in any interpretability technique. Robustness states that similar inputs should have close explanations. Identity states that similar inputs should have similar outputs. Robustness is very important for different reasons.