In general, a major obstacle for developing machine learning models using big data is the challenging and time consuming process of identifying and training an adequate predictive model. Therefore, machine learning mode building is a highly iterative exploratory process where most scientists work hard to find the best model or algorithm that meets their performance requirement. In practice, there is no one-model-fits-all solutions, thus, there is no single model or algorithm that can handle all data set varieties and changes in data that may occur over time. All machine learning algorithms require user defined inputs to achieve a balance between accuracy and generalizability. This task is referred to as parameter tuning. The tuning parameters impact the way the algorithm searches for the optimal solution. This iterative and explorative nature of the building of distributed process is prohibitively expensive with big datasets. The aim of this project is to develop novel mechanism for automated selection and optimization for distributed machine learning models based on the characteristics of the underlying data sets
Related Resources:
- Spark MLib: https://spark.apache.org/docs/latest/ml-guide.html
- Tensorflow https://www.tensorflow.org/
- Autoweka https://www.cs.ubc.ca/labs/beta/Projects/autoweka/