Fast Creation of Training Data using Weak Supervision

Abstract

Nowadays the traditional machine learning models are powerful and easy to use than ever before. The main bottleneck and the time consuming step used to be the feature engineering step but now we can feed raw data to models that can learn its features by their own. Such models like deep neural network perform very well in most of the tasks and the only problem is that such models requires vast amount of labeled training data. Such huge amount of labeled training data does not exist or hard to obtain in reality and creating such data is extremely expensive. In real applications, large amounts of training data usually inaccessible or are expensive to get. An alternative for training machine learning models without labeled training data is weak supervision. Weak supervision uses domain knowledge (from users or domain experts) about the specific problem or heuristics to approximate the true labels. A key challenge for weak supervision is the fact that there may be bias in the errors made by the weak supervision signals. Using multiple sources of weak supervision can somewhat minimize this concern. The goal of this project is to develop different weak supervision techniques that can be integrated together to predict the true labels and hence automate the creation of the training datasets [1-3].

Best References:

  1. Ratner, Alexander, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. "Snorkel: Rapid training data creation with weak supervision." Proceedings of the VLDB Endowment 11, no. 3 (2017): 269-282.
  2.  Ratner, Alexander J., Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. "Data programming: Creating large training sets, quickly." In Advances in neural information processing systems, pp. 3567-3575. 2016.
  3. Bach, Stephen H., Bryan He, Alexander Ratner, and Christopher Ré. "Learning the structure of generative models without labeled data." arXiv preprint arXiv:1703.00854 (2017)