The majority of large datasets are unlabeled, while the majority of machine learning algorithms that you are likely to use require labeled data. Of course this is a simplification, but it captures quite well my experience in practice.
One approach that we used in a recent research project is what you make call consensus labeling. Here is a high level outline of the approach:
- Select three or more high quality classifiers that have been trained on (small amounts) of labeled data. These classifiers will be used in the next step to assign labels to unlabeled data.
- Apply the ensemble of classifiers to a large dataset of unlabeled data to create a labeled dataset. Labels can be assigned either by using a majority vote or by only labeling those records in which the classifiers all agree (a consensus).
- From this larger labeled dataset, train and validate a classifier or other machine learning algorithm.
The goal of the project was to explore a class of algorithms that each night could use a large computing infrastructure (in our case the Open Cloud Consortium's petabyte-scale OCC-Y Cloud) to analyze an ever changing collection of text documents and build a new model for entity extraction, part of speech tagging, etc.
The project was a joint project with Andrey Rzhetsky and Shi Yu and I have described just a small part it. You can find more details in the paper: Shi Yu, Robert Grossman and Andrey Rzhetsky, Global and Local Approach of Part-of-Speech Tagging for Large Corpora, Information Retrieval and Knowledge Discovery in Biomedical Text: Papers from the 2012 AAAI Fall Symposium, AAAI Press, Menlo Park, California, 2012. pdf.Share