The rsync utility is a wonderfully useful tool for keeping two datasets synchronized, but it was never designed to keep two large datasets synchronized when they are separated by a long distance. Over the past couple of years, we developed a utility called UDR at the Laboratory for Advanced Computing at the University of Chicago which integrates rsync with the high performance network protocol UDT.
UDT is a reliable UDP-based protocol that was designed to move large datasets over wide area, high performance networks. UDT is open source and has been used as the basis for over six commercial products.
UDR is open source and available from github.
Here are some test results conducted by Erich Weiler from the University of California at Santa Cruz moving genomic data:
Allison Heath is the Project Lead for UDR.
If you are a graduate student or post-doc interested in improving your big data skills, you might want to consider applying for an Open Science Data Cloud (OSDC) PIRE 2013 Fellowship. These fellowships are supported by the NSF PIRE Program and provide support for up to eight weeks of work.
The OSDC allows researchers to compute over 1 PB of scientific data from a variety of scientific disciplines.
We provide a big data bootcamp for OSDC PIRE Fellows. OSDC PIRE Fellows then spend time working with one of the OSDC foreign collaborators on a variety of projects, including:
- Expanding the OSDC to other countries.
- Developing infrastructure so that the OSDC can interoperate with science clouds in other countries.
- Working on the OSDC software infrastructure.
- Developing domain specific OSDC applications in the biological sciences, earth sciences, social sciences, or digital humanities.
To apply for a OSDC PIRE Fellowship, please fill out the application here. Only U.S. citizens or permanent residents are eligible for OSDC PIRE Fellowships.
The majority of large datasets are unlabeled, while the majority of machine learning algorithms that you are likely to use require labeled data. Of course this is a simplification, but it captures quite well my experience in practice.
One approach that we used in a recent research project is what you make call consensus labeling. Here is a high level outline of the approach:
- Select three or more high quality classifiers that have been trained on (small amounts) of labeled data. These classifiers will be used in the next step to assign labels to unlabeled data.
- Apply the ensemble of classifiers to a large dataset of unlabeled data to create a labeled dataset. Labels can be assigned either by using a majority vote or by only labeling those records in which the classifiers all agree (a consensus).
- From this larger labeled dataset, train and validate a classifier or other machine learning algorithm.
The goal of the project was to explore a class of algorithms that each night could use a large computing infrastructure (in our case the Open Cloud Consortium’s petabyte-scale OCC-Y Cloud) to analyze an ever changing collection of text documents and build a new model for entity extraction, part of speech tagging, etc.
The project was a joint project with Andrey Rzhetsky and Shi Yu and I have described just a small part it. You can find more details in the paper: Shi Yu, Robert Grossman and Andrey Rzhetsky, Global and Local Approach of Part-of-Speech Tagging for Large Corpora, Information Retrieval and Knowledge Discovery in Biomedical Text: Papers from the 2012 AAAI Fall Symposium, AAAI Press, Menlo Park, California, 2012. pdf.