Bionimbus Protected Data Cloud (PDC) Update

The Bionimbus Protected Data Cloud (PDC) is an open source petabyte-scale cloud that is designed to manage, analyze and share large genomic datasets for the research community in a secure and compliant fashion. The Bionimbus now contains all of the data available to date from The Cancer Genome Atlas (TCGA). Today, this is over 600 TB of data and will grow over the next two years to over 2.5 PB. This includes both the controlled access BAM files containing the genomic data, as well as the open access aggregated data derived from the BAM files.

I’ll be giving a talk today about the Bionimbus PDC at the O’Reilly Strata Health Rx Conference in Boston.

Strata Rx Conference 2013

To analyze TCGA data using the Bionimbus TCGA, you will need the required approvals from dbGaP. Any researcher authorized to analyze controlled access TCGA data is welcome to use modest amounts of compute and storage resources on the PDC. If you need additional resources, you can apply for a PDC research allocation.

Please contact us if you would like to contribute some data to the PDC, have a project that would like to join the PDC, or have a biomedical cloud that would like to interoperate with the PDC.

Posted in big data | Comments Off

A Tool for Keeping Big Data in Sync

The rsync utility is a wonderfully useful tool for keeping two datasets synchronized, but it was never designed to keep two large datasets synchronized when they are separated by a long distance. Over the past couple of years, we developed a utility called UDR at the Laboratory for Advanced Computing at the University of Chicago which integrates rsync with the high performance network protocol UDT.

UDT is a reliable UDP-based protocol that was designed to move large datasets over wide area, high performance networks. UDT is open source and has been used as the basis for over six commercial products.

UDR is open source and available from github.

Here are some test results conducted by Erich Weiler from the University of California at Santa Cruz moving genomic data:

Source Destination UDR rsync
Santa Cruz Milwaukee 500 Mb/s 160 Mb/s
Santa Cruz Detroit 600 Mb/s 150 Mb/s
Santa Cruz Bielefeld 600 Mb/s 6 Mb/s
Santa Cruz Aarhus 350 Mb/s 6 Mb/s
Santa Cruz Brisbane 550 Mb/s 3 Mb/s

Allison Heath is the Project Lead for UDR.

Posted in big data | Tagged , , , | Comments Off

Do You Want Hands On Experience Working with Big Data?

If you are a graduate student or post-doc interested in improving your big data skills, you might want to consider applying for an Open Science Data Cloud (OSDC) PIRE 2013 Fellowship. These fellowships are supported by the NSF PIRE Program and provide support for up to eight weeks of work.

The OSDC allows researchers to compute over 1 PB of scientific data from a variety of scientific disciplines.

We provide a big data bootcamp for OSDC PIRE Fellows. OSDC PIRE Fellows then spend time working with one of the OSDC foreign collaborators on a variety of projects, including:

  • Expanding the OSDC to other countries.
  • Developing infrastructure so that the OSDC can interoperate with science clouds in other countries.
  • Working on the OSDC software infrastructure.
  • Developing domain specific OSDC applications in the biological sciences, earth sciences, social sciences, or digital humanities.

To apply for a OSDC PIRE Fellowship, please fill out the application here. Only U.S. citizens or permanent residents are eligible for OSDC PIRE Fellowships.

Posted in big data, data science | Tagged | Comments Off