The GDC contains approximately 5 PB of cancer genomics and associated clinical data. Unlike a data repository or data coordinating center, the GDC re-analyzes (or harmonizes) all the data that is submitted to it using a common set of bioinformatics pipelines. In other words, the alignments and variant calling are done with a common set of parameters using a common set of pipelines no matter which research group produced the data. This reduces the batch effects that can be a problem with next generation sequencing data.
More precisely, it reduces the batch effects associated with the processing of data using different bioinformatics pipelines by different research groups, but not necessarily the batch effects associated with the production of the research data.
The GDC is an example of data commons and is designed so that anyone can build an application over it using the GDC API. In fact, the GDC Data Portal, the GDC Data Submission System, and the various GDC re-analysis applications are just applications built by the GDC team using the GDC API.
The GDC software stack is open source and we are currently working on developing other disease specific and project specific commons using it.
The GDC was developed by the National Cancer Institute, the Center for Data Intensive Science (CDIS) at the University of Chicago, the Ontario Institute of Cancer Research, and Leidos Biomedical Research.
We are hiring, so if have the DevOps, AnalyticOps, bioinformatics, or software development skills required, and want to work on one of the largest data commons in the world and have an impact on cancer, please apply.Share