Clusters, Clouds and Commons for Big Data

Oct 24, 2014

Analytic infrastructure has evolved from clusters to clouds over the past decade. Over the next decade commons will emerge as an alternative infrastructure for big data.

Over the past few months I have given several talks about data commons and discussed some of the common requirements that are emerging and some of the technical challenges in computer systems that are relevant. In September, I gave a Plenary talk at the 2014 IEEE Cluster Conference in Madrid and in October, I gave a Colloquium at DePaul University in Chicago. You can find the DePaul Colloquium here.

By a commons (or sometimes data commons), I mean cyber infrastructure that co-locates data, storage, computing infrastructure, and commonly used tools for analyzing and sharing data to create a resource for the community.

In these talks, I identified five requirements. As a mnemonic, each requirement is assoiced with a word beginning with the letter "P" (Permanent, Pods, Peering, Portability, and Pay). I apologize in advance to those readers that are offended by the liberties that I took to make this alliteration work.

Requirement 1. Permanent Digital IDs. The first requirement is to support permanent identifiers for digital objects that represent the data. We also assume that associated to each digital ID is public, and perhaps private, metadata. In our implementations, we access the objects with a S3-compatible API.

Requirement 2. Software stacks that scale to cyber Pods. The second requirement is to support a software stack that scales out to what you might call a cyber pod. Data centers are sometimes divided into "pods," which can be built out and customized as needed. Pods vary in size but usually contain between 10 and 100 or more racks of computing infrastructure. Think of a cyberpod as as data center pod containing hardware and integrated software. Most software does not scale out to cyberpods and a lot of work is required to provide the software with the fault tolerance and the resiliency needed at the scale of cyberpod. At the scale, a wide variety of different types of failtures are commons.

Requirement 3. Data Peering. By data peering we mean cross connecting to commons so that: 1) data can be transfered from one commons to the other at no cost; 2) two commons are connected via high performance links; and 3) there are sufficient cross connections between commons that a researcher using one commons can access data at another commons that peers with it, simply by referencing its Digital ID.

Requirement 4. Data Portability. Today, many systems are designed to import easily, but exporting data can be quite challenging. Simplying the exporting of data is necessary for data commons to be widely accepted. Some data commons will succeed and some will fail and just as we have developed systems that deal with the failure of financial institutions, we will need to develop systems that can manage the failure of data commons so that essential research data is not lost.

Requirement 5. Pay for Compute. Data peering and data portability is about providing efficient and high performance access to research data. For such a system to be economically feasible, costs must be recovered in some other way and incentives must be created for rationally using resources. One simple way to do this is to require researchers to pay for compute: this can be via a credit card, via allocations (as is done with supercomputing centers), or via "chits," provided to researchers via funding agencies.