The Case for Data Peering

Dec 23, 2014

The case for data peering of research data between data commons.

Peering relationships established between the largest Internet Services Providers (what are sometimes called Tier 1 ISPs) was one of the important components in the success of the Internet. Peering meant that two ISPs agreed to transmit network traffic at no cost so that the customers of each could reach each other with no additional cost than the cost required to connect to their own ISP. Peering was established when ISPs connected to common large switches located at Internet Points of Presence (cross-connects).

Today, generally, the first step in analyzing data is downloading it. For small datasets, this is not a problem, but as the size of data grows, just downloading the data can be a hurdle. One approach is to co-locate research data in what are sometimes called data commons.

Define a data commons as a repository for data that supports i) a digital ID service so that researchers can easily discover and access data; and ii) an API that allows researches to access metadata associated with the DID and the data itself.

The Open Science Data Cloud is one data commons for research data that today contains over 1 PB of data.

Here are three rules that data commons might adopt to support data peering. Two Research Data Commons with a Tier 1 data peering relationship agree as follows:

  1. To transfer research data between them at no cost.

  2. To connect to at least two other Tier 1 Research Data Commons at 10 Gbps or higher.

  3. To support Digital IDs (of a form to be determined by mutual agreement) so that a researcher using infrastructure associated with one Tier 1 Research Data Commons can access data transparently from any of the Tier 1 Research Data Commons that hold the desired data.

With data peering, a researcher working in a commons or cloud that peers with another one can access data transparently using its digital ID, even if the data is located in another commons or cloud, as long as the two peer. There were be no costs incurred, and, at least for smaller datasets, the difference in latency may not be noticeable. As datasets grow in size, data slicing can be used in order to retrieve just the required slice of data needed.

Data peering can be supported by cross connecting two data commons at co-location facilities. With this approach, the cost to exchange research data between two peering entities is low and fixed. It is simply the cost of the cross connect.