Why is it still so hard to analyze remote and distributed data?

Jul 09, 2012

A perspective on some of the challenges analyzing remote and distribute data.


If the web (of documents), which is built upon open standards around html (for describing documents) and http (for accessing documents), is so successful, why don't we have a web of data, built upon open standards around xml (or something perhaps a bit more concise for describing data) and a protocol for accessing data (and metadata).


About ten years ago, I published a paper called DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining of Data, that described an infrastructure called DataSpace for creating a web of data, that uses html and xml for describing data and metadata and a protocol we introduced called the dataspace transfer protocol or dstp for transferring data. The key idea in DataSpace was to make it lightweight and minimal. It was based upon distributed columns of data, each of which was attached to a key called a universal correlation key or UCK. We developed reference implementations of dstp servers to serve columns of data, associated metadata, and associated UCKs. Correlating distributed columns of data was simple and applications just used UCKs. Discovery of data and metadata just used standard mechanisms.

The W3C Semantic Web effort, which was more ambitious, started at approximately the same time. Despite millions of dollars of funding, it too hasn't really caught on.

It is an interesting exercise to try to think about why the semantic web, DataSpace, or any of the similar ideas haven't caught on.

Today, we have linked data, whose key concepts are relatively close to DataSpace. Linked data is much simpler than the semantic web, and is based upon these four principles:

Tim Berners-Lee listed four principles of linked data in a note Design Issues: Linked Data:

  1. Use URIs to identify things.
  2. Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents.
  3. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
  4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

DataSpace is quite similar except it encourages the use of UCKs so that columns of data can be correlated.

More recently, Stuart Bailey, the Founder and CTO of Infoblox, has been working on IF-MAP, which is a standard for describing and accessing in a secure way distributed collections of objects and their links, as well as metadata about objects and their links. IF-MAP is an abbreviation for Interface to Metadata Access Points) and is a Trusted Computer Group (TCG) standard.

Stuart Bailey was part of the original DataSpace effort and IF-MAP is an interesting evolution of some of the key ideas in DataSpace.

It still seems like a great time to ask, why don't we have web of data supporting simple discovery, exploration, correlation and access?