What is Big Data?

The discipline of data intensive computing has been growing in importance and in popularity recently. It has now become popular enough that the term “big data” is beginning to be used instead. The graph below is from Google Trends and shows the growth of the term “big data” over the past couple of years.

I used to think that data came in three sizes depending upon how you managed it: either small enough to fit into memory, small enough to fit into a database, or too big for a database.

During the last few years, I have changed my point of view with respect to how you measure the size of big data. The most common point of view is to measure the size of data in terms of bytes: megabytes, gigabytes, terabytes, petabytes, and exabytes. But over the past few years, I have noticed that people with very large amounts of data, measure their data and the computing power required to process it in terms of MW.

Here are some examples:

  • A good sweet spot for a data center is 15MW.
  • Facebook’s leased data centers are typically between 2.5 MW and 6.0 MW.
  • Facebook’s new Pineville data center is 30 MW.
  • Google’s computing infrastructure uses 260 MW.

Today, the Open Science Data Cloud requires about 0.5MW. Our goal over the next 3 to 5 years is to develop and operate a 5 MW or so facility denoted to science.

The perspective when you measure data in MW is somewhat different. You would like the facility to be uniform. You would like to be able to add new racks and retire old racks with little if any manual intervention. You would like to be able optimize the amount of data you can manage and the amount of data you can process per MW.

Today, it takes to long for us to add and retire racks from the OSDC. If you would like to join a research project to develop open source software to simplify this, please write us at info at opencloudconsortium.org.

Posted in Uncategorized | Leave a comment

Some Research Topics Related to Big Data Science and Cloud Computing

IEEE Cloud 2011 took place in Washington DC from July 4 to 6, 2011.

The full name of the conference is The 4th International Conference on Cloud Computing and it was co-located with three related conferences: 1) IEEE Services 2011 (The 7th World Congress on Services), 2) IEEE SCC 2011 (The 8th International Conference on Services Computing), and 3) IEEE ICWS 2011 (The 9th International Conference on Web Services).

There were a lot of different technical topics covered. The diagram below shows you some of them.

In addition, all four conferences worked together and sponsored several plenary panels. I participated in one of them called “Science in Cloud Computing.” I have posted my slides on slideshare and you can find them here.

One of the topics that I work on these days is data intensive computing and in particular its impact on science. The popular term is big data science. Data intensive computing and big data has had an important impact on business over the past decade, but its impact on science is just beginning to be felt.

In my talk for the plenary panel, I described a project that I have been working on called the Open Science Data Cloud (OSDC). The OSDC is sponsored by the not-for-profit Open Cloud Consortium (OCC). We are working with OCC partners and sponsors to stand up a cloud devoted to science. Initially it will contain approximately 1 PB of data from a variety of scientific disciplines.

We are looking for volunteers to help with the OSDC, so please contact us at info at opencloudconsortium.org if you would like to get involved. We are looking for help loading and curating the data, data intensive computing cloud infrastructure, helping with the web site, and outreach.

Based upon my experience with the OSDC over the past year, I ended my presentation in the plenary panel with three research questions related to data intensive computing and cloud computing:

  1. Develop technology to encapsulate a scientist’s data and analysis tools and to export, save and move these between clouds.
  2. Develop protocols, utilities, and applications so that new racks and containers can be added to data clouds with minimal human involvement.
  3. Develop technology to support the long term (20+ years), low cost preservation of data and metadata in clouds.

Source: The diagram is from http://www.servicescongress.org/2011/.

Posted in Uncategorized | Comments Off

Small, Medium and Big Data

What is big data? From the point of view of the infrastructure required to do analytics, data comes in three sizes:

  • Small data. Small data fits into the memory of a single machine. A good example of a small dataset is the dataset for the Netflix Prize. The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop.
  • Medium data. A good working definition of medium size data is to think of data as medium size if it fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies and organizations to create 10 to 100 TB or larger size data warehouses, so medium size data can grow quite large.
  • Big data. Big data is so large that it is challenging to manage it in a database and instead specialized systems are used. The most popular such system these days is Hadoop, although I expect we will have more choices in a few years. Also, what have become known as NoSQL databases can also be used to manage big data sets.

There have always been large datasets, but until about 2000, most large datasets were produced by the scientific and defense communities. For example, the Large Hadron Collider (LHC) will produce a large data set.

Two things have changed during the last decade: First, large datasets are now produced by a third community: companies that provide Internet services, such as search, on-line advertising and social media. Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.

This is an update of a post that I originally wrote in 2009 and that is no longer available.

Posted in Uncategorized | Leave a comment

What is Analytic Infrastructure and Why Should You Care?

I have been building analytic models for over 20 years. The names have changed a lot over the years: 20 years ago we built statistical models, 10 years ago we built data mining models, and today we build analytic models. The algorithms have changed some: classification and regression trees became common 20 years ago, support vector machines about 10 years ago, and today graph-based algorithms are popular.

Perhaps what has changed the most is my perspective.

Analytic algorithms and models. Twenty years ago, I was focused on algorithms and was concerned with the different types of models that you could build using different types of algorithms on different types of data. This worked fine as long as the data fit into the memory of the computer.

Analytic infrastructure. For better or worse I ran into problems that had so much data that the data was too big to fit into memory. Some projects required a disk, some required many disks, and a few required tertiary storage. I spent over two decades working on what you might call analytic infrastructure. I first worked on teams that developed for the high energy physics community specialized data management infrastructures that were optimized for efficient reads (instead of safe writes) and accessed the data by columns (instead of rows) in order to speed up numerical computations. These turned out to be some of the first examples of data warehouses (the name was not used at that time), increased by 1 to 3 orders of magnitude the size of data that we could model, and were heavily criticized by the database community. Of course, several years later the database community embraced data warehouses at least for reports, if not for data intensive computing and modeling.

Beginning about five years ago, I began working on what are today called cloud computing platforms. Again, this increases by 1 to 3 orders of magnitude the size of data that we can model, and again these have been heavily criticized by some in the database community as being a big step backwards.

In 2009, I edited a special issue of the ACM SIGKDD Explorations about analytic infrastructure. In an article there, I define analytic infrastructure as the applications, services, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring, or related analytic activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. The article is available as a pdf from the SIGKDD Explorations web site (it’s Issue 1 in Volume 11).

I don’t really like this definition and encourage you to provide a better one. What is important though is that using the appropriate analytic infrastructure is critical to building models for problems with so much data that simply putting it into memory and forgetting about it is not a viable solution.

Analytic Strategy. Returning to how my perspective has evolved, for the past several years, I have become increasingly concerned with what is usually called analytic strategy. Analytic strategy is concerned with making sure you are asking the right analytic question, that you are building a model that can be deployed efficiently, that the output of the model is actionable, that the actions have a business impact, the business impact is aligned with corporate strategy, that there is an appropriate governance process in place, and related questions.

My perspective these days is that analytics requires a firm foundation and that the foundation has three columns: 1) analytic strategy; 2) analytic infrastructure; and 3) analytic algorithms and models.

This is a slightly updated version of a post from February 16, 2010.

Posted in analytic strategy | Comments Off