Small, Medium and Big Data

What is big data? From the point of view of the infrastructure required to do analytics, data comes in three sizes:

  • Small data. Small data fits into the memory of a single machine. A good example of a small dataset is the dataset for the Netflix Prize. The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop.
  • Medium data. A good working definition of medium size data is to think of data as medium size if it fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies and organizations to create 10 to 100 TB or larger size data warehouses, so medium size data can grow quite large.
  • Big data. Big data is so large that it is challenging to manage it in a database and instead specialized systems are used. The most popular such system these days is Hadoop, although I expect we will have more choices in a few years. Also, what have become known as NoSQL databases can also be used to manage big data sets.

There have always been large datasets, but until about 2000, most large datasets were produced by the scientific and defense communities. For example, the Large Hadron Collider (LHC) will produce a large data set.

Two things have changed during the last decade: First, large datasets are now produced by a third community: companies that provide Internet services, such as search, on-line advertising and social media. Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.

This is an update of a post that I originally wrote in 2009 and that is no longer available.

This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.