As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center. The stack consists of a storage service (the Google File System (GFS)), a compute service based upon MapReduce, and a table service (BigTable).
Although the Google stack of services is not directly available, the open source Hadoop system, which has a broadly similar architecture, is available.
The Google stack, consisting of GFS/MapReduce/Bigtable, and the Hadoop system, consisting of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce, are examples of clouds designed for data intensive computing — these types of clouds provide computing capacity on demand, with capacity scaling all the way up to the size of a data center.
There are still many open questions about how best to design clouds for data intensive computing. During the best several years, I have been involved with a cloud designed for data intensive computing called Sector. The lead developer of Sector is Yunhong Gu of the University of Illinois at Chicago. Sector was developed independently of Hadoop and the Google cloud services and makes several different design choices (see the table below).
To quantify the impact of some of these choices, I have been involved with the development of a benchmark for data intensive computing called MalStone. I will talk more about MalStone in a future post, but briefly, MalStone is a stylized analytic computing that can be done simply using MapReduce, as well as variants and generalizations of MapReduce. The open source MalStone code comes with a generator of synthetic records and one benchmark (called MalStone B) generates 10 billion 100-byte records (similar to terasort).
MalStone B Benchmarks
| System | Time (min) |
|---|---|
| Hadoop MapReduce | 799 min |
| Hadoop Streaming with Python | 143 min |
| Sector | 44 min |
Tests were done using 20 nodes on the Open Cloud Testbed. Each node contained 500 million 100-byte records.
Comparing Sector and Hadoop
| Hadoop | Sector | |
|---|---|---|
| Storage cloud | block-based file system | file-based |
| Programming model | MapReduce | user defined functions and MapReduce |
| Protocol | TCP | UDP |
| Security | NA | HIPAA capable |
| Replication | at time of writing | periodically |
| Language | Java | C++ |
I’ll be giving a talk on Sector at CloudSlam ‘09 on Monday, April 20, 2009 at 4pm ET. CloudSlam is a virtual conference, so that it is easy to listen to any of the talks that interest you.
#1 by lszyba1 on April 28, 2009 - 11:58 am
I wonder if you can add a comparison of hadoop running on kosmosfs which is written in C++?
http://www.linux-magazine.com/w3/issue/90/048-051_kosmos.pdf
Thanks,
Lucas
#2 by rgrossman on April 28, 2009 - 9:34 pm
MalStone tests with KFS are in progress. I’ll post a note when the technical report is available.
–Bob