cloud computing

The Data Center as the Unit of Computing

I’m at the KDD 2010 conference this week in Washington, D.C.. On Sunday, I gave the keynote in the The 2nd Workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2010), which was one of the workshops co-located with conference. The title of my talk was “My Other Computer is a Data Center: The Sector Perspective on Big Data.” You can download the talk from Slideshare.

KDD 2010

The first part of the talk argued that it may be useful to think of a data center as a “device” for extracting relationships from data, in broadly the same way that we view a telescope as a device for looking at things that are very far away and a microscope as a device for looking at things that are very small. Continuing in this way, you can think of a supercomputer as a device for computing simulations.

The table below is my rough “back of the envelope” computation of the scale up provided by each of these devices over what was possible before (these scale up numbers are very rough and if you have better numbers, please let me know).

In each of these cases, the device resulted in some pretty interesting new science. So it is interesting to speculate what type of new science might arise when you think of a data center for extracting patterns from very large collections of data.

In the second part of the talk, I described at a very high level some of the components and layers in a software stack for data center device.

Instrument Year Scale up
Telescope 1609 30x
Microscope 1670 250x
Supercomputing 1976 10x-100x
Data center 2003 10x-100x

No Comments

Open Source Cloud Computing Software at SC 09

SC 09 is in Portland this coming week from November 14 to 20. The Laboratory for Advanced Computing will have a booth and be showcasing a number of open source cloud computing technologies including:

Sector. Sector/Sphere is a high performance storage and compute cloud that scales to wide area networks. With Sector’s simplified parallel programming framework, you can easily apply a user defined function (UDF) to datasets that fill data centers. The current version of Sector is version 1.24 and includes support for streams and multiple master servers. Sector was the basis for an application that won the SC 08 Bandwidth Challenge. For more information, see sector.sourceforge.net.

As measured by the MalStone Benchmark, Sector was over 2x fast as Hadoop. Sector was one of six technologies selected by SC 09 as a disruptive technology.

How efficient is your cloud?

This snapshot is from the LAC Cloud Monitor monitoring a Sector computation on the Open Cloud Testbed.

Cistrack. The Chicago Utilities for Biological Science or CUBioS is a set of integrated utilities for managing, processing, analyzing and sharing biological data. CUBioS integrates databases with cloud computing to provide an infrastructure that scales to high throughput sequencing platforms. CUBioS uses the Sector/Sphere cloud to process images produced by high throughput sequencing platforms. Cistrack is a CUBioS instance for cis-regulatory data. For more information, see www.cistrack.org.

Canopy. With clouds, it is now possible with a portal to create, monitor, and migrate Virtual Machines (VMs). With the open source Canopy application, it is now possible to create, monitor and migrate Virtual Networks containing multiple VMs connected with virtualized network infrastructure. Canopy provides a standardized library of functions to programatically control switch VLAN assignments to create VNs at line speed. Canopy is an open source project with an alpha releases planned for 2010.

UDT. UDT is a widely deployed (with millions of deployed instances) application level network transport protocol designed for large data transfers over wide area high performance networks. For more information, see udt.sourceforge.net.

UDX. UDX is a version of UDT that is designed for wide area high performance research and corporate networks within a single security domain (UDX does not contain the code UDT uses for transversing fire walls). In recent tests, UDX was able to achieve over 9.2 Gbps on a 10 Gbps wide area testbed. For more information, see udt.sourceforge.net.

LAC Cloud Monitor (LACCM). The LAC Cloud Monitor is a low overhead monitor for clouds that gathers system performance for thousands of servers along multiple dimensions. It integrates with the Argus Monitoring System and Nagios for logging and alerting. LACCM is used to monitor the OCC Open Cloud Testbed. LACCM is open source.

LAC Cloud Scheduler (LACCS)The LAC Cloud Scheduler (LACCS) is a system for scheduling clouds for exclusive use by researchers. It is simple to use, scalable, and easy to deploy. Using LACCS, multiple groups can share easily a local or wide area cloud. LACCS is used for scheduling the Open Cloud Testbed. LACCS is open source.

This is a segment that aired on WTTW’s Chicago Matters about cloud computing that described the Sector/Sphere and the Open Cloud Testbed. You need to select the episode on the right hand side of the page dated November 10, 2009 and titled “Chicago Matters Beyond Burnham (9:40)”

, , , , , ,

3 Comments

What is the “Unit” of Cloud Computing? Virtual Machines, Virtual Networks, and Virtual Data Centers

This is a post that summarizes some conversations that Stuart Bailey (from Infoblox) and I have been having.

There is a lot of market clutter today about cloud computing and it can be challenging at times to identify the core technical issues. Sometimes it is helpful with an emerging technology to ask the question: “What is the ‘unit’ of deployment for the technology?” There are two important related questions: “How are the units named?” “How do the units communicate?”

Sometimes the perspective matters.

Sometimes the perspective matters.

Before we think about the answers for cloud computing, let’s warm up with some other examples.

  • For the web, the “unit” is the web page, web pages are identifid by URLs (or URIs), and the units “communicate” using HTTP and related protocols. Of course, web pages aggregate into web sites.
  • In networking, the “unit” is the IP address (at Layer 3) or the MAC address (at Layer 2) and DNS is the link between URLs and IP addresses (allowing them to communicate), while ARP (or NDP in IPv6) is the link between MAC addresses and IP addresses.
  • In grid computing, the “unit” is a computer in a cluster (”a grid resource”) and computers commnicate using the Message Passing Interface (MPI).

Depending upon your perspective and your role in the cloud computing eco-system, you could argue that any of the following are the units:

Infrastructure Perspective

  • A virtual machine (VM).
  • A virtual network (VN), consisting of multiple VMs and all required information to network the VMs.
  • A virtual data center (VDC), consisting of one or more VNs.

Data/Content/Resource Perspective

  • An identifier specifying the name of a resource for a cloud storage service. Examples include an object managed by Amazon’s S3 service, or a file managed by the Hadoop Distributed File System (HDFS).
  • An identifier specifying the name of a data resource for a cloud data service. Examples include a domain (database table) manged by Amazon’s SimpleDB service or a table (or row) manged by a BigTable-like service.

Once we take this point of view, a number of issues become much easier to discuss.

Intercloud Protocols. Today with clouds, we are in the same situation that networking was before Internet protocols enabled internetworking by supporting communication between networks. Until TCP and related Internet protocols were developed, there were not agreed upon standards identifying the appropriate entities and layers nor for passing names of entities between layers. We can ask what are the appropriate mechanisms for naming VMs, VNs and VDCs, as well as cloud and tables services, how do we pass the names of objects between layers, and how do the objects in the infrastructure stack communicate with objects in the data stack.

Virtual networks also count. Most of the cloud virtualization discussion today focuses on VMs and their migration, but it is just as essential to support VNs and their migration. If we look to how IP addresses arose, then it is tempting to think about using names for VMs that include information about VNs. Today, depending upon the units we feel are important, we will need layers in the cloud for naming and linking VMs, VNs and VDCs, not just VMs.

Removing the distinction between clouds and large data clouds. There are two fundamentally different approaches to cloud services for storage or data. In the first, there is an implicit assumption that the storage or data service must fit in a single VM (S3) or other device (such as NAS). In the second, the whole point is to develop cloud storage and data services that span multiple VMs and devices (Google’s GFS/MapReduce/BigTable), Hadoop HDFS/MapReduce, Sector Distributed File System/Sphere UDFs, etc.).

Services that link virtual infrastructure and data. In many discussions, no effort is made to span the virtual infrastructure perspective entities (VMs, VNs) with the data perspective. One simple approach is to provide a dynamic infrastructure service so that data/content/resource services could easily determine which VMs and VNs support their service (there is usally done with static configuration files today). With this approach, large data cloud services are simply data/content/resource services that are engineered to scale to multiple VMs (and perhaps VNs).

Scaling to services to data centers. One of attributes that I think is a core attribute of certain types of clouds, is for a service to scale beyond a single machine or VM to an entire data center or VDC. Defining these types of scalable services is something that is relatively easy to do from the perspective here.

Acknowledgements: The photograph is from the Flickr photostream of bourget_82 and was posted with a Attribution-No Derivative Works 2.0 Generic Creative Commons License.

, , , , , , ,

1 Comment

Building Your Own Large Data Clouds (Raywulf Clusters)

We recently added four new racks to the Open Cloud Testbed. The racks are designed to support cloud computing, both clouds that support on demand VMs as well as those that support data intensive computing. Since there is not a lot of information available describing how to put together these types of clouds, I thought I would share how we configured our racks.

These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out.  Photograph by Michal Sabala.

These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out. Photograph by Michal Sabala.

These racks can be used as a basis for private clouds, hybrid clouds, or condo clouds.

There is a lot of information about building Beowulf clusters, which are designed for compute intensive computing. Here is one of the first tutorials and some more recent information.

In contrast, our racks are designed to support data intensive computing. We sometimes call these Raywulf clusters. Briefly, the goal is to make sure that there are enough spindles moving data in parallel with enough cores to process the data being moved. (Our data intensive middleware is called Sector, Graywulf is already taken, and there are not many words that rhyme with Beo- left. Other suggestions are welcome. Please use the comments below.)

The racks cost about $85,000 (with standard discounts), consist of 32 nodes and 124 cores with 496 GB of RAM, 124 TB of disk & 124 spindles, and consume about 10.3 kW of power (excluding the power required for cooling).

With 3x replication, there is about 40 TB of usable storage available, which means that the cost to provide balanced long term storage and compute power is about $2,000 per TB. So, for example, a single rack could be used as a basis for a private cloud that can manage and analyze approximately 40 TB of data. At the end of this note, is some performance information about a single rack system.

Each rack is a standard 42U computer rack and consists of a head node and 31 compute/storage nodes. We installed GNU/Debian Linux 5.0 as the operating system. Here is the configuration of the rack and of the compute/storage nodes.

In contrast, there are specialized configurations, such as designed by Backblaze, that provide 67TB for $8,000. This is 1/2 the storage for 1/10 the cost. The difference is that Raywulf clusters are designed for data intensive computing using middleware such as Hadoop and Sector/Sphere, not just storage.

Rack Configuration

  • 31 compute/storage nodes (see below)
  • 1 head node (see below)
  • 2 Force10 S50N switches, with 2 10 Gbps uplinks so that the inter-rack bandwidth is 20 Gbps
  • 1 10GE module
  • 2 optics and stacking modules
  • 1 3Com Baseline 2250 switch to provide to provide additional cat5 ports for IPMI management interfaces.
  • cabling

Compute/storage node.

  • Intel Xeon 5410 Quad Core CPU with 16GB of RAM
  • SATA RAID controller
  • four (4) SATA 1TB hard drives in RAID-0 configuration
  • 1 Gbps NIC
  • IPMI management

Benchmarks. We benchmarked these new racks using the Terasort Benchmark and version 0.20.1 of Hadoop and version 1.24a of Sector/Sphere. Replication was turned off in both Hadop and Sector. All the racks were located within one data center. It is clear from these tests that the new versions of Hadoop and Sector/Sphere are both faster than the previous versions.

Configuration Sector/Sphere Hadoop
1 rack (32 nodes) 28m 25s 85m 49s
2 racks (64 nodes) 15m 20s 37m 0s
3 racks (96 nodes) 10m 19s 24m 14s
4 racks (128 nodes) 7m 56s 17m 45s

The Raywulf clusters were designed by Michal Sabala and Yunhong Gu of the National Center for Data Mining at the University of Illinois at Chicago.

We are working on putting together more information of how to build a Raywulf cluster.

Sector/Sphere and our Raywulf Clusters were selected as one of the Disruptive Technologies that will be highlighted at SC 09.

The photograph above of two racks from the Open Cloud Testbed was taken by Michal Sabala.

, , , , ,

No Comments

Revisiting the Case for Cloud Computing

The backlash to the hype over cloud computing is in full swing. I have given a number of talks on cloud computing over the past few months and have been struck by a few things.

First, at an industry event that I attended, although there were quite a few talks on cloud computing (it was one of the tracks), it seems that only a small number of speakers had actually participated in a cloud computing project and I was was one of only a handful that had actually completed several cloud computing projects. Many of the other speakers were simply summarizing second and third hand reports about cloud computing. In my opinion, something was lost in the translation.

Rack of servers

Second, I think some of the backlash has gone to far. At one breakfast meeting I attended, there were essentially no acknowledgement of the potential today that clouds offer, simply emphasis on why “real companies” that have to worry about security could never use (public) clouds. Private and condo clouds were not mentioned as alternatives for companies whose security or compliance requirements preclude the use of today’s public clouds. The trade-off, which is always present, that balances potential breaches from performing certain operations in public clouds, from the productivity gains that such clouds can provide was also not mentioned.

Because of this backlash, I think it is a good time to revisit the case for cloud computing. There are three basic reasons for deploying certain operations to clouds:

Cost savings. By employing virtualization and making use of the economies of scale that cloud service providers can take advantage of, deploying certain operations to clouds can lead to improved efficiencies. This advantage seems to be well understood, and is, for example, one of the factors driving the Federal CIO’s push for cloud computing. See for example, the recent RFQ from the GSA for a cloud computing store front.

Productivity. The Elastic, virtualized services that clouds provide lead directly to productivity improvements. As a simple example, I was building an analytic model over the weekend to meet a deadline and the computation took over 4 hours. Since I was using a virtualized resource in a cloud, I was able to use the portal that controlled the various machine images to double the memory in my resource. Five minutes later, I had a new virtualized image and the computation now took less than 5 minutes. (By the way, this is typical of analytic computations. When the data is so large that a computation can no longer be done in memory and requires accessing the disk, the time required increases dramatically.) If, instead, I had gone through a standard procurement process to get a new machine with twice the memory, it would have been quite some time before the model would have been completed.

As another example, I work with a Fortune 500 client in which the analytic models are taking weeks to build instead of days because the modeling environment does not have enough disk space for the entire team to hold all the temporary files and datasets required when building analytic models nor powerful enough computers for models to be computed fast enough to provide timely feedback to the modeler. This is unfortunately fairly typical of modeling environments in Fortune 500 companies (I’ll discuss this situation in a later post). A simple cloud would dramatically improve the situation.

New capabilities. Clouds also provide new capabilities. For example, large data clouds enable the processing and analysis of large datasets that was simply not possible with architectures that manage the data using databases. As a simple example, the type of analytic computations abstracted by the MalStone Benchmark are relatively straightforward, even when there are 100 TB of data, using a Hadoop or Sector based cloud, but in practice not practical using a traditional database when the data is that size.

What’s new. Many of the ideas behind cloud computing are quite old. On the other hand, the combination of: 1) the scale, 2) the utility based pricing, and 3) the simplicity provided by cloud computing make cloud computing a disruptive technology. If you are interested in understanding cloud computing from this point of view, you might find a recent talk I gave for an IEEE Conference on New Technologies called My Other Computer is a Data Center interesting. There is also a written version of a portion of the that recently appeared in the IEEE Bulletin on Data Engineering called On the Varieties of Clouds for Data Intensive Computing.

The image is by John Seb and is available from Flickr under the Creative Commons license.

, , , , ,

1 Comment