<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robert Grossman &#187; cloud computing</title>
	<atom:link href="http://rgrossman.com/category/blog/cloud-computing/feed/" rel="self" type="application/rss+xml" />
	<link>http://rgrossman.com</link>
	<description>analytics, analytic strategy and analytic infrastructure</description>
	<lastBuildDate>Wed, 28 Jul 2010 02:49:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The Data Center as the Unit of Computing</title>
		<link>http://rgrossman.com/2010/07/27/the-data-center-as-the-unit-of-computing/</link>
		<comments>http://rgrossman.com/2010/07/27/the-data-center-as-the-unit-of-computing/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 02:49:33 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=672</guid>
		<description><![CDATA[I&#8217;m at the KDD 2010 conference this week in Washington, D.C..  On Sunday,  I gave the keynote in the The 2nd Workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2010), which was one of the workshops co-located with conference.  The title of my talk was &#8220;My Other Computer is a Data [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m at the <a href="http://www.kdd.org/kdd2010/">KDD 2010</a> conference this week in Washington, D.C..  On Sunday,  I gave the keynote in the The 2nd Workshop on Large-scale Data Mining: Theory and Applications (<a href="http://arnetminer.org/LDMTA2010">LDMTA 2010</a>), which was one of the workshops co-located with conference.  The title of my talk was &#8220;My Other Computer is a Data Center: The Sector Perspective on Big Data.&#8221;  You can download the talk from <a href="http://www.slideshare.net/rgrossman/my-other-computer-is-a-data-center-the-sector-perspective-on-big-data">Slideshare</a>.</p>
<p><img src="http://rgrossman.com/files/2010/07/kdd10.png" alt="KDD 2010" title="KDD 2010" width="542" height="115" class="alignleft size-full wp-image-681" /></p>
<p>The first part of the talk argued that it may be useful to think of a data center as a &#8220;device&#8221; for extracting relationships from data, in broadly the same way that we view a telescope as a device for looking at things that are very far away and a microscope as a device for looking at things that are very small.   Continuing in this way, you can think of a supercomputer as a device for computing simulations.  </p>
<p>The table below is my rough &#8220;back of the envelope&#8221; computation of the scale up provided by each of these devices over what was possible before (these scale up numbers are very rough and if you have better numbers, please let me know).</p>
<p>In each of these cases, the device resulted in some pretty interesting new science.  So it is interesting to speculate what type of new science might arise when you think of a data center for extracting patterns from very large collections of data.</p>
<p>In the second part of the talk, I described at a very high level some of the components and layers in a software stack for data center device.</p>
<table border="1">
<tr>
<td><b>Instrument</b></td>
<td><b>Year</b></td>
<td><b>Scale up</b></td>
</tr>
<tr>
<td>Telescope</td>
<td>1609</td>
<td>30x</td>
</tr>
<tr>
<td>Microscope</td>
<td>1670</td>
<td>250x</td>
</tr>
<tr>
<td>Supercomputing</td>
<td>1976</td>
<td>10x-100x</td>
</tr>
<tr>
<td>Data center</td>
<td>2003</td>
<td>10x-100x</td>
</tr>
</table>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2010/07/27/the-data-center-as-the-unit-of-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open Source Cloud Computing Software at SC 09</title>
		<link>http://rgrossman.com/2009/11/11/cloud-computing-at-sc-09-from-la/</link>
		<comments>http://rgrossman.com/2009/11/11/cloud-computing-at-sc-09-from-la/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 17:42:22 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[Canopy]]></category>
		<category><![CDATA[LAC Cloud Monitor]]></category>
		<category><![CDATA[LAC Cloud Scheduler]]></category>
		<category><![CDATA[Sector/Sphere]]></category>
		<category><![CDATA[UDT]]></category>
		<category><![CDATA[UDX]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=500</guid>
		<description><![CDATA[SC 09 is in Portland this coming week from November 14 to 20.   The Laboratory for Advanced Computing will have a booth and be showcasing a number of open source cloud computing technologies including:
Sector.  Sector/Sphere is a high performance storage and compute cloud that scales to wide area networks.  With Sector&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://sc09.supercomputing.org/">SC 09</a> is in Portland this coming week from November 14 to 20.   The Laboratory for Advanced Computing will have a booth and be showcasing a number of open source cloud computing technologies including:</p>
<p><b>Sector.</b>  Sector/Sphere is a high performance storage and compute cloud that scales to wide area networks.  With Sector&#8217;s simplified parallel programming framework, you can easily apply a user defined function (UDF) to datasets that fill data centers.   The current version of Sector is version 1.24 and includes support for streams and multiple master servers.  Sector was the basis for an application that won the SC 08 Bandwidth Challenge.   For more information, see <a href="http://sector.sf.net">sector.sourceforge.net</a>.</p>
<p>As measured by the <a href="http://blog.rgrossman.com/2009/05/25/malstone-benchmark/">MalStone Benchmark</a>, Sector was over 2x fast as <a href="http://hadoop.apache.org/">Hadoop</a>.   Sector was one of six technologies selected by SC 09 as a <a href="http://sc09.supercomputing.org/?pg=disrupttech.html">disruptive technology</a>.</p>
<div id="attachment_511" class="wp-caption alignleft" style="width: 310px"><img src="http://rgrossman.files.wordpress.com/2009/11/sector-snapshot.jpg?w=300" alt="How efficient is your cloud?" title="How efficient is your cloud?" width="300" height="232" class="size-medium wp-image-511" /><p class="wp-caption-text">This snapshot is from the LAC Cloud Monitor monitoring a Sector computation on the Open Cloud Testbed.</p></div>
<p><b>Cistrack. </b>  The Chicago Utilities for Biological Science or CUBioS is a set of integrated utilities for managing, processing, analyzing and sharing biological data.  CUBioS integrates databases with cloud computing to provide an infrastructure that scales to high throughput sequencing platforms. CUBioS uses the Sector/Sphere cloud to process images produced by high throughput sequencing platforms.  Cistrack is a CUBioS instance for cis-regulatory data.  For more information, see <a href="http://www.cistrack.org">www.cistrack.org</a>.</p>
<p><b>Canopy.</b>  With clouds, it is now possible with a portal to create, monitor, and migrate Virtual Machines (VMs).  With the open source Canopy application, it is now possible to create, monitor and migrate Virtual Networks containing multiple VMs connected with virtualized network infrastructure.  Canopy provides a standardized library of functions to programatically control switch VLAN assignments to create VNs at line speed.  Canopy is an open source project with an alpha releases planned for 2010.</p>
<p><b>UDT.</b>   UDT is a widely deployed (with millions of deployed instances) application level network transport protocol designed for large data transfers over wide area high performance networks.  For more information, see  <a href="http://udt.sf.net">udt.sourceforge.net</a>.</p>
<p><b>UDX.</b> UDX is a version of UDT that is designed for wide area high performance research and corporate networks within a single security domain (UDX does not contain the code UDT uses for transversing fire walls).  In recent tests, UDX was able to achieve over 9.2 Gbps on a 10 Gbps wide area testbed.  For more information, see <a href="http://udt.sf.net">udt.sourceforge.net</a>.</p>
<p><b>LAC Cloud Monitor (LACCM).</b>   The LAC Cloud Monitor is a low overhead monitor for clouds that gathers system performance for thousands of servers along multiple dimensions.  It integrates with the Argus Monitoring System and Nagios for logging and alerting.  LACCM is used to monitor the OCC <a href="http://www.opencloudconsortium.org/testbed.html">Open Cloud Testbed</a>.   LACCM is open source.</p>
<p><b>LAC Cloud Scheduler (LACCS)</b>The LAC Cloud Scheduler (LACCS) is a system for scheduling clouds for exclusive use by researchers.  It is simple to use, scalable, and easy to deploy.  Using LACCS, multiple groups can share easily a local or wide area cloud.  LACCS is used for scheduling the Open Cloud Testbed.   LACCS is open source.</p>
<p>This is a <a href="http://www.wttw.com/main.taf?p=42,8,80">segment</a> that aired on WTTW&#8217;s Chicago Matters  about cloud computing that described the Sector/Sphere and the Open Cloud Testbed.   You need to select the episode on the right hand side of the page dated November 10, 2009 and titled &#8220;Chicago Matters Beyond Burnham (9:40)&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/11/11/cloud-computing-at-sc-09-from-la/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>What is the &#8220;Unit&#8221; of Cloud Computing?  Virtual Machines, Virtual Networks, and Virtual Data Centers</title>
		<link>http://rgrossman.com/2009/10/21/cloud-computing-units/</link>
		<comments>http://rgrossman.com/2009/10/21/cloud-computing-units/#comments</comments>
		<pubDate>Wed, 21 Oct 2009 20:04:26 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[cloud standards]]></category>
		<category><![CDATA[layers]]></category>
		<category><![CDATA[VDC]]></category>
		<category><![CDATA[virtual data centers]]></category>
		<category><![CDATA[virtual machines]]></category>
		<category><![CDATA[virtual networks]]></category>
		<category><![CDATA[VM]]></category>
		<category><![CDATA[VN]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=475</guid>
		<description><![CDATA[This is a post that summarizes some conversations that Stuart Bailey (from Infoblox) and I have been having.
There is a lot of market clutter today about cloud computing and it can be challenging at times to identify the core technical issues.  Sometimes it is helpful with an emerging technology to ask the question: &#8220;What [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post that summarizes some conversations that Stuart Bailey (from <a href="http://www.infoblox.com">Infoblox</a>) and I have been having.</p>
<p>There is a lot of market clutter today about cloud computing and it can be challenging at times to identify the core technical issues.  Sometimes it is helpful with an emerging technology to ask the question: &#8220;What is the &#8216;unit&#8217; of deployment for the technology?&#8221;    There are two important related questions: &#8220;How are the units named?&#8221;   &#8220;How do the units communicate?&#8221;</p>
<div id="attachment_488" class="wp-caption alignleft" style="width: 310px"><img src="http://rgrossman.files.wordpress.com/2009/10/spiral_staircase.jpg?w=300" alt="Sometimes the perspective matters." title="Perspective Counts" width="300" height="199" class="size-medium wp-image-488" /><p class="wp-caption-text">Sometimes the perspective matters.</p></div>
<p>Before we think about the answers for cloud computing, let&#8217;s warm up with some other examples.</p>
<ul>
<li>For the web, the &#8220;unit&#8221; is the web page, web pages are identifid by URLs (or URIs), and the units &#8220;communicate&#8221; using HTTP and related protocols.    Of course, web pages aggregate into web sites.</li>
<li>In networking, the &#8220;unit&#8221; is the IP address (at Layer 3) or the MAC address (at Layer 2) and DNS is the link between URLs and IP addresses (allowing them to communicate), while ARP (or NDP in IPv6) is the link between MAC addresses and IP addresses.</li>
<li>In grid computing, the &#8220;unit&#8221; is a computer in a cluster (&#8221;a grid resource&#8221;) and computers commnicate using the Message Passing Interface (MPI).</li>
</ul>
<p>Depending upon your perspective and your role in the cloud computing eco-system, you could argue that any of the following are the units:</p>
<p><strong>Infrastructure Perspective</strong></p>
<ul>
<li>A virtual machine (VM).</li>
<li>A virtual network (VN), consisting of multiple VMs and all required information to network the VMs.</li>
<li>A virtual data center (VDC), consisting of one or more VNs.
</ul>
<p><strong>Data/Content/Resource Perspective</strong></p>
<ul>
<li>An identifier specifying the name of a resource for a cloud storage service.   Examples include an object managed by Amazon&#8217;s S3 service, or a file managed by the Hadoop Distributed File System (HDFS).</li>
<li>An identifier specifying the name of a data resource for a cloud data service.  Examples include a domain (database table) manged by Amazon&#8217;s SimpleDB service or a table (or row) manged by a BigTable-like service.</li>
</ul>
<p>Once we take this point of view, a number of issues become much easier to discuss.</p>
<p><b>Intercloud Protocols. </b>  Today with clouds, we are in the same situation that networking was before Internet protocols enabled internetworking by supporting communication between networks.   Until TCP and related Internet protocols were developed, there were not agreed upon standards identifying the appropriate entities and layers nor for passing names of entities between layers.   We can ask what are the appropriate mechanisms for naming VMs, VNs and VDCs, as well as cloud and tables services, how do we pass the names of objects between layers, and how do the objects in the infrastructure stack communicate with objects in the data stack.</p>
<p><b>Virtual networks also count.</b>  Most of the cloud virtualization discussion today focuses on VMs and their migration, but it is just as essential to support VNs and their migration.   If we look to how IP addresses arose, then it is tempting to think about using names for VMs that include information about VNs.  Today, depending upon the units we feel are important, we will need layers in the cloud for naming and linking VMs, VNs and VDCs, not just VMs.</p>
<p><b>Removing the distinction between clouds and large data clouds. </b> There are two fundamentally different approaches to cloud services for storage or data.  In the first, there is an implicit assumption that the storage or data service must fit in a single VM (S3) or other device (such as NAS).  In the second, the whole point is to develop cloud storage and data services that span multiple VMs and devices (Google&#8217;s GFS/MapReduce/BigTable), Hadoop HDFS/MapReduce, Sector Distributed File System/Sphere UDFs, etc.).</p>
<p><b>Services that link virtual infrastructure and data. </b>  In many discussions, no effort is made to span the virtual infrastructure perspective entities (VMs, VNs) with the data perspective.   One simple approach is to provide a dynamic infrastructure service so that data/content/resource services could easily determine which VMs and VNs support their service (there is usally done with static configuration files today).   With this approach, large data cloud services are simply data/content/resource services that are engineered to scale to multiple VMs (and perhaps VNs).</p>
<p><b>Scaling to services to data centers. </b>  One of attributes that I think is a core attribute of certain types of clouds, is for a service to scale beyond a single machine or VM to an entire data center or VDC.  Defining these types of scalable services is something that is relatively easy to do from the perspective here.</p>
<p><b>Acknowledgements:</b>  The photograph is from the Flickr photostream of <a href="http://www.flickr.com/photos/bourget_82/291349047/">bourget_82</a> and was posted with a Attribution-No Derivative Works 2.0 Generic Creative Commons License.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/10/21/cloud-computing-units/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Building Your Own Large Data Clouds (Raywulf Clusters)</title>
		<link>http://rgrossman.com/2009/09/27/building-your-own-large-data-clouds/</link>
		<comments>http://rgrossman.com/2009/09/27/building-your-own-large-data-clouds/#comments</comments>
		<pubDate>Sun, 27 Sep 2009 16:59:19 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[large data clouds]]></category>
		<category><![CDATA[Sector Sphere]]></category>
		<category><![CDATA[Terasort]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=453</guid>
		<description><![CDATA[We recently added four new racks to the Open Cloud Testbed.  The racks are designed to support cloud computing, both clouds that support on demand VMs as well as those that support data intensive computing.  Since there is not a lot of information available describing how to put together these types of clouds, [...]]]></description>
			<content:encoded><![CDATA[<p>We recently added four new racks to the <a href="http://opencloudconsortium.org/testbed.html">Open Cloud Testbed</a>.  The racks are designed to support cloud computing, both clouds that support on demand VMs as well as those that support data intensive computing.  Since there is not a lot of information available describing how to put together these types of clouds, I thought I would share how we configured our racks.</p>
<div id="attachment_464" class="wp-caption alignleft" style="width: 194px"><img src="http://rgrossman.files.wordpress.com/2009/09/oct-gen2-09.jpg?w=184" alt="These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out.  Photograph by Michal Sabala." title="Two Racks from the Open Cloud Testbed" width="184" height="300" class="size-medium wp-image-464" /><p class="wp-caption-text">These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out.  Photograph by Michal Sabala.</p></div>
<p>These racks can be used as a basis for private clouds, hybrid clouds, or <a href="http://blog.rgrossman.com/2009/06/08/condo-clouds/">condo clouds</a>.</p>
<p>There is a lot of information about building Beowulf clusters, which are designed for compute intensive computing.  Here is one of the first <a href="http://www.cacr.caltech.edu/beowulf/tutorial/building.html">tutorials</a> and some more recent <a href="http://www.beowulf.org">information</a>.</p>
<p>In contrast, our racks are designed to support data intensive computing.  We sometimes call these Raywulf clusters.  Briefly, the goal is to make sure that there are enough spindles moving data in parallel with enough cores to process the data being moved.   (Our data intensive middleware is called Sector, Graywulf is already taken, and there are not many words that rhyme with Beo- left.  Other suggestions are welcome.  Please use the comments below.)</p>
<p>The racks cost about $85,000 (with standard discounts), consist of 32 nodes and 124 cores with 496 GB of RAM, 124 TB of disk &amp; 124 spindles, and consume about 10.3 kW of power (excluding the power required for cooling).</p>
<p>With 3x replication, there is about 40 TB of usable storage available, which means that the cost to provide balanced long term storage and compute power is about $2,000 per TB.   So, for example, a single rack could be used as a basis for a private cloud that can manage and analyze approximately 40 TB of data.  At the end of this note, is some performance information about a single rack system.</p>
<p>Each rack is a standard 42U computer rack and consists of a head node and 31 compute/storage nodes.  We installed GNU/Debian Linux 5.0 as the operating system.  Here is the configuration of the rack and of the compute/storage nodes.</p>
<p>In contrast, there are specialized <a href="http://blog.backblaze.com/category/cloud-storage/">configurations</a>, such as designed by Backblaze, that provide 67TB for $8,000.  This is 1/2 the storage for 1/10 the cost.   The difference is that Raywulf clusters are designed for data intensive computing using middleware such as Hadoop and Sector/Sphere, not just storage.</p>
<p><b>Rack Configuration </b></p>
<ul>
<li>31 compute/storage nodes (see below)</li>
<li>1  head node (see below)</li>
<li>2 Force10 S50N switches, with 2 10 Gbps uplinks so that the inter-rack bandwidth is 20 Gbps</li>
<li>1 10GE module </li>
<li>2 optics and stacking modules </li>
<li>1 3Com Baseline 2250 switch to provide to provide additional cat5 ports for IPMI management interfaces. </li>
<li> cabling </li>
</ul>
<p><b>Compute/storage node. </b></p>
<ul>
<li>Intel Xeon 5410 Quad Core CPU with 16GB of RAM </li>
<li> SATA RAID controller </li>
<li> four (4) SATA 1TB hard drives in RAID-0 configuration </li>
<li> 1 Gbps NIC </li>
<li> IPMI management </li>
</ul>
<p><b>Benchmarks.</b>  We benchmarked these new racks using the Terasort Benchmark and version 0.20.1 of <a href="http://hadoop.apache.org/">Hadoop</a> and version 1.24a of <a href="http://sector.sourceforge.net">Sector/Sphere</a>.   Replication was turned off in both Hadop and Sector.  All the racks were located within one data center.  It is clear from these tests that the new versions of Hadoop and Sector/Sphere are both faster than the previous versions.</p>
<table>
<tr>
<th>Configuration </th>
<th>Sector/Sphere</th>
<th>Hadoop</th>
</tr>
<tr>
<td>1 rack (32 nodes) </td>
<td>28m 25s </td>
<td>85m 49s</td>
</tr>
<tr>
<td>2 racks (64 nodes) </td>
<td>15m 20s </td>
<td>37m 0s</td>
</tr>
<tr>
<td>3 racks (96 nodes) </td>
<td>10m 19s </td>
<td>24m 14s</td>
</tr>
<tr>
<td>4 racks (128 nodes) </td>
<td>7m 56s </td>
<td>17m 45s</td>
</tr>
</table>
<p>The Raywulf clusters were designed by Michal Sabala and Yunhong Gu of the <a href="http://www.ncdm.uic.edu">National Center for Data Mining</a> at the University of Illinois at Chicago.</p>
<p>We are working on putting together more information of how to build a Raywulf cluster.</p>
<p>Sector/Sphere and our Raywulf Clusters were selected as one of the <a href="http://sc09.supercomputing.org/?pg=disrupttech.html">Disruptive Technologies</a> that will be highlighted at <a href="http://sc09.supercomputing.org">SC 09</a>.</p>
<p>The photograph above of two racks from the Open Cloud Testbed was taken by Michal Sabala.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/09/27/building-your-own-large-data-clouds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Revisiting the Case for Cloud Computing</title>
		<link>http://rgrossman.com/2009/09/06/revisiting-the-case-for-cloud-computing/</link>
		<comments>http://rgrossman.com/2009/09/06/revisiting-the-case-for-cloud-computing/#comments</comments>
		<pubDate>Sun, 06 Sep 2009 16:17:52 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[backlash]]></category>
		<category><![CDATA[case for cloud computing]]></category>
		<category><![CDATA[cost savings]]></category>
		<category><![CDATA[new capabilities]]></category>
		<category><![CDATA[productivity]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=424</guid>
		<description><![CDATA[The backlash to the hype over cloud computing is in full swing.  I have given a number of talks on cloud computing over the past few months and have been struck by a few things.
First, at an industry event that I attended, although there were quite a few talks on cloud computing (it was [...]]]></description>
			<content:encoded><![CDATA[<p>The backlash to the hype over cloud computing is in full swing.  I have given a number of talks on cloud computing over the past few months and have been struck by a few things.</p>
<p>First, at an industry event that I attended, although there were quite a few talks on cloud computing (it was one of the tracks), it seems that only a small number of speakers had actually participated in a cloud computing project and I was was one of only a handful that had actually completed several cloud computing projects.  Many of the other speakers were simply summarizing second and third hand reports about cloud computing.  In my opinion, something was lost in the translation.</p>
<p><img class="alignleft size-medium wp-image-437" title="Rack of servers" src="http://rgrossman.files.wordpress.com/2009/09/dell-server.jpg?w=300" alt="Rack of servers" width="300" height="225" /></p>
<p>Second, I think some of the backlash has gone to far.   At one breakfast meeting I attended, there were essentially no acknowledgement of the potential today that clouds offer, simply emphasis on why &#8220;real companies&#8221; that have to worry about security could never use (public) clouds.  Private and <a href="http://blog.rgrossman.com/2009/06/08/condo-clouds/">condo clouds</a> were not mentioned as alternatives for companies whose security or compliance requirements preclude the use of today&#8217;s public clouds.  The trade-off, which is always present, that balances potential breaches from performing certain operations in public clouds, from the productivity gains that such clouds can provide was also not mentioned.</p>
<p>Because of this backlash, I think it is a good time to revisit the case for cloud computing.  There are three basic reasons for deploying certain operations to clouds:</p>
<p><strong>Cost savings. </strong> By employing virtualization and making use of the economies of scale that cloud service providers can take advantage of, deploying certain operations to clouds can lead to improved efficiencies.   This advantage seems to be well understood, and is, for example, one of the factors driving the Federal CIO&#8217;s push for cloud computing.  See for example, the recent <a href="http://www.scribd.com/doc/17914883/US-Federal-Cloud-Computing-Initiative-RFQ-GSA">RFQ</a> from the GSA for a cloud computing store front.</p>
<p><strong>Productivity. </strong> The Elastic, virtualized services that clouds provide lead directly to productivity improvements.  As a simple example, I was building an analytic model over the weekend to meet a deadline and the computation took over 4 hours.  Since I was using a virtualized resource in a cloud, I was able to use the portal that controlled the various machine images to double the memory in my resource.  Five minutes later, I had a new virtualized image and the computation now took less than 5 minutes.   (By the way, this is typical of analytic computations.   When the data is so large that a computation can no longer be done in memory and requires accessing the disk, the time required increases dramatically.)   If, instead, I had gone through a standard procurement process to get a new machine with twice the memory, it would have been quite some time before the model would have been completed.</p>
<p>As another example, I work with a Fortune 500 client in which the analytic models are taking weeks to build instead of days because the modeling environment does not have enough disk space for the entire team to hold all the temporary files and datasets required when building analytic models nor powerful enough computers for models to be computed fast enough to provide timely feedback to the modeler.  This is unfortunately fairly typical of modeling environments in Fortune 500 companies (I&#8217;ll discuss this situation in a later post).     A simple cloud would dramatically improve the situation.</p>
<p><strong>New capabilities. </strong> Clouds also provide new capabilities.   For example, <a href="http://blog.rgrossman.com/2009/07/16/large-data-clouds-faq/">large data clouds</a> enable the processing and analysis of large datasets that was simply not possible with architectures that manage the data using databases.    As a simple example, the type of analytic computations abstracted by the <a href="http://blog.rgrossman.com/2009/05/25/malstone-benchmark/">MalStone Benchmark</a> are relatively straightforward, even when there are 100 TB of data, using a <a href="http://hadoop.apache.org/">Hadoop</a> or <a href="http://sector.sourceforge.net">Sector</a> based cloud, but in practice not practical using a traditional database when the data is that size.</p>
<p><strong>What&#8217;s new. </strong> Many of the ideas behind cloud computing are quite old.  On the other hand, the combination of: 1) the scale,  2) the utility based pricing, and 3) the simplicity provided by cloud computing make cloud computing a disruptive technology.   If you are interested in understanding cloud computing from this point of view, you might find a recent talk I gave for an IEEE Conference on New Technologies called <a href="http://www.slideshare.net/rgrossman/an-introduction-to-cloud-computing-2009-v19">My Other Computer is a Data Center</a> interesting.   There is also a written version of a portion of the that recently appeared in the IEEE Bulletin on Data Engineering called <a href="http://sites.computer.org/debull/A09mar/grossman.pdf">On the Varieties of Clouds for Data Intensive Computing</a>.</p>
<p>The image is by <a href="http://www.flickr.com/photos/johnseb/3425464/">John Seb</a> and is available from Flickr under the Creative Commons license.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/09/06/revisiting-the-case-for-cloud-computing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cloud Computing Testbeds</title>
		<link>http://rgrossman.com/2009/07/29/cloud-computing-testbeds/</link>
		<comments>http://rgrossman.com/2009/07/29/cloud-computing-testbeds/#comments</comments>
		<pubDate>Wed, 29 Jul 2009 11:39:44 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[cloud computing test beds]]></category>
		<category><![CDATA[cloud computing testbeds]]></category>
		<category><![CDATA[cloud testbeds]]></category>
		<category><![CDATA[eucalyptus]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[intercloud]]></category>
		<category><![CDATA[KFS]]></category>
		<category><![CDATA[open cirrus testbed]]></category>
		<category><![CDATA[Open Cloud Consortium]]></category>
		<category><![CDATA[open cloud testbed]]></category>
		<category><![CDATA[Sector]]></category>
		<category><![CDATA[Sector/Sphere]]></category>
		<category><![CDATA[thrift]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=377</guid>
		<description><![CDATA[Cloud computing is still an immature field: there are lots of interesting research problems, no standards, few benchmarks, and very limited interoperability between different applications and services.

Currently, there are relatively few testbeds available to the research community for research in cloud computing and few resources available to developers for testing interoperability.  I expect this [...]]]></description>
			<content:encoded><![CDATA[<p>Cloud computing is still an immature field: there are lots of interesting research problems, no standards, few benchmarks, and very limited interoperability between different applications and services.</p>
<p><img src="http://rgrossman.files.wordpress.com/2009/07/opencloudtestbed-08-v4.jpg?w=300" alt="The network infrastructure for the Phase 1 of the Open Cloud Testbed." title="The network infrastructure for the Phase 1 of the Open Cloud Testbed." width="300" height="135" class="alignleft size-medium wp-image-393" /></p>
<p>Currently, there are relatively few testbeds available to the research community for research in cloud computing and few resources available to developers for testing interoperability.  I expect this will change over time, but below are the testbeds that I am aware of and a little bit about each of them.  If you know of any others, please let me know so that I can keep the list current (at least for a while until cloud computing testbeds become more common).</p>
<p>Before discussing the testbeds per se, I want to highlight one of the lessons that I have learned while working with one of the testbeds &#8212; the Open Cloud Testbed (OCT).</p>
<p><b>Disclaimer:</b> I am one of the technical leads for the OCT and one of the Directors of the Open Cloud Consortium.</p>
<p>Currently the OCT consists of 120 identical nodes and 480 cores.  All were purchased and assembled at the same time by the same team.  One thing that caught me by suprise is that there are enough small differences between the nodes that the results of some experimental studies can vary by 5%, 10%, 20%, or more, depending upon which nodes are used within the testbed.  This is because even one or two nodes with slightly inferior performance can impact the overall end-to-end performance of an application that uses some of today&#8217;s common cloud middleware.</p>
<p><b>Amazon Cloud. </b> Although not usually thought of as a testbed, <a href="http://aws.amazon.com">Amazon&#8217;s EC2, S3, SQS, EBS</a> and related services are economical enough that they they can serve as the basis for an on-demand testbed for many experimental studies.   In addition, Amazon provides <a href="http://aws.amazon.com/education/">grants</a> so that their cloud services can be used for teaching and research.</p>
<p><b>Open Cloud Testbed (OCT). </b> The <a href="http://www.opencloudconsortium.org/testbed.html">Open Cloud Testbed</a> is a testbed managed by the <a href="http://www.opencloudconsortium.org">Open Cloud Consortium</a>.  The testbed currently consists of 4 racks of servers, located in 4 data centers at Johns Hopkins University (Baltimore), StarLight (Chicago), the University of Illinois (Chicago), and the University of California (San Diego). Each rack has 32 nodes and 128 cores.  Two Cisco 3750E switches connect the 32 nodes, which then connects to the outside by a 10Gb/s uplink.  In contrast to other cloud testbeds, the OCT utilizes wide area high performance networks, not the familiar commodity Internet.  There are 10Gb/s networks that connect the various data centers.  This network is provided by Cisco&#8217;s CWave national testbed infrastructure and through a partnership with the <a href="http://www.nlr.net/">National Lambda Rail</a>.  Over the next few months the OCT will double in size to 8 racks and over 1000 cores.  In the OCT, a variety of cloud systems and services are installed and available for research, including <a href="http://hadoop.apache.org/">Hadoop</a>, <a href="http://sector.sourceforge.net">Sector/Sphere</a>, <a href="http://kosmosfs.sourceforge.net/">CloudStore</a> (KosmosFS), <a href="http://www.eucalyptus.com/">Eucalyptus</a>, and <a href="http://incubator.apache.org/thrift/">Thrift</a>.  The OCT is a testbed designed to support systems-level, middleware and application level research in cloud computing, as well as the development of standards and interoperability frameworks.  A technical report described the OCT is available from <a href="http://arxiv.org/abs/0907.4810">arxiv.org:0907.4810</a>.</p>
<p><b>Open Cirrus(tm) Testbed. </b>  The <a href="https://opencirrus.org/">Open Cirrus Testbed</a> is a joint initiative sponsored by HP, Intel and Yahoo! in collaboration with the NSF, the University of Illinois at Urbana-Champaign (UIUC), Karlsruhe Institute of Technology, and the Infocomm Development Authority (IDA) of Singapore.  Each of the six sites consists of at least 1000 cores and associated storage.  The Open Cirrus Testbed is a federated system designed to support systems-level research in cloud computing.  A technical report describing the testbed can be found <a href="http://www.hpl.hp.com/techreports/2009/HPL-2009-134.html">here</a>.</p>
<p><b>Eucalyptus Public Cloud. </b>  The <a href="http://open.eucalyptus.com">Eucalyptus Public Cloud</a>  is a testbed for Eucalyptus applications.  <a href="http://www.eucalyptus.com">Eucalyptus</a> shares the same APIs as Amazon&#8217;s <a href="http://aws.amazon.com">web services</a>.   Currently, users are limited to no more than 4 virtual machines and experimental studies that require 6 hours or less.</p>
<p><b>Google-IBM-NSF CLuE Resource. </b>  Another cloud computing testbed is the IBM-Google-NSF Cluster Exploratory or CluE Resource.   The IBM-Google NSF CLuE resource appears to be a testbed for cloud computing applications in the sense that Hadoop applications can be run on the testbed but that the testbed does not support systems research and experiments involving cloud middleware and cloud services per se, as is possible with the OCT and the Open Cirrus Testbed.  (At least this was the case the last time I checked.  It may be different now.  If it is possible to do systems level research on the testbed, I would appreciate it if someone would let me know.)   NSF has awarded nearly $5 million in grants to <a href="http://www.nsf.gov/news/news_summ.jsp?cntn_id=114686&amp;govDel=USNSF_53">14 universities</a> through its Cluster Exploratory (CLuE) program to support research on this testbed.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/07/29/cloud-computing-testbeds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Large Data Clouds FAQ</title>
		<link>http://rgrossman.com/2009/07/16/large-data-clouds-faq/</link>
		<comments>http://rgrossman.com/2009/07/16/large-data-clouds-faq/#comments</comments>
		<pubDate>Thu, 16 Jul 2009 20:37:58 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[Aster MapReduce]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[Greenplum MapReduce]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hadoop word count]]></category>
		<category><![CDATA[large data]]></category>
		<category><![CDATA[large data clouds]]></category>
		<category><![CDATA[MalStone]]></category>
		<category><![CDATA[Pig]]></category>
		<category><![CDATA[Sector]]></category>
		<category><![CDATA[what is a large data cloud]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=355</guid>
		<description><![CDATA[This is a post that contains some questions and answers about large data clouds that  I expect to update and expand from time to time.
What is large data?  From the point of view of the infrastructure required to do analytics, data comes in three sizes:

Small data.   Small data fits into the [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post that contains some questions and answers about large data clouds that  I expect to update and expand from time to time.</p>
<p><b>What is large data?</b>  From the point of view of the infrastructure required to do analytics, data comes in three sizes:</p>
<ul>
<li><b>Small data. </b>  Small data fits into the memory of a single machine.  A good example of a small dataset is the dataset for the <a href="http://www.netflixprize.com/">Netflix Prize</a>.  The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles.   This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop.  I discuss some lessons in analytic strategy that you learn from this contest in this <a href="http://blog.rgrossman.com/2009/07/05/three-lessons-in-analytic-strategy-from-the-netflix-prize/">post</a>.
<p><a href="http://cdsweb.cern.ch/record/1087869"><img src="http://rgrossman.files.wordpress.com/2009/07/atlas-cern.jpg?w=199" alt="Building the ATLAS Detector at Cern&#39;s Large Hadron Collider" title="Building the ATLAS Detector at Cern&#39;s Large Hadron Collider" width="199" height="300" class="alignleft size-medium wp-image-372" /></a></p>
<li><b>Medium data. </b>  Medium data fits into a single disk or disk array and can be managed by a database.  It is becoming common today for companies to create 1 to 10 TB or large data warehouses.
<li><b>Large data. </b> Large data is so large that it is challenging to manage it in a database and instead specialized systems are used.  We&#8217;ll discuss some examples of these specialized systems below.  Scientific experiments, such as the Large Hadron Collider (<a href="http://lcg.web.cern.ch/LCG/">LHC</a>), produce large datasets.  Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets.
</ul>
<p>There have always been large datasets, but until recently, most large datasets were produced by the scientific and defense communities.  Two things have changed:  First, large datasets are now being produced by a third community: companies that provide internet services, such as search, on-line advertising and social media.  Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies.  This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models.  Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the <a href="http://www.globus.org/">grid-based infrastructure</a> that is generally used by the scientific community.</p>
<p><b>What is a large data cloud? </b>  There is no standard definition of a large data cloud, but a good working definition is that a large data cloud<br />
provides i) storage services and ii) compute services that are layered over the storage services that scale to a data center and that have the reliability associated with a data center.  You can find some background information on clouds on this page containing an <a href="http://blog.rgrossman.com/about-cloud-computing/">overview about clouds</a>.</p>
<p><b>What are some of the options for working with large data?</b>  There are several options, including:</p>
<ul>
<li>The most mature large data cloud application is the open source <a href="http://hadoop.apache.org/core/">Hadoop</a> system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop&#8217;s implementation of <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a>.  An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including <a href="http://hadoop.apache.org/pig/">Pig</a>, which provides simple database-like operations over data managed by HDFS.</li>
<li>Another option is <a href="http://sector.sourceforge.net">Sector</a>, which consists of the Sector Distributed File System (SDFS) and a compute service called Sphere that allows users to execute arbitrary User Defined Functions (UDFs) over the data managed by SDFS.  Sector supports MapReduce as a special case of a user-defined Map UDF, followed by Shuffle and Sort UDFs provided by Sphere, followed by a user-defined Reduce UDF.  Sector is a C++ open source application.  Unlike Hadoop, Sector includes security.   There is <a href="http://blog.rgrossman.com/2009/06/23/sector-public-cloud/">public Sector cloud</a>  for those interested in trying out Sector without downloading it and installing it.</li>
<li><a href="http://www.greenplum.com/technology/architecture/">Greenplum</a> uses a shared-nothing MPP (massively parallel processing) architecture based upon commodity hardware.  The Greenplum architecture also integrates <a href="http://www.greenplum.com/technology/mapreduce/">MapReduce-like functionality</a> into its platform.    </li>
<li>Aster has a MPP-based <a href="http://www.asterdata.com/product/appliance.php">data warehousing appliance</a> that supports MapReduce.  They have an entry level system that manages up to 1 TB of data and an enterprise level system that is designed to support up to 1 PB of data. </li>
<p><b>How do I get started? </b>  The easiest way to get started is to download one of the applications and to work through some basic examples.   The example that most people work through is <a href="http://wiki.apache.org/hadoop/WordCount">word count</a>.  Another common example is the terasort example (soring 10 billion 100 byte records where the first 10 bytes is the key that is sorted and the remaining 90 bytes is the payload).  A simple analytic to try is <a href="http://code.google.com/p/malgen">MalStone</a>, which I have described in another <a href="http://blog.rgrossman.com/2009/05/25/malstone-benchmark/">post</a>.</p>
<p><b>What are some of the issues that arise with large data cloud applications? </b> The first issue is mapping your problem to the MapReduce or generalized MapReduce (like Spheres UDFs) frameworks.  Although this type of data parallel framework may seem quite special initially, it is surprising how many problems can be mapped to this framework with a bit effort.</p>
<p>The second issue is that tuning Hadoop clusters can be challenging and time consuming.  This is not surprising, considering the power provided by Hadoop to tackle very large problems.</p>
<p>The third issue is that with medium (100 nodes) and large (1000 node) clusters, even a few under performing nodes can impact the overall performance.  There can also be problems with switches that impact performance in subtle ways.  Dealing with these types of hardware issues can also be time consuming.  It is sometimes helpful to run a known benchmark such as terasort or MalStone to distinguish hardware issues from programming issues.</p>
<p><b>What is the significance of large data clouds? </b>  Just a short time ago, it required specialized proprietary software to analyze 100 TB or more of data.  Today, a competant team should be able to do this relatively straightforwardly with a 100 node large data cloud powered by Hadoop, Sector or similar software.</p>
<p><b>Getting involved.  </b> I just set up a Google Group for large data clouds:<br />
<a href="http://groups.google.com/group/large-data-clouds">groups.google.com/group/large-data-clouds</a>.  Please use this group to discuss issues related to large data clouds, including lessons learned, questions, annoucements, etc. (no advertising please).  In particular, if you have software you would like added to the list below, please comment below or send a node to the large data cloud google group.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/07/16/large-data-clouds-faq/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Test Drive the Sector Public Cloud</title>
		<link>http://rgrossman.com/2009/06/23/sector-public-cloud/</link>
		<comments>http://rgrossman.com/2009/06/23/sector-public-cloud/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 12:17:51 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[C++ cloud]]></category>
		<category><![CDATA[Google File System]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[high performance networks]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[open source cloud]]></category>
		<category><![CDATA[Sector]]></category>
		<category><![CDATA[Sector/Sphere]]></category>
		<category><![CDATA[Sphere]]></category>
		<category><![CDATA[User Defined Functions]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=259</guid>
		<description><![CDATA[Sector is an open source cloud written in C++ for storing, sharing and processing large data sets.   Sector is broadly similar to the Google File System and the Hadoop Distributed File System, except that it is designed to utilize wide area high performance  networks.
Sphere is middleware that is designed to process data managed by Sector.  [...]]]></description>
			<content:encoded><![CDATA[<p>Sector is an open source cloud written in C++ for storing, sharing and processing large data sets.   Sector is broadly similar to the <a href="http://labs.google.com/papers/gfs.html">Google File System</a> and the <a href="http://hadoop.apache.org/core/">Hadoop Distributed File System</a>, except that it is designed to utilize wide area high performance  networks.</p>
<p>Sphere is middleware that is designed to process data managed by Sector.  Sphere implements a framework for distributed computing that allows any User Defined Function (UDF) to be applied to a Sector dataset.</p>
<p>One way to think about this is as a generalized MapReduce.  With MapReduce, users work with  pairs and define a Map function and a Reduce function, and the MapReduce application creates a workflow consisting of a Map, Shuffle, Sort and Reduce.  With Sector, users can create a workflow consisting of any sequence of User Define Functions (UDFs) and apply these to any datasets managed by Sector.  In particular, Sphere has predefined Shuffle and Sort UDFs that can be applied to datasets consisting of  pairs so that MapReduce applications can be implemented once a user defines a Map and Reduce UDF.</p>
<p>Sector also implements security and we are currently using it to bring up a HIPAA-compliant private cloud.</p>
<p>Since Sector/Sphere is written in C++, it is straightforward to support C++ based data access tools and programming APIs.</p>
<p>If you have access to high speed research network (for example if you network can reach <a href="http://www.startap.net/starlight/">StarLight</a>, the <a href="http://www.nlr.net/">National Lambda Rail</a>, <a href="http://www.es.net">ESNet</a>, or <a href="http://www.internet2.edu">Internet2</a>), then you can try out the Sector Public Cloud.</p>
<p>You can reach the Sector Public Cloud from the Sector home page <a href="http://sector.sourceforge.net">sector.sourceforge.net</a>.</p>
<p>There is a technical report on the design of Sector on arXiv: <a href="http://arxiv.org/abs/0809.1181">arXiv:0809.1181v2</a>.</p>
<p>There is some information on the performance of Sector/Sphere in my post on the <a href="http://blog.rgrossman.com/2009/05/25/malstone-benchmark/">MalStone Benchmark</a>, a benchmark for clouds that support data intensive computing.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/06/23/sector-public-cloud/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Some Reasons to Consider Condominium Clouds (Condo Clouds)</title>
		<link>http://rgrossman.com/2009/06/08/condo-clouds/</link>
		<comments>http://rgrossman.com/2009/06/08/condo-clouds/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 22:10:23 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[condo clouds]]></category>
		<category><![CDATA[condominium clouds]]></category>
		<category><![CDATA[private clouds]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=203</guid>
		<description><![CDATA[In this post, I&#8217;ll introduce condominium clouds and discuss some of their potential for changing computing.  From an architectural point of view, condominium clouds are essentially the same as private clouds.  Condominium clouds have a different business model though, which, in certain circumstances provides some definite advantages.
I argue here that condominium clouds and [...]]]></description>
			<content:encoded><![CDATA[<p>In this post, I&#8217;ll introduce condominium clouds and discuss some of their potential for changing computing.  From an architectural point of view, condominium clouds are essentially the same as private clouds.  Condominium clouds have a different business model though, which, in certain circumstances provides some definite advantages.</p>
<p>I argue here that condominium clouds and related offerings represent a fundamental shift in our computing platforms.  To explain this, I&#8217;ll take a short detour and recall a computing experience I had about a decade ago and the business model (condominium fiber) that made these types of experiences available to a broader community.</p>
<p><img class="alignleft size-medium wp-image-214" title="Some racks in data center. " src="http://rgrossman.files.wordpress.com/2009/06/data_center_rack.jpg?w=168" alt="Some racks in data center. " width="168" height="300" /></p>
<p>One of most exciting technical experiences I have had occurred in 2000 when I ran a distributed data intensive computing application over a dedicated 155 Mbps network link connecting clusters located at NCAR in Boulder and the University of Michigan in Ann Arbor.  Prior to that I only had access to 1.5 Mbps networks and these networks were shared by the rest of the campus.  The application was able to perform sustained computation at about 96 Mbps, which was not bad considering that each computer was limited by a 100 Mbps NIC.   Reaching a 96 Mbps over a wide area network was quite difficult at that time, but we did this using a new network protocol that was the precursor to <a href="http://udt.sourceforge.net">UDT</a>.  The reason for our excitement was that one day we were were limited to distributed computations that rarely reached 1 Mbps, while the next day we reached 96 Mbps, almost two orders of magnitude improvement.</p>
<p>By 2003, with improved protcols and 10 Gbps networks, sustained distributed computations reached  6.8 Gbps.  Within a four year span, we had passed through an inflection point in which high performance distributed computing improved by over 3 orders of magnitude.  Three things were required:</p>
<ul>
<li>A new computing platform, in this case, clusters connected by wide area, high performance networks.</li>
<li>A new network protocol and associated libraries, since TCP was not effective at data intensive computing over wide area high performance networks.</li>
<li>A new business model, which made high performance wide area networks more broadly available.</li>
</ul>
<p>Let&#8217;s turn now to cloud computing.  Cloud computing has two faces: the most familiar face offers utility-based pricing, on-demand elastic availability, and infrastructure as a service.  There is no doubt that this combination is changing the face of computing.  On the other hand, the other side of cloud computing is just as important.  This side is about thinking of the data center as your unit of computing.  Previously you probably thought of computing as requiring a certain number of racks.  With cloud computing, you now think of computing as requiring a certain number of data centers.  This is computing measured with Data Center Units or DCUs.</p>
<p>The problem is acquiring computing at the scale of data centers is prohibitive except for handful of companies (Google, Microsoft, Yahoo, IBM, &#8230;)</p>
<p>This is where the condominium clouds enter.  But first, here is a description of customer owned and condominium fiber from a <a href="http://www.canarie.ca/canet4/library/customer.html">2002 FAQ</a> titled &#8220;FAQ about Community Dark Fiber Networks&#8221; written by <a href="http://billstarnaud.blogspot.com/">Bill St Arnaud</a>:</p>
<blockquote><p>Dark fiber is optical fiber, dedicated to a single customer and where the customer is responsible for attaching the telecommunications equipment and lasers to &#8220;light&#8221; the fiber.  Traditionally optical fiber networks have been built by carriers where they take on the responsibility of lighting the fiber and provide a managed service to the customer.</p>
<p>Professional 3rd parties companies who specialize in dark fiber systems take care of the actual installation of the fiber and also maintain it on behalf of the customer.  Technically these companies actually own the fiber, but sell IRUs (Indefeasible Rights of Use) for up to 20 years for unrestricted use of the fiber.</p>
<p>&#8230;</p>
<p>All across North America businesses, school boards and municipalities are banding together to negotiate deals to purchase customer owned dark fiber.  A number of next generation service providers are now installing fiber networks and will sell strands of fiber to any organization who wish to purchase and manage their own dark fiber.</p>
<p>Many of these new fiber networks are built along the same model as a condominium apartment building.  The contractor advertises the fact that they intend to build a condominium fiber network and offers early participants special pricing before the construction begins.  That way the contractor is able to guarantee early financing for the project and demonstrate to bankers and other investors that there are some committed customers to the project.</p>
<p>The condominium fiber is operated like a condominium apartment building.  The individual owners of fiber strands can do whatever they want they want with their individual fiber strands.  They are free to carry any type of traffic and terminate the fiber any way they so choose. The company that installs the fiber network is responsible for overall maintenance and repairing the fiber in case of breaks, moves, adds or changes.  The &#8220;condominium manager&#8221; charges the owners of the individual strands of fiber a small annual maintenance fee which covers all maintenance and right of way costs.</p>
<p>&#8230;</p>
<p>The initial primary driver for dark fiber by individual customers is the dramatic savings in telecommunication costs.  The reduction in telecommunication costs can be in excess of 1000% depending on your current bandwidth requirements.
</p></blockquote>
<p>It is now easy to explain condominium clouds.   For those who cannot afford private clouds at the scale of data centers, condominium clouds became a way to share the expense with other members of the condominium.</p>
<p>The condominium cloud model is also attractive if there are compliance issues or security issues that make a private cloud desirable, but your scale is such that justifying your own private cloud at the scale of a data center does not make sense.</p>
<p>As with condominium fiber, professionals would build and operate the data center.  One way of looking at condominium clouds is as a more cost effective private clouds for certain organizations or associations that might benefit from the scale and operational control that data centers offer.</p>
<p>Condominium clouds might make sense for companies in a regulated industry that belong to an association that can manage the condominium.  They would also make sense for  scientific collaborations, especially those with large data.  Also, although the business model would be slightly different, government organizations that couldn&#8217;t justify their own cloud could work together and jointly manage a condominium cloud.</p>
<p>The image above is courtesy of Cory Doctorow.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/06/08/condo-clouds/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing</title>
		<link>http://rgrossman.com/2009/05/25/malstone-benchmark/</link>
		<comments>http://rgrossman.com/2009/05/25/malstone-benchmark/#comments</comments>
		<pubDate>Mon, 25 May 2009 16:41:58 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[benchmarks for cloud analytics]]></category>
		<category><![CDATA[benchmarks for data intensive computing]]></category>
		<category><![CDATA[cloud analytics]]></category>
		<category><![CDATA[cloud computing benchmarks]]></category>
		<category><![CDATA[CloudStone]]></category>
		<category><![CDATA[drive-by exploits]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hadoop wins TeraSort]]></category>
		<category><![CDATA[log files]]></category>
		<category><![CDATA[MalStone]]></category>
		<category><![CDATA[site-entity]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=146</guid>
		<description><![CDATA[The TPC Benchmarks have played an important role in comparing databases and transaction processing systems.   Currently, there are no similar benchmarks for comparing two clouds.

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications.  In this note, we describe the MalStone Benchmark, which [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.tpc.org/">TPC Benchmarks</a> have played an important role in comparing databases and transaction processing systems.   Currently, there are no similar benchmarks for comparing two clouds.</p>
<p><img src="http://rgrossman.files.wordpress.com/2009/06/benchmark_bourbon_whiskey.jpg?w=225" alt="Benchmark" title="Benchmark" width="225" height="300" class="alignleft size-medium wp-image-229" /></p>
<p>The <a href="http://radlab.cs.berkeley.edu/wiki/Projects/Cloudstone">CloudStone Benchmark</a> is a first step towards a benchmark for clouds designed to support Web 2.0 type applications.  In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as <a href="http://hadoop.apache.org/core/">Hadoop</a> and <a href="http://sector.sourceforge.net">Sector</a>, designed to support data intensive computing.</p>
<p>MalStone is a stylized analytic computation of a type that is common in data intensive computing.   The open source code to generate data for MalStone and a technical report describing MalStone and providing some sample implementations can be found at: <a href="http://code.google.com/p/malgen">code.google.com/p/malgen</a> (look in the feature downloads section along the right hand side).</p>
<h3>Detecting Drive-By Exploits from Log Files</h3>
<p>We introduce MalStone with a simple example.  Consider visitors to web sites.  As described in the paper <a href="http://www.usenix.org/events/hotbots07/tech/full_papers/provos/provos.pdf">The Ghost in the Browser</a> by <a href="http://www.provos.org">Provos</a> et. al.  that was presented at <a href="http://www.usenix.org/events/hotbots07/tech/">HotBot &#8216;07</a>, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages.  Sometimes these are called “drive-by exploits.”</p>
<p>The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:</p>
<pre>   | Timestamp | Web Site ID | User ID</pre>
<p>There is a further assumption that if the computers become infected, at perhaps a later time, then this is known.  That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:</p>
<pre>   | User ID | Compromise Flag</pre>
<p>Here the Compromise field is a flag, with 1 denoting a compromise.  A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.</p>
<p>We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites.  Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging.  For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today.  On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.</p>
<p>The MalStone benchmarks use records of the following form:</p>
<pre>   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID</pre>
<p>Here site abstracts web site and entity abstracts the possibly infected computer.   We assume that each record is 100 bytes long.</p>
<p>In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and  subsequently becomes compromised is divided by the total number of records for which an entity visited the site.  The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest).  MalStone A-10 uses 10 billion records so that in total there is 1 TB of data.  Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records.   MalStone B-10, B-100 and B-1000 are defined in the same way.</p>
<h3>TeraSort Benchmark</h3>
<p>One of the motivations for choosing 10 billion 100-byte records is that the <a href="http://sortbenchmark.org/">TeraSort Benchmark</a> (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.</p>
<p>In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark.  It was able to sort 1 TB of data using <a href="http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html">using 910 nodes in 209 seconds</a>, breaking the previous record of 297 seconds.   Hadoop set a new record in 2009 by sorting 100 TB of data at <a href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html">0.578 TB/minute using 3800 nodes</a>.  For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton <a href="http://perspectives.mvdirona.com/2008/07/08/HadoopWinsTeraSort.aspx">Hadoop Wins Terasort</a>.</p>
<p>Note that the TeraSort Benchmark is now deprecated and has been replaced by the <a href="http://sortbenchmark.org/">Minute Sort Benchmark</a>.  Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.</p>
<h3>Generating Data for MalStone Using MalGen</h3>
<p>We have developed a generator of synthetic data for MalStone called MalGen.  MalGen is open source and available from <a href="http://code.google.com/p/malgen">code.google.com/p/malgen</a>.  Using MalGen, data can be generated with power law distributions, which is useful when modeling  web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).</p>
<h3>Using MalStone to Study Design Tradeoffs</h3>
<p>Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records.   The experiments were done on 20 nodes of the <a href="http://www.opencloudconsortium.org/testbed.html">Open Cloud Testbed</a>.  Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, and 1 Gb/s network interface cards.</p>
<p>We compared three different implementations: 1) Hadoop HDFS with Hadoop&#8217;s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using <a href="http://arxiv.org/abs/0809.1181">Sphere User Defined Functions (UDFs)</a>.</p>
<table border="1">
<tr>
<th colspan="2"> MalStone A</th>
</tr>
<tr>
<td>Hadoop MapReduce </td>
<td>454m 13s</td>
</tr>
<tr>
<td>Hadoop Streams/Python</td>
<td>87m 29s </td>
</tr>
<tr>
<td>Sector/Sphere UDFs </td>
<td>33m 40s</td>
</tr>
<tr>
<th colspan="2"> MalStone B</th>
</tr>
<tr>
<td>Hadoop MapReduce </td>
<td>840m 50s</td>
</tr>
<tr>
<td>Hadoop Streams/Python</td>
<td>142m 32s </td>
</tr>
<tr>
<td>Sector/Sphere UDFs </td>
<td>43m 44s</td>
</tr>
</table>
<p><b>Please note that these timings are still preliminary and may be revised in the future as we better optimize the implementations. </b></p>
<p>If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice.  What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams.  In addition, you may also want to consider using <a href="http://sector.sourceforge.net">Sector</a>.</p>
<p>The image above is from <a href="http://www.flickr.com/photos/legeres/270126135/">Strolling everyday</a> and available via a Creative Commons license.</p>
<p><b>Disclaimer:</b>  I am involved in the development of Sector.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/05/25/malstone-benchmark/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>
