<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robert Grossman &#187; data intensive computing</title>
	<atom:link href="http://rgrossman.com/category/blog/data-intensive-computing/feed/" rel="self" type="application/rss+xml" />
	<link>http://rgrossman.com</link>
	<description>analytics, analytic strategy and analytic infrastructure</description>
	<lastBuildDate>Wed, 28 Jul 2010 02:49:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The Data Center as the Unit of Computing</title>
		<link>http://rgrossman.com/2010/07/27/the-data-center-as-the-unit-of-computing/</link>
		<comments>http://rgrossman.com/2010/07/27/the-data-center-as-the-unit-of-computing/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 02:49:33 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=672</guid>
		<description><![CDATA[I&#8217;m at the KDD 2010 conference this week in Washington, D.C..  On Sunday,  I gave the keynote in the The 2nd Workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2010), which was one of the workshops co-located with conference.  The title of my talk was &#8220;My Other Computer is a Data [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m at the <a href="http://www.kdd.org/kdd2010/">KDD 2010</a> conference this week in Washington, D.C..  On Sunday,  I gave the keynote in the The 2nd Workshop on Large-scale Data Mining: Theory and Applications (<a href="http://arnetminer.org/LDMTA2010">LDMTA 2010</a>), which was one of the workshops co-located with conference.  The title of my talk was &#8220;My Other Computer is a Data Center: The Sector Perspective on Big Data.&#8221;  You can download the talk from <a href="http://www.slideshare.net/rgrossman/my-other-computer-is-a-data-center-the-sector-perspective-on-big-data">Slideshare</a>.</p>
<p><img src="http://rgrossman.com/files/2010/07/kdd10.png" alt="KDD 2010" title="KDD 2010" width="542" height="115" class="alignleft size-full wp-image-681" /></p>
<p>The first part of the talk argued that it may be useful to think of a data center as a &#8220;device&#8221; for extracting relationships from data, in broadly the same way that we view a telescope as a device for looking at things that are very far away and a microscope as a device for looking at things that are very small.   Continuing in this way, you can think of a supercomputer as a device for computing simulations.  </p>
<p>The table below is my rough &#8220;back of the envelope&#8221; computation of the scale up provided by each of these devices over what was possible before (these scale up numbers are very rough and if you have better numbers, please let me know).</p>
<p>In each of these cases, the device resulted in some pretty interesting new science.  So it is interesting to speculate what type of new science might arise when you think of a data center for extracting patterns from very large collections of data.</p>
<p>In the second part of the talk, I described at a very high level some of the components and layers in a software stack for data center device.</p>
<table border="1">
<tr>
<td><b>Instrument</b></td>
<td><b>Year</b></td>
<td><b>Scale up</b></td>
</tr>
<tr>
<td>Telescope</td>
<td>1609</td>
<td>30x</td>
</tr>
<tr>
<td>Microscope</td>
<td>1670</td>
<td>250x</td>
</tr>
<tr>
<td>Supercomputing</td>
<td>1976</td>
<td>10x-100x</td>
</tr>
<tr>
<td>Data center</td>
<td>2003</td>
<td>10x-100x</td>
</tr>
</table>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2010/07/27/the-data-center-as-the-unit-of-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building Your Own Large Data Clouds (Raywulf Clusters)</title>
		<link>http://rgrossman.com/2009/09/27/building-your-own-large-data-clouds/</link>
		<comments>http://rgrossman.com/2009/09/27/building-your-own-large-data-clouds/#comments</comments>
		<pubDate>Sun, 27 Sep 2009 16:59:19 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[large data clouds]]></category>
		<category><![CDATA[Sector Sphere]]></category>
		<category><![CDATA[Terasort]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=453</guid>
		<description><![CDATA[We recently added four new racks to the Open Cloud Testbed.  The racks are designed to support cloud computing, both clouds that support on demand VMs as well as those that support data intensive computing.  Since there is not a lot of information available describing how to put together these types of clouds, [...]]]></description>
			<content:encoded><![CDATA[<p>We recently added four new racks to the <a href="http://opencloudconsortium.org/testbed.html">Open Cloud Testbed</a>.  The racks are designed to support cloud computing, both clouds that support on demand VMs as well as those that support data intensive computing.  Since there is not a lot of information available describing how to put together these types of clouds, I thought I would share how we configured our racks.</p>
<div id="attachment_464" class="wp-caption alignleft" style="width: 194px"><img src="http://rgrossman.files.wordpress.com/2009/09/oct-gen2-09.jpg?w=184" alt="These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out.  Photograph by Michal Sabala." title="Two Racks from the Open Cloud Testbed" width="184" height="300" class="size-medium wp-image-464" /><p class="wp-caption-text">These are two of the four racks that were added to the Open Cloud Testbed as part of the Phase 2 build out.  Photograph by Michal Sabala.</p></div>
<p>These racks can be used as a basis for private clouds, hybrid clouds, or <a href="http://blog.rgrossman.com/2009/06/08/condo-clouds/">condo clouds</a>.</p>
<p>There is a lot of information about building Beowulf clusters, which are designed for compute intensive computing.  Here is one of the first <a href="http://www.cacr.caltech.edu/beowulf/tutorial/building.html">tutorials</a> and some more recent <a href="http://www.beowulf.org">information</a>.</p>
<p>In contrast, our racks are designed to support data intensive computing.  We sometimes call these Raywulf clusters.  Briefly, the goal is to make sure that there are enough spindles moving data in parallel with enough cores to process the data being moved.   (Our data intensive middleware is called Sector, Graywulf is already taken, and there are not many words that rhyme with Beo- left.  Other suggestions are welcome.  Please use the comments below.)</p>
<p>The racks cost about $85,000 (with standard discounts), consist of 32 nodes and 124 cores with 496 GB of RAM, 124 TB of disk &amp; 124 spindles, and consume about 10.3 kW of power (excluding the power required for cooling).</p>
<p>With 3x replication, there is about 40 TB of usable storage available, which means that the cost to provide balanced long term storage and compute power is about $2,000 per TB.   So, for example, a single rack could be used as a basis for a private cloud that can manage and analyze approximately 40 TB of data.  At the end of this note, is some performance information about a single rack system.</p>
<p>Each rack is a standard 42U computer rack and consists of a head node and 31 compute/storage nodes.  We installed GNU/Debian Linux 5.0 as the operating system.  Here is the configuration of the rack and of the compute/storage nodes.</p>
<p>In contrast, there are specialized <a href="http://blog.backblaze.com/category/cloud-storage/">configurations</a>, such as designed by Backblaze, that provide 67TB for $8,000.  This is 1/2 the storage for 1/10 the cost.   The difference is that Raywulf clusters are designed for data intensive computing using middleware such as Hadoop and Sector/Sphere, not just storage.</p>
<p><b>Rack Configuration </b></p>
<ul>
<li>31 compute/storage nodes (see below)</li>
<li>1  head node (see below)</li>
<li>2 Force10 S50N switches, with 2 10 Gbps uplinks so that the inter-rack bandwidth is 20 Gbps</li>
<li>1 10GE module </li>
<li>2 optics and stacking modules </li>
<li>1 3Com Baseline 2250 switch to provide to provide additional cat5 ports for IPMI management interfaces. </li>
<li> cabling </li>
</ul>
<p><b>Compute/storage node. </b></p>
<ul>
<li>Intel Xeon 5410 Quad Core CPU with 16GB of RAM </li>
<li> SATA RAID controller </li>
<li> four (4) SATA 1TB hard drives in RAID-0 configuration </li>
<li> 1 Gbps NIC </li>
<li> IPMI management </li>
</ul>
<p><b>Benchmarks.</b>  We benchmarked these new racks using the Terasort Benchmark and version 0.20.1 of <a href="http://hadoop.apache.org/">Hadoop</a> and version 1.24a of <a href="http://sector.sourceforge.net">Sector/Sphere</a>.   Replication was turned off in both Hadop and Sector.  All the racks were located within one data center.  It is clear from these tests that the new versions of Hadoop and Sector/Sphere are both faster than the previous versions.</p>
<table>
<tr>
<th>Configuration </th>
<th>Sector/Sphere</th>
<th>Hadoop</th>
</tr>
<tr>
<td>1 rack (32 nodes) </td>
<td>28m 25s </td>
<td>85m 49s</td>
</tr>
<tr>
<td>2 racks (64 nodes) </td>
<td>15m 20s </td>
<td>37m 0s</td>
</tr>
<tr>
<td>3 racks (96 nodes) </td>
<td>10m 19s </td>
<td>24m 14s</td>
</tr>
<tr>
<td>4 racks (128 nodes) </td>
<td>7m 56s </td>
<td>17m 45s</td>
</tr>
</table>
<p>The Raywulf clusters were designed by Michal Sabala and Yunhong Gu of the <a href="http://www.ncdm.uic.edu">National Center for Data Mining</a> at the University of Illinois at Chicago.</p>
<p>We are working on putting together more information of how to build a Raywulf cluster.</p>
<p>Sector/Sphere and our Raywulf Clusters were selected as one of the <a href="http://sc09.supercomputing.org/?pg=disrupttech.html">Disruptive Technologies</a> that will be highlighted at <a href="http://sc09.supercomputing.org">SC 09</a>.</p>
<p>The photograph above of two racks from the Open Cloud Testbed was taken by Michal Sabala.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/09/27/building-your-own-large-data-clouds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Test Drive the Sector Public Cloud</title>
		<link>http://rgrossman.com/2009/06/23/sector-public-cloud/</link>
		<comments>http://rgrossman.com/2009/06/23/sector-public-cloud/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 12:17:51 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[C++ cloud]]></category>
		<category><![CDATA[Google File System]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[high performance networks]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[open source cloud]]></category>
		<category><![CDATA[Sector]]></category>
		<category><![CDATA[Sector/Sphere]]></category>
		<category><![CDATA[Sphere]]></category>
		<category><![CDATA[User Defined Functions]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=259</guid>
		<description><![CDATA[Sector is an open source cloud written in C++ for storing, sharing and processing large data sets.   Sector is broadly similar to the Google File System and the Hadoop Distributed File System, except that it is designed to utilize wide area high performance  networks.
Sphere is middleware that is designed to process data managed by Sector.  [...]]]></description>
			<content:encoded><![CDATA[<p>Sector is an open source cloud written in C++ for storing, sharing and processing large data sets.   Sector is broadly similar to the <a href="http://labs.google.com/papers/gfs.html">Google File System</a> and the <a href="http://hadoop.apache.org/core/">Hadoop Distributed File System</a>, except that it is designed to utilize wide area high performance  networks.</p>
<p>Sphere is middleware that is designed to process data managed by Sector.  Sphere implements a framework for distributed computing that allows any User Defined Function (UDF) to be applied to a Sector dataset.</p>
<p>One way to think about this is as a generalized MapReduce.  With MapReduce, users work with  pairs and define a Map function and a Reduce function, and the MapReduce application creates a workflow consisting of a Map, Shuffle, Sort and Reduce.  With Sector, users can create a workflow consisting of any sequence of User Define Functions (UDFs) and apply these to any datasets managed by Sector.  In particular, Sphere has predefined Shuffle and Sort UDFs that can be applied to datasets consisting of  pairs so that MapReduce applications can be implemented once a user defines a Map and Reduce UDF.</p>
<p>Sector also implements security and we are currently using it to bring up a HIPAA-compliant private cloud.</p>
<p>Since Sector/Sphere is written in C++, it is straightforward to support C++ based data access tools and programming APIs.</p>
<p>If you have access to high speed research network (for example if you network can reach <a href="http://www.startap.net/starlight/">StarLight</a>, the <a href="http://www.nlr.net/">National Lambda Rail</a>, <a href="http://www.es.net">ESNet</a>, or <a href="http://www.internet2.edu">Internet2</a>), then you can try out the Sector Public Cloud.</p>
<p>You can reach the Sector Public Cloud from the Sector home page <a href="http://sector.sourceforge.net">sector.sourceforge.net</a>.</p>
<p>There is a technical report on the design of Sector on arXiv: <a href="http://arxiv.org/abs/0809.1181">arXiv:0809.1181v2</a>.</p>
<p>There is some information on the performance of Sector/Sphere in my post on the <a href="http://blog.rgrossman.com/2009/05/25/malstone-benchmark/">MalStone Benchmark</a>, a benchmark for clouds that support data intensive computing.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/06/23/sector-public-cloud/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing</title>
		<link>http://rgrossman.com/2009/05/25/malstone-benchmark/</link>
		<comments>http://rgrossman.com/2009/05/25/malstone-benchmark/#comments</comments>
		<pubDate>Mon, 25 May 2009 16:41:58 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[benchmarks for cloud analytics]]></category>
		<category><![CDATA[benchmarks for data intensive computing]]></category>
		<category><![CDATA[cloud analytics]]></category>
		<category><![CDATA[cloud computing benchmarks]]></category>
		<category><![CDATA[CloudStone]]></category>
		<category><![CDATA[drive-by exploits]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hadoop wins TeraSort]]></category>
		<category><![CDATA[log files]]></category>
		<category><![CDATA[MalStone]]></category>
		<category><![CDATA[site-entity]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=146</guid>
		<description><![CDATA[The TPC Benchmarks have played an important role in comparing databases and transaction processing systems.   Currently, there are no similar benchmarks for comparing two clouds.

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications.  In this note, we describe the MalStone Benchmark, which [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.tpc.org/">TPC Benchmarks</a> have played an important role in comparing databases and transaction processing systems.   Currently, there are no similar benchmarks for comparing two clouds.</p>
<p><img src="http://rgrossman.files.wordpress.com/2009/06/benchmark_bourbon_whiskey.jpg?w=225" alt="Benchmark" title="Benchmark" width="225" height="300" class="alignleft size-medium wp-image-229" /></p>
<p>The <a href="http://radlab.cs.berkeley.edu/wiki/Projects/Cloudstone">CloudStone Benchmark</a> is a first step towards a benchmark for clouds designed to support Web 2.0 type applications.  In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as <a href="http://hadoop.apache.org/core/">Hadoop</a> and <a href="http://sector.sourceforge.net">Sector</a>, designed to support data intensive computing.</p>
<p>MalStone is a stylized analytic computation of a type that is common in data intensive computing.   The open source code to generate data for MalStone and a technical report describing MalStone and providing some sample implementations can be found at: <a href="http://code.google.com/p/malgen">code.google.com/p/malgen</a> (look in the feature downloads section along the right hand side).</p>
<h3>Detecting Drive-By Exploits from Log Files</h3>
<p>We introduce MalStone with a simple example.  Consider visitors to web sites.  As described in the paper <a href="http://www.usenix.org/events/hotbots07/tech/full_papers/provos/provos.pdf">The Ghost in the Browser</a> by <a href="http://www.provos.org">Provos</a> et. al.  that was presented at <a href="http://www.usenix.org/events/hotbots07/tech/">HotBot &#8216;07</a>, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages.  Sometimes these are called “drive-by exploits.”</p>
<p>The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:</p>
<pre>   | Timestamp | Web Site ID | User ID</pre>
<p>There is a further assumption that if the computers become infected, at perhaps a later time, then this is known.  That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:</p>
<pre>   | User ID | Compromise Flag</pre>
<p>Here the Compromise field is a flag, with 1 denoting a compromise.  A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.</p>
<p>We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites.  Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging.  For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today.  On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.</p>
<p>The MalStone benchmarks use records of the following form:</p>
<pre>   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID</pre>
<p>Here site abstracts web site and entity abstracts the possibly infected computer.   We assume that each record is 100 bytes long.</p>
<p>In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and  subsequently becomes compromised is divided by the total number of records for which an entity visited the site.  The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest).  MalStone A-10 uses 10 billion records so that in total there is 1 TB of data.  Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records.   MalStone B-10, B-100 and B-1000 are defined in the same way.</p>
<h3>TeraSort Benchmark</h3>
<p>One of the motivations for choosing 10 billion 100-byte records is that the <a href="http://sortbenchmark.org/">TeraSort Benchmark</a> (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.</p>
<p>In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark.  It was able to sort 1 TB of data using <a href="http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html">using 910 nodes in 209 seconds</a>, breaking the previous record of 297 seconds.   Hadoop set a new record in 2009 by sorting 100 TB of data at <a href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html">0.578 TB/minute using 3800 nodes</a>.  For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton <a href="http://perspectives.mvdirona.com/2008/07/08/HadoopWinsTeraSort.aspx">Hadoop Wins Terasort</a>.</p>
<p>Note that the TeraSort Benchmark is now deprecated and has been replaced by the <a href="http://sortbenchmark.org/">Minute Sort Benchmark</a>.  Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.</p>
<h3>Generating Data for MalStone Using MalGen</h3>
<p>We have developed a generator of synthetic data for MalStone called MalGen.  MalGen is open source and available from <a href="http://code.google.com/p/malgen">code.google.com/p/malgen</a>.  Using MalGen, data can be generated with power law distributions, which is useful when modeling  web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).</p>
<h3>Using MalStone to Study Design Tradeoffs</h3>
<p>Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records.   The experiments were done on 20 nodes of the <a href="http://www.opencloudconsortium.org/testbed.html">Open Cloud Testbed</a>.  Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, and 1 Gb/s network interface cards.</p>
<p>We compared three different implementations: 1) Hadoop HDFS with Hadoop&#8217;s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using <a href="http://arxiv.org/abs/0809.1181">Sphere User Defined Functions (UDFs)</a>.</p>
<table border="1">
<tr>
<th colspan="2"> MalStone A</th>
</tr>
<tr>
<td>Hadoop MapReduce </td>
<td>454m 13s</td>
</tr>
<tr>
<td>Hadoop Streams/Python</td>
<td>87m 29s </td>
</tr>
<tr>
<td>Sector/Sphere UDFs </td>
<td>33m 40s</td>
</tr>
<tr>
<th colspan="2"> MalStone B</th>
</tr>
<tr>
<td>Hadoop MapReduce </td>
<td>840m 50s</td>
</tr>
<tr>
<td>Hadoop Streams/Python</td>
<td>142m 32s </td>
</tr>
<tr>
<td>Sector/Sphere UDFs </td>
<td>43m 44s</td>
</tr>
</table>
<p><b>Please note that these timings are still preliminary and may be revised in the future as we better optimize the implementations. </b></p>
<p>If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice.  What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams.  In addition, you may also want to consider using <a href="http://sector.sourceforge.net">Sector</a>.</p>
<p>The image above is from <a href="http://www.flickr.com/photos/legeres/270126135/">Strolling everyday</a> and available via a Creative Commons license.</p>
<p><b>Disclaimer:</b>  I am involved in the development of Sector.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/05/25/malstone-benchmark/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Open Source Analytics Reaches Main Street (and Some Other Trends in Analytics)</title>
		<link>http://rgrossman.com/2009/05/11/open-source-analytics-reaches-main-street/</link>
		<comments>http://rgrossman.com/2009/05/11/open-source-analytics-reaches-main-street/#comments</comments>
		<pubDate>Mon, 11 May 2009 17:18:40 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[PMML]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[standards]]></category>
		<category><![CDATA[analytic standards]]></category>
		<category><![CDATA[cloud-based data services]]></category>
		<category><![CDATA[commoditization of data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[public datasets]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=78</guid>
		<description><![CDATA[This is the first of three posts about systems, applications, services and architectures for building and deploying analytics.   Sometimes this is called analytic infrastructure.  This post is primarily directed at the analytic infrastructure needs of companies.  Later posts will look at analytic infrastructure for the research community.
In this first post of [...]]]></description>
			<content:encoded><![CDATA[<p>This is the first of three posts about systems, applications, services and architectures for building and deploying analytics.   Sometimes this is called <em>analytic infrastructure</em>.  This post is primarily directed at the analytic infrastructure needs of companies.  Later posts will look at analytic infrastructure for the research community.</p>
<p>In this first post of the series, we discuss five important trends impacting analytic infrastructure.</p>
<p><strong>Trend 1.  Open source analytics has reached Main Street. </strong> <a href="http://www.r-project.org">R</a>, which was first released in 1996, is now the most widely deployed open source system for statistical computing.  A recent <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html">article</a> in the New York Times estimated that over 250,000 individuals use R regularly.  Dice News has created a video called &#8220;<a href="http://www.youtube.com/watch?v=ZwYQPtU2Pa0&amp;feature=channel_page">What&#8217;s Up with R</a>&#8221; to inform job hunters using their services about R.  In the language of Geoffrey A. Moore&#8217;s book <em>Crossing the Chasm</em>, R has reached &#8220;Main Street.&#8221;</p>
<p>Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used.  Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.</p>
<p><strong>Trend 2.   The maturing of open, standards based architectures for analytics. </strong> Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician.  It is usually a challenge to deploy the model produced by the application into operational systems.  Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.</p>
<p>The <a href="http://www.dmg.org">Predictive Model Markup Language</a> (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models.  PMML has become the dominant standard for statistical and data mining models.   Many applications now support PMML.</p>
<p>By using these applications,  it is possible to build an open, modular standards based environment for analytics.  With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.</p>
<p>Disclaimer: I&#8217;m one of the many people that has been involved in the development of the PMML standard.</p>
<p><strong>Trend 3.  The emergence of systems that simplify the analysis of large datasets. </strong> Analyzing large datasets is still very challenging, but with the introduction of <a href="http://hadoop.apache.org/core/">Hadoop</a>, there is now an open source system supporting <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a> that scales to thousands of processors.</p>
<p>The significance of Hadoop and MapReduce is not only the <em>scalability</em>, but also the <em>simplicity</em>.  Most programmers, with no prior experience, can have their first Hadoop job running on a large cluster within a day.  Most programmers find that it is much easier and much quicker to use MapReduce and some of its generalizations than it is develop and implement an MPI job on a cluster, which is currently the most common programming model for clusters.</p>
<p><strong>Trend 4.   Cloud-based data services. </strong> Over the next several years, cloud-based services will begin to impact analytics significantly.   A later post in this series will show simple it is use R in a cloud for example.  Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.</p>
<p>Cloud-based services provide several advantages for analytics.  Perhaps the most important is elastic capacity &#8212; if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more.  This ability of clouds to handle surge capacity is important for many groups that do analytics.  With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense.  (Third party clouds provide computing capacity that is an operating and not a capital expense.)</p>
<p><strong>Trend 5.  The commoditization of data. </strong> Moore&#8217;s law applies not only to CPUs, but also to the chips that are used in all of the digital device that produce data.  The result has been that the cost to produce data has been falling for some time.  Similarly, the cost to store data has also been falling for some time.</p>
<p>Indeed, more and more datasets are being offered for free.  For example, end of day stock <a href="http://finance.yahoo.com/q">quotes</a> from Yahoo, gene sequence data from <a href="http://www.ncbi.nlm.nih.gov/">NCBI</a>, and <a href="http://aws.amazon.com/publicdatasets/">public data sets</a> hosted by Amazon, including the U.S. Census Bureau, are all available now for free.</p>
<p>The significance to analytics is that the cost to enrich data with third party data, which often produces better models, is falling.  Over time, more and more of this data will be available in clouds, so that the effort to integrate this data into modeling will also decrease.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/05/11/open-source-analytics-reaches-main-street/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
