<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robert Grossman &#187; Robert Grossman</title>
	<atom:link href="http://rgrossman.com/author/rlg/feed/" rel="self" type="application/rss+xml" />
	<link>http://rgrossman.com</link>
	<description>Robert Grossman&#039;s home page</description>
	<lastBuildDate>Sun, 19 Feb 2012 13:33:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.1</generator>
		<item>
		<title>A Vision for a Biomedical Clouds</title>
		<link>http://rgrossman.com/2012/01/17/a-vision-for-a-biomedical-clouds/</link>
		<comments>http://rgrossman.com/2012/01/17/a-vision-for-a-biomedical-clouds/#comments</comments>
		<pubDate>Tue, 17 Jan 2012 19:02:24 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[genomics]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=120</guid>
		<description><![CDATA[Kevin White and I wrote a paper about the impact of big data in biology, medicine and health care and some of the technology, such as science clouds, that provide the enabling the technology. The paper is called &#8220;A Vision &#8230; <a href="http://rgrossman.com/2012/01/17/a-vision-for-a-biomedical-clouds/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Kevin White and I wrote a paper about the impact of big data in biology, medicine and health care and some of the technology, such as science clouds, that provide the enabling the technology.</p>
<p>The paper is called &#8220;A Vision for Biomedical Clouds&#8221; and was published in the Journal of Internal Medicine (<a href="http://dx.doi.org/10.1111/j.1365-2796.2011.02491.x">doi:10.1111/j.1365-2796.2011.02491.x</a>).  The paper is open access.</p>
<p>You can also find an online version of the paper <a href="http://papers.rgrossman.com/journal-046.pdf">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2012/01/17/a-vision-for-a-biomedical-clouds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>PMML version 4.1 and Augustus version 0.5 Released</title>
		<link>http://rgrossman.com/2011/12/30/pmml-version-4-1-and-augustus-version-0-5-released/</link>
		<comments>http://rgrossman.com/2011/12/30/pmml-version-4-1-and-augustus-version-0-5-released/#comments</comments>
		<pubDate>Fri, 30 Dec 2011 19:31:53 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic infrastructure]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=135</guid>
		<description><![CDATA[The Data Mining Group just released PMML Version 4.1. PMML is the leading standard for statistical and data mining models. Version 4.1 includes support for multiple models, such as segmented models and ensembles of models, and for new models, such &#8230; <a href="http://rgrossman.com/2011/12/30/pmml-version-4-1-and-augustus-version-0-5-released/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The Data Mining Group just released <a href="http://www.dmg.org">PMML Version 4.1</a>.  PMML is the leading standard for statistical and data mining models.  Version 4.1 includes support for multiple models, such as segmented models and ensembles of models, and for new models, such as baselines models, which are used in data quality, process control and change detection.</p>
<p>Open Data also just released a new version of <a href="http://augustus.googlecode.com">Augustus (version 0.5)</a>, which includes support for PMML 4.1.  Augustus is an open source, python based PMML compliant analytic application that can produce PMML compliant models (a PMML Producer) and read PMML models and score data against them (a PMML Consumer).  This newest version of Augustus also includes new support for streaming analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2011/12/30/pmml-version-4-1-and-augustus-version-0-5-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tutorial on Data Intensive Computing at SC 11</title>
		<link>http://rgrossman.com/2011/11/14/tutorial-on-data-intensive-computing-at-sc-11/</link>
		<comments>http://rgrossman.com/2011/11/14/tutorial-on-data-intensive-computing-at-sc-11/#comments</comments>
		<pubDate>Mon, 14 Nov 2011 18:47:49 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=115</guid>
		<description><![CDATA[Collin Bennett and I gave a three hour tutorial at SC 11 in Seattle on data intensive computing. You can find the slides for the tutorial here. The titles of the talks were: An Introduction to Big Data (Chapter 1), &#8230; <a href="http://rgrossman.com/2011/11/14/tutorial-on-data-intensive-computing-at-sc-11/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Collin Bennett and I gave a three hour tutorial at SC 11 in Seattle on data intensive computing.  You can find the slides for the tutorial <a href="http://www.slideshare.net/rgrossman">here</a>.</p>
<p>The titles of the talks were: An Introduction to Big Data (Chapter 1), Managing Big Data (Chapter 2), and Processing Big Data (Chapter 3).  You can also find the slides for the hands on laboratory session that we led.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2011/11/14/tutorial-on-data-intensive-computing-at-sc-11/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What is Big Data?</title>
		<link>http://rgrossman.com/2011/10/01/what-is-big-data-2/</link>
		<comments>http://rgrossman.com/2011/10/01/what-is-big-data-2/#comments</comments>
		<pubDate>Sat, 01 Oct 2011 15:36:01 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=95</guid>
		<description><![CDATA[The discipline of data intensive computing has been growing in importance and in popularity recently. It has now become popular enough that the term &#8220;big data&#8221; is beginning to be used instead. The graph below is from Google Trends and &#8230; <a href="http://rgrossman.com/2011/10/01/what-is-big-data-2/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The discipline of data intensive computing has been growing in importance and in popularity recently.   It has now become popular enough that   the term &#8220;big data&#8221; is beginning to be used instead.  The graph below is from Google Trends and shows the growth of the term &#8220;big data&#8221; over the past couple of years.</p>
<p><img src="http://rgrossman.com/files/2011/11/big-data-trends.png"/></p>
<p>I used to think that data came in three sizes depending upon how you managed it: either small enough to fit into memory, small enough to fit into a database, or too big for a database.</p>
<p>During the last few years, I have changed my point of view with respect to how you measure the size of big data.  The most common point of view is to measure the size of data in terms of bytes: megabytes, gigabytes, terabytes, petabytes, and exabytes.  But over the past few years, I have noticed that people with very large amounts of data, measure their data and the computing power required to process it in terms of MW.  </p>
<p>Here are some examples: </p>
<ul>
<li> A good sweet spot for a data center is 15MW. </li>
<li> Facebook’s leased data centers are typically between 2.5 MW and 6.0 MW. </li>
<li> Facebook’s new Pineville data center is 30 MW. </li>
<li> Google’s computing infrastructure uses 260 MW. </li>
</ul>
<p>Today, the Open Science Data Cloud requires about 0.5MW.   Our goal over the next 3 to 5 years is to develop and operate a 5 MW or so facility denoted to science.</p>
<p>The perspective when you measure data in MW is somewhat different.  You would like the facility to be uniform.  You would like to be able to add new racks and retire old racks with little if any manual intervention.   You would like to be able optimize the amount of data you can manage and the amount of data you can process per MW.   </p>
<p>Today, it takes too long for us to add and retire racks from the OSDC.  If you would like to join a research project to develop open source software to simplify this, please write us at info at opencloudconsortium.org.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2011/10/01/what-is-big-data-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Emergence of Genomics as a Data Intensive Science</title>
		<link>http://rgrossman.com/2011/09/20/the-emergence-of-genomics-as-a-data-intensive-science/</link>
		<comments>http://rgrossman.com/2011/09/20/the-emergence-of-genomics-as-a-data-intensive-science/#comments</comments>
		<pubDate>Tue, 20 Sep 2011 19:22:34 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[big data]]></category>
		<category><![CDATA[genomics]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=131</guid>
		<description><![CDATA[I gave a talk today at Bio-IT World in La Jolla discussing the impact of the emergence of genomics as a data intensive science. I discussed some of the infrastructure being developed to support data intensive science, such as the &#8230; <a href="http://rgrossman.com/2011/09/20/the-emergence-of-genomics-as-a-data-intensive-science/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>
I gave a talk today at <a href="http://www.bio-itcloudsummit.com/cld_content.aspx?id=106151">Bio-IT World</a> in La Jolla discussing the impact of the emergence of genomics as a data intensive science. I discussed some of the infrastructure being developed to support data intensive science, such as the Open Cloud Consortium’s <a href="http://www.opensciencedatacloud.org">Open Science Data Cloud</a>.  I also described the design and implementation of the <a href="http://www.bionimbus.org">Bionimbus</a> system, which  is a comprehensive bioinformatics system for archiving, managing, analyzing, re-analyzing, and sharing large collections of genome-wide datasets.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2011/09/20/the-emergence-of-genomics-as-a-data-intensive-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Some Research Topics Related to Big Data Science and Cloud Computing</title>
		<link>http://rgrossman.com/2011/07/08/ieee-cloud-2011-plenary-panel/</link>
		<comments>http://rgrossman.com/2011/07/08/ieee-cloud-2011-plenary-panel/#comments</comments>
		<pubDate>Fri, 08 Jul 2011 14:39:40 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=83</guid>
		<description><![CDATA[IEEE Cloud 2011 took place in Washington DC from July 4 to 6, 2011. The full name of the conference is The 4th International Conference on Cloud Computing and it was co-located with three related conferences: 1) IEEE Services 2011 &#8230; <a href="http://rgrossman.com/2011/07/08/ieee-cloud-2011-plenary-panel/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.thecloudcomputing.org/2011/">IEEE Cloud 2011</a> took place in Washington DC from July 4 to 6, 2011.</p>
<p>The full name of the conference is The 4th International Conference on Cloud Computing and it was co-located with three related conferences: 1) IEEE Services 2011 (The 7th World Congress on Services), 2) IEEE SCC 2011 (The 8th International Conference on Services Computing), and 3) IEEE ICWS 2011 (The 9th International Conference on Web Services).</p>
<p>There were a lot of different technical topics covered.   The diagram below shows you some of them.  </p>
<p>In addition, all four conferences worked together and sponsored several plenary panels.  I participated in one of them called &#8220;Science in Cloud Computing.&#8221;  I have posted my slides on slideshare and you can find them <a href="http://www.slideshare.net/rgrossman/open-science-data-cloud-ieee-cloud-2011">here</a>.</p>
<p><a href="http://rgrossman.com/files/2011/10/SC-confs-Landscape.jpg"><img src="http://rgrossman.com/files/2011/10/SC-confs-Landscape-300x232.jpg" alt="" title="IEEE Cloud and Service Related Conferences 2011" width="300" height="232" class="alignnone size-medium wp-image-87" /></a></p>
<p>One of the topics that I work on these days is data intensive computing and in particular its impact on science.   The popular term is <em>big data science</em>.  Data intensive computing and big data has had an important impact on business over the past decade, but its impact on science is just beginning to be felt.</p>
<p>In my talk for the plenary panel, I described a project that I have been working on called the Open Science Data Cloud (OSDC).  The OSDC is sponsored by the not-for-profit Open Cloud Consortium <a href="http://www.opencloudconsortium.org">(OCC)</a>.   We are working with OCC partners and sponsors to stand up a cloud devoted to science.  Initially it will contain approximately 1 PB of data from a variety of scientific disciplines.  </p>
<p>We are looking for volunteers to help with the OSDC, so please contact us at  info at opencloudconsortium.org if you would like to get involved.  We are looking for help loading and curating the data, data intensive computing cloud infrastructure, helping with the web site, and outreach. </p>
<p>Based upon my experience with the OSDC over the past year, I ended my presentation in the plenary panel with three research questions related to data intensive computing and cloud computing:</p>
<ol>
<p>
<li>Develop technology to encapsulate a scientist’s data and analysis tools and to export, save and move these between clouds.</li>
</p>
<p>
<li>Develop protocols, utilities, and applications so that new racks and containers can be added to data clouds with minimal human involvement. </li>
</p>
<p>
<li>Develop technology to support the long term (20+ years), low cost preservation of data and metadata in clouds.</li>
</p>
</ol>
<p>Source: The diagram is from http://www.servicescongress.org/2011/.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2011/07/08/ieee-cloud-2011-plenary-panel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Small, Medium and Big Data</title>
		<link>http://rgrossman.com/2011/06/01/small-medium-big-data/</link>
		<comments>http://rgrossman.com/2011/06/01/small-medium-big-data/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 10:23:37 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://rgrossman.com/?p=72</guid>
		<description><![CDATA[What is big data? From the point of view of the infrastructure required to do analytics, data comes in three sizes: Small data. Small data fits into the memory of a single machine. A good example of a small dataset &#8230; <a href="http://rgrossman.com/2011/06/01/small-medium-big-data/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What is big data? From the point of view of the infrastructure required to do analytics, data comes in three sizes:</p>
<ul>
<p>
<li><b>Small data.</b> Small data fits into the memory of a single machine. A good example of a small dataset is the dataset for the <a href="http://www.netflixprize.com/">Netflix Prize</a>. The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. This dataset (although challenging enough to keep anyone from winning the grand prize for over 2 years) is just 2 GB of data and fits into the memory of a laptop. </li>
</p>
<p>
<li><b>Medium data.</b> A good working definition of medium size data is to think of data as medium size if it fits into a single disk or disk array and can be managed by a database. It is becoming common today for companies and organizations  to create  10 to 100 TB or larger size data warehouses, so medium size data can grow quite large. </li>
</p>
<p>
<li><b>Big data.</b> Big data is so large that it is challenging to manage it in a database and instead specialized systems are used.  The most popular such system these days is <a href="http://hadoop.apache.org/">Hadoop</a>, although I expect we will have more choices in a few years.  Also, what have become known as NoSQL databases can also be used to manage big data sets. </li>
</p>
</ul>
<p>There have always been large datasets, but until about 2000, most large datasets were produced by the scientific and defense communities.  For example,  the Large Hadron Collider (<a href="http://lhc.web.cern.ch/lhc/">LHC</a>) will produce a large data set.   </p>
<p>Two things have changed during the last decade: First, large datasets are now produced by a third community: companies that provide Internet services, such as search, on-line advertising and social media.  Second, the ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. This provides a measure (dollars of online revenue produced) by which to measure the effectiveness of analytic infrastructure and analytic models. Using this metric, companies such as Google, settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.</p>
<p>This is an update of a post that I originally wrote in 2009 and that is no longer available.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2011/06/01/small-medium-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What is Analytic Infrastructure and Why Should You Care?</title>
		<link>http://rgrossman.com/2011/05/01/analytic-infrastructure/</link>
		<comments>http://rgrossman.com/2011/05/01/analytic-infrastructure/#comments</comments>
		<pubDate>Sun, 01 May 2011 16:07:57 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic infrastructure]]></category>

		<guid isPermaLink="false">http://rgrossman.opendatagroup.net/?p=1</guid>
		<description><![CDATA[I have been building analytic models for over 20 years. The names have changed a lot over the years: 20 years ago we built statistical models, 10 years ago we built data mining models, and today we build analytic models. &#8230; <a href="http://rgrossman.com/2011/05/01/analytic-infrastructure/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I have been building analytic models for over 20 years. The names have changed a lot over the years: 20 years ago we built statistical models, 10 years ago we built data mining models, and today we build analytic models. The algorithms have changed some: classification and regression trees became common 20 years ago, support vector machines about 10 years ago, and today graph-based algorithms are popular.</p>
<p>Perhaps what has changed the most is my perspective.</p>
<p><b>Analytic algorithms and models.</b> Twenty years ago, I was focused on algorithms and was concerned with the different types of models that you could build using different types of algorithms on different types of data. This worked fine as long as the data fit into the memory of the computer.</p>
<p><b>Analytic infrastructure.</b> For better or worse I ran into problems that had so much data that the data was too big to fit into memory. Some projects required a disk, some required many disks, and a few required tertiary storage. I spent over two decades working on what you might call analytic infrastructure. I first worked on teams that developed for the high energy physics community specialized data management infrastructures that were optimized for efficient reads (instead of safe writes) and accessed the data by columns (instead of rows) in order to speed up numerical computations. These turned out to be some of the first examples of data warehouses (the name was not used at that time), increased by 1 to 3 orders of magnitude the size of data that we could model, and were heavily criticized by the database community. Of course, several years later the database community embraced data warehouses at least for reports, if not for data intensive computing and modeling.</p>
<p>Beginning about five years ago, I began working on what are today called cloud computing platforms. Again, this increases by 1 to 3 orders of magnitude the size of data that we can model, and again these have been heavily criticized by some in the database community as being a big step backwards.</p>
<p>In 2009, I edited a <a href="http://www.sigkdd.org/explorations/issue.php?volume=11&#038;issue=1&#038;year=2009&#038;month=07">special issue</a> of the ACM SIGKDD Explorations about analytic infrastructure. In an <a href="http://www.sigkdd.org/explorations/issues/11-1-2009-07/p1V11n1.pdf">article</a> there, I define analytic infrastructure as the applications, services, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring, or related analytic activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. The article is available as a pdf from the SIGKDD Explorations web site (it’s Issue 1 in Volume 11).</p>
<p>I don’t really like this definition and encourage you to provide a better one. What is important though is that using the appropriate analytic infrastructure is critical to building models for problems with so much data that simply putting it into memory and forgetting about it is not a viable solution.</p>
<p><b>Analytic Strategy.</b> Returning to how my perspective has evolved, for the past several years, I have become increasingly concerned with what is usually called analytic strategy. Analytic strategy is concerned with making sure you are asking the right analytic question, that you are building a model that can be deployed efficiently, that the output of the model is actionable, that the actions have a business impact, the business impact is aligned with corporate strategy, that there is an appropriate governance process in place, and related questions.</p>
<p>My perspective these days is that analytics requires a firm foundation and that the foundation has three columns: 1) analytic strategy; 2) analytic infrastructure; and 3) analytic algorithms and models.</p>
<p>This is a slightly updated version of a post from February 16, 2010.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2011/05/01/analytic-infrastructure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

