<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robert Grossman &#187; analytics</title>
	<atom:link href="http://rgrossman.com/category/blog/analytics/feed/" rel="self" type="application/rss+xml" />
	<link>http://rgrossman.com</link>
	<description>analytics, analytic strategy and analytic infrastructure</description>
	<lastBuildDate>Wed, 28 Jul 2010 02:49:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Ten Years of the SC XY Bandwidth Challenge</title>
		<link>http://rgrossman.com/2009/11/30/ten-years-of-the-sc-xy-bandwidth-challenge/</link>
		<comments>http://rgrossman.com/2009/11/30/ten-years-of-the-sc-xy-bandwidth-challenge/#comments</comments>
		<pubDate>Mon, 30 Nov 2009 10:06:08 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[Bandwidth Challenge]]></category>
		<category><![CDATA[bandwidth delay products]]></category>
		<category><![CDATA[FAST TCP]]></category>
		<category><![CDATA[SC 09]]></category>
		<category><![CDATA[scinet]]></category>
		<category><![CDATA[UDT]]></category>
		<category><![CDATA[UDX]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=518</guid>
		<description><![CDATA[The SC 09 Conference took place early this month in Portland.  The Bandwidth Challenge (BWC) is an interesting and friendly rivalry between research groups to develop high performance network protocols and interesting applications that use them.  The Bandwidth Challenge was started ten years ago at SC 99, which also took place in Portland.
Some [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://sc09.supercomputing.org/">SC 09</a> Conference took place early this month in Portland.  The Bandwidth Challenge (BWC) is an interesting and friendly rivalry between research groups to develop high performance network protocols and interesting applications that use them.  The Bandwidth Challenge was started ten years ago at SC 99, which also took place in Portland.</p>
<p>Some of the history is available at the web site <a href="https://scinet.supercomputing.org/">scinet.supercomputing.org</a>.  For example, in 2000, there were 2 OC-48 (2.5 Gbps) circuits that connected the research exhibits at the conference to external research networks and the challenge was to develop network protocols and applications that could fill these circuits.   The winner of the BWC (called the Network Bandwidth Challenge in 2000) was a scientific visualization application called Visapult that reached 1.48 Gbps and transferred 262 GB in 1 hour (providing 582 Mbps of sustained bandwidth utilization).</p>
<p>This year, there were approximately 24 10 GE circuits and one 40 GE circuit that connected research exhibits to external exhibits and one of the applications reached a bandwidth utilization of over 114 Gbps.</p>
<p>I have had an interest in the BWC over the years, because you cannot analyze data without accessing it and accessing and transporting large remote datasets has always been a challenge.  To say it slightly different, for large datasets and high performance networks, network transport protocols are an important element of the analytic infrastructure.</p>
<p>It&#8217;s useful to know the bandwidth delay product of a network, which is the product of the network capacity (in Mbps, say) multiplied by the round trip time (RTT) of a packet (in sec).   This measures the amount of data on the network that has been transmitted but not yet received.  This can be MB of data for wide area high performance networks.  This data must be buffered so that it can be resent if a packet is not received.</p>
<p>Challenges that have been worked out over the past decade include:</p>
<ul>
<li>Improving TCP so that it is effective over networks with high bandwidth delay products.  One of the successes is the development of <a href="http://netlab.caltech.edu/FAST/">FAST TCP</a>, a variant of the TCP protocol. </li>
<li>Developing reliable and friendly UDP-based protocols that are effective over networks with high bandwidth delay products.  For example, the open source <a href="http://udt.sf.net">UDT</a> protocol has proved over time to be quite effective.  (<b>Disclosure: </b> I have been involved in the development of the UDT protocol.) </li>
<li>Developing architectures that are effective for high end-to-end performance for transporting large datasets, from disks at one end to disks at the other end. </li>
</ul>
<p>For the past several years, it has been relatively routine for applications using FAST TCP or UDT to fill a wide area 10 Gbps network link or multiple 10 Gbps network links, if these are available.</p>
<p>Today&#8217;s problems include:</p>
<ul>
<li>Connecting data intensive devices and applications to high performance networks.  For example, with high throughput sequencing, biology is becoming data intensive, yet very few high throughput sequencing devices are connected to high performance research networks.    </li>
<li>Incorporating the appropriate network protocols into data intensive applications.  For example, one of the reasons, the <a href="http://sector.sf.net">Sector/Sphere cloud</a> is effective over wide area networks is that it is based upon UDT and not TCP.  (<b>Disclosure: </b> I have been involved in the development of the Sector/Sphere cloud.)
</ul>
<p>I ran into the first problem just after I got back from SC 09.  At SC 09, we ran a number of wide area data intensive applications, and in fact won the 2009 BWC for these applications.  For example, a new variant of UDT called UDX reached 9.2 Gbps over a network link with 200 ms RTT.    In contrast, as soon as I got back to Chicago, I worked for a couple of days trying to get access to 200 GB of sequence data, since the sequencing instrument that produced it was not connected to a high performance network.  With the device connected to a high performance research network, the data would have been available in a few minutes.</p>
<p>To summarize, today network experts are comfortable designing systems that can easily fill wide area 10 GE networks, but most analytic applications are not designed to use the required protocols or to to take advantage of high performance networks,  and most do not have access to the required networks, even if the applications could benefit from them.</p>
<p>In disciplines, like biology, that are becoming data intensive, this type of analytic infrastructure will provide distinct competitive advantages.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/11/30/ten-years-of-the-sc-xy-bandwidth-challenge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>He Said, She Said &#8211; Why Custom Models Take Time</title>
		<link>http://rgrossman.com/2009/08/06/why-custom-models-take-time/</link>
		<comments>http://rgrossman.com/2009/08/06/why-custom-models-take-time/#comments</comments>
		<pubDate>Thu, 06 Aug 2009 19:27:45 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic strategy]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[custom models]]></category>
		<category><![CDATA[data mining process]]></category>
		<category><![CDATA[rebuilding models]]></category>
		<category><![CDATA[refreshing models]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=397</guid>
		<description><![CDATA[In this post, I discuss some of the different options available when building analytic models.  For the purposes here, a good short definition of analytics is to view analytics as using data to make predictions.  The term predictive analytics is beginning to be applied (appropriately enough) to this type of analytics.   [...]]]></description>
			<content:encoded><![CDATA[<p>In this post, I discuss some of the different options available when building analytic models.  For the purposes here, a good short definition of analytics is to view analytics as using data to make predictions.  The term <em>predictive analytics</em> is beginning to be applied (appropriately enough) to this type of analytics.   A longer definition is to view predictive analytics as building statistically valid models from data that can be used to make predictions about future events, to take actions, and to make decisions.</p>
<p><img src="http://rgrossman.files.wordpress.com/2009/09/watch.jpg?w=300" alt="Pocket watch." title="Pocket watch." width="300" height="199" class="alignleft size-medium wp-image-438" /></p>
<p>In this post, the point of view is that a business owner of a problem in a company requires a model and is considering whether to build the model in-house, outsource the model to a vendor providing analytic services, or simply to give up on building a model and produce a report instead.  I don&#8217;t recommend the latter option, but unfortunately, in practice, it is all too common.</p>
<p>Broadly speaking, from a business owner&#8217;s point of view, there are several phases required to build a model for a new project.  The process looks a bit different from the modeler&#8217;s point of view.  It is also a bit simpler if the same model has been built before and all that is required is to update the model using new data.  Here are the basic steps required to build a model from the business owner&#8217;s point of view.</p>
<ol>
<li>Working with IT to obtain all the data required for the project and making it available to the modeler.  </li>
<li>Answering questions from the modeler about the data. </li>
<li>Agreeing upon the output of the model. </li>
<li>Reviewing the first model with the modeler. </li>
<li>Reviewing the second and subsequent models with the modeler. </li>
<li>Working with IT to deploy the model. </li>
</ol>
<p>Although all steps except for Step 1 are collaborative between the business owner and the modeler, Step 6 is primarily the business owner&#8217;s responsibility, while Steps 2, 4, and 5 are primarily the modeler&#8217;s responsibility.  At the beginning of many projects, Step 3 looks obvious.  It turns out that it is often not so obvious until the project is towards the end, the data has been cleaned, and the deployment well underway.  One way to understood why this is so is because often one doesn&#8217;t have a good understanding of the most appropriate output of a model until the data has been cleaned and there is a good understanding of how the model will be deployed in operational systems.</p>
<p>Let&#8217;s look at this same process now from the viewpoint of the modeler.  To simplify, the folllowing steps are required:</p>
<ol>
<li>Waiting for the data. </li>
<li>Cleaning the data.  </li>
<li>Asking the business owner questions about the data. </li>
<li>Agreeing upon the output of the model. </li>
<li>Developing a set of features for the model.  </li>
<li>Estimating the parameters of the model. </li>
<li>Building a measure to evaluate the model. </li>
<li>Evaluating the model using the measure. </li>
<li>Developing post-processing rules for the scores produced by the model. </li>
<li>Repeating the steps above for the second and subsequent versions of the model until everyone is happy, or there is no more time or funding left.  </li>
<li>Deploying the model. </li>
</ol>
<p>Building a new model requires completing all the steps above.  Generally, a series of models (version 1 of the model, version 2 of the model, etc.) are produced and reviewed by the business owner and the modeler (Step 10).  The more time available for Step 10, the better the quality of the model.</p>
<p>To understand these steps a bit better, it might be helpful to review post about the <a href="http://blog.rgrossman.com/2009/04/28/sams-methodology/">SAMS Methodology</a>.  The SAMS methodology explains how to think of models in terms of the <b>S</b>cores they produce, the <b>A</b>ctions these enable, the <b>M</b>easures used to evaluate the actions, and whether these actions support a targeted <b>S</b>trategy or not.</p>
<p>Sometimes a model has been built before and only some of these steps need to be repeated.  For example, <em>refreshing a model</em> only require completing steps 6 and 8 for a series of models.  <em>Rebuilding a model</em> usually only requires repeating Steps 5, 6, 8 and 9 for a series of models.</p>
<p>Sometimes, the data is supplied in a standard format (for example, it is provided by a third party) and the deployment uses a standard format (for example, only a list is required that contains a list of names and corresponding offers).  In this case, after a model has been built once, all that is required when a business owner supplies new data is to perform Steps 6 and 8.   Call this a <em>standard model</em>.  Standard models are substantially less work to build then models that require completing all the steps above.  These more labor intensive models are often called <em>custom models</em>.</p>
<p>Most requests for models fit into some standard categories of models.  For example, models that predict whether a prospect will respond to an offer (response models), whether a customer will remain a customer (attrition models), whether a customer will keep current with their payments (credit model), whether a transaction is valid or fraudulent (fraud models), etc.</p>
<p>Sometimes, models that don&#8217;t fit into these familar categories of models are built.  Call these <em>new types of models</em>.  A new type of model also requires that the modeler develop new types of features, new types of measures for evaluating the models, etc.  New types of custom models are the most labor intensive to build.</p>
<p>In practice, it usually takes four to six months or longer to build a custom model, once the data has arrived.  As the size and complexity of the data grows, each of the steps usually requires more time.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/08/06/why-custom-models-take-time/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Three Lessons in Analytic Strategy from the Netflix Prize</title>
		<link>http://rgrossman.com/2009/07/05/three-lessons-in-analytic-strategy-from-the-netflix-prize/</link>
		<comments>http://rgrossman.com/2009/07/05/three-lessons-in-analytic-strategy-from-the-netflix-prize/#comments</comments>
		<pubDate>Sun, 05 Jul 2009 09:54:37 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic strategy]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[analytic metrics]]></category>
		<category><![CDATA[analytic strategies]]></category>
		<category><![CDATA[analytics measures]]></category>
		<category><![CDATA[anonymizing data]]></category>
		<category><![CDATA[case study]]></category>
		<category><![CDATA[Cinematch]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[Netflix Prize]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=284</guid>
		<description><![CDATA[The Netflix Prize requires developing a new rating algorithm that improves by over 10% the current system called Cinematch that is used by Netflix to suggest movies to its customers.  According to the contest&#8217;s Leaderboard, it looks the $1,000,000 Grand Prize will be awarded shortly.

The Netflix Prize provides some interesting lessons in analytic strategy. [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.netflixprize.com/">Netflix Prize</a> requires developing a new rating algorithm that improves by over 10% the current system called Cinematch that is used by Netflix to suggest movies to its customers.  According to the contest&#8217;s <a href="http://www.netflixprize.com//leaderboard">Leaderboard</a>, it looks the $1,000,000 Grand Prize will be awarded shortly.</p>
<p><a href="http://www.netflixprize.com"><img src="http://rgrossman.files.wordpress.com/2009/07/netflixprize.jpg?w=251" alt="Netflix Prize" title="Netflix Prize" width="251" height="300" class="alignleft size-medium wp-image-319" /></a></p>
<p>The Netflix Prize provides some interesting lessons in analytic strategy.  In addition to the Grand Prize, each year until the Grand Prize is awarded,  a $50,000 Progress Prize is awarded.   The Progress Prize was awarded in 2007 and 2008.   The Netflix Prize has become quite well known due to the prize money being offered.   It deserves to be just as well known for the analytic strategy they chose.</p>
<p>It is relatively common for a company or an organization to spend a million dollars on an analytic project.  It is less common for something useful to result from  it.   I don&#8217;t have any inside knowledge about the Netflix Prize, but I think that there are several valuable lessons about analytic strategy that the Netflix Prize illustrate.   Here are three lessons.</p>
<p><b>Lesson 1.  Agree upon a metric to measure the effectiveness of an analytic model and use it consistently. </b>  It is usually not possible to find a single metric that captures all the relevant information required when comparing two analytic systems.  It is certainly the case that any actual ratings system requires several metrics.  For example, one metric might measure how many stars a viewer would assign to a move and another metric might measure how often a viewer selects movies recommended.  On the other hand, by singling out a single metric, it becomes straightforward to compare two recommendation algorithms.  Once this is possible, it becomes simple to use the metric to create a dashboard (the <a href="http://www.netflixprize.com//leaderboard">Netflix Leaderboard</a>) and then to use the dashboard to track progress.  Netflix chose to use the root mean squared error (RMSE) between the predictions of a proposed system and actual choices made by users in a validation dataset.  Over 49,000 contestants from over 180 different countries formed over 40,000 teams and entered the contest and tried to develop a recommendation algorithm with a low enough RMSE to win the Grand Prize.   In my experience, most companies and organizations lack the discipline to use a single (lead) metric to compare two analytic systems and to use the metric to track progress improving an analytic system over time using a dashboard.  Having the discipline to do so is one sign of the analytic maturity of a company.</p>
<p><b>Lesson 2.  Don&#8217;t be afraid to disclose analytic technology you develop if the advantages outweigh the disadvantages. </b>  In general, it makes sense for companies and organizations not to disclose the proprietary technology they use.   On the other hand, there are some important exceptions.</p>
<ul>
<li>One exception are patents.  Patents provide some important protections, but the trade off is that the technology must be disclosed in the patent filing.  </li>
<li>Another exception is when the software of an internal analytic project is made open source or when an internal project decides to contribute to an existing open source software project.  Again, there is a trade off.  Some technology is disclosed, but the benefit is the community support that many open source projects engender. </li>
<li>Crowdsourcing is a similar type of exception.  The benefit is the innovation that crowdsourcing can provide.  The downside is that crowdsourcing discloses technology that may be critical to your business.  Netflix found that with Cinematch customers rented more movies and were less likely to cancel their subscriptions.   Cinematch was introduced in 2000 and improved each year until a plateau was reached in 2006.  In the summer of 2006, Reed Hastings, the CEO of Netflix, suggested a public contest to improve Cinematch.    According to an <a href="http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html">article</a> in the New York Times, &#8220;Cinematch suggestions&#8230; drive a surprising 60 percent of Netflix’s rentals.&#8221;   By setting a threshold for the prize of 10% or more improvement, Netflix would obtain enough incremental revenue from an improved Cinematch system to make up for any information that Netflix&#8217;s competitors might gain.  Again, this is a good analytic strategy.  </li>
</ul>
<p><b>Lesson 3.  Double and triple check any data before making it public.  </b>  No company or organization would knowingly make data public that contains personally identifiable information (PII) without permission.  On the other hand, even if data does not contain PII per se, often times PII can be inferred from data, as was done when <a href="http://en.wikipedia.org/wiki/AOL_search_data_scandal">AOL released 3 months</a> of sample query logs in 2006.   For less obvious ways to break anonymization of data, see the paper <a href="http://portal.acm.org/citation.cfm?id=1242598">Wherefore art thou r3579x?</a>.  In some cases, it can be quite challenging to take data and to anonymize it so that it does not contain PII information, especially if the data is being updated.  On the other hand, making data public enables a broad community to contribute to your problem.     </li>
<p>Finally, it is interesting to think about the size of the data used for the prize.  The data consisted of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers.  The rated over 17 thousand movie titles during the period October, 1998 to December, 2005.  In some sense, this is a lot of data.  Certainly there are a lot of degrees of freedom in the dataset.  On the other hand, it is less than 2 GB of data and easily fits <em>in the memory</em> of a modest size computer.   From this perspective, it is a <em>small</em> amount of data.   From the view point of analytic infrastructure, it is useful to classify data as small (fits into the memory of a single computer), medium (fills the disks of a single storage device or fits into a database), or large (requires specialized infrastructure such as a cloud).</p>
<p><b>For more information:</b></p>
<ul>
<li> R. M. Bell and Y. Koren, Lessons from the Netflix prize challenge. SIGKDD Explororations Newsletter, Volume 9, Number 2 (Dec. 2007), pages 75-79. DOI= <a href="http://portal.acm.org/citation.cfm?id=1345448.1345465">http://doi.acm.org/10.1145/1345448.1345465</a> (subscription required) </li>
<li>Clive Thompson, The Screens Issue.  <a href="http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html">If You Liked This, You’re Sure to Love That</a>, New York Times, November 23, 2008 (registration required). </li>
<li>L. Backstrom, C. Dwork,  and J. Kleinberg,  <a href="http://portal.acm.org/citation.cfm?id=1242598">Wherefore art thou r3579x?</a>: anonymized social networks, hidden patterns, and structural steganography,  Proceedings of the 16th international Conference on World Wide Web (WWW &#8216;07),  ACM, New York, NY, 181-190.  (subscription required)  </li>
</ul>
<p><b>Upcoming Course.  </b>  I&#8217;ll be using this example in an <a href="http://blog.rgrossman.com/courses/">upcoming course</a>  I&#8217;m teaching in San Mateo on July 14, 2009.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/07/05/three-lessons-in-analytic-strategy-from-the-netflix-prize/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Three Most Important Interfaces in Analytics</title>
		<link>http://rgrossman.com/2009/06/17/important-analytic-inferfaces/</link>
		<comments>http://rgrossman.com/2009/06/17/important-analytic-inferfaces/#comments</comments>
		<pubDate>Wed, 17 Jun 2009 21:05:53 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[PMML]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[cloud analytics]]></category>
		<category><![CDATA[Data Mining Group]]></category>
		<category><![CDATA[data preprocessing]]></category>
		<category><![CDATA[model consumers]]></category>
		<category><![CDATA[model producers]]></category>
		<category><![CDATA[multiple analytic models]]></category>
		<category><![CDATA[PMML Version 4.0]]></category>
		<category><![CDATA[scoring]]></category>
		<category><![CDATA[time series models]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=236</guid>
		<description><![CDATA[If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements.  On the other hand, if your data is large, your model is complicated, [...]]]></description>
			<content:encoded><![CDATA[<p>If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements.  On the other hand, if your data is large, your model is complicated, your output is a model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the years to support analytics.   I use the term <em>analytic infrastructure</em> to refer to these components, services, applications and systems.</p>
<p><img src="http://rgrossman.files.wordpress.com/2009/06/dmg-logo.jpg?w=300" alt="The Data Mining Group, which develops the Predictive Model Markup Language." title="The Data Mining Group, which develops the Predictive Model Markup Language." width="300" height="119" class="alignleft size-medium wp-image-247" /></p>
<p>For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds.  Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process.</p>
<p>There are several fundamental steps when building and deploying analytic models that are directly relevant to analytic infrastructure:</p>
<table border="1">
<tbody>
<tr>
<th>Step</th>
<th>Inputs</th>
<th>Outputs</th>
</tr>
<tr>
<td>Preprocessing</td>
<td>dataset (data fields)</td>
<td>dataset of features</td>
</tr>
<tr>
<td>Modeling</td>
<td>dataset of features</td>
<td>model</td>
</tr>
<tr>
<td>Scoring</td>
<td>dataset (data fields), model</td>
<td>scores</td>
</tr>
<tr>
<td>Postprocessing</td>
<td>scores</td>
<td>actions</td>
</tr>
</tbody>
</table>
<p>Perhaps, the most important interfaces in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages (which have a human in the loop), and components in the analytic infrastructure that score data using models and often reside in operational environments.  The former are examples of what are sometimes called <em>model producers</em>, while the latter are sometimes called <em>model consumers</em>.   The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion.</p>
<p>On June 16, the <a href="http://www.dmg.org">Data Mining Group</a> released version 4.0 of the Predictive Model Markup Language or <a href="http://www.dmg.org/v4-0/GeneralStructure.html">PMML</a>.  Version 4.0 is the first release of PMML since Version 3.2 was released in May, 2007.</p>
<p>Version 4.0 of PMML adds the following new features:</p>
<ul>
<li>support for time series models;</li>
<li>support for multiple models, which includes support for both<br />
segmented models and ensembles of models;</li>
<li>improved support for preprocessing data, which will help simplify<br />
deployment of models;</li>
<li>new models, such as survival models;</li>
<li>support for additional information about models called model<br />
explanation, which includes information for visualization, model<br />
quality, gains and lift charts, confusion matrix, and related<br />
information.</li>
</ul>
<p>Since Version 2.0 of PMML, which was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models.  Using these transformations, it would be possible to use PMML define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as models).  This is probably the second most important interface in analytics.</p>
<p>With Version 4.0 now released, the PMML working group is now working on Version 4.1.  One of the goals is to enable PMML describe postprocessing of scores.  This would allow PMML to be used as interface between analytic infrastructure components and services that produce scores (such as modeling engines) and those that consume scores (such as recommendation engines).  This is probably the third most important interface in analytics.</p>
<p>Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems.  For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler.</p>
<p>If you are interested in getting involved in the PMML working group, please visit the web site: <a href="http://www.dmg.org">www.dmg.org</a></p>
<p><strong>Disclaimer:</strong>I&#8217;m a member of the PMML working group and worked on PMML Version 4.0.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/06/17/important-analytic-inferfaces/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Five Common Mistakes in Analytic Projects</title>
		<link>http://rgrossman.com/2009/06/01/five-common-mistakes-in-analytic-projects/</link>
		<comments>http://rgrossman.com/2009/06/01/five-common-mistakes-in-analytic-projects/#comments</comments>
		<pubDate>Mon, 01 Jun 2009 16:45:13 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[PMML]]></category>
		<category><![CDATA[analytic strategy]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[analytic projects]]></category>
		<category><![CDATA[analytic strategies]]></category>
		<category><![CDATA[common mistakes]]></category>
		<category><![CDATA[deploying analytic models]]></category>
		<category><![CDATA[difficulties getting data for modeling]]></category>
		<category><![CDATA[refreshing models]]></category>
		<category><![CDATA[scores/actions/measures/strategy]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=168</guid>
		<description><![CDATA[Managing projects is often challenging.  Developing predictive models can be very challenging.  Managing projects that develop analytic models can present some especially difficult challenges.  In this post, I&#8217;ll describe some of the most common mistakes that occur when managing analytic projects.

Mistake 1. Underestimating the time required to get the data.  This [...]]]></description>
			<content:encoded><![CDATA[<p>Managing projects is often challenging.  Developing predictive models can be very challenging.  Managing projects that develop analytic models can present some especially difficult challenges.  In this post, I&#8217;ll describe some of the most common mistakes that occur when managing analytic projects.<br />
<img src="http://rgrossman.files.wordpress.com/2009/06/mistake.jpg?w=300" alt="Managing projects involving analytics can be difficult." title="Managing projects involving analytics can be difficult." width="300" height="177" class="alignleft size-medium wp-image-197" /><br />
<strong>Mistake 1. Underestimating the time required to get the data. </strong> This is probably the most common mistake in modeling projects.  Getting the data required for analytic projects usually requires a special request to the IT department.  Any special requests made to IT departments can take time.  Usually, several meetings are required between the business owners of the analytic problem, the statisticians building the models, and the IT department in order to decide what data is required and whether it is available. Once there is agreement on what data is required, then the special request to the IT department is made and the wait begins.  Project managers are sometimes under the impression that good models can be built without data, just as statisticians are sometimes under the impression that modeling projects can be managed without a project plan.</p>
<p><strong>Mistake 2. There is not a good plan for deploying the model. </strong> There are several phases in a modeling project.  In one phase, data is acquired from the IT department and the model is built.  A statistician is usually in charge of building the model.  In the next phase, the model is deployed.  This is the responsibility of the IT department.  This requires providing the model with the appropriate data, post-processing the scores produced by the model to compute the associated actions, and then integrating these actions into the required business processes.  Deploying models is in many cases just as complicated or more complicated than building the models and requires a plan.  A good standards-compliant architecture can help here.  It is often useful for the statistician to export the model as <a href="http://www.dmg.org">PMML</a>.  The model can then be imported by the application used in the operational system.</p>
<p><strong>Mistake 3.  Working backwards, instead of starting with an analytic strategy.</strong> To say it another way: first, decide on an analytic strategy; then, check that the data that is available supports the analytic strategy; then, make sure that there are modelers (or statisticians) available to develop the models; and, then, finally, make sure that the modelers have the right (software) tools. The most important factor effecting the success of an analytic project is choosing the right analytic project and approaching it in the right way.  This is a matter of analytic strategy.  Once the right project is chosen, the success of the project is most dependent on the data that is available; next on the talent of the modeler that is developing the models; and then on the software that is used.  In general, companies new to modeling proceed in precisely the opposite direction.  First, they buy software they don&#8217;t need (for many problems open source analytic software works just fine). Then, when the IT staff has trouble using the modeling software, they hire a statistician to build models.  Finally, once a statistician is on board, someone looks at the data, and realizes (often) that the data will not support the model required.  Finally, much later, the business owners of the problem realize they started with the wrong analytic problem.  This is usually because they didn&#8217;t start with an analytic strategy.</p>
<p><strong>Mistake 4. Trying to build the perfect model.</strong> Another common mistake is trying to build the perfect statistical model.  Usually, the impact of a model will be much higher if a model that is good enough is deployed and then a process is put in place that: i) reviews the effectiveness of the model frequently with the business owner of the problem; ii) refreshes the model on a regular basis with the most recent data; and, iii) rebuilds the model on a periodic basis with the lessons learned from the reviews.</p>
<p><strong>Mistake 5.  The predictions of the model are not actionable.</strong> This was the subject of a recent post about an approach that I call the <a href="http://blog.rgrossman.com/2009/04/28/sams-methodology/">SAMS methodology</a>.  Recall that SAMS is an abbreviation for Scores/Actions/Measures/Strategy.  From this point of view, the model is evaluated not just by its accuracy but instead by measures that directly support a specified strategy.  For example, the strategy might be to increase sales by recommending another product after an initial product is selected.  Here the relevant measure might be the incremental revenue generated by the recommendations.  The actions would be either to present up to three additional products to the shopper.  The scores might be a score from 1 to 1000.  The products with the highest three scores are then presented.  This is a simple example.  Unfortunately, in most of the projects that I have been involved with determining the appropriate actions and measures often requires an iterative process to get it right.</p>
<p><strong>Please share by making comments below any lessons you have learned building analytic models. </strong> I would like to expand this list over time to include many of the common mistakes that occur in analytic projects.</p>
<p>The image above is from <a href="http://www.flickr.com/photos/doobybrain/360276843">www.flickr.com/photos/doobybrain/360276843</a> and is available under a Creative Commons license.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/06/01/five-common-mistakes-in-analytic-projects/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing</title>
		<link>http://rgrossman.com/2009/05/25/malstone-benchmark/</link>
		<comments>http://rgrossman.com/2009/05/25/malstone-benchmark/#comments</comments>
		<pubDate>Mon, 25 May 2009 16:41:58 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[benchmarks for cloud analytics]]></category>
		<category><![CDATA[benchmarks for data intensive computing]]></category>
		<category><![CDATA[cloud analytics]]></category>
		<category><![CDATA[cloud computing benchmarks]]></category>
		<category><![CDATA[CloudStone]]></category>
		<category><![CDATA[drive-by exploits]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hadoop wins TeraSort]]></category>
		<category><![CDATA[log files]]></category>
		<category><![CDATA[MalStone]]></category>
		<category><![CDATA[site-entity]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=146</guid>
		<description><![CDATA[The TPC Benchmarks have played an important role in comparing databases and transaction processing systems.   Currently, there are no similar benchmarks for comparing two clouds.

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications.  In this note, we describe the MalStone Benchmark, which [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.tpc.org/">TPC Benchmarks</a> have played an important role in comparing databases and transaction processing systems.   Currently, there are no similar benchmarks for comparing two clouds.</p>
<p><img src="http://rgrossman.files.wordpress.com/2009/06/benchmark_bourbon_whiskey.jpg?w=225" alt="Benchmark" title="Benchmark" width="225" height="300" class="alignleft size-medium wp-image-229" /></p>
<p>The <a href="http://radlab.cs.berkeley.edu/wiki/Projects/Cloudstone">CloudStone Benchmark</a> is a first step towards a benchmark for clouds designed to support Web 2.0 type applications.  In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as <a href="http://hadoop.apache.org/core/">Hadoop</a> and <a href="http://sector.sourceforge.net">Sector</a>, designed to support data intensive computing.</p>
<p>MalStone is a stylized analytic computation of a type that is common in data intensive computing.   The open source code to generate data for MalStone and a technical report describing MalStone and providing some sample implementations can be found at: <a href="http://code.google.com/p/malgen">code.google.com/p/malgen</a> (look in the feature downloads section along the right hand side).</p>
<h3>Detecting Drive-By Exploits from Log Files</h3>
<p>We introduce MalStone with a simple example.  Consider visitors to web sites.  As described in the paper <a href="http://www.usenix.org/events/hotbots07/tech/full_papers/provos/provos.pdf">The Ghost in the Browser</a> by <a href="http://www.provos.org">Provos</a> et. al.  that was presented at <a href="http://www.usenix.org/events/hotbots07/tech/">HotBot &#8216;07</a>, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages.  Sometimes these are called “drive-by exploits.”</p>
<p>The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:</p>
<pre>   | Timestamp | Web Site ID | User ID</pre>
<p>There is a further assumption that if the computers become infected, at perhaps a later time, then this is known.  That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:</p>
<pre>   | User ID | Compromise Flag</pre>
<p>Here the Compromise field is a flag, with 1 denoting a compromise.  A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.</p>
<p>We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites.  Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging.  For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today.  On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.</p>
<p>The MalStone benchmarks use records of the following form:</p>
<pre>   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID</pre>
<p>Here site abstracts web site and entity abstracts the possibly infected computer.   We assume that each record is 100 bytes long.</p>
<p>In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and  subsequently becomes compromised is divided by the total number of records for which an entity visited the site.  The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest).  MalStone A-10 uses 10 billion records so that in total there is 1 TB of data.  Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records.   MalStone B-10, B-100 and B-1000 are defined in the same way.</p>
<h3>TeraSort Benchmark</h3>
<p>One of the motivations for choosing 10 billion 100-byte records is that the <a href="http://sortbenchmark.org/">TeraSort Benchmark</a> (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.</p>
<p>In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark.  It was able to sort 1 TB of data using <a href="http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html">using 910 nodes in 209 seconds</a>, breaking the previous record of 297 seconds.   Hadoop set a new record in 2009 by sorting 100 TB of data at <a href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html">0.578 TB/minute using 3800 nodes</a>.  For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton <a href="http://perspectives.mvdirona.com/2008/07/08/HadoopWinsTeraSort.aspx">Hadoop Wins Terasort</a>.</p>
<p>Note that the TeraSort Benchmark is now deprecated and has been replaced by the <a href="http://sortbenchmark.org/">Minute Sort Benchmark</a>.  Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.</p>
<h3>Generating Data for MalStone Using MalGen</h3>
<p>We have developed a generator of synthetic data for MalStone called MalGen.  MalGen is open source and available from <a href="http://code.google.com/p/malgen">code.google.com/p/malgen</a>.  Using MalGen, data can be generated with power law distributions, which is useful when modeling  web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).</p>
<h3>Using MalStone to Study Design Tradeoffs</h3>
<p>Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records.   The experiments were done on 20 nodes of the <a href="http://www.opencloudconsortium.org/testbed.html">Open Cloud Testbed</a>.  Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, and 1 Gb/s network interface cards.</p>
<p>We compared three different implementations: 1) Hadoop HDFS with Hadoop&#8217;s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using <a href="http://arxiv.org/abs/0809.1181">Sphere User Defined Functions (UDFs)</a>.</p>
<table border="1">
<tr>
<th colspan="2"> MalStone A</th>
</tr>
<tr>
<td>Hadoop MapReduce </td>
<td>454m 13s</td>
</tr>
<tr>
<td>Hadoop Streams/Python</td>
<td>87m 29s </td>
</tr>
<tr>
<td>Sector/Sphere UDFs </td>
<td>33m 40s</td>
</tr>
<tr>
<th colspan="2"> MalStone B</th>
</tr>
<tr>
<td>Hadoop MapReduce </td>
<td>840m 50s</td>
</tr>
<tr>
<td>Hadoop Streams/Python</td>
<td>142m 32s </td>
</tr>
<tr>
<td>Sector/Sphere UDFs </td>
<td>43m 44s</td>
</tr>
</table>
<p><b>Please note that these timings are still preliminary and may be revised in the future as we better optimize the implementations. </b></p>
<p>If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice.  What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams.  In addition, you may also want to consider using <a href="http://sector.sourceforge.net">Sector</a>.</p>
<p>The image above is from <a href="http://www.flickr.com/photos/legeres/270126135/">Strolling everyday</a> and available via a Creative Commons license.</p>
<p><b>Disclaimer:</b>  I am involved in the development of Sector.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/05/25/malstone-benchmark/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Open Source Analytics Reaches Main Street (and Some Other Trends in Analytics)</title>
		<link>http://rgrossman.com/2009/05/11/open-source-analytics-reaches-main-street/</link>
		<comments>http://rgrossman.com/2009/05/11/open-source-analytics-reaches-main-street/#comments</comments>
		<pubDate>Mon, 11 May 2009 17:18:40 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[PMML]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[data intensive computing]]></category>
		<category><![CDATA[standards]]></category>
		<category><![CDATA[analytic standards]]></category>
		<category><![CDATA[cloud-based data services]]></category>
		<category><![CDATA[commoditization of data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[public datasets]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=78</guid>
		<description><![CDATA[This is the first of three posts about systems, applications, services and architectures for building and deploying analytics.   Sometimes this is called analytic infrastructure.  This post is primarily directed at the analytic infrastructure needs of companies.  Later posts will look at analytic infrastructure for the research community.
In this first post of [...]]]></description>
			<content:encoded><![CDATA[<p>This is the first of three posts about systems, applications, services and architectures for building and deploying analytics.   Sometimes this is called <em>analytic infrastructure</em>.  This post is primarily directed at the analytic infrastructure needs of companies.  Later posts will look at analytic infrastructure for the research community.</p>
<p>In this first post of the series, we discuss five important trends impacting analytic infrastructure.</p>
<p><strong>Trend 1.  Open source analytics has reached Main Street. </strong> <a href="http://www.r-project.org">R</a>, which was first released in 1996, is now the most widely deployed open source system for statistical computing.  A recent <a href="http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html">article</a> in the New York Times estimated that over 250,000 individuals use R regularly.  Dice News has created a video called &#8220;<a href="http://www.youtube.com/watch?v=ZwYQPtU2Pa0&amp;feature=channel_page">What&#8217;s Up with R</a>&#8221; to inform job hunters using their services about R.  In the language of Geoffrey A. Moore&#8217;s book <em>Crossing the Chasm</em>, R has reached &#8220;Main Street.&#8221;</p>
<p>Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used.  Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.</p>
<p><strong>Trend 2.   The maturing of open, standards based architectures for analytics. </strong> Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician.  It is usually a challenge to deploy the model produced by the application into operational systems.  Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.</p>
<p>The <a href="http://www.dmg.org">Predictive Model Markup Language</a> (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models.  PMML has become the dominant standard for statistical and data mining models.   Many applications now support PMML.</p>
<p>By using these applications,  it is possible to build an open, modular standards based environment for analytics.  With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.</p>
<p>Disclaimer: I&#8217;m one of the many people that has been involved in the development of the PMML standard.</p>
<p><strong>Trend 3.  The emergence of systems that simplify the analysis of large datasets. </strong> Analyzing large datasets is still very challenging, but with the introduction of <a href="http://hadoop.apache.org/core/">Hadoop</a>, there is now an open source system supporting <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a> that scales to thousands of processors.</p>
<p>The significance of Hadoop and MapReduce is not only the <em>scalability</em>, but also the <em>simplicity</em>.  Most programmers, with no prior experience, can have their first Hadoop job running on a large cluster within a day.  Most programmers find that it is much easier and much quicker to use MapReduce and some of its generalizations than it is develop and implement an MPI job on a cluster, which is currently the most common programming model for clusters.</p>
<p><strong>Trend 4.   Cloud-based data services. </strong> Over the next several years, cloud-based services will begin to impact analytics significantly.   A later post in this series will show simple it is use R in a cloud for example.  Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.</p>
<p>Cloud-based services provide several advantages for analytics.  Perhaps the most important is elastic capacity &#8212; if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more.  This ability of clouds to handle surge capacity is important for many groups that do analytics.  With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense.  (Third party clouds provide computing capacity that is an operating and not a capital expense.)</p>
<p><strong>Trend 5.  The commoditization of data. </strong> Moore&#8217;s law applies not only to CPUs, but also to the chips that are used in all of the digital device that produce data.  The result has been that the cost to produce data has been falling for some time.  Similarly, the cost to store data has also been falling for some time.</p>
<p>Indeed, more and more datasets are being offered for free.  For example, end of day stock <a href="http://finance.yahoo.com/q">quotes</a> from Yahoo, gene sequence data from <a href="http://www.ncbi.nlm.nih.gov/">NCBI</a>, and <a href="http://aws.amazon.com/publicdatasets/">public data sets</a> hosted by Amazon, including the U.S. Census Bureau, are all available now for free.</p>
<p>The significance to analytics is that the cost to enrich data with third party data, which often produces better models, is falling.  Over time, more and more of this data will be available in clouds, so that the effort to integrate this data into modeling will also decrease.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/05/11/open-source-analytics-reaches-main-street/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>In Analytics, It&#8217;s the Actions that Matter</title>
		<link>http://rgrossman.com/2009/04/28/sams-methodology/</link>
		<comments>http://rgrossman.com/2009/04/28/sams-methodology/#comments</comments>
		<pubDate>Tue, 28 Apr 2009 22:36:53 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytic strategy]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[analytic strategies]]></category>
		<category><![CDATA[SAMS]]></category>
		<category><![CDATA[sams methodology]]></category>
		<category><![CDATA[scores/actions/measures]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=53</guid>
		<description><![CDATA[In this note, let’s define analytics as the analysis of data in order to take actions.  (This is a narrow definition of analytics, but one that is useful here.)  If you don&#8217;t have day to day work experience with analytics, it is easy to have the mistaken impression that analytics is only about [...]]]></description>
			<content:encoded><![CDATA[<p>In this note, let’s define <em>analytics</em> as the analysis of data in order to take actions.  (This is a narrow definition of analytics, but one that is useful here.)  If you don&#8217;t have day to day work experience with analytics, it is easy to have the mistaken impression that analytics is only about data and statistical models.</p>
<p>Although understanding data and developing statistical models is certainly an important component of an analytic project, this is just one aspect of analytics.   This aspect includes cleaning data, enriching data, exploring data, developing features, building models, validating models, and iterating the process.   From a broad perspective, this is a process in which the input is data and the output is a statistical model.  When most people think of modeling, this is what they think of.   For many analytic projects, this is just a small part of what is required for a successful engagement.</p>
<p>The second aspect of analytics is what I am concerned with in this note.  This is the aspect of analytics concerned with:</p>
<ul>
<li>developing an appropriate score for a statistical model;</li>
<li>using the score to define useful actions;</li>
<li>determining which measures are best for evaluating the effectiveness of these actions;</li>
<li>tracking these measures (often with a dashboard) and making sure that that they advance the  strategic objectives of the company or organization.</li>
</ul>
<p>One way to remember this is using the mnemonic SAMS for <strong>S</strong>cores, <strong>A</strong>ctions, <strong>M</strong>easures and <strong>S</strong>trategies.</p>
<p>For example, with a response model, often a threshold is used.  If the score from the response model is above the threshold, an offer is made (this is the action); if not, no offer is made.</p>
<p>Here are some examples of SAMS:</p>
<table border="1">
<tbody>
<tr>
<th>Model</th>
<th>Score</th>
<th>Action</th>
<th>Measure</th>
<th>Strategy</th>
</tr>
<tr>
<td>on-line response model</td>
<td>likelihood to respond to an offer</td>
<td>display the offer to the visitor that has the highest likelihood of response and available inventory</td>
<td>revenue per day generated by the web site</td>
<td>increase revenue from a website by improving targeting of offers</td>
</tr>
<tr>
<td>fraud model</td>
<td>likelihood that a transaction is fraudulent</td>
<td>approve, decline, or obtain more information</td>
<td>detection and false positive rates</td>
<td>reduce costs and improve customer experience by lowering fraud rates</td>
</tr>
<tr>
<td>data quality model</td>
<td>likelihood that a data source has data quality problems</td>
<td>if the score is above a threshold, manually investigate the data to check whether there is in fact a data quality problem</td>
<td>detection and false positive rates</td>
<td>improve operational efficiencies by detecting data quality problems more quickly</td>
</tr>
</tbody>
</table>
<p>A successful analytics projects requires a careful study of what actions are possible; of the possible actions, which can be deployed into operational systems; and, how the systems can be instrumented so that the data required to compute the required measures is available.</p>
<p>The organizational challenge when developing and deploying analytics is that four groups must work together to complete a successful analytic project:</p>
<ul>
<li>The IT group must provide the required data to build the model.</li>
<li>The analytics group must build the appropriate models and develop the appropriate scores.</li>
<li>The operations group must decide which actions are possible and how these actions can be integrated with current systems and business processes.</li>
<li>An executive sponsor must make sure that the measures have strategic relevance and the three groups above collaborate effectively.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/04/28/sams-methodology/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Sector &#8211; When You Really Need to Process 10 Billion Records</title>
		<link>http://rgrossman.com/2009/04/19/sector-when-you-really-need-to-process-10-billion-records/</link>
		<comments>http://rgrossman.com/2009/04/19/sector-when-you-really-need-to-process-10-billion-records/#comments</comments>
		<pubDate>Sun, 19 Apr 2009 15:20:29 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[standards]]></category>
		<category><![CDATA[cloud analytics]]></category>
		<category><![CDATA[CloudSlam '09]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MalStone]]></category>
		<category><![CDATA[Sector]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=37</guid>
		<description><![CDATA[As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center.  The stack consists of a storage service (the Google File System (GFS)), a compute service based upon MapReduce, and a table service (BigTable).
Although the Google stack [...]]]></description>
			<content:encoded><![CDATA[<p>As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center.  The stack consists of a storage service (the <a href="http://labs.google.com/papers/gfs.html">Google File System (GFS)</a>), a compute service based upon <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a>, and a table service (<a href="http://labs.google.com/papers/bigtable.html">BigTable</a>).</p>
<p>Although the Google stack of services is not directly available, the open source <a href="http://hadoop.apache.org/core/">Hadoop</a> system, which has a broadly similar architecture, is available.</p>
<p>The Google stack, consisting of GFS/MapReduce/Bigtable, and the Hadoop system, consisting of the Hadoop Distributed File System (HDFS) and Hadoop&#8217;s implementation of MapReduce, are examples of clouds designed for data intensive computing &#8212; these types of clouds provide computing capacity on demand, with capacity scaling all the way up to the size of a data center.</p>
<p>There are still many open questions about how best to design clouds for data intensive computing.  During the best several years, I have been involved with a cloud designed for data intensive computing called <a href="http://sector.sourceforge.net">Sector</a>.   The lead developer of Sector is <a href="http://www.lac.uic.edu/~yunhong">Yunhong Gu</a> of the University of Illinois at Chicago.  Sector was developed independently of Hadoop and the Google cloud services and makes several different design choices (see the table below).</p>
<p>To quantify the impact of some of these choices, I have been involved with the development of a benchmark for data intensive computing called MalStone.  I will talk more about MalStone in a future post, but briefly, MalStone is a stylized analytic computing that can be done simply using MapReduce, as well as variants and generalizations of MapReduce.  The open source MalStone code comes with a generator of synthetic records and one benchmark (called MalStone B) generates 10 billion 100-byte records (similar to terasort).</p>
<p><strong>MalStone B Benchmarks </strong></p>
<table border="1">
<tbody>
<tr>
<th>System</th>
<th>Time (min)</th>
</tr>
<tr>
<td>Hadoop MapReduce</td>
<td>799 min</td>
</tr>
<tr>
<td>Hadoop Streaming with Python</td>
<td>143 min</td>
</tr>
<tr>
<td>Sector</td>
<td>44 min</td>
</tr>
</tbody>
</table>
<p>Tests were done using 20 nodes on the <a href="http://www.opencloudconsortium.org">Open Cloud Testbed</a>.  Each node contained 500 million 100-byte records.</p>
<p><strong>Comparing Sector and Hadoop</strong></p>
<table border="1">
<tbody>
<tr>
<th></th>
<th>Hadoop</th>
<th>Sector</th>
</tr>
<tr>
<td>Storage cloud</td>
<td>block-based file system</td>
<td>file-based</td>
</tr>
<tr>
<td>Programming model</td>
<td>MapReduce</td>
<td>user defined functions and MapReduce</td>
</tr>
<tr>
<td>Protocol</td>
<td>TCP</td>
<td>UDP</td>
</tr>
<tr>
<td>Security</td>
<td>NA</td>
<td>HIPAA capable</td>
</tr>
<tr>
<td>Replication</td>
<td>at time of writing</td>
<td>periodically</td>
</tr>
<tr>
<td>Language</td>
<td>Java</td>
<td>C++</td>
</tr>
</tbody>
</table>
<p>I&#8217;ll be giving a talk on Sector at <a href="http://www.cloudslam09.com">CloudSlam &#8216;09</a> on Monday, April 20, 2009 at 4pm ET.  CloudSlam is a virtual conference, so that it is easy to listen to any of the talks that interest you.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/04/19/sector-when-you-really-need-to-process-10-billion-records/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Learning About Cloud Analytics</title>
		<link>http://rgrossman.com/2009/04/06/learning-about-cloud-analytics/</link>
		<comments>http://rgrossman.com/2009/04/06/learning-about-cloud-analytics/#comments</comments>
		<pubDate>Mon, 06 Apr 2009 12:42:04 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[PMML]]></category>
		<category><![CDATA[analytic infrastructure]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[courses]]></category>
		<category><![CDATA[cloud analytics]]></category>
		<category><![CDATA[cloud analytics courses]]></category>
		<category><![CDATA[cloud computing courses]]></category>
		<category><![CDATA[Hadoop courses]]></category>
		<category><![CDATA[introduction to cloud computing]]></category>
		<category><![CDATA[modeling environments]]></category>
		<category><![CDATA[public datasets]]></category>
		<category><![CDATA[R courses]]></category>
		<category><![CDATA[scoring engines and clouds]]></category>

		<guid isPermaLink="false">http://blog.rgrossman.com/?p=23</guid>
		<description><![CDATA[Clouds are changing the way that analytic models get built and the way they get deployed.
Neither analytics nor clouds have standard definitions yet.
A definition I like is to define analytics as the analysis of data to support decisions.   For example, analytics is used in marketing to develop statistical models for acquiring customers and [...]]]></description>
			<content:encoded><![CDATA[<p>Clouds are changing the way that analytic models get built and the way they get deployed.</p>
<p>Neither analytics nor clouds have standard definitions yet.</p>
<p>A definition I like is to define analytics as the <em>analysis of data to support decisions</em>.   For example, analytics is used in marketing to develop statistical models for acquiring customers and predicting the future profitability of customers.  Analytics is used in risk management to identify fraud, to discover compromises in operations, and to reduce risk.  Analytics is used in operations to improve business and operational processes.</p>
<p>Cloud computing also doesn&#8217;t yet have a standard definition.   A good working definition is to define clouds as racks of commodity computers that provide on-demand resources and services over a network, usually the Internet, with the scale and the reliability of a data center.</p>
<p>There are two different, but related, types of clouds: the first category of clouds provide computing <em>instances</em> on demand, while the second category of clouds provide computing <em>capacity</em> on demand.  Both use the same underlying hardware, but the first is designed to scale out by providing additional computing instances, while the second is designed to support data- or compute-intensive applications by scaling capacity.  Amazon&#8217;s <a href="http://aws.amazon.com/"> EC2 and S3</a> services are an example of the first type of cloud.  The  <a href="http://hadoop.apache.org/core/">Hadoop</a> system is an example of the second type of cloud.</p>
<p>Currently, as a platform for analytics, clouds offer several advantages:</p>
<ol>
<li><strong>Building analytic models on very large datasets. </strong> &#8220;Hadoop style clouds&#8221; provide a very effective platform for developing analytic models on very large datasets.</li>
<li><strong>Scoring data using analytic models.</strong> Given an analytic model and some data (either a file of data or a stream of data), &#8220;Amazon style clouds&#8221; provide a simple and effective platform for scoring data.  The Predictive Model Markup Language (<a href="http://www.dmg.org">PMML</a>) has proved to be a very effective mechanism for moving a statistical or analytic model built using one analytic system into a cloud for scoring.  Sometimes the terminology PMML Producer is used for the application that builds the model and PMML Consumer is used for the application that scores new data using the model.  Using this terminology, &#8220;Amazon style clouds&#8221; can be used to score  data easily using PMML models built elsewhere.</li>
<li><strong>Simplifying  modeling environments. </strong> Finally, computing instances in a cloud can be built that incorporate all the analytic software required for building models, including preconfigured connections to all the data required for modeling.  At least for small to medium size datasets, preconfiguring computing instances in this way can simplify the development of analytic models.</li>
<li><strong>Easy access to data.</strong>  Clouds can also make it much easier to access data for modeling.  Amazon has recently made available a variety of <a href="http://aws.amazon.com/publicdatasets/">public datasets</a>.  For example, using <a href="http://aws.amazon.com/ebs/">Amazon&#8217;s EBS service</a>, the U.S. Census data can be accessed immediately.</li>
</ol>
<p>I&#8217;ll be one of the lecturers in two up coming courses on cloud analytics that introduce clouds as well as cloud analytics.</p>
<p>The first course will be taught in Chicago on June 22, 2009 and the second one in San Mateo on July 14, 2009.   You can register for the Chicago course using this <a href="https://www.regonline.com/714251">registration link</a> and the San Mateo course using this <a href="https://www.regonline.com/712057">registration link</a>.</p>
<p>This one day course will give a quick introduction to cloud computing and analytics.  It describes several different types of clouds and what is new about cloud computing, and discusses some of the advantages and disadvantages that clouds offer when building and deploying analytic models.  It includes three case studies, a survey of vendors, and information about setting up your first cloud.</p>
<p>The course syllabus can be found here: <a href="http://www.opendatagroup.com/courses.htm">www.opendatagroup.com/courses.htm</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://rgrossman.com/2009/04/06/learning-about-cloud-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
