analytics
Ten Years of the SC XY Bandwidth Challenge
Posted by admin in Blog, analytic infrastructure, analytics on November 30, 2009
The SC 09 Conference took place early this month in Portland. The Bandwidth Challenge (BWC) is an interesting and friendly rivalry between research groups to develop high performance network protocols and interesting applications that use them. The Bandwidth Challenge was started ten years ago at SC 99, which also took place in Portland.
Some of the history is available at the web site scinet.supercomputing.org. For example, in 2000, there were 2 OC-48 (2.5 Gbps) circuits that connected the research exhibits at the conference to external research networks and the challenge was to develop network protocols and applications that could fill these circuits. The winner of the BWC (called the Network Bandwidth Challenge in 2000) was a scientific visualization application called Visapult that reached 1.48 Gbps and transferred 262 GB in 1 hour (providing 582 Mbps of sustained bandwidth utilization).
This year, there were approximately 24 10 GE circuits and one 40 GE circuit that connected research exhibits to external exhibits and one of the applications reached a bandwidth utilization of over 114 Gbps.
I have had an interest in the BWC over the years, because you cannot analyze data without accessing it and accessing and transporting large remote datasets has always been a challenge. To say it slightly different, for large datasets and high performance networks, network transport protocols are an important element of the analytic infrastructure.
It’s useful to know the bandwidth delay product of a network, which is the product of the network capacity (in Mbps, say) multiplied by the round trip time (RTT) of a packet (in sec). This measures the amount of data on the network that has been transmitted but not yet received. This can be MB of data for wide area high performance networks. This data must be buffered so that it can be resent if a packet is not received.
Challenges that have been worked out over the past decade include:
- Improving TCP so that it is effective over networks with high bandwidth delay products. One of the successes is the development of FAST TCP, a variant of the TCP protocol.
- Developing reliable and friendly UDP-based protocols that are effective over networks with high bandwidth delay products. For example, the open source UDT protocol has proved over time to be quite effective. (Disclosure: I have been involved in the development of the UDT protocol.)
- Developing architectures that are effective for high end-to-end performance for transporting large datasets, from disks at one end to disks at the other end.
For the past several years, it has been relatively routine for applications using FAST TCP or UDT to fill a wide area 10 Gbps network link or multiple 10 Gbps network links, if these are available.
Today’s problems include:
- Connecting data intensive devices and applications to high performance networks. For example, with high throughput sequencing, biology is becoming data intensive, yet very few high throughput sequencing devices are connected to high performance research networks.
- Incorporating the appropriate network protocols into data intensive applications. For example, one of the reasons, the Sector/Sphere cloud is effective over wide area networks is that it is based upon UDT and not TCP. (Disclosure: I have been involved in the development of the Sector/Sphere cloud.)
I ran into the first problem just after I got back from SC 09. At SC 09, we ran a number of wide area data intensive applications, and in fact won the 2009 BWC for these applications. For example, a new variant of UDT called UDX reached 9.2 Gbps over a network link with 200 ms RTT. In contrast, as soon as I got back to Chicago, I worked for a couple of days trying to get access to 200 GB of sequence data, since the sequencing instrument that produced it was not connected to a high performance network. With the device connected to a high performance research network, the data would have been available in a few minutes.
To summarize, today network experts are comfortable designing systems that can easily fill wide area 10 GE networks, but most analytic applications are not designed to use the required protocols or to to take advantage of high performance networks, and most do not have access to the required networks, even if the applications could benefit from them.
In disciplines, like biology, that are becoming data intensive, this type of analytic infrastructure will provide distinct competitive advantages.
He Said, She Said – Why Custom Models Take Time
Posted by admin in Blog, analytic strategy, analytics on August 6, 2009
In this post, I discuss some of the different options available when building analytic models. For the purposes here, a good short definition of analytics is to view analytics as using data to make predictions. The term predictive analytics is beginning to be applied (appropriately enough) to this type of analytics. A longer definition is to view predictive analytics as building statistically valid models from data that can be used to make predictions about future events, to take actions, and to make decisions.

In this post, the point of view is that a business owner of a problem in a company requires a model and is considering whether to build the model in-house, outsource the model to a vendor providing analytic services, or simply to give up on building a model and produce a report instead. I don’t recommend the latter option, but unfortunately, in practice, it is all too common.
Broadly speaking, from a business owner’s point of view, there are several phases required to build a model for a new project. The process looks a bit different from the modeler’s point of view. It is also a bit simpler if the same model has been built before and all that is required is to update the model using new data. Here are the basic steps required to build a model from the business owner’s point of view.
- Working with IT to obtain all the data required for the project and making it available to the modeler.
- Answering questions from the modeler about the data.
- Agreeing upon the output of the model.
- Reviewing the first model with the modeler.
- Reviewing the second and subsequent models with the modeler.
- Working with IT to deploy the model.
Although all steps except for Step 1 are collaborative between the business owner and the modeler, Step 6 is primarily the business owner’s responsibility, while Steps 2, 4, and 5 are primarily the modeler’s responsibility. At the beginning of many projects, Step 3 looks obvious. It turns out that it is often not so obvious until the project is towards the end, the data has been cleaned, and the deployment well underway. One way to understood why this is so is because often one doesn’t have a good understanding of the most appropriate output of a model until the data has been cleaned and there is a good understanding of how the model will be deployed in operational systems.
Let’s look at this same process now from the viewpoint of the modeler. To simplify, the folllowing steps are required:
- Waiting for the data.
- Cleaning the data.
- Asking the business owner questions about the data.
- Agreeing upon the output of the model.
- Developing a set of features for the model.
- Estimating the parameters of the model.
- Building a measure to evaluate the model.
- Evaluating the model using the measure.
- Developing post-processing rules for the scores produced by the model.
- Repeating the steps above for the second and subsequent versions of the model until everyone is happy, or there is no more time or funding left.
- Deploying the model.
Building a new model requires completing all the steps above. Generally, a series of models (version 1 of the model, version 2 of the model, etc.) are produced and reviewed by the business owner and the modeler (Step 10). The more time available for Step 10, the better the quality of the model.
To understand these steps a bit better, it might be helpful to review post about the SAMS Methodology. The SAMS methodology explains how to think of models in terms of the Scores they produce, the Actions these enable, the Measures used to evaluate the actions, and whether these actions support a targeted Strategy or not.
Sometimes a model has been built before and only some of these steps need to be repeated. For example, refreshing a model only require completing steps 6 and 8 for a series of models. Rebuilding a model usually only requires repeating Steps 5, 6, 8 and 9 for a series of models.
Sometimes, the data is supplied in a standard format (for example, it is provided by a third party) and the deployment uses a standard format (for example, only a list is required that contains a list of names and corresponding offers). In this case, after a model has been built once, all that is required when a business owner supplies new data is to perform Steps 6 and 8. Call this a standard model. Standard models are substantially less work to build then models that require completing all the steps above. These more labor intensive models are often called custom models.
Most requests for models fit into some standard categories of models. For example, models that predict whether a prospect will respond to an offer (response models), whether a customer will remain a customer (attrition models), whether a customer will keep current with their payments (credit model), whether a transaction is valid or fraudulent (fraud models), etc.
Sometimes, models that don’t fit into these familar categories of models are built. Call these new types of models. A new type of model also requires that the modeler develop new types of features, new types of measures for evaluating the models, etc. New types of custom models are the most labor intensive to build.
In practice, it usually takes four to six months or longer to build a custom model, once the data has arrived. As the size and complexity of the data grows, each of the steps usually requires more time.
Three Lessons in Analytic Strategy from the Netflix Prize
Posted by admin in Blog, analytic strategy, analytics on July 5, 2009
The Netflix Prize requires developing a new rating algorithm that improves by over 10% the current system called Cinematch that is used by Netflix to suggest movies to its customers. According to the contest’s Leaderboard, it looks the $1,000,000 Grand Prize will be awarded shortly.
The Netflix Prize provides some interesting lessons in analytic strategy. In addition to the Grand Prize, each year until the Grand Prize is awarded, a $50,000 Progress Prize is awarded. The Progress Prize was awarded in 2007 and 2008. The Netflix Prize has become quite well known due to the prize money being offered. It deserves to be just as well known for the analytic strategy they chose.
It is relatively common for a company or an organization to spend a million dollars on an analytic project. It is less common for something useful to result from it. I don’t have any inside knowledge about the Netflix Prize, but I think that there are several valuable lessons about analytic strategy that the Netflix Prize illustrate. Here are three lessons.
Lesson 1. Agree upon a metric to measure the effectiveness of an analytic model and use it consistently. It is usually not possible to find a single metric that captures all the relevant information required when comparing two analytic systems. It is certainly the case that any actual ratings system requires several metrics. For example, one metric might measure how many stars a viewer would assign to a move and another metric might measure how often a viewer selects movies recommended. On the other hand, by singling out a single metric, it becomes straightforward to compare two recommendation algorithms. Once this is possible, it becomes simple to use the metric to create a dashboard (the Netflix Leaderboard) and then to use the dashboard to track progress. Netflix chose to use the root mean squared error (RMSE) between the predictions of a proposed system and actual choices made by users in a validation dataset. Over 49,000 contestants from over 180 different countries formed over 40,000 teams and entered the contest and tried to develop a recommendation algorithm with a low enough RMSE to win the Grand Prize. In my experience, most companies and organizations lack the discipline to use a single (lead) metric to compare two analytic systems and to use the metric to track progress improving an analytic system over time using a dashboard. Having the discipline to do so is one sign of the analytic maturity of a company.
Lesson 2. Don’t be afraid to disclose analytic technology you develop if the advantages outweigh the disadvantages. In general, it makes sense for companies and organizations not to disclose the proprietary technology they use. On the other hand, there are some important exceptions.
- One exception are patents. Patents provide some important protections, but the trade off is that the technology must be disclosed in the patent filing.
- Another exception is when the software of an internal analytic project is made open source or when an internal project decides to contribute to an existing open source software project. Again, there is a trade off. Some technology is disclosed, but the benefit is the community support that many open source projects engender.
- Crowdsourcing is a similar type of exception. The benefit is the innovation that crowdsourcing can provide. The downside is that crowdsourcing discloses technology that may be critical to your business. Netflix found that with Cinematch customers rented more movies and were less likely to cancel their subscriptions. Cinematch was introduced in 2000 and improved each year until a plateau was reached in 2006. In the summer of 2006, Reed Hastings, the CEO of Netflix, suggested a public contest to improve Cinematch. According to an article in the New York Times, “Cinematch suggestions… drive a surprising 60 percent of Netflix’s rentals.” By setting a threshold for the prize of 10% or more improvement, Netflix would obtain enough incremental revenue from an improved Cinematch system to make up for any information that Netflix’s competitors might gain. Again, this is a good analytic strategy.
Lesson 3. Double and triple check any data before making it public. No company or organization would knowingly make data public that contains personally identifiable information (PII) without permission. On the other hand, even if data does not contain PII per se, often times PII can be inferred from data, as was done when AOL released 3 months of sample query logs in 2006. For less obvious ways to break anonymization of data, see the paper Wherefore art thou r3579x?. In some cases, it can be quite challenging to take data and to anonymize it so that it does not contain PII information, especially if the data is being updated. On the other hand, making data public enables a broad community to contribute to your problem.
Finally, it is interesting to think about the size of the data used for the prize. The data consisted of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers. The rated over 17 thousand movie titles during the period October, 1998 to December, 2005. In some sense, this is a lot of data. Certainly there are a lot of degrees of freedom in the dataset. On the other hand, it is less than 2 GB of data and easily fits in the memory of a modest size computer. From this perspective, it is a small amount of data. From the view point of analytic infrastructure, it is useful to classify data as small (fits into the memory of a single computer), medium (fills the disks of a single storage device or fits into a database), or large (requires specialized infrastructure such as a cloud).
For more information:
- R. M. Bell and Y. Koren, Lessons from the Netflix prize challenge. SIGKDD Explororations Newsletter, Volume 9, Number 2 (Dec. 2007), pages 75-79. DOI= http://doi.acm.org/10.1145/1345448.1345465 (subscription required)
- Clive Thompson, The Screens Issue. If You Liked This, You’re Sure to Love That, New York Times, November 23, 2008 (registration required).
- L. Backstrom, C. Dwork, and J. Kleinberg, Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography, Proceedings of the 16th international Conference on World Wide Web (WWW ‘07), ACM, New York, NY, 181-190. (subscription required)
Upcoming Course. I’ll be using this example in an upcoming course I’m teaching in San Mateo on July 14, 2009.
The Three Most Important Interfaces in Analytics
Posted by admin in Blog, PMML, analytic infrastructure, analytics on June 17, 2009
If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements. On the other hand, if your data is large, your model is complicated, your output is a model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the years to support analytics. I use the term analytic infrastructure to refer to these components, services, applications and systems.

For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process.
There are several fundamental steps when building and deploying analytic models that are directly relevant to analytic infrastructure:
| Step | Inputs | Outputs |
|---|---|---|
| Preprocessing | dataset (data fields) | dataset of features |
| Modeling | dataset of features | model |
| Scoring | dataset (data fields), model | scores |
| Postprocessing | scores | actions |
Perhaps, the most important interfaces in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages (which have a human in the loop), and components in the analytic infrastructure that score data using models and often reside in operational environments. The former are examples of what are sometimes called model producers, while the latter are sometimes called model consumers. The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion.
On June 16, the Data Mining Group released version 4.0 of the Predictive Model Markup Language or PMML. Version 4.0 is the first release of PMML since Version 3.2 was released in May, 2007.
Version 4.0 of PMML adds the following new features:
- support for time series models;
- support for multiple models, which includes support for both
segmented models and ensembles of models; - improved support for preprocessing data, which will help simplify
deployment of models; - new models, such as survival models;
- support for additional information about models called model
explanation, which includes information for visualization, model
quality, gains and lift charts, confusion matrix, and related
information.
Since Version 2.0 of PMML, which was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models. Using these transformations, it would be possible to use PMML define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as models). This is probably the second most important interface in analytics.
With Version 4.0 now released, the PMML working group is now working on Version 4.1. One of the goals is to enable PMML describe postprocessing of scores. This would allow PMML to be used as interface between analytic infrastructure components and services that produce scores (such as modeling engines) and those that consume scores (such as recommendation engines). This is probably the third most important interface in analytics.
Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems. For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler.
If you are interested in getting involved in the PMML working group, please visit the web site: www.dmg.org
Disclaimer:I’m a member of the PMML working group and worked on PMML Version 4.0.

