analytic strategy
What is Analytic Infrastructure and Why Should You Care?
Posted by Robert Grossman in Blog, analytic infrastructure, analytic strategy on February 16, 2010
I have been building analytic models for over 20 years. The names have changed a lot over the years: 20 years ago we built statistical models, 10 years ago we built data mining models, and today we build analytic models. The algorithms have changed some: classification and regression trees became common 20 years ago, support vector machines about 10 years ago, and today graph-based algorithms are popular.

Perhaps what has changed the most is my perspective.
Analytic algorithms and models. Twenty years ago, I was focused on algorithms and was concerned with the different types of models that you could build using different types of algorithms on different types of data. This worked fine as long as the data fit into the memory of the computer.
Analytic infrastructure. For better or worse I ran into problems that had so much data that the data was too big to fit into memory. Some projects required a disk, some required many disks, and a few required tertiary storage. I spent over two decades working on what you might call analytic infrastructure. I first worked on teams that developed for the high energy physics community specialized data management infrastructures that were optimized for efficient reads (instead of safe writes) and accessed the data by columns (instead of rows) in order to speed up numerical computations. These turned out to be some of the first examples of data warehouses (the name was not used at that time), increased by 1 to 3 orders of magnitude the size of data that we could model, and were heavily criticized by the database community. Of course, several years later the database community embraced data warehouses at least for reports, if not for data intensive computing and modeling.
Beginning about five years ago, I began working on what are today called cloud computing platforms. Again, this increases by 1 to 3 orders of magnitude the size of data that we can model, and again these have been heavily criticized by some in the database community as being a big step backwards.
I recently edited a special issue of the ACM SIGKDD Explorations about analytic infrastructure. In an article there, I define analytic infrastructure as the applications, services, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring, or related analytic activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. The article is available as a pdf from the SIGKDD Explorations web site (it’s Issue 1 in Volume 11).
I don’t really like this definition and encourage you to provide a better one. What is important though is that using the appropriate analytic infrastructure is critical to building models for problems with so much data that simply putting it into memory and forgetting about it is not a viable solution.
Analytic Strategy. Returning to how my perspective has evolved, for the past several years, I have become increasingly concerned with what is usually called analytic strategy. Analytic strategy is concerned with making sure you are asking the right analytic question, that you are building a model that can be deployed efficiently, that the output of the model is actionable, that the actions have a business impact, the business impact is aligned with corporate strategy, that there is an appropriate governance process in place, and related questions.
My perspective these days is that analytics requires a firm foundation and that the foundation has three columns: 1) analytic strategy; 2) analytic infrastructure; and 3) analytic algorithms and models.
The picture is by Alyson Hurt.
He Said, She Said – Why Custom Models Take Time
Posted by admin in Blog, analytic strategy, analytics on August 6, 2009
In this post, I discuss some of the different options available when building analytic models. For the purposes here, a good short definition of analytics is to view analytics as using data to make predictions. The term predictive analytics is beginning to be applied (appropriately enough) to this type of analytics. A longer definition is to view predictive analytics as building statistically valid models from data that can be used to make predictions about future events, to take actions, and to make decisions.

In this post, the point of view is that a business owner of a problem in a company requires a model and is considering whether to build the model in-house, outsource the model to a vendor providing analytic services, or simply to give up on building a model and produce a report instead. I don’t recommend the latter option, but unfortunately, in practice, it is all too common.
Broadly speaking, from a business owner’s point of view, there are several phases required to build a model for a new project. The process looks a bit different from the modeler’s point of view. It is also a bit simpler if the same model has been built before and all that is required is to update the model using new data. Here are the basic steps required to build a model from the business owner’s point of view.
- Working with IT to obtain all the data required for the project and making it available to the modeler.
- Answering questions from the modeler about the data.
- Agreeing upon the output of the model.
- Reviewing the first model with the modeler.
- Reviewing the second and subsequent models with the modeler.
- Working with IT to deploy the model.
Although all steps except for Step 1 are collaborative between the business owner and the modeler, Step 6 is primarily the business owner’s responsibility, while Steps 2, 4, and 5 are primarily the modeler’s responsibility. At the beginning of many projects, Step 3 looks obvious. It turns out that it is often not so obvious until the project is towards the end, the data has been cleaned, and the deployment well underway. One way to understood why this is so is because often one doesn’t have a good understanding of the most appropriate output of a model until the data has been cleaned and there is a good understanding of how the model will be deployed in operational systems.
Let’s look at this same process now from the viewpoint of the modeler. To simplify, the folllowing steps are required:
- Waiting for the data.
- Cleaning the data.
- Asking the business owner questions about the data.
- Agreeing upon the output of the model.
- Developing a set of features for the model.
- Estimating the parameters of the model.
- Building a measure to evaluate the model.
- Evaluating the model using the measure.
- Developing post-processing rules for the scores produced by the model.
- Repeating the steps above for the second and subsequent versions of the model until everyone is happy, or there is no more time or funding left.
- Deploying the model.
Building a new model requires completing all the steps above. Generally, a series of models (version 1 of the model, version 2 of the model, etc.) are produced and reviewed by the business owner and the modeler (Step 10). The more time available for Step 10, the better the quality of the model.
To understand these steps a bit better, it might be helpful to review post about the SAMS Methodology. The SAMS methodology explains how to think of models in terms of the Scores they produce, the Actions these enable, the Measures used to evaluate the actions, and whether these actions support a targeted Strategy or not.
Sometimes a model has been built before and only some of these steps need to be repeated. For example, refreshing a model only require completing steps 6 and 8 for a series of models. Rebuilding a model usually only requires repeating Steps 5, 6, 8 and 9 for a series of models.
Sometimes, the data is supplied in a standard format (for example, it is provided by a third party) and the deployment uses a standard format (for example, only a list is required that contains a list of names and corresponding offers). In this case, after a model has been built once, all that is required when a business owner supplies new data is to perform Steps 6 and 8. Call this a standard model. Standard models are substantially less work to build then models that require completing all the steps above. These more labor intensive models are often called custom models.
Most requests for models fit into some standard categories of models. For example, models that predict whether a prospect will respond to an offer (response models), whether a customer will remain a customer (attrition models), whether a customer will keep current with their payments (credit model), whether a transaction is valid or fraudulent (fraud models), etc.
Sometimes, models that don’t fit into these familar categories of models are built. Call these new types of models. A new type of model also requires that the modeler develop new types of features, new types of measures for evaluating the models, etc. New types of custom models are the most labor intensive to build.
In practice, it usually takes four to six months or longer to build a custom model, once the data has arrived. As the size and complexity of the data grows, each of the steps usually requires more time.
Three Lessons in Analytic Strategy from the Netflix Prize
Posted by admin in Blog, analytic strategy, analytics on July 5, 2009
The Netflix Prize requires developing a new rating algorithm that improves by over 10% the current system called Cinematch that is used by Netflix to suggest movies to its customers. According to the contest’s Leaderboard, it looks the $1,000,000 Grand Prize will be awarded shortly.
The Netflix Prize provides some interesting lessons in analytic strategy. In addition to the Grand Prize, each year until the Grand Prize is awarded, a $50,000 Progress Prize is awarded. The Progress Prize was awarded in 2007 and 2008. The Netflix Prize has become quite well known due to the prize money being offered. It deserves to be just as well known for the analytic strategy they chose.
It is relatively common for a company or an organization to spend a million dollars on an analytic project. It is less common for something useful to result from it. I don’t have any inside knowledge about the Netflix Prize, but I think that there are several valuable lessons about analytic strategy that the Netflix Prize illustrate. Here are three lessons.
Lesson 1. Agree upon a metric to measure the effectiveness of an analytic model and use it consistently. It is usually not possible to find a single metric that captures all the relevant information required when comparing two analytic systems. It is certainly the case that any actual ratings system requires several metrics. For example, one metric might measure how many stars a viewer would assign to a move and another metric might measure how often a viewer selects movies recommended. On the other hand, by singling out a single metric, it becomes straightforward to compare two recommendation algorithms. Once this is possible, it becomes simple to use the metric to create a dashboard (the Netflix Leaderboard) and then to use the dashboard to track progress. Netflix chose to use the root mean squared error (RMSE) between the predictions of a proposed system and actual choices made by users in a validation dataset. Over 49,000 contestants from over 180 different countries formed over 40,000 teams and entered the contest and tried to develop a recommendation algorithm with a low enough RMSE to win the Grand Prize. In my experience, most companies and organizations lack the discipline to use a single (lead) metric to compare two analytic systems and to use the metric to track progress improving an analytic system over time using a dashboard. Having the discipline to do so is one sign of the analytic maturity of a company.
Lesson 2. Don’t be afraid to disclose analytic technology you develop if the advantages outweigh the disadvantages. In general, it makes sense for companies and organizations not to disclose the proprietary technology they use. On the other hand, there are some important exceptions.
- One exception are patents. Patents provide some important protections, but the trade off is that the technology must be disclosed in the patent filing.
- Another exception is when the software of an internal analytic project is made open source or when an internal project decides to contribute to an existing open source software project. Again, there is a trade off. Some technology is disclosed, but the benefit is the community support that many open source projects engender.
- Crowdsourcing is a similar type of exception. The benefit is the innovation that crowdsourcing can provide. The downside is that crowdsourcing discloses technology that may be critical to your business. Netflix found that with Cinematch customers rented more movies and were less likely to cancel their subscriptions. Cinematch was introduced in 2000 and improved each year until a plateau was reached in 2006. In the summer of 2006, Reed Hastings, the CEO of Netflix, suggested a public contest to improve Cinematch. According to an article in the New York Times, “Cinematch suggestions… drive a surprising 60 percent of Netflix’s rentals.” By setting a threshold for the prize of 10% or more improvement, Netflix would obtain enough incremental revenue from an improved Cinematch system to make up for any information that Netflix’s competitors might gain. Again, this is a good analytic strategy.
Lesson 3. Double and triple check any data before making it public. No company or organization would knowingly make data public that contains personally identifiable information (PII) without permission. On the other hand, even if data does not contain PII per se, often times PII can be inferred from data, as was done when AOL released 3 months of sample query logs in 2006. For less obvious ways to break anonymization of data, see the paper Wherefore art thou r3579x?. In some cases, it can be quite challenging to take data and to anonymize it so that it does not contain PII information, especially if the data is being updated. On the other hand, making data public enables a broad community to contribute to your problem.
Finally, it is interesting to think about the size of the data used for the prize. The data consisted of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers. The rated over 17 thousand movie titles during the period October, 1998 to December, 2005. In some sense, this is a lot of data. Certainly there are a lot of degrees of freedom in the dataset. On the other hand, it is less than 2 GB of data and easily fits in the memory of a modest size computer. From this perspective, it is a small amount of data. From the view point of analytic infrastructure, it is useful to classify data as small (fits into the memory of a single computer), medium (fills the disks of a single storage device or fits into a database), or large (requires specialized infrastructure such as a cloud).
For more information:
- R. M. Bell and Y. Koren, Lessons from the Netflix prize challenge. SIGKDD Explororations Newsletter, Volume 9, Number 2 (Dec. 2007), pages 75-79. DOI= http://doi.acm.org/10.1145/1345448.1345465 (subscription required)
- Clive Thompson, The Screens Issue. If You Liked This, You’re Sure to Love That, New York Times, November 23, 2008 (registration required).
- L. Backstrom, C. Dwork, and J. Kleinberg, Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography, Proceedings of the 16th international Conference on World Wide Web (WWW ‘07), ACM, New York, NY, 181-190. (subscription required)
Upcoming Course. I’ll be using this example in an upcoming course I’m teaching in San Mateo on July 14, 2009.
In Analytics, It’s the Actions that Matter
Posted by admin in Blog, analytic strategy, analytics on April 28, 2009
In this note, let’s define analytics as the analysis of data in order to take actions. (This is a narrow definition of analytics, but one that is useful here.) If you don’t have day to day work experience with analytics, it is easy to have the mistaken impression that analytics is only about data and statistical models.
Although understanding data and developing statistical models is certainly an important component of an analytic project, this is just one aspect of analytics. This aspect includes cleaning data, enriching data, exploring data, developing features, building models, validating models, and iterating the process. From a broad perspective, this is a process in which the input is data and the output is a statistical model. When most people think of modeling, this is what they think of. For many analytic projects, this is just a small part of what is required for a successful engagement.
The second aspect of analytics is what I am concerned with in this note. This is the aspect of analytics concerned with:
- developing an appropriate score for a statistical model;
- using the score to define useful actions;
- determining which measures are best for evaluating the effectiveness of these actions;
- tracking these measures (often with a dashboard) and making sure that that they advance the strategic objectives of the company or organization.
One way to remember this is using the mnemonic SAMS for Scores, Actions, Measures and Strategies.
For example, with a response model, often a threshold is used. If the score from the response model is above the threshold, an offer is made (this is the action); if not, no offer is made.
Here are some examples of SAMS:
| Model | Score | Action | Measure | Strategy |
|---|---|---|---|---|
| on-line response model | likelihood to respond to an offer | display the offer to the visitor that has the highest likelihood of response and available inventory | revenue per day generated by the web site | increase revenue from a website by improving targeting of offers |
| fraud model | likelihood that a transaction is fraudulent | approve, decline, or obtain more information | detection and false positive rates | reduce costs and improve customer experience by lowering fraud rates |
| data quality model | likelihood that a data source has data quality problems | if the score is above a threshold, manually investigate the data to check whether there is in fact a data quality problem | detection and false positive rates | improve operational efficiencies by detecting data quality problems more quickly |
A successful analytics projects requires a careful study of what actions are possible; of the possible actions, which can be deployed into operational systems; and, how the systems can be instrumented so that the data required to compute the required measures is available.
The organizational challenge when developing and deploying analytics is that four groups must work together to complete a successful analytic project:
- The IT group must provide the required data to build the model.
- The analytics group must build the appropriate models and develop the appropriate scores.
- The operations group must decide which actions are possible and how these actions can be integrated with current systems and business processes.
- An executive sponsor must make sure that the measures have strategic relevance and the three groups above collaborate effectively.

