Posts Tagged analytic strategy
What is Analytic Infrastructure and Why Should You Care?
Posted by Robert Grossman in Blog, analytic infrastructure, analytic strategy on February 16, 2010
I have been building analytic models for over 20 years. The names have changed a lot over the years: 20 years ago we built statistical models, 10 years ago we built data mining models, and today we build analytic models. The algorithms have changed some: classification and regression trees became common 20 years ago, support vector machines about 10 years ago, and today graph-based algorithms are popular.

Perhaps what has changed the most is my perspective.
Analytic algorithms and models. Twenty years ago, I was focused on algorithms and was concerned with the different types of models that you could build using different types of algorithms on different types of data. This worked fine as long as the data fit into the memory of the computer.
Analytic infrastructure. For better or worse I ran into problems that had so much data that the data was too big to fit into memory. Some projects required a disk, some required many disks, and a few required tertiary storage. I spent over two decades working on what you might call analytic infrastructure. I first worked on teams that developed for the high energy physics community specialized data management infrastructures that were optimized for efficient reads (instead of safe writes) and accessed the data by columns (instead of rows) in order to speed up numerical computations. These turned out to be some of the first examples of data warehouses (the name was not used at that time), increased by 1 to 3 orders of magnitude the size of data that we could model, and were heavily criticized by the database community. Of course, several years later the database community embraced data warehouses at least for reports, if not for data intensive computing and modeling.
Beginning about five years ago, I began working on what are today called cloud computing platforms. Again, this increases by 1 to 3 orders of magnitude the size of data that we can model, and again these have been heavily criticized by some in the database community as being a big step backwards.
I recently edited a special issue of the ACM SIGKDD Explorations about analytic infrastructure. In an article there, I define analytic infrastructure as the applications, services, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring, or related analytic activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. The article is available as a pdf from the SIGKDD Explorations web site (it’s Issue 1 in Volume 11).
I don’t really like this definition and encourage you to provide a better one. What is important though is that using the appropriate analytic infrastructure is critical to building models for problems with so much data that simply putting it into memory and forgetting about it is not a viable solution.
Analytic Strategy. Returning to how my perspective has evolved, for the past several years, I have become increasingly concerned with what is usually called analytic strategy. Analytic strategy is concerned with making sure you are asking the right analytic question, that you are building a model that can be deployed efficiently, that the output of the model is actionable, that the actions have a business impact, the business impact is aligned with corporate strategy, that there is an appropriate governance process in place, and related questions.
My perspective these days is that analytics requires a firm foundation and that the foundation has three columns: 1) analytic strategy; 2) analytic infrastructure; and 3) analytic algorithms and models.
The picture is by Alyson Hurt.
Three Lessons in Analytic Strategy from the Netflix Prize
Posted by admin in Blog, analytic strategy, analytics on July 5, 2009
The Netflix Prize requires developing a new rating algorithm that improves by over 10% the current system called Cinematch that is used by Netflix to suggest movies to its customers. According to the contest’s Leaderboard, it looks the $1,000,000 Grand Prize will be awarded shortly.
The Netflix Prize provides some interesting lessons in analytic strategy. In addition to the Grand Prize, each year until the Grand Prize is awarded, a $50,000 Progress Prize is awarded. The Progress Prize was awarded in 2007 and 2008. The Netflix Prize has become quite well known due to the prize money being offered. It deserves to be just as well known for the analytic strategy they chose.
It is relatively common for a company or an organization to spend a million dollars on an analytic project. It is less common for something useful to result from it. I don’t have any inside knowledge about the Netflix Prize, but I think that there are several valuable lessons about analytic strategy that the Netflix Prize illustrate. Here are three lessons.
Lesson 1. Agree upon a metric to measure the effectiveness of an analytic model and use it consistently. It is usually not possible to find a single metric that captures all the relevant information required when comparing two analytic systems. It is certainly the case that any actual ratings system requires several metrics. For example, one metric might measure how many stars a viewer would assign to a move and another metric might measure how often a viewer selects movies recommended. On the other hand, by singling out a single metric, it becomes straightforward to compare two recommendation algorithms. Once this is possible, it becomes simple to use the metric to create a dashboard (the Netflix Leaderboard) and then to use the dashboard to track progress. Netflix chose to use the root mean squared error (RMSE) between the predictions of a proposed system and actual choices made by users in a validation dataset. Over 49,000 contestants from over 180 different countries formed over 40,000 teams and entered the contest and tried to develop a recommendation algorithm with a low enough RMSE to win the Grand Prize. In my experience, most companies and organizations lack the discipline to use a single (lead) metric to compare two analytic systems and to use the metric to track progress improving an analytic system over time using a dashboard. Having the discipline to do so is one sign of the analytic maturity of a company.
Lesson 2. Don’t be afraid to disclose analytic technology you develop if the advantages outweigh the disadvantages. In general, it makes sense for companies and organizations not to disclose the proprietary technology they use. On the other hand, there are some important exceptions.
- One exception are patents. Patents provide some important protections, but the trade off is that the technology must be disclosed in the patent filing.
- Another exception is when the software of an internal analytic project is made open source or when an internal project decides to contribute to an existing open source software project. Again, there is a trade off. Some technology is disclosed, but the benefit is the community support that many open source projects engender.
- Crowdsourcing is a similar type of exception. The benefit is the innovation that crowdsourcing can provide. The downside is that crowdsourcing discloses technology that may be critical to your business. Netflix found that with Cinematch customers rented more movies and were less likely to cancel their subscriptions. Cinematch was introduced in 2000 and improved each year until a plateau was reached in 2006. In the summer of 2006, Reed Hastings, the CEO of Netflix, suggested a public contest to improve Cinematch. According to an article in the New York Times, “Cinematch suggestions… drive a surprising 60 percent of Netflix’s rentals.” By setting a threshold for the prize of 10% or more improvement, Netflix would obtain enough incremental revenue from an improved Cinematch system to make up for any information that Netflix’s competitors might gain. Again, this is a good analytic strategy.
Lesson 3. Double and triple check any data before making it public. No company or organization would knowingly make data public that contains personally identifiable information (PII) without permission. On the other hand, even if data does not contain PII per se, often times PII can be inferred from data, as was done when AOL released 3 months of sample query logs in 2006. For less obvious ways to break anonymization of data, see the paper Wherefore art thou r3579x?. In some cases, it can be quite challenging to take data and to anonymize it so that it does not contain PII information, especially if the data is being updated. On the other hand, making data public enables a broad community to contribute to your problem.
Finally, it is interesting to think about the size of the data used for the prize. The data consisted of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers. The rated over 17 thousand movie titles during the period October, 1998 to December, 2005. In some sense, this is a lot of data. Certainly there are a lot of degrees of freedom in the dataset. On the other hand, it is less than 2 GB of data and easily fits in the memory of a modest size computer. From this perspective, it is a small amount of data. From the view point of analytic infrastructure, it is useful to classify data as small (fits into the memory of a single computer), medium (fills the disks of a single storage device or fits into a database), or large (requires specialized infrastructure such as a cloud).
For more information:
- R. M. Bell and Y. Koren, Lessons from the Netflix prize challenge. SIGKDD Explororations Newsletter, Volume 9, Number 2 (Dec. 2007), pages 75-79. DOI= http://doi.acm.org/10.1145/1345448.1345465 (subscription required)
- Clive Thompson, The Screens Issue. If You Liked This, You’re Sure to Love That, New York Times, November 23, 2008 (registration required).
- L. Backstrom, C. Dwork, and J. Kleinberg, Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography, Proceedings of the 16th international Conference on World Wide Web (WWW ‘07), ACM, New York, NY, 181-190. (subscription required)
Upcoming Course. I’ll be using this example in an upcoming course I’m teaching in San Mateo on July 14, 2009.
