Posts Tagged cloud analytics

The Three Most Important Interfaces in Analytics

If your data is small, your statistical model is simple, your only output is a report, and the work needs to be done just once, then are a quite a few statistical and data mining applications that will satisfy your requirements. On the other hand, if your data is large, your model is complicated, your output is a model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the years to support analytics. I use the term analytic infrastructure to refer to these components, services, applications and systems.

The Data Mining Group, which develops the Predictive Model Markup Language.

For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process.

There are several fundamental steps when building and deploying analytic models that are directly relevant to analytic infrastructure:

Step Inputs Outputs
Preprocessing dataset (data fields) dataset of features
Modeling dataset of features model
Scoring dataset (data fields), model scores
Postprocessing scores actions

Perhaps, the most important interfaces in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages (which have a human in the loop), and components in the analytic infrastructure that score data using models and often reside in operational environments. The former are examples of what are sometimes called model producers, while the latter are sometimes called model consumers. The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion.

On June 16, the Data Mining Group released version 4.0 of the Predictive Model Markup Language or PMML. Version 4.0 is the first release of PMML since Version 3.2 was released in May, 2007.

Version 4.0 of PMML adds the following new features:

  • support for time series models;
  • support for multiple models, which includes support for both
    segmented models and ensembles of models;
  • improved support for preprocessing data, which will help simplify
    deployment of models;
  • new models, such as survival models;
  • support for additional information about models called model
    explanation, which includes information for visualization, model
    quality, gains and lift charts, confusion matrix, and related
    information.

Since Version 2.0 of PMML, which was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models. Using these transformations, it would be possible to use PMML define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as models). This is probably the second most important interface in analytics.

With Version 4.0 now released, the PMML working group is now working on Version 4.1. One of the goals is to enable PMML describe postprocessing of scores. This would allow PMML to be used as interface between analytic infrastructure components and services that produce scores (such as modeling engines) and those that consume scores (such as recommendation engines). This is probably the third most important interface in analytics.

Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems. For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler.

If you are interested in getting involved in the PMML working group, please visit the web site: www.dmg.org

Disclaimer:I’m a member of the PMML working group and worked on PMML Version 4.0.

, , , , , , , , ,

No Comments

The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing

The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.

Benchmark

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications. In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as Hadoop and Sector, designed to support data intensive computing.

MalStone is a stylized analytic computation of a type that is common in data intensive computing. The open source code to generate data for MalStone and a technical report describing MalStone and providing some sample implementations can be found at: code.google.com/p/malgen (look in the feature downloads section along the right hand side).

Detecting Drive-By Exploits from Log Files

We introduce MalStone with a simple example. Consider visitors to web sites. As described in the paper The Ghost in the Browser by Provos et. al. that was presented at HotBot ‘07, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages. Sometimes these are called “drive-by exploits.”

The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:

   | Timestamp | Web Site ID | User ID

There is a further assumption that if the computers become infected, at perhaps a later time, then this is known. That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:

   | User ID | Compromise Flag

Here the Compromise field is a flag, with 1 denoting a compromise. A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.

We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites. Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging. For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today. On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.

The MalStone benchmarks use records of the following form:

   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

Here site abstracts web site and entity abstracts the possibly infected computer. We assume that each record is 100 bytes long.

In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and subsequently becomes compromised is divided by the total number of records for which an entity visited the site. The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest). MalStone A-10 uses 10 billion records so that in total there is 1 TB of data. Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records. MalStone B-10, B-100 and B-1000 are defined in the same way.

TeraSort Benchmark

One of the motivations for choosing 10 billion 100-byte records is that the TeraSort Benchmark (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.

In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark. It was able to sort 1 TB of data using using 910 nodes in 209 seconds, breaking the previous record of 297 seconds. Hadoop set a new record in 2009 by sorting 100 TB of data at 0.578 TB/minute using 3800 nodes. For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton Hadoop Wins Terasort.

Note that the TeraSort Benchmark is now deprecated and has been replaced by the Minute Sort Benchmark. Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.

Generating Data for MalStone Using MalGen

We have developed a generator of synthetic data for MalStone called MalGen. MalGen is open source and available from code.google.com/p/malgen. Using MalGen, data can be generated with power law distributions, which is useful when modeling web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).

Using MalStone to Study Design Tradeoffs

Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records. The experiments were done on 20 nodes of the Open Cloud Testbed. Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, and 1 Gb/s network interface cards.

We compared three different implementations: 1) Hadoop HDFS with Hadoop’s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using Sphere User Defined Functions (UDFs).

MalStone A
Hadoop MapReduce 454m 13s
Hadoop Streams/Python 87m 29s
Sector/Sphere UDFs 33m 40s
MalStone B
Hadoop MapReduce 840m 50s
Hadoop Streams/Python 142m 32s
Sector/Sphere UDFs 43m 44s

Please note that these timings are still preliminary and may be revised in the future as we better optimize the implementations.

If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice. What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams. In addition, you may also want to consider using Sector.

The image above is from Strolling everyday and available via a Creative Commons license.

Disclaimer: I am involved in the development of Sector.

, , , , , , , , , , , ,

8 Comments

Sector – When You Really Need to Process 10 Billion Records

As is well known by now, Google demonstrated the power of a layered stack of cloud services that are designed for commodity computers that fill a data center. The stack consists of a storage service (the Google File System (GFS)), a compute service based upon MapReduce, and a table service (BigTable).

Although the Google stack of services is not directly available, the open source Hadoop system, which has a broadly similar architecture, is available.

The Google stack, consisting of GFS/MapReduce/Bigtable, and the Hadoop system, consisting of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce, are examples of clouds designed for data intensive computing — these types of clouds provide computing capacity on demand, with capacity scaling all the way up to the size of a data center.

There are still many open questions about how best to design clouds for data intensive computing. During the best several years, I have been involved with a cloud designed for data intensive computing called Sector. The lead developer of Sector is Yunhong Gu of the University of Illinois at Chicago. Sector was developed independently of Hadoop and the Google cloud services and makes several different design choices (see the table below).

To quantify the impact of some of these choices, I have been involved with the development of a benchmark for data intensive computing called MalStone. I will talk more about MalStone in a future post, but briefly, MalStone is a stylized analytic computing that can be done simply using MapReduce, as well as variants and generalizations of MapReduce. The open source MalStone code comes with a generator of synthetic records and one benchmark (called MalStone B) generates 10 billion 100-byte records (similar to terasort).

MalStone B Benchmarks

System Time (min)
Hadoop MapReduce 799 min
Hadoop Streaming with Python 143 min
Sector 44 min

Tests were done using 20 nodes on the Open Cloud Testbed. Each node contained 500 million 100-byte records.

Comparing Sector and Hadoop

Hadoop Sector
Storage cloud block-based file system file-based
Programming model MapReduce user defined functions and MapReduce
Protocol TCP UDP
Security NA HIPAA capable
Replication at time of writing periodically
Language Java C++

I’ll be giving a talk on Sector at CloudSlam ‘09 on Monday, April 20, 2009 at 4pm ET. CloudSlam is a virtual conference, so that it is easy to listen to any of the talks that interest you.

, , , ,

2 Comments

Learning About Cloud Analytics

Clouds are changing the way that analytic models get built and the way they get deployed.

Neither analytics nor clouds have standard definitions yet.

A definition I like is to define analytics as the analysis of data to support decisions. For example, analytics is used in marketing to develop statistical models for acquiring customers and predicting the future profitability of customers. Analytics is used in risk management to identify fraud, to discover compromises in operations, and to reduce risk. Analytics is used in operations to improve business and operational processes.

Cloud computing also doesn’t yet have a standard definition. A good working definition is to define clouds as racks of commodity computers that provide on-demand resources and services over a network, usually the Internet, with the scale and the reliability of a data center.

There are two different, but related, types of clouds: the first category of clouds provide computing instances on demand, while the second category of clouds provide computing capacity on demand. Both use the same underlying hardware, but the first is designed to scale out by providing additional computing instances, while the second is designed to support data- or compute-intensive applications by scaling capacity. Amazon’s EC2 and S3 services are an example of the first type of cloud. The Hadoop system is an example of the second type of cloud.

Currently, as a platform for analytics, clouds offer several advantages:

  1. Building analytic models on very large datasets. “Hadoop style clouds” provide a very effective platform for developing analytic models on very large datasets.
  2. Scoring data using analytic models. Given an analytic model and some data (either a file of data or a stream of data), “Amazon style clouds” provide a simple and effective platform for scoring data. The Predictive Model Markup Language (PMML) has proved to be a very effective mechanism for moving a statistical or analytic model built using one analytic system into a cloud for scoring. Sometimes the terminology PMML Producer is used for the application that builds the model and PMML Consumer is used for the application that scores new data using the model. Using this terminology, “Amazon style clouds” can be used to score data easily using PMML models built elsewhere.
  3. Simplifying modeling environments. Finally, computing instances in a cloud can be built that incorporate all the analytic software required for building models, including preconfigured connections to all the data required for modeling. At least for small to medium size datasets, preconfiguring computing instances in this way can simplify the development of analytic models.
  4. Easy access to data. Clouds can also make it much easier to access data for modeling. Amazon has recently made available a variety of public datasets. For example, using Amazon’s EBS service, the U.S. Census data can be accessed immediately.

I’ll be one of the lecturers in two up coming courses on cloud analytics that introduce clouds as well as cloud analytics.

The first course will be taught in Chicago on June 22, 2009 and the second one in San Mateo on July 14, 2009.   You can register for the Chicago course using this registration link and the San Mateo course using this registration link.

This one day course will give a quick introduction to cloud computing and analytics. It describes several different types of clouds and what is new about cloud computing, and discusses some of the advantages and disadvantages that clouds offer when building and deploying analytic models. It includes three case studies, a survey of vendors, and information about setting up your first cloud.

The course syllabus can be found here: www.opendatagroup.com/courses.htm.

, , , , , , , , ,

No Comments