Posts Tagged cloud computing

What is Analytic Infrastructure and Why Should You Care?

I have been building analytic models for over 20 years. The names have changed a lot over the years: 20 years ago we built statistical models, 10 years ago we built data mining models, and today we build analytic models. The algorithms have changed some: classification and regression trees became common 20 years ago, support vector machines about 10 years ago, and today graph-based algorithms are popular.

Three pillars

Perhaps what has changed the most is my perspective.

Analytic algorithms and models. Twenty years ago, I was focused on algorithms and was concerned with the different types of models that you could build using different types of algorithms on different types of data. This worked fine as long as the data fit into the memory of the computer.

Analytic infrastructure. For better or worse I ran into problems that had so much data that the data was too big to fit into memory. Some projects required a disk, some required many disks, and a few required tertiary storage. I spent over two decades working on what you might call analytic infrastructure. I first worked on teams that developed for the high energy physics community specialized data management infrastructures that were optimized for efficient reads (instead of safe writes) and accessed the data by columns (instead of rows) in order to speed up numerical computations. These turned out to be some of the first examples of data warehouses (the name was not used at that time), increased by 1 to 3 orders of magnitude the size of data that we could model, and were heavily criticized by the database community. Of course, several years later the database community embraced data warehouses at least for reports, if not for data intensive computing and modeling.

Beginning about five years ago, I began working on what are today called cloud computing platforms. Again, this increases by 1 to 3 orders of magnitude the size of data that we can model, and again these have been heavily criticized by some in the database community as being a big step backwards.

I recently edited a special issue of the ACM SIGKDD Explorations about analytic infrastructure. In an article there, I define analytic infrastructure as the applications, services, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring, or related analytic activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. The article is available as a pdf from the SIGKDD Explorations web site (it’s Issue 1 in Volume 11).

I don’t really like this definition and encourage you to provide a better one. What is important though is that using the appropriate analytic infrastructure is critical to building models for problems with so much data that simply putting it into memory and forgetting about it is not a viable solution.

Analytic Strategy. Returning to how my perspective has evolved, for the past several years, I have become increasingly concerned with what is usually called analytic strategy. Analytic strategy is concerned with making sure you are asking the right analytic question, that you are building a model that can be deployed efficiently, that the output of the model is actionable, that the actions have a business impact, the business impact is aligned with corporate strategy, that there is an appropriate governance process in place, and related questions.

My perspective these days is that analytics requires a firm foundation and that the foundation has three columns: 1) analytic strategy; 2) analytic infrastructure; and 3) analytic algorithms and models.

The picture is by Alyson Hurt.

, , , , , ,

No Comments

Open Source Cloud Computing Software at SC 09

SC 09 is in Portland this coming week from November 14 to 20. The Laboratory for Advanced Computing will have a booth and be showcasing a number of open source cloud computing technologies including:

Sector. Sector/Sphere is a high performance storage and compute cloud that scales to wide area networks. With Sector’s simplified parallel programming framework, you can easily apply a user defined function (UDF) to datasets that fill data centers. The current version of Sector is version 1.24 and includes support for streams and multiple master servers. Sector was the basis for an application that won the SC 08 Bandwidth Challenge. For more information, see sector.sourceforge.net.

As measured by the MalStone Benchmark, Sector was over 2x fast as Hadoop. Sector was one of six technologies selected by SC 09 as a disruptive technology.

How efficient is your cloud?

This snapshot is from the LAC Cloud Monitor monitoring a Sector computation on the Open Cloud Testbed.

Cistrack. The Chicago Utilities for Biological Science or CUBioS is a set of integrated utilities for managing, processing, analyzing and sharing biological data. CUBioS integrates databases with cloud computing to provide an infrastructure that scales to high throughput sequencing platforms. CUBioS uses the Sector/Sphere cloud to process images produced by high throughput sequencing platforms. Cistrack is a CUBioS instance for cis-regulatory data. For more information, see www.cistrack.org.

Canopy. With clouds, it is now possible with a portal to create, monitor, and migrate Virtual Machines (VMs). With the open source Canopy application, it is now possible to create, monitor and migrate Virtual Networks containing multiple VMs connected with virtualized network infrastructure. Canopy provides a standardized library of functions to programatically control switch VLAN assignments to create VNs at line speed. Canopy is an open source project with an alpha releases planned for 2010.

UDT. UDT is a widely deployed (with millions of deployed instances) application level network transport protocol designed for large data transfers over wide area high performance networks. For more information, see udt.sourceforge.net.

UDX. UDX is a version of UDT that is designed for wide area high performance research and corporate networks within a single security domain (UDX does not contain the code UDT uses for transversing fire walls). In recent tests, UDX was able to achieve over 9.2 Gbps on a 10 Gbps wide area testbed. For more information, see udt.sourceforge.net.

LAC Cloud Monitor (LACCM). The LAC Cloud Monitor is a low overhead monitor for clouds that gathers system performance for thousands of servers along multiple dimensions. It integrates with the Argus Monitoring System and Nagios for logging and alerting. LACCM is used to monitor the OCC Open Cloud Testbed. LACCM is open source.

LAC Cloud Scheduler (LACCS)The LAC Cloud Scheduler (LACCS) is a system for scheduling clouds for exclusive use by researchers. It is simple to use, scalable, and easy to deploy. Using LACCS, multiple groups can share easily a local or wide area cloud. LACCS is used for scheduling the Open Cloud Testbed. LACCS is open source.

This is a segment that aired on WTTW’s Chicago Matters about cloud computing that described the Sector/Sphere and the Open Cloud Testbed. You need to select the episode on the right hand side of the page dated November 10, 2009 and titled “Chicago Matters Beyond Burnham (9:40)”

, , , , , ,

3 Comments

Revisiting the Case for Cloud Computing

The backlash to the hype over cloud computing is in full swing. I have given a number of talks on cloud computing over the past few months and have been struck by a few things.

First, at an industry event that I attended, although there were quite a few talks on cloud computing (it was one of the tracks), it seems that only a small number of speakers had actually participated in a cloud computing project and I was was one of only a handful that had actually completed several cloud computing projects. Many of the other speakers were simply summarizing second and third hand reports about cloud computing. In my opinion, something was lost in the translation.

Rack of servers

Second, I think some of the backlash has gone to far. At one breakfast meeting I attended, there were essentially no acknowledgement of the potential today that clouds offer, simply emphasis on why “real companies” that have to worry about security could never use (public) clouds. Private and condo clouds were not mentioned as alternatives for companies whose security or compliance requirements preclude the use of today’s public clouds. The trade-off, which is always present, that balances potential breaches from performing certain operations in public clouds, from the productivity gains that such clouds can provide was also not mentioned.

Because of this backlash, I think it is a good time to revisit the case for cloud computing. There are three basic reasons for deploying certain operations to clouds:

Cost savings. By employing virtualization and making use of the economies of scale that cloud service providers can take advantage of, deploying certain operations to clouds can lead to improved efficiencies. This advantage seems to be well understood, and is, for example, one of the factors driving the Federal CIO’s push for cloud computing. See for example, the recent RFQ from the GSA for a cloud computing store front.

Productivity. The Elastic, virtualized services that clouds provide lead directly to productivity improvements. As a simple example, I was building an analytic model over the weekend to meet a deadline and the computation took over 4 hours. Since I was using a virtualized resource in a cloud, I was able to use the portal that controlled the various machine images to double the memory in my resource. Five minutes later, I had a new virtualized image and the computation now took less than 5 minutes. (By the way, this is typical of analytic computations. When the data is so large that a computation can no longer be done in memory and requires accessing the disk, the time required increases dramatically.) If, instead, I had gone through a standard procurement process to get a new machine with twice the memory, it would have been quite some time before the model would have been completed.

As another example, I work with a Fortune 500 client in which the analytic models are taking weeks to build instead of days because the modeling environment does not have enough disk space for the entire team to hold all the temporary files and datasets required when building analytic models nor powerful enough computers for models to be computed fast enough to provide timely feedback to the modeler. This is unfortunately fairly typical of modeling environments in Fortune 500 companies (I’ll discuss this situation in a later post). A simple cloud would dramatically improve the situation.

New capabilities. Clouds also provide new capabilities. For example, large data clouds enable the processing and analysis of large datasets that was simply not possible with architectures that manage the data using databases. As a simple example, the type of analytic computations abstracted by the MalStone Benchmark are relatively straightforward, even when there are 100 TB of data, using a Hadoop or Sector based cloud, but in practice not practical using a traditional database when the data is that size.

What’s new. Many of the ideas behind cloud computing are quite old. On the other hand, the combination of: 1) the scale, 2) the utility based pricing, and 3) the simplicity provided by cloud computing make cloud computing a disruptive technology. If you are interested in understanding cloud computing from this point of view, you might find a recent talk I gave for an IEEE Conference on New Technologies called My Other Computer is a Data Center interesting. There is also a written version of a portion of the that recently appeared in the IEEE Bulletin on Data Engineering called On the Varieties of Clouds for Data Intensive Computing.

The image is by John Seb and is available from Flickr under the Creative Commons license.

, , , , ,

1 Comment

Some Reasons to Consider Condominium Clouds (Condo Clouds)

In this post, I’ll introduce condominium clouds and discuss some of their potential for changing computing. From an architectural point of view, condominium clouds are essentially the same as private clouds. Condominium clouds have a different business model though, which, in certain circumstances provides some definite advantages.

I argue here that condominium clouds and related offerings represent a fundamental shift in our computing platforms. To explain this, I’ll take a short detour and recall a computing experience I had about a decade ago and the business model (condominium fiber) that made these types of experiences available to a broader community.

Some racks in data center.

One of most exciting technical experiences I have had occurred in 2000 when I ran a distributed data intensive computing application over a dedicated 155 Mbps network link connecting clusters located at NCAR in Boulder and the University of Michigan in Ann Arbor. Prior to that I only had access to 1.5 Mbps networks and these networks were shared by the rest of the campus. The application was able to perform sustained computation at about 96 Mbps, which was not bad considering that each computer was limited by a 100 Mbps NIC. Reaching a 96 Mbps over a wide area network was quite difficult at that time, but we did this using a new network protocol that was the precursor to UDT. The reason for our excitement was that one day we were were limited to distributed computations that rarely reached 1 Mbps, while the next day we reached 96 Mbps, almost two orders of magnitude improvement.

By 2003, with improved protcols and 10 Gbps networks, sustained distributed computations reached 6.8 Gbps. Within a four year span, we had passed through an inflection point in which high performance distributed computing improved by over 3 orders of magnitude. Three things were required:

  • A new computing platform, in this case, clusters connected by wide area, high performance networks.
  • A new network protocol and associated libraries, since TCP was not effective at data intensive computing over wide area high performance networks.
  • A new business model, which made high performance wide area networks more broadly available.

Let’s turn now to cloud computing. Cloud computing has two faces: the most familiar face offers utility-based pricing, on-demand elastic availability, and infrastructure as a service. There is no doubt that this combination is changing the face of computing. On the other hand, the other side of cloud computing is just as important. This side is about thinking of the data center as your unit of computing. Previously you probably thought of computing as requiring a certain number of racks. With cloud computing, you now think of computing as requiring a certain number of data centers. This is computing measured with Data Center Units or DCUs.

The problem is acquiring computing at the scale of data centers is prohibitive except for handful of companies (Google, Microsoft, Yahoo, IBM, …)

This is where the condominium clouds enter. But first, here is a description of customer owned and condominium fiber from a 2002 FAQ titled “FAQ about Community Dark Fiber Networks” written by Bill St Arnaud:

Dark fiber is optical fiber, dedicated to a single customer and where the customer is responsible for attaching the telecommunications equipment and lasers to “light” the fiber. Traditionally optical fiber networks have been built by carriers where they take on the responsibility of lighting the fiber and provide a managed service to the customer.

Professional 3rd parties companies who specialize in dark fiber systems take care of the actual installation of the fiber and also maintain it on behalf of the customer. Technically these companies actually own the fiber, but sell IRUs (Indefeasible Rights of Use) for up to 20 years for unrestricted use of the fiber.

All across North America businesses, school boards and municipalities are banding together to negotiate deals to purchase customer owned dark fiber. A number of next generation service providers are now installing fiber networks and will sell strands of fiber to any organization who wish to purchase and manage their own dark fiber.

Many of these new fiber networks are built along the same model as a condominium apartment building. The contractor advertises the fact that they intend to build a condominium fiber network and offers early participants special pricing before the construction begins. That way the contractor is able to guarantee early financing for the project and demonstrate to bankers and other investors that there are some committed customers to the project.

The condominium fiber is operated like a condominium apartment building. The individual owners of fiber strands can do whatever they want they want with their individual fiber strands. They are free to carry any type of traffic and terminate the fiber any way they so choose. The company that installs the fiber network is responsible for overall maintenance and repairing the fiber in case of breaks, moves, adds or changes. The “condominium manager” charges the owners of the individual strands of fiber a small annual maintenance fee which covers all maintenance and right of way costs.

The initial primary driver for dark fiber by individual customers is the dramatic savings in telecommunication costs. The reduction in telecommunication costs can be in excess of 1000% depending on your current bandwidth requirements.

It is now easy to explain condominium clouds. For those who cannot afford private clouds at the scale of data centers, condominium clouds became a way to share the expense with other members of the condominium.

The condominium cloud model is also attractive if there are compliance issues or security issues that make a private cloud desirable, but your scale is such that justifying your own private cloud at the scale of a data center does not make sense.

As with condominium fiber, professionals would build and operate the data center. One way of looking at condominium clouds is as a more cost effective private clouds for certain organizations or associations that might benefit from the scale and operational control that data centers offer.

Condominium clouds might make sense for companies in a regulated industry that belong to an association that can manage the condominium. They would also make sense for scientific collaborations, especially those with large data. Also, although the business model would be slightly different, government organizations that couldn’t justify their own cloud could work together and jointly manage a condominium cloud.

The image above is courtesy of Cory Doctorow.

, , , ,

2 Comments

The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing

The TPC Benchmarks have played an important role in comparing databases and transaction processing systems. Currently, there are no similar benchmarks for comparing two clouds.

Benchmark

The CloudStone Benchmark is a first step towards a benchmark for clouds designed to support Web 2.0 type applications. In this note, we describe the MalStone Benchmark, which is a first step towards a benchmark for clouds, such as Hadoop and Sector, designed to support data intensive computing.

MalStone is a stylized analytic computation of a type that is common in data intensive computing. The open source code to generate data for MalStone and a technical report describing MalStone and providing some sample implementations can be found at: code.google.com/p/malgen (look in the feature downloads section along the right hand side).

Detecting Drive-By Exploits from Log Files

We introduce MalStone with a simple example. Consider visitors to web sites. As described in the paper The Ghost in the Browser by Provos et. al. that was presented at HotBot ‘07, approximately 10% of web pages have exploits installed that can infect certain computers when users visit the web pages. Sometimes these are called “drive-by exploits.”

The MalStone benchmark assumes that there are log files that record the date and time that users visited web pages. Assume that the log files of visits have the following fields:

   | Timestamp | Web Site ID | User ID

There is a further assumption that if the computers become infected, at perhaps a later time, then this is known. That is for each computer, which we assume is identified by the ID of the corresponding user, it is known whether at some later time that computer has become compromised:

   | User ID | Compromise Flag

Here the Compromise field is a flag, with 1 denoting a compromise. A very simple statistic that provides some insight into whether a web page is a possible source of compromises is to compute for each web site the ratio of visits in which the computer subsequently becomes compromised to those in which the computer remains uncompromised.

We call MalStone stylized since we do not argue that this is a useful or effective algorithm for finding compromised sites. Rather, we point out that if the log data is so large that it requires large numbers of disks to manage it, then computing something as simple as this ratio can be computationally challenging. For example, if the data spans 100 disks, then the computation cannot be done easily with any of the databases that are common today. On the other hand, if the data fits into a database, then this statistic can be computed easily using a few lines of SQL.

The MalStone benchmarks use records of the following form:

   | Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

Here site abstracts web site and entity abstracts the possibly infected computer. We assume that each record is 100 bytes long.

In the MalStone A Benchmarks, for each site, the number of records for which an entity visited the site and subsequently becomes compromised is divided by the total number of records for which an entity visited the site. The MalStone B Benchmark is similar, but this ratio is computed for each week (a window is used from the beginning of the period to the end of the week of interest). MalStone A-10 uses 10 billion records so that in total there is 1 TB of data. Similarly, MalStone A-100 requires 100 billion records and MalStone A-1000 requires 1 trillion records. MalStone B-10, B-100 and B-1000 are defined in the same way.

TeraSort Benchmark

One of the motivations for choosing 10 billion 100-byte records is that the TeraSort Benchmark (sometimes called the Terabyte Sort Benchmark) also uses 10 billion 100-byte records.

In 2008, Hadoop became the first open source program to hold the record for the TeraSort Benchmark. It was able to sort 1 TB of data using using 910 nodes in 209 seconds, breaking the previous record of 297 seconds. Hadoop set a new record in 2009 by sorting 100 TB of data at 0.578 TB/minute using 3800 nodes. For some background about the TeraSort Benchmark, see the blog posting by Jamie Hamilton Hadoop Wins Terasort.

Note that the TeraSort Benchmark is now deprecated and has been replaced by the Minute Sort Benchmark. Currently, 1 TB of data can be sorted in about a minute given the right software and sufficient hardware.

Generating Data for MalStone Using MalGen

We have developed a generator of synthetic data for MalStone called MalGen. MalGen is open source and available from code.google.com/p/malgen. Using MalGen, data can be generated with power law distributions, which is useful when modeling web sites (a few sites have a lot of visitors, but most sites have relatively few visitors).

Using MalStone to Study Design Tradeoffs

Recently, we did several experimental studies comparing different implementations of MalStone on 10 billion 100-byte records. The experiments were done on 20 nodes of the Open Cloud Testbed. Each node was a Dell 1435 computer with 12 GB memory, 1TB disk, 2.0GHz dual dual-core AMD Opteron 2212, and 1 Gb/s network interface cards.

We compared three different implementations: 1) Hadoop HDFS with Hadoop’s implementation of MapReduce; 2) Hadoop HDFS using Streams and coding MalStone in Python; and 3) the Sector Distributed File System (SDFS) and coding the algorithm using Sphere User Defined Functions (UDFs).

MalStone A
Hadoop MapReduce 454m 13s
Hadoop Streams/Python 87m 29s
Sector/Sphere UDFs 33m 40s
MalStone B
Hadoop MapReduce 840m 50s
Hadoop Streams/Python 142m 32s
Sector/Sphere UDFs 43m 44s

Please note that these timings are still preliminary and may be revised in the future as we better optimize the implementations.

If you have 1000 nodes and want to run a data intensive or analytic computation, then Hadoop is a very good choice. What these preliminary benchmarks indicate though is that you may want to compare the performance of Hadoop MapReduce and Hadoop Streams. In addition, you may also want to consider using Sector.

The image above is from Strolling everyday and available via a Creative Commons license.

Disclaimer: I am involved in the development of Sector.

, , , , , , , , , , , ,

7 Comments