analytic infrastructure
The Data Center as the Unit of Computing
Posted by Robert Grossman in Blog, analytic infrastructure, cloud computing, data intensive computing on July 27, 2010
I’m at the KDD 2010 conference this week in Washington, D.C.. On Sunday, I gave the keynote in the The 2nd Workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2010), which was one of the workshops co-located with conference. The title of my talk was “My Other Computer is a Data Center: The Sector Perspective on Big Data.” You can download the talk from Slideshare.

The first part of the talk argued that it may be useful to think of a data center as a “device” for extracting relationships from data, in broadly the same way that we view a telescope as a device for looking at things that are very far away and a microscope as a device for looking at things that are very small. Continuing in this way, you can think of a supercomputer as a device for computing simulations.
The table below is my rough “back of the envelope” computation of the scale up provided by each of these devices over what was possible before (these scale up numbers are very rough and if you have better numbers, please let me know).
In each of these cases, the device resulted in some pretty interesting new science. So it is interesting to speculate what type of new science might arise when you think of a data center for extracting patterns from very large collections of data.
In the second part of the talk, I described at a very high level some of the components and layers in a software stack for data center device.
| Instrument | Year | Scale up |
| Telescope | 1609 | 30x |
| Microscope | 1670 | 250x |
| Supercomputing | 1976 | 10x-100x |
| Data center | 2003 | 10x-100x |
What is Analytic Infrastructure and Why Should You Care?
Posted by Robert Grossman in Blog, analytic infrastructure, analytic strategy on February 16, 2010
I have been building analytic models for over 20 years. The names have changed a lot over the years: 20 years ago we built statistical models, 10 years ago we built data mining models, and today we build analytic models. The algorithms have changed some: classification and regression trees became common 20 years ago, support vector machines about 10 years ago, and today graph-based algorithms are popular.

Perhaps what has changed the most is my perspective.
Analytic algorithms and models. Twenty years ago, I was focused on algorithms and was concerned with the different types of models that you could build using different types of algorithms on different types of data. This worked fine as long as the data fit into the memory of the computer.
Analytic infrastructure. For better or worse I ran into problems that had so much data that the data was too big to fit into memory. Some projects required a disk, some required many disks, and a few required tertiary storage. I spent over two decades working on what you might call analytic infrastructure. I first worked on teams that developed for the high energy physics community specialized data management infrastructures that were optimized for efficient reads (instead of safe writes) and accessed the data by columns (instead of rows) in order to speed up numerical computations. These turned out to be some of the first examples of data warehouses (the name was not used at that time), increased by 1 to 3 orders of magnitude the size of data that we could model, and were heavily criticized by the database community. Of course, several years later the database community embraced data warehouses at least for reports, if not for data intensive computing and modeling.
Beginning about five years ago, I began working on what are today called cloud computing platforms. Again, this increases by 1 to 3 orders of magnitude the size of data that we can model, and again these have been heavily criticized by some in the database community as being a big step backwards.
I recently edited a special issue of the ACM SIGKDD Explorations about analytic infrastructure. In an article there, I define analytic infrastructure as the applications, services, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring, or related analytic activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. The article is available as a pdf from the SIGKDD Explorations web site (it’s Issue 1 in Volume 11).
I don’t really like this definition and encourage you to provide a better one. What is important though is that using the appropriate analytic infrastructure is critical to building models for problems with so much data that simply putting it into memory and forgetting about it is not a viable solution.
Analytic Strategy. Returning to how my perspective has evolved, for the past several years, I have become increasingly concerned with what is usually called analytic strategy. Analytic strategy is concerned with making sure you are asking the right analytic question, that you are building a model that can be deployed efficiently, that the output of the model is actionable, that the actions have a business impact, the business impact is aligned with corporate strategy, that there is an appropriate governance process in place, and related questions.
My perspective these days is that analytics requires a firm foundation and that the foundation has three columns: 1) analytic strategy; 2) analytic infrastructure; and 3) analytic algorithms and models.
The picture is by Alyson Hurt.
Ten Years of the SC XY Bandwidth Challenge
Posted by admin in Blog, analytic infrastructure, analytics on November 30, 2009
The SC 09 Conference took place early this month in Portland. The Bandwidth Challenge (BWC) is an interesting and friendly rivalry between research groups to develop high performance network protocols and interesting applications that use them. The Bandwidth Challenge was started ten years ago at SC 99, which also took place in Portland.
Some of the history is available at the web site scinet.supercomputing.org. For example, in 2000, there were 2 OC-48 (2.5 Gbps) circuits that connected the research exhibits at the conference to external research networks and the challenge was to develop network protocols and applications that could fill these circuits. The winner of the BWC (called the Network Bandwidth Challenge in 2000) was a scientific visualization application called Visapult that reached 1.48 Gbps and transferred 262 GB in 1 hour (providing 582 Mbps of sustained bandwidth utilization).
This year, there were approximately 24 10 GE circuits and one 40 GE circuit that connected research exhibits to external exhibits and one of the applications reached a bandwidth utilization of over 114 Gbps.
I have had an interest in the BWC over the years, because you cannot analyze data without accessing it and accessing and transporting large remote datasets has always been a challenge. To say it slightly different, for large datasets and high performance networks, network transport protocols are an important element of the analytic infrastructure.
It’s useful to know the bandwidth delay product of a network, which is the product of the network capacity (in Mbps, say) multiplied by the round trip time (RTT) of a packet (in sec). This measures the amount of data on the network that has been transmitted but not yet received. This can be MB of data for wide area high performance networks. This data must be buffered so that it can be resent if a packet is not received.
Challenges that have been worked out over the past decade include:
- Improving TCP so that it is effective over networks with high bandwidth delay products. One of the successes is the development of FAST TCP, a variant of the TCP protocol.
- Developing reliable and friendly UDP-based protocols that are effective over networks with high bandwidth delay products. For example, the open source UDT protocol has proved over time to be quite effective. (Disclosure: I have been involved in the development of the UDT protocol.)
- Developing architectures that are effective for high end-to-end performance for transporting large datasets, from disks at one end to disks at the other end.
For the past several years, it has been relatively routine for applications using FAST TCP or UDT to fill a wide area 10 Gbps network link or multiple 10 Gbps network links, if these are available.
Today’s problems include:
- Connecting data intensive devices and applications to high performance networks. For example, with high throughput sequencing, biology is becoming data intensive, yet very few high throughput sequencing devices are connected to high performance research networks.
- Incorporating the appropriate network protocols into data intensive applications. For example, one of the reasons, the Sector/Sphere cloud is effective over wide area networks is that it is based upon UDT and not TCP. (Disclosure: I have been involved in the development of the Sector/Sphere cloud.)
I ran into the first problem just after I got back from SC 09. At SC 09, we ran a number of wide area data intensive applications, and in fact won the 2009 BWC for these applications. For example, a new variant of UDT called UDX reached 9.2 Gbps over a network link with 200 ms RTT. In contrast, as soon as I got back to Chicago, I worked for a couple of days trying to get access to 200 GB of sequence data, since the sequencing instrument that produced it was not connected to a high performance network. With the device connected to a high performance research network, the data would have been available in a few minutes.
To summarize, today network experts are comfortable designing systems that can easily fill wide area 10 GE networks, but most analytic applications are not designed to use the required protocols or to to take advantage of high performance networks, and most do not have access to the required networks, even if the applications could benefit from them.
In disciplines, like biology, that are becoming data intensive, this type of analytic infrastructure will provide distinct competitive advantages.
Open Source Cloud Computing Software at SC 09
Posted by admin in Blog, analytic infrastructure, cloud computing on November 11, 2009
SC 09 is in Portland this coming week from November 14 to 20. The Laboratory for Advanced Computing will have a booth and be showcasing a number of open source cloud computing technologies including:
Sector. Sector/Sphere is a high performance storage and compute cloud that scales to wide area networks. With Sector’s simplified parallel programming framework, you can easily apply a user defined function (UDF) to datasets that fill data centers. The current version of Sector is version 1.24 and includes support for streams and multiple master servers. Sector was the basis for an application that won the SC 08 Bandwidth Challenge. For more information, see sector.sourceforge.net.
As measured by the MalStone Benchmark, Sector was over 2x fast as Hadoop. Sector was one of six technologies selected by SC 09 as a disruptive technology.

This snapshot is from the LAC Cloud Monitor monitoring a Sector computation on the Open Cloud Testbed.
Cistrack. The Chicago Utilities for Biological Science or CUBioS is a set of integrated utilities for managing, processing, analyzing and sharing biological data. CUBioS integrates databases with cloud computing to provide an infrastructure that scales to high throughput sequencing platforms. CUBioS uses the Sector/Sphere cloud to process images produced by high throughput sequencing platforms. Cistrack is a CUBioS instance for cis-regulatory data. For more information, see www.cistrack.org.
Canopy. With clouds, it is now possible with a portal to create, monitor, and migrate Virtual Machines (VMs). With the open source Canopy application, it is now possible to create, monitor and migrate Virtual Networks containing multiple VMs connected with virtualized network infrastructure. Canopy provides a standardized library of functions to programatically control switch VLAN assignments to create VNs at line speed. Canopy is an open source project with an alpha releases planned for 2010.
UDT. UDT is a widely deployed (with millions of deployed instances) application level network transport protocol designed for large data transfers over wide area high performance networks. For more information, see udt.sourceforge.net.
UDX. UDX is a version of UDT that is designed for wide area high performance research and corporate networks within a single security domain (UDX does not contain the code UDT uses for transversing fire walls). In recent tests, UDX was able to achieve over 9.2 Gbps on a 10 Gbps wide area testbed. For more information, see udt.sourceforge.net.
LAC Cloud Monitor (LACCM). The LAC Cloud Monitor is a low overhead monitor for clouds that gathers system performance for thousands of servers along multiple dimensions. It integrates with the Argus Monitoring System and Nagios for logging and alerting. LACCM is used to monitor the OCC Open Cloud Testbed. LACCM is open source.
LAC Cloud Scheduler (LACCS)The LAC Cloud Scheduler (LACCS) is a system for scheduling clouds for exclusive use by researchers. It is simple to use, scalable, and easy to deploy. Using LACCS, multiple groups can share easily a local or wide area cloud. LACCS is used for scheduling the Open Cloud Testbed. LACCS is open source.
This is a segment that aired on WTTW’s Chicago Matters about cloud computing that described the Sector/Sphere and the Open Cloud Testbed. You need to select the episode on the right hand side of the page dated November 10, 2009 and titled “Chicago Matters Beyond Burnham (9:40)”
What is the “Unit” of Cloud Computing? Virtual Machines, Virtual Networks, and Virtual Data Centers
Posted by admin in Blog, analytic infrastructure, cloud computing on October 21, 2009
This is a post that summarizes some conversations that Stuart Bailey (from Infoblox) and I have been having.
There is a lot of market clutter today about cloud computing and it can be challenging at times to identify the core technical issues. Sometimes it is helpful with an emerging technology to ask the question: “What is the ‘unit’ of deployment for the technology?” There are two important related questions: “How are the units named?” “How do the units communicate?”

Sometimes the perspective matters.
Before we think about the answers for cloud computing, let’s warm up with some other examples.
- For the web, the “unit” is the web page, web pages are identifid by URLs (or URIs), and the units “communicate” using HTTP and related protocols. Of course, web pages aggregate into web sites.
- In networking, the “unit” is the IP address (at Layer 3) or the MAC address (at Layer 2) and DNS is the link between URLs and IP addresses (allowing them to communicate), while ARP (or NDP in IPv6) is the link between MAC addresses and IP addresses.
- In grid computing, the “unit” is a computer in a cluster (”a grid resource”) and computers commnicate using the Message Passing Interface (MPI).
Depending upon your perspective and your role in the cloud computing eco-system, you could argue that any of the following are the units:
Infrastructure Perspective
- A virtual machine (VM).
- A virtual network (VN), consisting of multiple VMs and all required information to network the VMs.
- A virtual data center (VDC), consisting of one or more VNs.
Data/Content/Resource Perspective
- An identifier specifying the name of a resource for a cloud storage service. Examples include an object managed by Amazon’s S3 service, or a file managed by the Hadoop Distributed File System (HDFS).
- An identifier specifying the name of a data resource for a cloud data service. Examples include a domain (database table) manged by Amazon’s SimpleDB service or a table (or row) manged by a BigTable-like service.
Once we take this point of view, a number of issues become much easier to discuss.
Intercloud Protocols. Today with clouds, we are in the same situation that networking was before Internet protocols enabled internetworking by supporting communication between networks. Until TCP and related Internet protocols were developed, there were not agreed upon standards identifying the appropriate entities and layers nor for passing names of entities between layers. We can ask what are the appropriate mechanisms for naming VMs, VNs and VDCs, as well as cloud and tables services, how do we pass the names of objects between layers, and how do the objects in the infrastructure stack communicate with objects in the data stack.
Virtual networks also count. Most of the cloud virtualization discussion today focuses on VMs and their migration, but it is just as essential to support VNs and their migration. If we look to how IP addresses arose, then it is tempting to think about using names for VMs that include information about VNs. Today, depending upon the units we feel are important, we will need layers in the cloud for naming and linking VMs, VNs and VDCs, not just VMs.
Removing the distinction between clouds and large data clouds. There are two fundamentally different approaches to cloud services for storage or data. In the first, there is an implicit assumption that the storage or data service must fit in a single VM (S3) or other device (such as NAS). In the second, the whole point is to develop cloud storage and data services that span multiple VMs and devices (Google’s GFS/MapReduce/BigTable), Hadoop HDFS/MapReduce, Sector Distributed File System/Sphere UDFs, etc.).
Services that link virtual infrastructure and data. In many discussions, no effort is made to span the virtual infrastructure perspective entities (VMs, VNs) with the data perspective. One simple approach is to provide a dynamic infrastructure service so that data/content/resource services could easily determine which VMs and VNs support their service (there is usally done with static configuration files today). With this approach, large data cloud services are simply data/content/resource services that are engineered to scale to multiple VMs (and perhaps VNs).
Scaling to services to data centers. One of attributes that I think is a core attribute of certain types of clouds, is for a service to scale beyond a single machine or VM to an entire data center or VDC. Defining these types of scalable services is something that is relatively easy to do from the perspective here.
Acknowledgements: The photograph is from the Flickr photostream of bourget_82 and was posted with a Attribution-No Derivative Works 2.0 Generic Creative Commons License.