Open Source Software Projects

Augustus. Augustus is an open source Python-based application for building and deploying data mining and statistical models. The first release of Augustus was in 2005. Augustus is compliant with the Predictive Model Markup Language (PMML). Augustus supports vectorized operations. See

Sector/Sphere. Sector/Sphere is an open source cloud computing platform designed to manage and compute with large data that was first released in 2008. It was used to build the application that won the SC 08 and SC 09 Bandwidth Challenge. Unlike most other cloud computing platforms, Sector/Sphere is not only designed to operate within a data center but also across multiple geographically distributed data centers. See

UDT. During the period 1999-2003, I led the development of a high performance network protocol called SABUL. SABUL used a UDP based data channel and a TCP based control channel. SABUL set a number of milestones for high performance data transport on wide area OC-3 and OC-12 networks during this period. During the period, 2003 through 2010, I co-led the development of a successor to SABUL called UDT. UDT is entirely implemented in UDP and provides reliable, fair, and friendly data transport for high volume data flows. UDT is open source and available on source forge. UDT is the basis for several commercial products and is widely deployed. See

PMML. From 1998 to the 2010, I was the chair of the Data Mining Group’s working group on the Predictive Model Markup Language (PMML). PMML has now been adopted by most vendors of data mining and statistical software including IBM, SAS, SPSS, and over ten others vendors. See

Previous Software Projects

DataSpace. From 1998 to 2004, I led the development of open source clients and severs to create an internet of linked data. These tools scaled to big data and provided a lightweight data integration framework that scaled across the Internet.

PATTERN. During 1996-2000, I led the development of Magnify’s PATTERN data mining system. PATTERN was a scalable data mining system sold and marketed by Magnify and used by a variety of financial and insurance companies, as well as in-house by Magnify. PATTERN employed a layered architecture, consisting of a scalable column oriented data warehouse, a scalable data mining system, and an XML based infrastructure for quickly deploying predictive models. PATTERN was the first commercial data mining system to use ensemble based modeling techniques. It was also the first to use taxonomy-based modeling techniques. See

PTool. During 1992-1996, I led the development of PTool, a scalable high performance persistent object manager designed for warehousing large data sets. Variants of PTool were later adopted by various scientific collaborations, including high energy physicists at Fermi Lab. PTool was used to create some of the earliest terabyte size data warehouses of scientific and engineering data. PTool was also used as an infrastructure for the data analysis and data mining of very large data sets.