A Tool for Keeping Big Data in Sync

The rsync utility is a wonderfully useful tool for keeping two datasets synchronized, but it was never designed to keep two large datasets synchronized when they are separated by a long distance. Over the past couple of years, we developed a utility called UDR at the Laboratory for Advanced Computing at the University of Chicago which integrates rsync with the high performance network protocol UDT.

UDT is a reliable UDP-based protocol that was designed to move large datasets over wide area, high performance networks. UDT is open source and has been used as the basis for over six commercial products.

UDR is open source and available from github.

Here are some test results conducted by Erich Weiler from the University of California at Santa Cruz moving genomic data:

Source Destination UDR rsync
Santa Cruz Milwaukee 500 Mb/s 160 Mb/s
Santa Cruz Detroit 600 Mb/s 150 Mb/s
Santa Cruz Bielefeld 600 Mb/s 6 Mb/s
Santa Cruz Aarhus 350 Mb/s 6 Mb/s
Santa Cruz Brisbane 550 Mb/s 3 Mb/s

Allison Heath is the Project Lead for UDR.

This entry was posted in big data and tagged , , , . Bookmark the permalink.