Speeding Up Science Data Transfers Between Department of Energy Facilities
May 18, 2009
Contact: Linda Vu, CSnews@lbl.gov
As scientists conduct cutting-edge research with ever more sophisticated techniques, instruments, and supercomputers, the data sets that they must move, analyze, and manage are increasing in size to unprecedented levels. The ability to move and share data is essential to scientific collaboration, and in support of this activity network and systems engineers from the Department of Energy's (DOE) Energy Sciences Network (ESnet), National Energy Research Scientific Computing Center (NERSC) and Oak Ridge Leadership Computing Facility (OLCF) are teaming up to optimize wide-area network (WAN) data transfers.
OLCF, located at Oak Ridge National Laboratory in Tennessee, and NERSC, located at Lawrence Berkeley National Laboratory in California, are home to some of the fastest supercomputers in the world. OLCF is one of two DOE Leadership Computing Facilities, and NERSC provides computing resources to 3,000 researchers supported by the DOE Office of Science. A number of research groups use resources at both centers. ESnet, DOE's high-speed network, connects the two centers, as well as other national labs and universities around the country.
With the installation and deployment of new dedicated data transfer nodes at NERSC and OLCF linked by ESnet, researchers are now able to move large data sets between each facility's mass storage systems at a rate of 200 megabytes per second (MB/sec). At this rate, 74 terabytes of information in the U.S. Library of Congress' digital collection could be transferred in approximately four days.
“Our goal is to enable the scientists to rapidly move large-scale data sets between supercomputer centers as dictated by the needs of the science. High-performance networking has become critical to science due to the size of the data sets and the wide scope of collaboration that are characteristic of today's large science projects such as climate research and high energy physics, ” said Eli Dart, a network engineer for ESnet, which is managed by Lawrence Berkeley National Laboratory.
According to Jason Hick, NERSC Mass Storage Group lead, WAN transfers between NERSC and OLCF increased by a factor of 20 with the new dedicated nodes that are tuned and optimized specifically for wide-area transfers. Prior to this installation, wide-area data transfers between the two sites used infrastructure and tools that were tuned and optimized for local-area transfers. This slowed data movement between the two supercomputing centers creating a bottleneck to scientific progress.
“The data transfer effort will enable breakthroughs for researchers of both centers,” said Josh Lothian, of OLCF’s HPC Operations group, and the key liaison between OLCF and NERSC. “The researchers using these powerful systems need to transfer data at the extreme scale. The ability to quickly and easily share data between supercomputing centers will streamline workflows and free scientists to focus on their science, rather than on the transfer technology.”
“Collaboration is critical to science, and sharing information in a timely manner is critical to a successful collaboration,” says Hai Ah Nam, a computational scientist in the OLCF Scientific Computing Group who is currently researching the fundamental nuclear properties of carbon-14, in collaboration with scientists from Lawrence Livermore National Laboratory (LLNL) and Iowa State University. They are studying the reason for the anomalously long half-life of carbon-14, useful in dating organic remains from geological and archeological samples.
“When ideas are flying and emails allow us to have nearly immediate communications despite our disperse locations, a sure-fire way to slow the momentum in a project is to expect your collaborators to wait weeks, even days for data to transfer. I admit to waiting more than an entire workday for a 33 GB input file to scp, and feeling extremely discouraged knowing I had 20 more to transfer,” she adds.
Nam says she now transfers about 40 terabytes of information between NERSC and OLCF for each of the nuclei she studies. After her collaborators at LLNL finish computing at NERSC, she uses their inputs for additional calculations at OLCF and sends back results for post-processing. With a 200 MB/sec transfer rate, she can move all 40 terabytes of data between Berkeley and Oak Ridge in less than three days.
“Having a high speed network to quickly transfer data allows me to spend more time on the science, not the logistics of getting to the science,” says Nam, whose work is supported by the DOE Office of Advanced Scientific Computing Research (ASCR) and is a Petascale Early Science user of the Jaguar supercomputer at ORNL on the “Ab Initio Nuclear Structure of Carbon-14” project led by David Dean, ORNL.
In addition to building the infrastructure, engineers from ESnet, OLCF and NERSC have been collaborating on strategies for optimizing bandwidth performance between the various data storage systems at the supercomputing sites. Both sites deployed perfSONAR network monitoring applications on their servers during the testing phase to identify the transfer “choke points,” where data stalled between the two facilities. The perfSONAR findings allowed staff at both sites to make the necessary adjustments to alleviate congestion. The engineers were also able to identify a variety of user-specific tuning parameters that will enable the best transfer rates possible between the two facilities. These tips are published at http://fasterdata.es.net.
The engineers call their collaborative effort the Data Transfer Working Group. In addition to ESnet, OLCF and NERSC, engineers from the Leadership Computing Facility (LCF) at Argonne National Laboratory in Illinois also participated in the collaboration and are currently working on deploying their own dedicated gateway.
“The Data Transfer Working Group is seeing a marked increase in user demand for higher performance data transfer capabilities between the centers,” said NERSC’s Hick. “Combined with record levels of new data generation at each center, the timing for improving on existing capabilities couldn’t be better for the user or the science goals they are working to achieve.”
PerfSONAR is a joint collaboration between ESnet, GÉANT2, Internet2, and Rede Nacional de Ensino e Pesquisa (RNP). GÉANT2 and RNP are the research and education networks of Europe and Brazil, respectively. Internet2 is an advanced networking consortium for research and education, comprising of more than 200 U.S. universities, 70 corporations and 45 government agencies. ESnet interconnects more than 40 DOE research facilities and dozens of universities across the United States, also providing network connections to research networks, experimental facilities and research institutions around the globe.