Computing Sciences masthead Berkeley Lab Computing Sciences Berkeley Lab logo

SIDEBAR: Managing and Analyzing Petabytes of Data

The high spatial and temporal resolution of GCRM simulations will result in volumes of data output, expected to be on the order of 1 terabyte per hourly snapshot or 8.6 petabytes per year of continuous simulation time. It would be impractical to run the model repeatedly just to save the output required for a particular analysis. Therefore, it is necessary to store model results throughout the computation and provide flexible tools to extract subsets of the data required for a wide range of analyses. Developing those tools is the goal of the SciDAC "Community Access to Global Cloud Resolving Model and Data" Scientific Application Partnership (SAP), led by Karen Schuchardt of Pacific Northwest National Laboratory.

"We cannot easily generate this data every time someone needs it, so we view each dataset as an extremely valuable resource and want to make it available to as many collaborators as possible," Schuchardt says.


Figure 1. The Global Cloud Resolving Model requires high-performance, highly coupled compute, storage, and analysis resources, which are accessed by the climate science community through tools and services provided by the Scientific Application Partnership.

The main tasks of this partnership include developing a web portal that enables users to browse, search, and make specific data subset requests; developing tools to efficiently access, analyze, and visualize subsets of data; and developing a high performance input/output (I/O) application program interface (API) and data format definition. As illustrated in Figure 1, paradigm changing models such as a GCRM require coupled compute, storage, and analysis resources. The software services that provide data access to the broad community are a vital link in the flow of information.

These goals are a perfect match for NERSC's Science Gateways project, which is developing custom web interfaces for computing, data distribution, collaboration, and analytics. NERSC's Outreach, Software, and Programming Group has collaborated with Schuchardt&'s team on portal development. In addition, the NERSC Analytics Group and the SciDAC Visualization and Analytics Center for Enabling Technologies (VACET) are assisting with troubleshooting and improving I/O and with evaluating and developing visualization tools for GCRM data.

When a high-resolution model like the GCRM is running on 40,000 processors of a fast computer like Franklin, outputting the high volume of data and writing it to disk can become a bottleneck, slowing down the entire computation. The I/O API being developed for GCRM allows the data to be efficiently output in parallel streams to local storage on Franklin, in a data format (netCDF) that is common in the climate modeling community.

So far the researchers have achieved an effective aggregate I/O bandwidth of 5 gigabytes per second for writing GCRM output on Franklin. The increased bandwidth was achieved by consolidating I/O on an optimal number of processors, aggregating writes into large chunks of data, and making additional improvements in the filesystem and parallel I/O libraries. From Franklin, the data is copied to the NERSC Global Filesystem (NGF), where it can be shared with or transferred to scientists around the world.

The large datasets generated by the GCRM also require new analysis and visualization capabilities, including parallel processing and rendering. The NERSC Analytics/VACET team has developed a GCRM plug-in for the VisIt visualization tool that supports the geodesic grid used by the GCRM (Figure 4). Rather than transferring huge datasets between sites, scientists can choose to keep their simulation output at NERSC and use VisIt’s client/server architecture to do remote visualization.