top lines Ernest Orlando Lawrence Berkeley National Laboratory
Computing Sciences
   
 
WHO WE ARE
Organization Chart
Computational
Research Division
• BDMTC
• DSD
• HPCRD
NERSC Center Division
NEWS AND
PUBLICATIONS
ESnet
NERSC CENTER
CS STAFF ONLY
Employment
Privacy & Security Notice
Questions & Comments
Berkeley Lab A-Z Index Phone Book Search
 
Berkeley Lab Workshop Seeks Standards for Making Mountains of Data More Accessible
 

By John Bashor

June 28, 1997

The information age has not only spawned a never-ending torrent of data flooding our lives, it's also led to huge libraries of electronic information stored in computers everywhere.

Although such information represents a valuable resource, the total volume of data stacking up is making it increasingly difficult to retrieve needle-sized files of useful information from these virtual haystacks. From July 8-12, experts in the field will attend a four-day workshop organized by Berkeley Lab computer scientists aimed at coming up with recommended standards and practices to improve access and make it easier for various organizations to share electronic information.

The issue is so large, it's generated its own terminology to describe the situation. The data about mountains of information is called "metadata" and the techniques used to find valuable nuggets is called "data mining."

According to program committee chairman Frank Olken of Berkeley Lab, metadata can facilitate access, use and sharing of data across cyberspace and time by systematically describing the content, structure and semantics of data residing in information systems, databases or files.

The main sponsor of the workshop is the U.S. Environmental Protection Agency, which has amassed volumes of environmental data. The information was usually collected by EPA program offices on a media-specific basis (such as air, water, solid waste), making it difficult to draw together into a whole picture of environmental conditions for any specific place. The data also lost meaning as the staff and contractors who know about it - how and why it was collected - moved on to other work. To make and defend policies today, the EPA needs to access data from many sources, ensure its validity and integrate many perspectives, such as air quality, land use, water quality and chemical toxicity.

The EPA is also opening access of its databases to the public via the World Wide Web. This makes the information available to decision-makers in government, private enterprise and addresses the general public's right to know about environmental conditions in their communities.

The workshop is being held under the auspices of the International Organization for Standardization's (ISO) Joint Technical Committee on Information Standards. The goal of the workshop, said Bruce Bargmeyer, chair of ISO's subcommittee on data engineering and manager of EPA's Information and Data Management Program, is to bring together metadata experts from a variety of fields and try to find common ways to share data. Among the hurdles to be overcome are many separate standards created by different disciplines and user communities, overlapping standards and limited software tools.

The list of other organizations participating in the workshop at UC Berkeley's Clark Kerr Campus illustrates the extent and importance of the issue: the U.S. Census Bureau, Boeing , Xerox, AT&T Laboratories, the National Institute of Standards and Technology, UC Berkeley, Stanford University, University of Michigan, Rutgers University, the University of Maryland, and Lawrence Berkeley and Los Alamos national laboratories. Also attending will be representatives of traditional libraries who have extensive experience in classifying information sources within larger collections.

Featured speakers include Clifford Lynch, who is just stepping down as head of the University of California Division of Library Automation to lead the Center for Networked Information, and Phil Bernstein, head of database development at Microsoft.

"There are not only mountains of data to be conquered, but those mountains come in different varieties," said John McCarthy, a computer scientist in the Lab's Computing Sciences organization and chairman of the workshop. "Some data, like that transmitted by satellites, have very large numbers of observations for relatively few variables. Other data libraries, like those on the genetic makeup of humans and other organisms, have many, many related and complex variables.

"The problem common to all of these vast libraries is that it is very difficult to find exactly what you're looking for and to relate one data set to another," said McCarthy, one of the first researchers to coin the term metadata based on work at Berkeley Lab 25 years ago. "Many organizations still haven't come to grips with the extent of the problem-the data side of an organization's work is typically under-estimated, under-budgeted and understaffed."

Until recently, what metadata standards did exist, Olken said, were oriented toward people and use "natural language." This approach doesn't work for the new generation of "intelligent search agents," which require more formal descriptions and automation.

A key approach for improving the situation is the creation of metadata registries - facilities following specified national and international standards for storing and registering detailed metadata from multiple databases and diverse organizations in a common, structured framework. Such registries help people and computer programs find relevant data more easily, give information providers a means to catalog and preserve their expertise for future use, facilitate integration of data from diverse sources and make better analysis possible. Metadata standards will also further commerce and health care through electronic data interchange, or EDI.

"More and more, people are becoming frustrated that so many groups are pursuing different standards for finding and using data," Olken said. "Our goal is to bring together experts in various fields, get them talking to each other and then come up with a small set of recommendations regarding standards and practices to address this global issue."

   
  U.S. Department of Energy · Office of Science · SciDAC · ASCR · University of California Science logos ASCR logo SciDAC logo Office of Science logo DOE logo UC seal