|
By John Bashor
June 28, 1997
The information age has not only spawned a never-ending torrent
of data flooding our lives, it's also led to huge libraries of electronic
information stored in computers everywhere.
Although such information represents a valuable resource, the
total volume of data stacking up is making it increasingly difficult
to retrieve needle-sized files of useful information from these
virtual haystacks. From July 8-12, experts in the field will attend
a four-day workshop organized by Berkeley Lab computer scientists
aimed at coming up with recommended standards and practices to improve
access and make it easier for various organizations to share electronic
information.
The issue is so large, it's generated its own terminology to describe
the situation. The data about mountains of information is called
"metadata" and the techniques used to find valuable nuggets is called
"data mining."
According to program committee chairman Frank Olken of Berkeley
Lab, metadata can facilitate access, use and sharing of data across
cyberspace and time by systematically describing the content, structure
and semantics of data residing in information systems, databases
or files.
The main sponsor of the workshop is the U.S. Environmental Protection
Agency, which has amassed volumes of environmental data. The information
was usually collected by EPA program offices on a media-specific
basis (such as air, water, solid waste), making it difficult to
draw together into a whole picture of environmental conditions for
any specific place. The data also lost meaning as the staff and
contractors who know about it - how and why it was collected - moved
on to other work. To make and defend policies today, the EPA needs
to access data from many sources, ensure its validity and integrate
many perspectives, such as air quality, land use, water quality
and chemical toxicity.
The EPA is also opening access of its databases to the public
via the World Wide Web. This makes the information available to
decision-makers in government, private enterprise and addresses
the general public's right to know about environmental conditions
in their communities.
The workshop is being held under the auspices of the International
Organization for Standardization's (ISO) Joint Technical Committee
on Information Standards. The goal of the workshop, said Bruce Bargmeyer,
chair of ISO's subcommittee on data engineering and manager of EPA's
Information and Data Management Program, is to bring together metadata
experts from a variety of fields and try to find common ways to
share data. Among the hurdles to be overcome are many separate standards
created by different disciplines and user communities, overlapping
standards and limited software tools.
The list of other organizations participating in the workshop
at UC Berkeley's Clark Kerr Campus illustrates the extent and importance
of the issue: the U.S. Census Bureau, Boeing , Xerox, AT&T Laboratories,
the National Institute of Standards and Technology, UC Berkeley,
Stanford University, University of Michigan, Rutgers University,
the University of Maryland, and Lawrence Berkeley and Los Alamos
national laboratories. Also attending will be representatives of
traditional libraries who have extensive experience in classifying
information sources within larger collections.
Featured speakers include Clifford Lynch, who is just stepping
down as head of the University of California Division of Library
Automation to lead the Center for Networked Information, and Phil
Bernstein, head of database development at Microsoft.
"There are not only mountains of data to be conquered, but those
mountains come in different varieties," said John McCarthy, a computer
scientist in the Lab's Computing Sciences organization and chairman
of the workshop. "Some data, like that transmitted by satellites,
have very large numbers of observations for relatively few variables.
Other data libraries, like those on the genetic makeup of humans
and other organisms, have many, many related and complex variables.
"The problem common to all of these vast libraries is that it
is very difficult to find exactly what you're looking for and to
relate one data set to another," said McCarthy, one of the first
researchers to coin the term metadata based on work at Berkeley
Lab 25 years ago. "Many organizations still haven't come to grips
with the extent of the problem-the data side of an organization's
work is typically under-estimated, under-budgeted and understaffed."
Until recently, what metadata standards did exist, Olken said,
were oriented toward people and use "natural language." This approach
doesn't work for the new generation of "intelligent search agents,"
which require more formal descriptions and automation.
A key approach for improving the situation is the creation of
metadata registries - facilities following specified national and
international standards for storing and registering detailed metadata
from multiple databases and diverse organizations in a common, structured
framework. Such registries help people and computer programs find
relevant data more easily, give information providers a means to
catalog and preserve their expertise for future use, facilitate
integration of data from diverse sources and make better analysis
possible. Metadata standards will also further commerce and health
care through electronic data interchange, or EDI.
"More and more, people are becoming frustrated that so many groups
are pursuing different standards for finding and using data," Olken
said. "Our goal is to bring together experts in various fields,
get them talking to each other and then come up with a small set
of recommendations regarding standards and practices to address
this global issue."
|