July 15, 1998
A newly funded computer research program at Lawrence Berkeley
National Laboratory could revolutionize the way scientific instruments,
computers and humans work together to gather, analyze and use data.
The program, funded by the U.S. Department of Energy, will build
on efforts over the past 10 years to gather, store and make information
available over computer networks. The program is called "China
Clipper," in reference to the 1930s commercial air service
which spanned the Pacific Ocean and opened the door to the reliable,
global air service taken for granted today.
believe that our China Clipper project epitomizes the research environment
we will see in the future," says Bill Johnston, leader of the
Imaging and Distributed Computing Group at Berkeley Lab. "It
will provide an excellent model for on-line scientific instrumentation.
Data are fundamental to analytical science, and one of my professional
goals is to greatly improve the routine access to scientific data
- especially very large datasets - by widely distributed collaborators,
and to facilitate its routine computer analysis."
The idea behind China Clipper, like the pioneering air service,
is to bring diverse resources closer together. In this case, scientific
instruments such as electron microscopes and accelerators would
be linked by networks to data storage "caches" and computers.
China Clipper will provide the "middleware" to allow these
separate components, often located hundreds or thousands of miles
apart, to function as a single system. Johnson is scheduled to discuss
the work of the Lab in this area at an IEEE symposium on High Performance
Distributed Computing next week.
Data Intensive Computing
Modern scientific computing involves organizing, moving, visualizing,
and analyzing massive amounts of data from around the world, as
well as employing large-scale computation. The distributed systems
that solve large-scale problems involve aggregating and scheduling
many resources. Data must be located and staged, cache and network
capacity must be available at the same time as computing capacity,
Every aspect of such a system is dynamic: locating and scheduling
resources, adapting running application systems to availability
and congestion in the middleware and infrastructure, responding
to human interaction, etc. The technologies, the middleware services,
and the architectures that are used to build useful high-speed,
wide area distributed systems, constitute the field of data intensive
Enhancing data intensive computing will make research facilities
and instruments at various DOE sites available to a wider group
of users. Berkeley Lab scientists are developing China Clipper in
collaboration with their counterparts at the Stanford Linear Accelerator
Center (SLAC), Argonne National Laboratory and the Department of
Energy's Energy Sciences Network, or ESnet.
"This will lead to a substantial increase in the capabilities
of experimental facilities," predicts Johnston.
As an example of the benefits, Johnston cites a Cooperative Research
and Development Agreement project called "WALDO" (for
Wide Area Large Data Object). In this project, Johnston's group,
Pacific Bell, the NTON optical network testbed project at Lawrence
Livermore National Lab and others worked with Kaiser Permanente
to produce a prototype on-line, distributed, high-data-rate medical
imaging system. The project allowed cardio-angiography data to be
collected directly from a scanner in a San Francisco hospital. The
system was connected to a high-speed Bay Area network and data was
collected, processed, and stored at Berkeley Lab, and accessed by
cardiologists at the Kaiser Oakland hospital.
One result of the Kaiser project was a demonstration that physicians
could have immediate access to the numerous medical images from
each patient. Currently, such images are processed and kept by a
central office, and doctors at the referring hospitals only see
one or two images after a couple of weeks. However, with the WALDO
real-time acquisition and cataloguing approach, they had access
in a few hours.
The vision guiding this work is that getting faster access to
data will allow scientists to conduct their work more efficiently
and gain new insights. Much research involves starting out with
a scientific model of what's supposed to occur, then conducting
an experiment and comparing the actual results with what was expected.
Figuring out the how and why of this difference is where the real
science happens, Johnston says. China Clipper is expected to lead
to better utilization of instrumentation for experiments and provide
fast comparisons of actual experiments and computational models,
thereby giving researchers better tools for testing scientific theories.
Because the test-and-compare procedure must often be conducted
over and over to obtain reliable results, streamlining the process
each time around could significantly increase the rate of scientific
Evolution of an Idea
According to Johnston, China Clipper is the culmination of a decade
of research and development of high-speed, wide area, data intensive
computing. The first demonstration of the project's potential came
during 1989 hearings held by then-Senator Al Gore on his High Performance
Computing and Communications legislation. Because the Senate room
had no network connections at the time, a simulated transmission
of images over a network at various speeds was put together. The
successful effort introduced legislators to the implications of
Johnston's group continued its work, evolving from scientific
visualization to the idea of operating scientific instruments on
line. This work is lead by Bahram Parvin in collaboration with Berkeley
Lab's Materials Sciences and Life Sciences divisions. Last year,
several group members patented their system which provides automatic
computerized control of microscopic experiments. The system collects
video data, analyzes the data and then sends a signal to the instruments
to carry out such delicate tasks as cleaving DNA molecules and controlling
the shape of growing micro-crystals.
One key aspect of successful data-intensive computing -- accessing
data cached at various sites -- was developed by Berkeley Lab for
a project funded by DARPA, the Defense Advanced Research Projects
Agency. Called Distributed-Parallel Storage System, or DPSS, this
technology successfully provided an economical, high performance
and highly scalable design for caching large amounts of data for
use by many different users. Brian Tierney continues this project
with his team in NERSC's Future Technologies Group.
In May, a team from Berkeley Lab and SLAC conducted an experiment
using DPSS to support high energy physics data analysis. The team
achieved a sustained data transfer rate of 57 MBytes per second,
demonstrating that high-speed data storage systems could use distributed
caches to make data available to systems running analysis codes.
Overcoming the Hurdles
With the development of various components necessary for data
intensive computing, the number of obstacles has dwindled. One of
the last remaining issues, that of scheduling and allocating resources
over networks, is being addressed by "differentiated services."
This technology, resulting from work by Van Jacobson's Network Research
Group, specially marks some data packets for priority service as
they move across networks. A demonstration by Berkeley Lab in April
showed that priority-marked packets arrived at eight times the speed
of regular packets when sent through congested network connections.
Differentiated services would ensure that designated projects could
be conducted by reserving sufficient resources.
The next big step, says Johnston, is to integrate the various
components and technologies into a cohesive and reliable package
-- a set of "middleware services" that let applications
easily use these new capabilities.
"We see China Clipper not so much as a 'system,' but rather
as a coordinated collection of services that may be flexibly used
for a variety of applications," says Johnston. "Once it
takes off, we see it opening new routes and opportunities for scientific
For more information see http://www-itg.lbl.gov/WALDO
and the papers at http://www-itg.lbl.gov/~johnston/.