| |
Graph Data Management for Biology Tutorial
Scope
Many kinds of data arising in bioinformatics can be
usefully represented as graphs.
Biopathways (metabolic pathways, signaling pathways, or gene
regulatory networks) are among the most important examples
of such graph data.
Other examples include: taxonomies, e.g., of enzymes or organisms
(directed acyclic graphs),
protein interaction networks (undirected graphs),
DNA, RNA or protein sequences (linear graphs),
chemical structure graphs (undirected graphs),
sequence fragment overlap graphs (interval graphs)
for shotgun sequence assembly,
experimental protocols (directed graphs),
bibliographic citations (directed graphs),
bibliographic co-citations (undirected graphs),
gene co-expression (undirected graphs),
genetic maps (partial orders),
multiple sequence alignments (partial orders),
contact graphs for protein structures (undirected graphs),
etc.
Graph data models have been studied
in the database community for semantic data modeling,
hypertext, geographic information systems, semi-structured
data, XML, multi-media, etc.
Graph data management has long formed the basis for
chemical information retrieval systems.
This tutorial is concerned with techniques for the modeling
and management of such graph data.
The tutorial will cover a variety of biological graph data examples,
but concentrate on issues related to biopathways.
We will discuss the details of various graph data models
directed vs. unidrected graphs, nested vs. unnested graphs,
graphs vs. hypergraphs. labeling and type systems on edges and nodes,
specification of attributes.
We will discuss both edge list and incidence (adjacency)
matrix representations of graphs.
We will also discuss related work on binary relational data models
(NIAM), and on mappings between relational (or entity-relationship)
schemas and graph data model schemas.
We will discuss various measures used to
characterize graphs: average node degrees, graph diameter,
number of connected components, various topological indices.
We will also explain a variety of graph queries:
path queries,
Boolean graph queries,
subgraph matching queries,
graph characterization queries,
approximate subgraph matching queries.
Then we will compare various proposals for graph query languages.
We will introduce some of the algorithms for processing
graphs queries: e.g., transitive closures,
subgraph isomorphism queries, etc.
We will introduce basic concepts of query plan specification
and optimization.
We will discuss the relative merits of several proposed implementation
strategies:
building a graph data manager on top of a relational
backend,
extending object-relational DBMS with graph data types and functions,
etc.
We will briefly survey several graph data exchange formats.
Audience
This tutorial is intended primarily for
computer scientists, bioinformaticists,
database designers, and DBMS developers
concerned with graph data management for
biological applications.
Computer literate biologists working with
graph data will also find the course useful.
Prerequisites
It will be useful for students to have some
previous exposure to concepts of relational
databases (e.g., relational or entity-relation schemas,
SQL, relational algebra). No previous knowledge of graph
theory is assumed. See also discussion of the related morning
tutorial below.
Related Tutorial
Some students may also wish to take the
The Pathway Tools Software Tutorial
by
Randy Gobbel and Suzane Paley
from SRI in the morning tutorials at
IEEE Computer Society Bioinformatics Conference
Both tutorials are concerned with storing and querying biopathways
information.
Our tutorial approaches these topics from a database management perspective,
while their tutorial approaches similar topics from
a knowledge representation perspective.
Location, Directions, Parking
The tutorial will be offered at the site of the
IEEE Computer Society Bioinformatics Conference
at Stanford University
in Stanford, California.
Building and Room number are yet to be assigned.
Information on directions, parking, maps, etc. can be found
at
IEEE CSB Conference Site Information Web Page.
Date, Time and Duration
The tutorial will be given
Monday, August 11, 2003 of the first day of
IEEE Computer Society Bioinformatics Conference
The time for the tutorial is currently set for the afternoon.
This will be a half-day tutorial, approximately 3-3.5 hours long.
Registration and Fees
See the
registration web page
for the
IEEE Computer Society Bioinformatics Conference
for information on registration and fees for the tutorial.
Notes
Enrolled tutorial attendees will be provided with hardcopies
of the tutorial slides.
Instructor
This instructor will be
Frank Olken .
He holds a Ph.D. from U.C. Berkeley in Computer Science and works in
the scientific data management group at Lawrence Berkeley National Laboratory
in Berkeley, California.
He has taught data management for UC Berkeley, and tutorials
on XML Schema Language and Work Flow Management Systems for
UC Berkeley Extension.
He also served on the W3C Committees for standardization of
RDF Schemas, and XML Schema Language.
He has worked on various topics in file migration,
statistical and scientific data management, bioinformatics
and computational biology, and workflow management.
He curently leads the
Biopathways Graph Data Manager Project
at LBNL for the
Arkin Lab.
This page last updated on April 1, 2003.
Contact: Frank Olken
concerning this web page.
|
|