In this talk we shall discuss the use of graph data management for a variety of biological applications. Simple graphs are composed of a set of nodes and a binary relation comprising the edges which connect pairs of nodes.
We shall emphasize applications related to the representation and querying of biopathways databases, e.g., metabolic pathways, signal transduction pathways, and genetic regulatory networks. Other potential applications of graph data management to biology include: chemical structure graphs, protein interaction networks, phylogenetic trees, taxonomies of chemicals, proteins, enzymes and diseases, partonomies (e.g., in anatomy), topological adjacency relations (in image analysis), contact graphs (for 3D protein structure), bibliographic citation graphs, food webs, biogeochemical cycles, gene clusterings, partial order graphs for DNA multiple sequence alignments, genetic maps, operon and regulon structures, sequence overlap graphs for shotgun DNA sequencing, database schemas and mappings among schemas, data provenance (lineage), hypertext, semantic web applications, hypertext, semantic web applications, laboratory protocols, etc.
Graph data models for biology come in a number of variants: undirected and directed graphs, simple graphs, nested graphs, multigraphs, and hypergraphs. We will mention these variants and illustrate their applications.
Graph data management offers two major advantages for biopathways applications: naturalness of representation of pathway data, and ease of querying pathway data. It is the latter issue which is more important.
Graph data management systems permit users to frame queries in terms of graph operations, e.g., subgraph isomorphism, shortest paths, etc. which would be difficult to express or compute in conventional (relational) DBMS systems. We discuss a number of graph queries in the talk, e.g., subgraph homomorphism queries.
Graph data management systems typically treat individual fragments of the database more homogeneously (as either nodes or edges) than relational databases which partition the database into many specialized relations. In a GDMS the analog of relation structure is encoded as edges which indicate types of nodes. While the relational storage structures offer advantages in performance on fixed structure queries, the homogeneous graph data model is much easier to use in posing queries which allow paths to span many different possible relations (node types). It is this storage homogeneity which facilitates pattern matching and path queries in graph databases. In contrast, similar queries in relational DBMS involve large numbers of union queries over the various possible relations which might participate in a path or subgraph pattern match. We will discuss this issue and (briefly) some related comparisons to logic and object oriented database management systems.
We will illustrate the talk with references to some major biopathways databases. Time permitting we will also mention the role of W3C's RDF (Resource Description Framework) as a a graph data model. We will conclude with a brief survey of some alternative approaches to implementation of graph database management systems.
The talk will be held in Room 306 of Soda Hall on the UC Berkeley campus. Here is a link to a map of the vicinity of Soda Hall . Soda Hall is located just north of the main campus on Hearst Avenue, at the corner of Hearst and Leroy Street. It is adjacent (and east of) Etcheverry Hall, due north of Davis Hall on campus, northwest of Cory Hall, and due west of the Goldman School for Public Policy. The closest LBNL Hearst Ave. shuttle stops are at Cory Hall inbound to LBNL from BART and at the Goldman School of Public Policy (on Hearst Ave. between Leroy St.) outbound from LBNL to BART. The UCB campus perimeter shuttle also stops at Cory Hall. Entrance to the third floor is from the west side of building opposite Etcheverry Hall. Here is a picture of Soda Hall . The third floor entrance is on the left underneath the trellis. Contact the conference organizers about parking permits. Street parking is usually extremely scarce, but perhaps possible since school will not be in session during the meeting.
This is joint work with Kevin D. Keck and Vijaya Natarajan (both at LBNL). Further information on the Biopathways Graph Data Manager Project may be found at http://www.lbl.gov/~olken/graphdm/graphdm.htm The work is funded by DARPA Biocomp Program (via the Biospice Project at LBNL (PI: A. Arkin)), and DOE Genomes to Life Program (via VIMSS GTL project at LBNL (PIs: A. Arkin and T. Hazen) and Synechococcus GTL project at Sandia National Lab (PI: G. Heffelfinger, LBNL PI: A. Shoshani). Submitted by Frank Olken on 2003-12-08.