Graph Data Management for Biology Tutorial
 

Graph Data Management for Biology Tutorial

Frank Olken
Lawrence Berkeley National Laboratory

Monday, August 11, 2003
IEEE Computer Society Bioinformatics Conference
Stanford University, Stanford, CA

Scope

Many kinds of data arising in bioinformatics can be usefully represented as graphs. Biopathways (metabolic pathways, signaling pathways, or gene regulatory networks) are among the most important examples of such graph data. Other examples include: taxonomies, e.g., of enzymes or organisms (directed acyclic graphs), protein interaction networks (undirected graphs), DNA, RNA or protein sequences (linear graphs), chemical structure graphs (undirected graphs), sequence fragment overlap graphs (interval graphs) for shotgun sequence assembly, experimental protocols (directed graphs), bibliographic citations (directed graphs), bibliographic co-citations (undirected graphs), gene co-expression (undirected graphs), genetic maps (partial orders), multiple sequence alignments (partial orders), contact graphs for protein structures (undirected graphs), etc. Graph data models have been studied in the database community for semantic data modeling, hypertext, geographic information systems, semi-structured data, XML, multi-media, etc. Graph data management has long formed the basis for chemical information retrieval systems. This tutorial is concerned with techniques for the modeling and management of such graph data.

The tutorial will cover a variety of biological graph data examples, but concentrate on issues related to biopathways. We will discuss the details of various graph data models directed vs. unidrected graphs, nested vs. unnested graphs, graphs vs. hypergraphs. labeling and type systems on edges and nodes, specification of attributes. We will discuss both edge list and incidence (adjacency) matrix representations of graphs. We will also discuss related work on binary relational data models (NIAM), and on mappings between relational (or entity-relationship) schemas and graph data model schemas. We will discuss various measures used to characterize graphs: average node degrees, graph diameter, number of connected components, various topological indices.

We will also explain a variety of graph queries: path queries, Boolean graph queries, subgraph matching queries, graph characterization queries, approximate subgraph matching queries. Then we will compare various proposals for graph query languages.

We will introduce some of the algorithms for processing graphs queries: e.g., transitive closures, subgraph isomorphism queries, etc. We will introduce basic concepts of query plan specification and optimization.

We will discuss the relative merits of several proposed implementation strategies: building a graph data manager on top of a relational backend, extending object-relational DBMS with graph data types and functions, etc.

We will briefly survey several graph data exchange formats.

Audience

This tutorial is intended primarily for computer scientists, bioinformaticists, database designers, and DBMS developers concerned with graph data management for biological applications. Computer literate biologists working with graph data will also find the course useful.

Prerequisites

It will be useful for students to have some previous exposure to concepts of relational databases (e.g., relational or entity-relation schemas, SQL, relational algebra). No previous knowledge of graph theory is assumed. See also discussion of the related morning tutorial below.

Related Tutorial

Some students may also wish to take the The Pathway Tools Software Tutorial by Randy Gobbel and Suzane Paley from SRI in the morning tutorials at IEEE Computer Society Bioinformatics Conference Both tutorials are concerned with storing and querying biopathways information. Our tutorial approaches these topics from a database management perspective, while their tutorial approaches similar topics from a knowledge representation perspective.

Location, Directions, Parking

The tutorial will be offered at the site of the IEEE Computer Society Bioinformatics Conference at Stanford University in Stanford, California. Building and Room number are yet to be assigned. Information on directions, parking, maps, etc. can be found at IEEE CSB Conference Site Information Web Page.

Date, Time and Duration

The tutorial will be given Monday, August 11, 2003 of the first day of IEEE Computer Society Bioinformatics Conference The time for the tutorial is currently set for the afternoon. This will be a half-day tutorial, approximately 3-3.5 hours long.

Registration and Fees

See the registration web page for the IEEE Computer Society Bioinformatics Conference for information on registration and fees for the tutorial.

Notes

Enrolled tutorial attendees will be provided with hardcopies of the tutorial slides.

Instructor

This instructor will be Frank Olken . He holds a Ph.D. from U.C. Berkeley in Computer Science and works in the scientific data management group at Lawrence Berkeley National Laboratory in Berkeley, California. He has taught data management for UC Berkeley, and tutorials on XML Schema Language and Work Flow Management Systems for UC Berkeley Extension. He also served on the W3C Committees for standardization of RDF Schemas, and XML Schema Language. He has worked on various topics in file migration, statistical and scientific data management, bioinformatics and computational biology, and workflow management. He curently leads the Biopathways Graph Data Manager Project at LBNL for the Arkin Lab.


This page last updated on April 1, 2003. Contact: Frank Olken concerning this web page.