ISMB-97: Intelligent Systems for Molecular Biology 1997 Gelfand Tutorial

The Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB-97)

return to main index for ISMB97

Bayesian Inferences and Sampling Algorithms
A Second Generation in Statistical Bioinformatics
a tutorial
Chip Lawrence

Biometrics Lab, Wadsworth Labs, Albany, NY 12201
Phone (518) 473-3382 FAX: (518) 474-7992
Internet: chip.lawrence@wadsworth.org
Over the last few years statistical algorithms in the form of expectation maximization (EM) algorithms (Bailey and Elkans, 1994) (Lawrence and Reilly, 1996) hidden Markov models (HMM) (Baldi et. al, 1994)(Krough et.al. 1994), and Gibbs sampling algorithms (Lawrence et. al, 1993) (Liu, Nuewald, and Lawrence, 1995) have demonstrated their utility in biopolymer sequences analysis. The algorithmic advantages of these methods in solving some of bioinformatics` difficult combinatorial problems have been the basis for their success. However, only limited use has been made of two other key features of these approaches: statistical inference and generalizablity.

This tutorial will illustrate how Bayesian statistics give access to these features. We will begin with an overview of Bayesian statistics and associated probability theory. Next we will illustrate how nearly all existing single sequence and pairwise sequence methods can be formulated as stochastic models appropriate to Bayesian statistics. With this background as a starting point, we describe how Bayesian inferences can be used to solve some of the fields outstanding problems. For example, this approach eliminates the need to specify gap penalties and scoring matrices for pairwise alignment. We will describe how the exact posterior distribution of the number of gaps including the optimal number can be determined based on the data alone, and extend this derivation to selection of scoring matrices. Furthermore, we will illustrate how the "evidence" against the null, the Bayesian equivalent of a p-value, is obtained exactly. Generalizations of "dynamic programming" recursive algorithms which complete the sums required for these exact posteriors and yield samples from them will described.

Building on this foundation, the tutorial will also illustrate how nearly all existing single sequences and sequence pair methods can be generalized to multiple sequence algorithms using a unified Gibbs sampling approach (Liu and Lawrence, 1996). Just as algorithms for the determination of exact optimum for a multiple alignment are computational intractable, exact Bayesian inferences are also intractable. To address inferences for multiple sequences, approximate Bayesian inference criteria will be discussed. Applications of these approaches to sequence alignment, database searches, and structure prediction will be used through out to illustrate these concepts.