Since their introduction to the computational biology community [Haussler et al., 1993, Krogh et al., 1994a], hidden Markov models (HMMs) have gained increasing acceptance as a means of sequence modeling, multiple alignment, and profiling [Baldi et al., 1994, Eddy et al., 1995, Eddy, 1995]. A hidden Markov model is a statistical model similar to a profile [Gribskov et al., 1987, Bucher & Bairoch, 1994], but can also be estimated from unaligned sequences.
A linear HMM for a family of nucleotide or amino acid sequences is a set of positions that roughly correspond to columns in a multiple alignment. Typically, each position will have three states: match, insert, and delete, corresponding to matching a single character to a column of the multiple alignment, emitting characters not modeled by the HMM, or skipping over that column and proceeding to the next. Probabilities or costs (negative log-probabilities) are associated with each character emission and each transition between states. The alignment of a sequence is simply the highest-probability, or lowest-cost, path through the HMM.
The primary advantage of HMMs over, for example, profiles, is that they can be automatically estimated, or trained, from unaligned sequences. The most basic mathematical method of training an HMM does not work well without the addition of several mathematical extensions and heuristics. The body of this paper is a study of three of the most important extensions to the basic model presented previously [Krogh et al., 1994a]: regularizers, dynamic model modification, and `free insertion modules'. After reviewing these in turn, we present an experimental evaluation of the utility of these and other extensions.